DEA-C01 — AWS Certified Data Engineer – Associate Quick Review

Last revised: June 29, 2026

Concise independent Quick Review for AWS Certified Data Engineer – Associate (DEA-C01), focused on high-yield services, decision rules, traps, and practice planning.

Quick Review purpose

This Quick Review is for candidates preparing for the real AWS Certified Data Engineer – Associate (DEA-C01) exam who need a fast, practical review before moving into topic drills, mock exams, and detailed explanations.

Use it as an IT Mastery practice guide, not as an AWS publication. The goal is to sharpen service selection, recognize common traps, and connect concepts to original practice questions in a question bank.

The core DEA-C01 mental model

Most questions can be reduced to a data pipeline decision:

Source — application, database, SaaS, stream, log, file, on-premises system.
Ingest — batch, CDC, stream, event, transfer, managed delivery.
Store — S3 data lake, Redshift warehouse, DynamoDB, RDS/Aurora, OpenSearch, or another target.
Catalog and govern — Glue Data Catalog, Lake Formation, IAM, KMS, tags, metadata.
Transform — Glue, EMR, Lambda, Athena SQL, Redshift SQL, Step Functions orchestration.
Serve — Athena, Redshift, QuickSight, APIs, ML, search, downstream applications.
Operate — monitor, retry, checkpoint, validate, secure, audit, optimize cost.

    flowchart LR
	    A[Data source] --> B{Ingestion pattern}
	    B -->|Batch files| C[S3 / DataSync / Transfer Family]
	    B -->|Database migration or CDC| D[AWS DMS]
	    B -->|Streaming records| E[Kinesis Data Streams / MSK]
	    B -->|Managed delivery| F[Kinesis Data Firehose]
	    B -->|SaaS integration| G[AppFlow]
	    C --> H[S3 data lake]
	    D --> H
	    E --> H
	    F --> H
	    G --> H
	    H --> I[Glue Data Catalog]
	    I --> J{Processing}
	    J -->|Serverless Spark ETL| K[AWS Glue]
	    J -->|Custom Spark/Hadoop| L[EMR]
	    J -->|SQL over S3| M[Athena]
	    J -->|Warehouse analytics| N[Redshift]
	    K --> O[Govern, monitor, optimize]
	    L --> O
	    M --> O
	    N --> O

High-yield service map

Need in the question	Strong AWS service candidate	Watch for traps
Serverless Spark ETL, schema-aware jobs, job bookmarks	AWS Glue	Glue Data Catalog stores metadata; it does not transform data by itself.
SQL queries directly over S3 data	Athena	Performance depends heavily on partitions, columnar formats, and scan reduction.
Managed data warehouse for analytics	Amazon Redshift	Not ideal as a raw object data lake; use COPY, Spectrum, external schemas, and proper table design.
Full load and CDC from databases	AWS Database Migration Service, AWS DMS	DMS is not a general-purpose transformation engine.
High-throughput streaming with custom consumers	Kinesis Data Streams	Ordering is per shard, not global. Partition-key design matters.
Managed stream delivery to S3, Redshift, OpenSearch, or other destinations	Kinesis Data Firehose	Firehose is delivery-focused; use Kinesis Data Streams when consumers need custom, low-latency stream processing.
Kafka-compatible streaming workloads	Amazon MSK	Do not choose MSK merely because the word “streaming” appears. Look for Kafka compatibility or existing Kafka clients.
Decoupled application messages	SQS	SQS is a queue, not a replayable analytics stream in the same sense as Kinesis or Kafka.
Event routing from AWS services or scheduled events	EventBridge	EventBridge is not a full ETL orchestrator.
Stateful workflow orchestration	Step Functions	Best when explicit states, retries, branches, and service integrations matter.
Airflow-compatible orchestration	Amazon MWAA	Choose when Airflow DAG compatibility or migration is required.
Metadata catalog for data lake tables	AWS Glue Data Catalog	Permissions may still involve IAM, S3 policies, KMS, and Lake Formation.
Centralized data lake permissions and fine-grained access	AWS Lake Formation	Lake Formation does not replace all IAM, network, and KMS considerations.
Sensitive data discovery in S3	Amazon Macie	Macie discovers and classifies; it is not an ETL service.
Secrets for database connections	AWS Secrets Manager	Prefer over hardcoded credentials or plain-text job parameters.
Monitoring jobs, logs, alarms, metrics	CloudWatch	CloudWatch observes; it does not automatically fix bad partition design or failed records.
Audit API activity	CloudTrail	CloudTrail is audit history, not pipeline health monitoring by itself.

Ingestion decision rules

Batch, file, database, SaaS, and stream choices

Scenario clue	Prefer	Why
“Move large files from on-premises storage to S3”	AWS DataSync	Managed transfer, scheduling, verification, bandwidth controls.
“SFTP/FTPS/FTP endpoint for partners”	AWS Transfer Family	Managed file transfer into S3 or EFS.
“Migrate relational database with minimal downtime”	AWS DMS full load plus CDC	Handles initial load and ongoing change replication.
“Capture changes from an operational database into S3”	AWS DMS CDC, sometimes Kinesis targets	CDC is the key phrase.
“SaaS application data into S3 or Redshift”	Amazon AppFlow	Managed SaaS connectors and scheduled/event flows.
“Producers emit real-time clickstream events; multiple consumers process them”	Kinesis Data Streams or MSK	Custom consumers and replay are likely needed.
“Deliver streaming records to S3 with minimal operational overhead”	Kinesis Data Firehose	Managed buffering, batching, delivery, and optional transformation.
“Application components need asynchronous decoupling”	SQS	Queue semantics, retries, DLQs, decoupling.
“Route events from AWS services to targets”	EventBridge	Event pattern matching and event bus routing.

Kinesis Data Streams vs Kinesis Data Firehose

Feature	Kinesis Data Streams	Kinesis Data Firehose
Primary use	Custom stream processing	Managed delivery
Consumers	You build/manage consumers	Firehose manages delivery
Replay	Supported within stream retention	Not the main pattern
Ordering	Per shard	Delivery batching; do not assume per-record processing order
Transformations	Consumer applications, Lambda, analytics services	Optional lightweight Lambda transformation
Best clue	“Multiple consumers,” “custom processing,” “replay,” “low latency”	“Deliver to S3/Redshift/OpenSearch,” “minimal management,” “buffer”

Streaming traps

Ordering is usually partition-specific. In Kinesis Data Streams, records with the same partition key go to the same shard, so ordering is per shard.
Hot shards come from bad partition keys. A timestamp, region, or constant customer type can overload a shard if it concentrates traffic.
At-least-once delivery means duplicates can happen. Design idempotent consumers and deduplication where needed.
Firehose buffering affects latency. If a question requires very low latency custom processing, Firehose may not be the best fit.
SQS is not Kinesis. SQS is excellent for decoupling and retries, but it is not usually the right answer for replayable analytics streams with multiple independent consumers.

Transformation and orchestration review

Choose the processing service

Workload	Prefer	Decision rule
Serverless batch ETL using Spark	AWS Glue	Default for managed Spark ETL, Glue Data Catalog integration, crawlers, bookmarks.
Highly customized Spark/Hadoop ecosystem	Amazon EMR	Choose when cluster-level control, frameworks, or custom tuning matters.
Small event transformation	Lambda	Good for lightweight, short-running transformations; avoid for large ETL.
SQL transformation in warehouse	Redshift	Strong for ELT after data is loaded into warehouse tables.
SQL transformation over S3	Athena	Good for serverless querying and CTAS/INSERT-style transformations over data lake tables.
Multi-step workflow with branching and retries	Step Functions	Explicit state machine, error handling, service integrations.
Airflow DAGs	Amazon MWAA	When Airflow compatibility is central.
Glue-centric job sequence	Glue workflows/triggers	Useful when most steps are Glue crawlers and jobs.
Schedule or event trigger	EventBridge	Good for invoking jobs/functions on a schedule or event pattern.

AWS Glue high-yield points

Concept	What to remember
Glue Data Catalog	Central metadata store for databases, tables, schemas, partitions, and connections.
Crawlers	Infer schema and create/update catalog metadata. They do not clean, join, or transform data.
Jobs	Perform ETL, often Spark-based. Jobs can read from S3, JDBC sources, streams, and catalog tables.
Job bookmarks	Track previously processed data to help avoid reprocessing in incremental workloads.
DynamicFrames	Glue abstraction that can handle semi-structured data and schema inconsistencies.
Connections	Store network and connection information for data stores. Credentials should be protected.
Glue Studio	Visual interface for building and monitoring ETL jobs.
Glue Data Quality	Helps define and evaluate quality rules; failed records still need operational handling.

Common transformation mistakes

Choosing Lambda for heavy joins, large file conversions, or long-running Spark work.
Choosing Glue crawlers when the question asks for transformation logic.
Choosing EMR when the requirement says serverless and minimal infrastructure management.
Ignoring incremental processing. If only new data should be processed, look for bookmarks, CDC, timestamps, watermarks, or checkpoints.
Ignoring bad-record handling. Strong pipelines separate valid records, rejected records, and operational alerts.
Treating orchestration as transformation. Step Functions and MWAA coordinate work; they are not the processing engine by themselves.

Storage, data lake design, and query performance

Main storage choices

Storage target	Best for	Watch for
Amazon S3	Data lakes, raw/curated zones, durable object storage	Object layout, partitions, file size, compression, lifecycle, security.
Amazon Redshift	Warehousing, BI, complex analytics	Distribution, sort keys, COPY/UNLOAD, workload management, concurrency.
DynamoDB	Low-latency key-value and document access	Partition-key design, hot keys, GSIs/LSIs, capacity mode, streams.
RDS/Aurora	Transactional relational workloads	Not usually the best answer for large analytical scans.
OpenSearch	Search, log analytics, text search, near-real-time indexing	Not a replacement for a warehouse or data lake.
EFS	Shared file system for compute	Not normally the primary analytical data lake store.

S3 data lake layout

A strong S3 lake design usually has zones:

Zone	Purpose	Example
Raw / landing	Preserve source data with minimal changes	Original JSON, CSV, logs, CDC files.
Cleaned / standardized	Validated, normalized, deduplicated	Parquet with consistent schema.
Curated / serving	Business-ready data sets	Partitioned tables for Athena, Redshift Spectrum, or ML.
Quarantine / rejected	Bad or suspicious records	Schema failures, malformed files, validation errors.

File format and partition decisions

Design choice	Strong exam answer
Querying only selected columns	Use columnar formats such as Parquet or ORC.
Reducing scanned data	Partition by common filters and use compression.
Avoiding excessive S3 requests	Compact small files into larger analytical files.
Handling evolving schema	Use catalog updates, compatible formats, and planned schema evolution.
Frequent Athena queries	Use partition pruning, partition projection where appropriate, and columnar storage.
Raw auditability	Keep immutable raw data before transformation.
Lifecycle cost control	Move older data to lower-cost storage classes when access patterns allow.

Partition traps

Partition by query pattern, not by habit. Date partitions are common, but the best key depends on how users filter data.
Too many tiny partitions can hurt performance. Hour/minute/customer partitions may create partition explosion.
Partition columns are often derived from path structure. The catalog must know the partition values.
Columnar format plus partitions is stronger than either alone.
Athena cost and speed depend on data scanned. Compress, partition, and select only needed columns.

Redshift review

Redshift decision points

Topic	Review rule
Loading	COPY from S3 is the standard high-throughput loading pattern.
Exporting	UNLOAD writes query results back to S3.
External data	Redshift Spectrum queries external S3 data through external schemas and catalog metadata.
Distribution	Choose distribution style to reduce data movement, especially for large joins.
Sort keys	Improve range-restricted scans and query pruning when aligned with filters.
Workload isolation	Use workload management, scaling, or separate designs depending on the requirement.
Security	Combine IAM roles, VPC/network controls, encryption, and audit logging.

Distribution and sort-key intuition

Question clue	Likely design
Large fact table frequently joins to dimension table on customer_id	Consider distribution on the join key if it avoids redistribution.
Small dimension table joined often	Replication-style distribution may help when supported by the design.
Queries filter by date ranges	Date/time sort key may help range scans.
Queries filter by many dimensions unpredictably	Avoid overcommitting to one narrow sort strategy without evidence.
Skewed join key	Bad DISTKEY candidate even if it appears in joins.

Redshift traps

Redshift is not the default place for all raw data; S3 is usually the landing zone for a lake.
Redshift Spectrum still needs access to S3 data and metadata.
COPY is usually better than row-by-row inserts for large loads.
Poor distribution keys cause data skew and network redistribution.
Sort keys help only when query predicates can benefit from them.

Catalog, schema, and metadata

Requirement	Best concept
Discover schemas in S3 and create table definitions	Glue crawler
Central table definitions for Athena, Glue, and other analytics services	Glue Data Catalog
Fine-grained lake permissions	Lake Formation
Schema compatibility for streaming producers/consumers	Schema registry pattern, such as AWS Glue Schema Registry
Governed access by business classification	LF-tags and Lake Formation permissions
Detect personally identifiable or sensitive data in S3	Macie
Track operational lineage and transformations	Use catalog metadata, job logs, workflow metadata, and governance tools as appropriate

Schema evolution traps

A crawler can detect changes, but automatic schema changes can break downstream consumers.
Adding nullable columns is usually safer than renaming or changing data types.
Streaming producers and consumers need compatibility controls before bad data reaches storage.
Catalog schema and physical data must match. A table definition alone does not fix inconsistent files.
Partition schema drift can cause query errors when old and new files differ.

Security and governance review

IAM, resource policies, KMS, and Lake Formation

Layer	What it controls	Common trap
IAM identity policy	What a user, role, or service principal can do	Granting identity permissions but forgetting resource policy or KMS access.
S3 bucket policy	Who can access bucket and objects	Cross-account access often needs both role permissions and bucket policy.
KMS key policy/grants	Who can use encryption keys	S3 or Glue access can still fail if KMS decrypt is not allowed.
Lake Formation	Table, column, row, and tag-based data lake permissions	Lake Formation is separate from ordinary IAM thinking.
Secrets Manager	Secure database credentials and rotation	Do not hardcode passwords in Glue scripts or job parameters.
VPC endpoints	Private access to AWS services	Glue jobs in private subnets need a route to S3, KMS, Secrets Manager, and other services they call.
CloudTrail	API audit events	Not the same as application logs or ETL error logs.
CloudWatch	Metrics, logs, alarms	Not a permission system or audit ledger.

Cross-account data access checklist

When a question involves cross-account S3, Glue, Redshift Spectrum, Athena, or Lake Formation, check all of these:

Trust policy — can the principal assume the required role?
Identity policy — does the role allow required actions?
Resource policy — does the bucket, key, queue, or topic allow access?
KMS key policy — can the principal encrypt/decrypt with the key?
Lake Formation grants — if Lake Formation governs the table, are data permissions granted?
Network path — can the service reach the endpoint privately if required?
Catalog sharing — is metadata available to the consuming account?

Encryption review

Data state	Typical controls
At rest in S3	Server-side encryption with AWS-managed or customer-managed KMS keys, bucket policies.
At rest in Redshift/RDS/DynamoDB	Service encryption settings and KMS keys where applicable.
In transit	TLS/HTTPS/JDBC over TLS, secure endpoints.
Secrets	Secrets Manager or Parameter Store with appropriate encryption and access control.
Logs	Encrypt and restrict access to CloudWatch Logs, S3 log buckets, and audit trails.

Operations, reliability, and troubleshooting

Pipeline reliability patterns

Need	Pattern
Avoid duplicate processing	Idempotent writes, deduplication keys, checkpoints, job bookmarks.
Recover from transient failures	Retries with backoff, DLQs, replayable streams, reprocessing from raw zone.
Detect failed jobs	CloudWatch metrics/logs/alarms, EventBridge failure events, workflow status.
Handle bad records	Quarantine path, validation rules, data quality reports, alerting.
Maintain auditability	Raw immutable landing zone, CloudTrail, job logs, lineage metadata.
Minimize blast radius	Separate environments, least privilege roles, isolated prefixes/buckets/accounts.
Reduce cost	Partition pruning, file compaction, right-sized compute, lifecycle policies.

Service-specific operational points

Service	Operational focus
AWS Glue	Job logs, worker sizing, bookmarks, retries, data quality, connection failures.
Kinesis Data Streams	Shard capacity, iterator age, consumer lag, hot shards, retention.
Kinesis Data Firehose	Delivery failures, backup S3 prefix, transformation errors, buffering settings.
AWS DMS	Replication lag, task errors, table mapping, endpoint connectivity, CDC status.
Athena	Query failures, partition metadata, data format errors, scanned data volume.
Redshift	Query performance, data skew, workload queues, COPY errors, disk/storage pressure.
DynamoDB	Throttling, hot partitions, capacity mode, GSI design, stream consumers.
Step Functions	Failed states, retry/catch behavior, timeout settings, state input/output size.
EventBridge	Rule pattern matching, target permissions, dead-letter or retry configuration.

Monitoring trap list

A passing pipeline can still produce bad data; monitor quality, not just job success.
CloudWatch logs may show the error, but you still need retry, alert, and remediation design.
Duplicate events are normal in many distributed systems; consumers must handle them.
“Near real time” may require watching consumer lag, iterator age, or delivery delay.
A job that scans too much data is both slower and more expensive.

Calculation and capacity review

If a scenario gives per-shard or per-partition limits, calculate from the numbers in the question rather than memorizing changing quotas.

For shard-style capacity planning, use the largest requirement across write throughput, record count, and read throughput:

\[ \text{required shards} = \max( \lceil \text{write throughput} / \text{write capacity per shard} \rceil, \lceil \text{records per second} / \text{records per shard} \rceil, \lceil \text{read throughput} / \text{read capacity per shard} \rceil ) \]

For data-lake query cost/performance questions, focus on scanned data:

\[ \text{data scanned} \approx \text{selected columns} \times \text{matching partitions} \times \text{uncompressed or effective file size} \]

Practical implications:

Parquet/ORC reduces scanned columns.
Partitions reduce scanned rows/files.
Compression reduces physical bytes read.
File compaction reduces request overhead.
Predicate pushdown helps only when data format and query design support it.

Common DEA-C01 answer-choice traps

Trap	Better reasoning
“Use Lambda for all transformations.”	Lambda is for lightweight event processing; Glue or EMR is better for large ETL.
“Use Glue crawler to transform data.”	Crawlers infer metadata; Glue jobs transform.
“Use SQS for analytics stream replay.”	SQS is a queue; choose Kinesis Data Streams or MSK for stream processing and replay-style consumers.
“Use Firehose for custom multi-consumer stream apps.”	Firehose is managed delivery; Kinesis Data Streams or MSK fits custom consumers.
“Use Redshift as the raw data lake.”	Land raw data in S3; load curated data into Redshift when warehouse analytics are needed.
“Grant S3 access and ignore KMS.”	Encrypted objects require KMS permissions too.
“Lake Formation means IAM no longer matters.”	IAM, Lake Formation, S3, KMS, and service roles can all matter.
“Partition by every possible column.”	Overpartitioning causes metadata and small-file problems.
“Use CSV because it is simple.”	Columnar formats are usually better for analytics over large data.
“Assume exactly-once delivery.”	Many services are at-least-once; design deduplication and idempotency.
“Use DMS for complex ETL.”	DMS is for migration and replication, not rich transformations.
“Use Athena without considering format.”	Athena performance depends on data layout, compression, partitions, and scanned bytes.
“Use public internet paths for private data jobs.”	Look for VPC endpoints, private subnets, security groups, and private connectivity.
“Trust the catalog blindly.”	Catalog metadata must match actual files and permissions.
“Ignore rejected records.”	Real pipelines need quarantine, alerts, and reprocessing strategy.

Quick service-selection drills

Use these as fast mental prompts before attempting original practice questions.

If the stem says…

Stem phrase	Think first
“Serverless ETL”	AWS Glue
“Spark with minimal infrastructure”	AWS Glue
“Existing Hadoop/Spark ecosystem”	EMR
“Run SQL on S3”	Athena
“Data warehouse analytics”	Redshift
“Full load and CDC”	AWS DMS
“Kafka-compatible”	Amazon MSK
“Managed delivery stream”	Kinesis Data Firehose
“Multiple custom stream consumers”	Kinesis Data Streams
“Route events from AWS services”	EventBridge
“State machine, retries, branching”	Step Functions
“Airflow DAGs”	Amazon MWAA
“Discover table schema”	Glue crawler
“Central table metadata”	Glue Data Catalog
“Fine-grained lake permissions”	Lake Formation
“Sensitive data discovery in S3”	Macie
“Store and rotate credentials”	Secrets Manager

Practice plan after this Quick Review

For the AWS Certified Data Engineer – Associate (DEA-C01) exam, do not practice only by memorizing service names. The real skill is choosing among plausible AWS services under constraints.

Use a question bank in this order:

Topic drills: ingestion
- Kinesis Data Streams vs Firehose vs MSK vs SQS.
- DMS full load and CDC.
- DataSync, Transfer Family, and AppFlow scenarios.
Topic drills: storage and catalog
- S3 partitioning and file formats.
- Glue Data Catalog and crawlers.
- Athena, Redshift Spectrum, and Redshift loading.
Topic drills: transformation
- Glue vs EMR vs Lambda vs Athena SQL vs Redshift SQL.
- Incremental processing, bookmarks, checkpoints, and bad-record handling.
Topic drills: security and governance
- IAM plus S3 bucket policies.
- KMS key-policy failures.
- Lake Formation permissions and LF-tags.
- Cross-account access.
Topic drills: operations
- CloudWatch logs and metrics.
- Retry, DLQ, replay, deduplication.
- Cost and performance optimization.
Mixed mock exams
- Force yourself to explain why each wrong answer is wrong.
- Track misses by decision type, not just by service.
- Revisit detailed explanations for every guessed question.

Final quick checklist

Before you start a timed mock exam, confirm you can answer these without notes:

Can I distinguish Kinesis Data Streams, Kinesis Data Firehose, MSK, and SQS?
Can I choose between Glue, EMR, Lambda, Athena, and Redshift for transformations?
Can I explain why S3 layout affects Athena and data-lake performance?
Can I identify when DMS CDC is the right ingestion pattern?
Can I troubleshoot access failures involving IAM, S3 policies, KMS, and Lake Formation?
Can I recognize small-file, overpartitioning, schema drift, and hot partition problems?
Can I design for retries, idempotency, DLQs, checkpoints, and quarantine paths?
Can I connect monitoring tools to the right failure type?

Next step: use this Quick Review as a checklist, then move into DEA-C01 topic drills and original practice questions with detailed explanations until your mistakes are concentrated in a few identifiable decision areas.

Continue in IT Mastery

Use this Quick Review as a final concept map, then move into IT Mastery for focused topic drills, mixed practice sets, timed mock exams, and detailed explanations. The practice questions are original IT Mastery practice items; they are not official AWS questions, copied live-exam content, or exam dumps.

Study Plan