DEA-C01 — AWS Certified Data Engineer – Associate Exam Blueprint

Last revised: June 29, 2026

Independent exam blueprint for AWS Certified Data Engineer – Associate (DEA-C01) readiness, covering ingestion, transformation, storage, operations, security, and governance.

How to Use This Exam Blueprint

Use this checklist as a practical readiness map for the AWS Certified Data Engineer – Associate (DEA-C01) exam from AWS. It is not a replacement for the official exam guide, and it does not claim exact exam weighting. Instead, it turns likely exam topic areas into concrete review tasks.

For each area, ask:

Can I choose the right AWS service for the scenario?
Can I explain why the wrong options are wrong?
Can I identify security, cost, reliability, and operational tradeoffs?
Can I troubleshoot a broken pipeline from symptoms, logs, permissions, schema changes, or data quality signals?
Can I connect ingestion, storage, transformation, cataloging, orchestration, monitoring, and governance into an end-to-end data architecture?

DEA-C01 readiness areas at a glance

Readiness area	What to review	You are ready when you can…
Data ingestion	Batch, streaming, CDC, event-driven ingestion, managed connectors	Match sources and latency needs to AWS ingestion services
Data storage	Amazon S3, data lakes, warehouses, operational stores, file formats, partitioning	Choose storage patterns for analytics, durability, cost, and performance
Data transformation	AWS Glue, Apache Spark concepts, SQL transforms, ELT/ETL, schema evolution	Design repeatable transformations and handle dirty or changing data
Data cataloging	AWS Glue Data Catalog, crawlers, metadata, partitions, schemas	Make datasets discoverable and queryable by downstream tools
Orchestration	AWS Step Functions, Amazon EventBridge, AWS Glue workflows, Amazon MWAA concepts	Coordinate jobs, retries, dependencies, and failure handling
Analytics services	Amazon Athena, Amazon Redshift, Amazon EMR, Amazon OpenSearch Service where relevant	Select the right query or processing engine for workload needs
Security	IAM, resource policies, AWS KMS, encryption, network boundaries, Lake Formation	Apply least privilege and protect data at rest and in transit
Governance	Data classification, access control, lineage concepts, quality checks, retention	Explain how governed data access works across users and services
Monitoring and operations	Amazon CloudWatch, logs, metrics, alarms, AWS CloudTrail, job run history	Diagnose failed, slow, expensive, or incomplete data pipelines
Performance and cost	Partitioning, compression, file size, query pruning, scaling, lifecycle policies	Improve throughput and cost without weakening reliability or security
Reliability	Idempotency, retries, checkpoints, dead-letter handling, backfills	Recover from failures without duplicating or losing data

Core service selection checklist

Ingestion and movement

Scenario cue	Service or pattern to consider	Readiness check
Continuous application events	Amazon Kinesis Data Streams	Can you reason about producers, consumers, ordering, retention, and scaling?
Near-real-time delivery to S3 or analytics destinations	Amazon Data Firehose	Can you distinguish managed delivery from custom stream processing?
Managed Apache Kafka requirement	Amazon MSK	Can you identify when Kafka compatibility matters?
Database migration or change data capture	AWS Database Migration Service	Can you separate full load, CDC, replication, and transformation responsibilities?
SaaS data ingestion	AWS AppFlow	Can you identify when a managed connector reduces custom integration work?
File-based batch ingestion	Amazon S3 landing zone	Can you design prefixes, validation, metadata, and downstream triggers?
Event-based trigger after object arrival	Amazon EventBridge or S3 event notification pattern	Can you choose an event-driven pipeline without unnecessary polling?
Cross-account data movement	IAM roles, bucket policies, AWS Lake Formation where applicable	Can you secure producer and consumer access without broad permissions?

Storage and data layout

Topic	What to know	Ready signal
Amazon S3 data lake design	Raw, curated, and consumption zones; prefixes; object lifecycle	You can explain how data moves through zones and who can access each zone
File formats	CSV, JSON, Parquet, ORC, Avro concepts	You can choose columnar formats for analytical query efficiency
Compression	Common compression tradeoffs	You can explain cost, scan reduction, and splittability considerations
Partitioning	Date, region, tenant, source-system, or business keys	You can avoid over-partitioning and under-partitioning
Small files	Compaction and batching	You can identify why many tiny files hurt query and job performance
Schema evolution	Backward-compatible changes, nullable fields, crawler updates	You can predict downstream impact of added, removed, or renamed fields
Amazon Redshift	Warehousing, SQL analytics, loading from S3, external data concepts	You can choose Redshift for structured analytics and performance requirements
Amazon Athena	Serverless SQL over S3 data	You can optimize Athena with partitions, columnar files, and catalog metadata
Operational stores	Amazon DynamoDB, Amazon RDS, purpose-built stores	You can avoid using analytics services for transactional access patterns

Transformation and processing

Need	Review	Ready signal
Serverless ETL	AWS Glue jobs	You can describe Glue job inputs, transforms, outputs, bookmarks, and error handling
Distributed processing	Apache Spark concepts on Glue or EMR	You can reason about partitions, joins, shuffles, skew, and executor resource pressure
SQL transformation	Athena, Redshift SQL, Spark SQL	You can write or read transformations involving joins, filtering, aggregation, and deduplication
Large-scale custom frameworks	Amazon EMR	You can identify when managed clusters or open-source ecosystem compatibility are needed
Lightweight event transforms	AWS Lambda	You can recognize payload size, runtime, retry, and idempotency constraints conceptually
Data preparation	Validation, cleansing, normalization, enrichment	You can detect dirty data and decide where to handle it in the pipeline
Incremental processing	Bookmarks, watermarks, CDC columns, checkpoints	You can explain how to process only new or changed records safely
Backfills	Replay from source, reprocess from raw zone, controlled overwrite	You can backfill without corrupting curated data or double-counting records

“Can you do this?” skills checklist

Architecture and service choice

Given batch, streaming, and CDC requirements, choose an ingestion pattern and justify it.
Distinguish Amazon Kinesis Data Streams, Amazon Data Firehose, Amazon MSK, AWS DMS, and file-based S3 ingestion.
Choose between Amazon Athena, Amazon Redshift, AWS Glue, Amazon EMR, and Amazon OpenSearch Service for analytics or processing scenarios.
Identify when a serverless approach is simpler than provisioning and operating clusters.
Identify when custom code is justified versus managed connectors or managed ETL.
Design a pipeline with landing, raw, curated, and consumption layers.
Recognize which services require a Data Catalog, table metadata, or partition metadata.
Choose orchestration for multi-step jobs, retries, branching, scheduled runs, and event-driven starts.

Data modeling, formats, and query performance

Choose Parquet or ORC for analytical workloads when column pruning and compression matter.
Explain why JSON or CSV may be acceptable for interchange but inefficient for repeated analytics.
Design partition keys that match common query predicates.
Detect an over-partitioned table from symptoms such as excessive metadata, many tiny files, or slow planning.
Detect an under-partitioned table from symptoms such as high scan volume.
Explain how compaction improves query performance.
Explain when denormalization, star schemas, or curated aggregates may support analytics.
Recognize join skew and high-cardinality grouping as performance risks.
Explain schema evolution risks for crawlers, consumers, and downstream reports.

Security and governance

Apply least privilege with IAM roles used by Glue jobs, crawlers, Lambda functions, and orchestration services.
Distinguish identity-based policies, resource-based policies, bucket policies, and service roles.
Explain how AWS KMS keys affect encrypted data access.
Identify missing permissions from access denied symptoms.
Secure S3 data with encryption, restricted public access, bucket policies, and scoped access.
Explain how AWS Lake Formation can centralize data lake permissions.
Recognize when column-level, row-level, or table-level access controls are relevant.
Identify CloudTrail as a source for API activity auditing.
Explain why network controls, VPC endpoints, and private connectivity may matter for data pipelines.

Operations, troubleshooting, and reliability

Read job failure symptoms and identify whether the likely cause is IAM, schema, data quality, network, capacity, dependency, or code.
Use CloudWatch logs and metrics as the first place to investigate managed job failures.
Design retries without creating duplicate records.
Explain idempotent writes and deterministic output paths.
Use checkpoints, bookmarks, offsets, or watermarks to support incremental processing.
Handle late-arriving data in streaming or event-driven pipelines.
Route bad records to a quarantine location or dead-letter path.
Plan a backfill from a raw immutable source.
Separate monitoring for pipeline health, data quality, latency, freshness, and cost.

Data pipeline lifecycle checklist

Pipeline phase	Candidate tasks to practice	Common exam-style decision
Source analysis	Identify source type, volume, velocity, schema stability, ownership	Is this batch, stream, CDC, or SaaS ingestion?
Landing	Store original data in S3 or appropriate landing store	Should raw data be immutable and replayable?
Validation	Check schema, required fields, ranges, duplicates, referential assumptions	Should invalid records fail the job or be quarantined?
Transformation	Cleanse, normalize, enrich, join, aggregate	Should transformation be ETL before load or ELT after load?
Cataloging	Register tables, schemas, partitions, and metadata	Does the query engine know where the data is and how it is structured?
Consumption	Serve Athena, Redshift, dashboards, ML, APIs, or search	Which access pattern is primary: SQL analytics, low-latency search, or operational reads?
Governance	Apply permissions, classification, encryption, retention	Who is allowed to see which columns, rows, or datasets?
Observability	Monitor job success, freshness, latency, cost, errors	How will teams detect stale, incomplete, or expensive pipelines?
Recovery	Retry, replay, backfill, rollback, or reprocess	Can you recover without duplication or data loss?

Scenario and decision-point checks

Batch versus streaming

If the question says…	Think about…	Avoid jumping to…
Data arrives once per day from files	Batch ingestion to S3, Glue jobs, Athena/Redshift loading	Kinesis just because “data pipeline” is mentioned
Events must be processed continuously	Kinesis Data Streams, Data Firehose, Lambda, stream processing	Nightly batch jobs
Delivery to S3 with minimal custom code	Amazon Data Firehose	Building producers and consumers unless required
Kafka clients or Kafka ecosystem compatibility	Amazon MSK	Kinesis without checking compatibility needs
Source database changes must be captured	AWS DMS with CDC concepts	Periodic full exports if changes must be near-real-time

Athena versus Redshift versus Glue

Requirement	More likely fit	Check your reasoning
Serverless ad hoc SQL over S3	Amazon Athena	Is data already in S3 and cataloged?
Managed data warehouse for structured analytics	Amazon Redshift	Are there repeated analytics workloads and warehouse-style needs?
ETL job to transform large datasets	AWS Glue	Is the task processing/transformation rather than interactive querying?
Open-source big data framework flexibility	Amazon EMR	Is cluster-level control or ecosystem compatibility required?
Search, indexing, log-style exploration	Amazon OpenSearch Service	Is full-text search or near-real-time indexing central?

S3 data lake layout

Decision	Good exam-ready answer includes
Where should raw data go?	Immutable raw zone with source-preserving format and restricted access
Where should cleaned data go?	Curated zone with validated schema, optimized format, and documented partitions
How should output paths be designed?	Deterministic, partition-aware, and safe for retries or overwrites
How should old data be managed?	Lifecycle policies, retention requirements, and cost-aware storage classes
How should sensitive data be handled?	Encryption, access controls, masking or tokenization where appropriate, auditability

IAM and access troubleshooting

Symptom	Likely checks
Glue job cannot read S3 input	Job role permissions, bucket policy, KMS key permissions, object path
Athena query cannot access table data	Data Catalog permissions, S3 location permissions, Lake Formation controls if used
Crawler creates no tables	Crawler role, S3 path, supported format, file layout, permissions
Cross-account consumer cannot query data	Trust policy, resource policy, bucket policy, Lake Formation grants, KMS access
Encrypted object cannot be read despite S3 permission	KMS key policy or grant missing

Performance troubleshooting

Symptom	Likely cause	Review action
Athena queries scan too much data	Poor partitioning, row-based files, unnecessary columns	Partition, convert to columnar format, select only needed columns
Spark job slow during joins	Data skew, shuffle pressure, large joins	Partitioning, broadcast strategy concepts, filter early
Redshift load or query performance poor	Data layout, sort/distribution concepts, file sizing, workload pressure	Review warehouse loading and query design concepts
Streaming consumer lag increases	Insufficient processing throughput or downstream bottleneck	Review scaling, batching, retries, and destination capacity
Pipeline cost spikes	Excess scans, repeated reprocessing, inefficient file format, no lifecycle policy	Review query scan reduction and storage lifecycle

AWS Glue readiness checklist

Glue concepts

Explain what the AWS Glue Data Catalog stores.
Explain what a Glue crawler does and when not to rely solely on crawlers.
Understand Glue jobs as managed ETL jobs.
Understand Glue job bookmarks as a way to support incremental processing in supported scenarios.
Identify the role of Glue connections for connecting to data stores.
Recognize how Glue workflows or external orchestrators can sequence jobs.
Know that Spark-based jobs may be affected by partitioning, skew, shuffles, and memory pressure.
Identify where to look for Glue job logs and run history.

Glue job failure checklist

Failure clue	What to inspect
`AccessDenied`	IAM role, S3 policy, KMS key policy, Lake Formation grants
Schema mismatch	Data Catalog table, crawler result, source schema changes, transform assumptions
Out-of-memory or executor failures	Dataset size, joins, repartitioning, skew, file sizes
Missing partitions	Partition registration, crawler coverage, MSCK/repair-style concepts where relevant
Duplicate outputs	Retry behavior, idempotency, output overwrite mode, bookmarks
Empty outputs	Filters, source path, bookmark state, partition predicate, upstream ingestion

Minimal PySpark-style concepts to recognize

You do not need to memorize large code blocks, but you should understand the intent of common transform patterns:

## Conceptual pattern: read, filter, transform, write curated output
df = spark.read.parquet("s3://raw-zone/orders/")
clean = (
    df.filter("order_id IS NOT NULL")
      .dropDuplicates(["order_id"])
)
clean.write.mode("overwrite").partitionBy("order_date").parquet("s3://curated-zone/orders/")

Be ready to identify:

What data is read and written.
Whether the write mode is safe for the scenario.
How partitioning affects downstream query pruning.
Whether duplicate handling is correct for the business key.
Whether overwrite could remove valid data if used incorrectly.

SQL and analytics readiness checks

Query patterns to recognize

Filtering by partition columns to reduce scans.
Aggregating by business dimensions and time windows.
Joining fact and dimension data.
Deduplicating with business keys and timestamps.
Using window functions conceptually for latest-record selection.
Understanding null handling and type casting.
Distinguishing raw ingestion schema from curated analytics schema.

Example pattern:

WITH ranked_orders AS (
  SELECT
    order_id,
    customer_id,
    order_status,
    updated_at,
    ROW_NUMBER() OVER (
      PARTITION BY order_id
      ORDER BY updated_at DESC
    ) AS rn
  FROM curated_orders
)
SELECT order_id, customer_id, order_status, updated_at
FROM ranked_orders
WHERE rn = 1;

Can you explain:

Why ROW_NUMBER() is used?
What happens if updated_at is missing or duplicated?
Whether the table should be partitioned by a date column for common queries?
Whether this logic belongs in a curated table, a view, or a downstream report?

Data quality and validation checklist

Quality dimension	What to test	Example failure
Completeness	Required fields exist and are populated	Missing customer ID
Validity	Values match expected type, range, or pattern	Negative quantity where not allowed
Uniqueness	Business keys are not duplicated unexpectedly	Duplicate order ID
Consistency	Related datasets agree	Order references unknown customer
Timeliness	Data arrives within expected freshness window	Daily file missing
Accuracy	Values match source of truth	Aggregates do not reconcile
Schema conformity	Fields and types match contract	String date replaces timestamp
Volume anomaly	Record counts are within expected bounds	Sudden 90% drop in rows

Readiness prompts:

Can you decide whether bad records should fail the pipeline or be quarantined?
Can you design a retry that does not reload already accepted records?
Can you explain how to alert on missing files or stale partitions?
Can you identify whether validation belongs at ingestion, transformation, or consumption?

Orchestration and workflow checklist

Need	Pattern to review	Readiness prompt
Scheduled daily ETL	EventBridge schedule plus Glue job or workflow	Can you handle missed or failed runs?
Multi-step dependency chain	Step Functions, Glue workflows, or MWAA concepts	Can you model success, failure, retry, and branching?
Event-driven object processing	S3 event pattern or EventBridge	Can you avoid duplicate processing?
Human-readable DAGs	MWAA / Apache Airflow concepts	Can you explain task dependencies and retries?
Conditional routing	Step Functions branching	Can you route validation failures separately?
Long-running distributed transform	Glue or EMR job orchestration	Can the orchestrator monitor completion and failure?

Workflow decision path

    flowchart TD
	    A[New data or schedule] --> B{Single simple task?}
	    B -->|Yes| C[Trigger job or function directly]
	    B -->|No| D{Multiple dependencies or branches?}
	    D -->|Yes| E[Use workflow orchestration]
	    D -->|No| F[Use scheduled managed job]
	    E --> G{Failure handling needed?}
	    G -->|Yes| H[Add retries, alerts, quarantine, rollback or replay]
	    G -->|No| I[Still log status and outputs]

Security and governance checklist

IAM and permissions

Control	What to know	Ready signal
IAM role	Service assumes a role to access resources	You can identify the execution role for Glue, Lambda, or Step Functions
Trust policy	Defines who can assume a role	You can troubleshoot role assumption failures
Identity policy	Grants actions to principals	You can scope actions and resources
Resource policy	Grants access at resource level	You can reason about S3 bucket policies and cross-account access
KMS key policy	Controls use of encryption keys	You know S3 access alone may not be enough for encrypted data
Lake Formation permissions	Governs data lake access	You can separate table permissions from raw S3 access concepts

Encryption and network controls

Know where encryption at rest applies: S3, Redshift, databases, streams, logs, and intermediate outputs.
Know why encryption in transit matters for service-to-service and client-to-service traffic.
Recognize when KMS permissions are needed in addition to service permissions.
Understand why private connectivity and VPC endpoints may reduce exposure to public network paths.
Recognize that security groups and subnet routing can affect connectivity for jobs accessing data stores.
Understand audit needs using CloudTrail, service logs, and access logs where applicable.

Governance prompts

Which users can discover the dataset?
Which users can query the dataset?
Which users can access the underlying S3 objects?
Are sensitive columns masked, tokenized, excluded, or restricted?
Is access controlled consistently across Athena, Redshift, Glue, and other consumers?
Are data retention and deletion expectations reflected in lifecycle or pipeline design?
Can you prove who accessed or changed data-related resources?

Monitoring, logging, and operations checklist

Operational question	AWS area to review
Did the job run?	Glue job run history, Step Functions execution history, MWAA task status
Did it process the expected data?	Row counts, file counts, partition checks, data quality metrics
Did it fail because of permissions?	CloudWatch logs, IAM simulation concepts, CloudTrail events
Did it fail because of data?	Schema validation, rejected-record logs, bad-record quarantine
Is the pipeline late?	Freshness metrics, schedule monitoring, event arrival tracking
Is the stream falling behind?	Consumer lag concepts, processing throughput, destination errors
Is the query too expensive?	Scan volume, file format, partition pruning, repeated queries
Who changed something?	CloudTrail and configuration history concepts

Alerting checklist

Job failure.
Job duration exceeds expected range.
Data volume anomaly.
Missing partition or file.
Data freshness breach.
Excessive rejected records.
Stream delivery failures.
Query cost or scan-volume anomaly.
Unauthorized access attempts or policy changes.

Cost and performance tradeoff checklist

Decision	Cost/performance issue	Better answer usually considers
Raw JSON queried repeatedly	High scan cost and slower analytics	Convert to Parquet/ORC in curated zone
No partitioning	Full scans	Partition by common filters
Too many tiny files	Planning overhead and inefficient reads	Compact files into larger analytical objects
Reprocessing full history daily	High compute and risk	Incremental processing, bookmarks, CDC, watermarks
Always-on cluster for sporadic workload	Idle cost	Serverless or scheduled processing
Unbounded retention in hot storage	Storage cost	Lifecycle and retention policies
Broad data access	Security and audit risk	Least privilege and governed access
Overly complex custom pipeline	Operational cost	Managed services where requirements fit

Common weak areas and traps

Trap	Why it hurts	How to fix your readiness
Memorizing services without scenario fit	DEA-C01 questions often test judgment	Practice “why this service, why not the others”
Treating S3 as just a bucket	Data lake design depends on layout, metadata, and governance	Review zones, prefixes, formats, partitions, cataloging
Ignoring IAM in pipeline failures	Many managed-service failures are permission-related	Trace the execution role and resource access path
Forgetting KMS permissions	Encrypted data may fail even when S3 permission exists	Include key policy and grants in troubleshooting
Overusing crawlers	Crawlers do not replace schema governance	Know when explicit schema control is safer
Partitioning by too many columns	Metadata and small-file problems	Match partitions to common query predicates
Not designing for retries	Duplicate records and partial outputs	Use idempotent writes and deterministic processing
Confusing ETL and orchestration	Transform code and workflow control are separate concerns	Identify job logic versus dependency management
Missing observability	A pipeline can “succeed” but deliver bad or stale data	Monitor freshness, counts, quality, and failures
Choosing streaming for every problem	Streaming adds complexity	Match latency requirement to architecture

Final-week review checklist

7 to 5 days before the exam

Re-read the official AWS exam guide for AWS Certified Data Engineer – Associate (DEA-C01).
Build a one-page service selection chart for ingestion, storage, processing, orchestration, and governance.
Review IAM role assumptions, S3 bucket policies, and KMS access scenarios.
Review Glue Data Catalog, crawlers, jobs, bookmarks, and troubleshooting.
Practice distinguishing Athena, Redshift, Glue, EMR, Kinesis, Data Firehose, MSK, and DMS.
Review data layout: raw/curated zones, file formats, compression, partitioning, and compaction.

4 to 2 days before the exam

Work through mixed scenario questions, not just service flashcards.
For every missed question, write the decision cue you failed to notice.
Practice troubleshooting from symptoms: access denied, slow query, missing partition, duplicate output, schema mismatch.
Review data quality patterns and where validation belongs.
Review orchestration failure handling: retries, branching, alerts, backfills, and dead-letter paths.
Review monitoring tools and what evidence each one provides.

Final 24 hours

Skim this Exam Blueprint and mark any remaining weak areas.
Review your own missed-question notes.
Memorize no unsupported limits, dates, prices, or quotas.
Focus on service selection logic and tradeoffs.
Sleep instead of attempting to learn a new service from scratch.

Quick self-assessment table

Rate each area before you finish review.

Area	Not ready	Almost ready	Ready
Ingestion service selection	☐	☐	☐
S3 data lake layout	☐	☐	☐
File formats and partitioning	☐	☐	☐
Glue jobs and Data Catalog	☐	☐	☐
Athena and Redshift use cases	☐	☐	☐
Streaming and CDC concepts	☐	☐	☐
IAM, KMS, and Lake Formation concepts	☐	☐	☐
Orchestration and retries	☐	☐	☐
Monitoring and troubleshooting	☐	☐	☐
Cost and performance tradeoffs	☐	☐	☐

If any row is still “Not ready,” spend your next study session on scenario questions for that area. If most rows are “Almost ready,” shift from reading to timed mixed practice and post-question review.

Study Plan

Scenario Guide