DEA-C01 — AWS Certified Data Engineer – Associate Exam Blueprint

Independent exam blueprint for AWS Certified Data Engineer – Associate (DEA-C01) readiness, covering ingestion, transformation, storage, operations, security, and governance.

How to Use This Exam Blueprint

Use this checklist as a practical readiness map for the AWS Certified Data Engineer – Associate (DEA-C01) exam from AWS. It is not a replacement for the official exam guide, and it does not claim exact exam weighting. Instead, it turns likely exam topic areas into concrete review tasks.

For each area, ask:

  • Can I choose the right AWS service for the scenario?
  • Can I explain why the wrong options are wrong?
  • Can I identify security, cost, reliability, and operational tradeoffs?
  • Can I troubleshoot a broken pipeline from symptoms, logs, permissions, schema changes, or data quality signals?
  • Can I connect ingestion, storage, transformation, cataloging, orchestration, monitoring, and governance into an end-to-end data architecture?

DEA-C01 readiness areas at a glance

Readiness areaWhat to reviewYou are ready when you can…
Data ingestionBatch, streaming, CDC, event-driven ingestion, managed connectorsMatch sources and latency needs to AWS ingestion services
Data storageAmazon S3, data lakes, warehouses, operational stores, file formats, partitioningChoose storage patterns for analytics, durability, cost, and performance
Data transformationAWS Glue, Apache Spark concepts, SQL transforms, ELT/ETL, schema evolutionDesign repeatable transformations and handle dirty or changing data
Data catalogingAWS Glue Data Catalog, crawlers, metadata, partitions, schemasMake datasets discoverable and queryable by downstream tools
OrchestrationAWS Step Functions, Amazon EventBridge, AWS Glue workflows, Amazon MWAA conceptsCoordinate jobs, retries, dependencies, and failure handling
Analytics servicesAmazon Athena, Amazon Redshift, Amazon EMR, Amazon OpenSearch Service where relevantSelect the right query or processing engine for workload needs
SecurityIAM, resource policies, AWS KMS, encryption, network boundaries, Lake FormationApply least privilege and protect data at rest and in transit
GovernanceData classification, access control, lineage concepts, quality checks, retentionExplain how governed data access works across users and services
Monitoring and operationsAmazon CloudWatch, logs, metrics, alarms, AWS CloudTrail, job run historyDiagnose failed, slow, expensive, or incomplete data pipelines
Performance and costPartitioning, compression, file size, query pruning, scaling, lifecycle policiesImprove throughput and cost without weakening reliability or security
ReliabilityIdempotency, retries, checkpoints, dead-letter handling, backfillsRecover from failures without duplicating or losing data

Core service selection checklist

Ingestion and movement

Scenario cueService or pattern to considerReadiness check
Continuous application eventsAmazon Kinesis Data StreamsCan you reason about producers, consumers, ordering, retention, and scaling?
Near-real-time delivery to S3 or analytics destinationsAmazon Data FirehoseCan you distinguish managed delivery from custom stream processing?
Managed Apache Kafka requirementAmazon MSKCan you identify when Kafka compatibility matters?
Database migration or change data captureAWS Database Migration ServiceCan you separate full load, CDC, replication, and transformation responsibilities?
SaaS data ingestionAWS AppFlowCan you identify when a managed connector reduces custom integration work?
File-based batch ingestionAmazon S3 landing zoneCan you design prefixes, validation, metadata, and downstream triggers?
Event-based trigger after object arrivalAmazon EventBridge or S3 event notification patternCan you choose an event-driven pipeline without unnecessary polling?
Cross-account data movementIAM roles, bucket policies, AWS Lake Formation where applicableCan you secure producer and consumer access without broad permissions?

Storage and data layout

TopicWhat to knowReady signal
Amazon S3 data lake designRaw, curated, and consumption zones; prefixes; object lifecycleYou can explain how data moves through zones and who can access each zone
File formatsCSV, JSON, Parquet, ORC, Avro conceptsYou can choose columnar formats for analytical query efficiency
CompressionCommon compression tradeoffsYou can explain cost, scan reduction, and splittability considerations
PartitioningDate, region, tenant, source-system, or business keysYou can avoid over-partitioning and under-partitioning
Small filesCompaction and batchingYou can identify why many tiny files hurt query and job performance
Schema evolutionBackward-compatible changes, nullable fields, crawler updatesYou can predict downstream impact of added, removed, or renamed fields
Amazon RedshiftWarehousing, SQL analytics, loading from S3, external data conceptsYou can choose Redshift for structured analytics and performance requirements
Amazon AthenaServerless SQL over S3 dataYou can optimize Athena with partitions, columnar files, and catalog metadata
Operational storesAmazon DynamoDB, Amazon RDS, purpose-built storesYou can avoid using analytics services for transactional access patterns

Transformation and processing

NeedReviewReady signal
Serverless ETLAWS Glue jobsYou can describe Glue job inputs, transforms, outputs, bookmarks, and error handling
Distributed processingApache Spark concepts on Glue or EMRYou can reason about partitions, joins, shuffles, skew, and executor resource pressure
SQL transformationAthena, Redshift SQL, Spark SQLYou can write or read transformations involving joins, filtering, aggregation, and deduplication
Large-scale custom frameworksAmazon EMRYou can identify when managed clusters or open-source ecosystem compatibility are needed
Lightweight event transformsAWS LambdaYou can recognize payload size, runtime, retry, and idempotency constraints conceptually
Data preparationValidation, cleansing, normalization, enrichmentYou can detect dirty data and decide where to handle it in the pipeline
Incremental processingBookmarks, watermarks, CDC columns, checkpointsYou can explain how to process only new or changed records safely
BackfillsReplay from source, reprocess from raw zone, controlled overwriteYou can backfill without corrupting curated data or double-counting records

“Can you do this?” skills checklist

Architecture and service choice

  • Given batch, streaming, and CDC requirements, choose an ingestion pattern and justify it.
  • Distinguish Amazon Kinesis Data Streams, Amazon Data Firehose, Amazon MSK, AWS DMS, and file-based S3 ingestion.
  • Choose between Amazon Athena, Amazon Redshift, AWS Glue, Amazon EMR, and Amazon OpenSearch Service for analytics or processing scenarios.
  • Identify when a serverless approach is simpler than provisioning and operating clusters.
  • Identify when custom code is justified versus managed connectors or managed ETL.
  • Design a pipeline with landing, raw, curated, and consumption layers.
  • Recognize which services require a Data Catalog, table metadata, or partition metadata.
  • Choose orchestration for multi-step jobs, retries, branching, scheduled runs, and event-driven starts.

Data modeling, formats, and query performance

  • Choose Parquet or ORC for analytical workloads when column pruning and compression matter.
  • Explain why JSON or CSV may be acceptable for interchange but inefficient for repeated analytics.
  • Design partition keys that match common query predicates.
  • Detect an over-partitioned table from symptoms such as excessive metadata, many tiny files, or slow planning.
  • Detect an under-partitioned table from symptoms such as high scan volume.
  • Explain how compaction improves query performance.
  • Explain when denormalization, star schemas, or curated aggregates may support analytics.
  • Recognize join skew and high-cardinality grouping as performance risks.
  • Explain schema evolution risks for crawlers, consumers, and downstream reports.

Security and governance

  • Apply least privilege with IAM roles used by Glue jobs, crawlers, Lambda functions, and orchestration services.
  • Distinguish identity-based policies, resource-based policies, bucket policies, and service roles.
  • Explain how AWS KMS keys affect encrypted data access.
  • Identify missing permissions from access denied symptoms.
  • Secure S3 data with encryption, restricted public access, bucket policies, and scoped access.
  • Explain how AWS Lake Formation can centralize data lake permissions.
  • Recognize when column-level, row-level, or table-level access controls are relevant.
  • Identify CloudTrail as a source for API activity auditing.
  • Explain why network controls, VPC endpoints, and private connectivity may matter for data pipelines.

Operations, troubleshooting, and reliability

  • Read job failure symptoms and identify whether the likely cause is IAM, schema, data quality, network, capacity, dependency, or code.
  • Use CloudWatch logs and metrics as the first place to investigate managed job failures.
  • Design retries without creating duplicate records.
  • Explain idempotent writes and deterministic output paths.
  • Use checkpoints, bookmarks, offsets, or watermarks to support incremental processing.
  • Handle late-arriving data in streaming or event-driven pipelines.
  • Route bad records to a quarantine location or dead-letter path.
  • Plan a backfill from a raw immutable source.
  • Separate monitoring for pipeline health, data quality, latency, freshness, and cost.

Data pipeline lifecycle checklist

Pipeline phaseCandidate tasks to practiceCommon exam-style decision
Source analysisIdentify source type, volume, velocity, schema stability, ownershipIs this batch, stream, CDC, or SaaS ingestion?
LandingStore original data in S3 or appropriate landing storeShould raw data be immutable and replayable?
ValidationCheck schema, required fields, ranges, duplicates, referential assumptionsShould invalid records fail the job or be quarantined?
TransformationCleanse, normalize, enrich, join, aggregateShould transformation be ETL before load or ELT after load?
CatalogingRegister tables, schemas, partitions, and metadataDoes the query engine know where the data is and how it is structured?
ConsumptionServe Athena, Redshift, dashboards, ML, APIs, or searchWhich access pattern is primary: SQL analytics, low-latency search, or operational reads?
GovernanceApply permissions, classification, encryption, retentionWho is allowed to see which columns, rows, or datasets?
ObservabilityMonitor job success, freshness, latency, cost, errorsHow will teams detect stale, incomplete, or expensive pipelines?
RecoveryRetry, replay, backfill, rollback, or reprocessCan you recover without duplication or data loss?

Scenario and decision-point checks

Batch versus streaming

If the question says…Think about…Avoid jumping to…
Data arrives once per day from filesBatch ingestion to S3, Glue jobs, Athena/Redshift loadingKinesis just because “data pipeline” is mentioned
Events must be processed continuouslyKinesis Data Streams, Data Firehose, Lambda, stream processingNightly batch jobs
Delivery to S3 with minimal custom codeAmazon Data FirehoseBuilding producers and consumers unless required
Kafka clients or Kafka ecosystem compatibilityAmazon MSKKinesis without checking compatibility needs
Source database changes must be capturedAWS DMS with CDC conceptsPeriodic full exports if changes must be near-real-time

Athena versus Redshift versus Glue

RequirementMore likely fitCheck your reasoning
Serverless ad hoc SQL over S3Amazon AthenaIs data already in S3 and cataloged?
Managed data warehouse for structured analyticsAmazon RedshiftAre there repeated analytics workloads and warehouse-style needs?
ETL job to transform large datasetsAWS GlueIs the task processing/transformation rather than interactive querying?
Open-source big data framework flexibilityAmazon EMRIs cluster-level control or ecosystem compatibility required?
Search, indexing, log-style explorationAmazon OpenSearch ServiceIs full-text search or near-real-time indexing central?

S3 data lake layout

DecisionGood exam-ready answer includes
Where should raw data go?Immutable raw zone with source-preserving format and restricted access
Where should cleaned data go?Curated zone with validated schema, optimized format, and documented partitions
How should output paths be designed?Deterministic, partition-aware, and safe for retries or overwrites
How should old data be managed?Lifecycle policies, retention requirements, and cost-aware storage classes
How should sensitive data be handled?Encryption, access controls, masking or tokenization where appropriate, auditability

IAM and access troubleshooting

SymptomLikely checks
Glue job cannot read S3 inputJob role permissions, bucket policy, KMS key permissions, object path
Athena query cannot access table dataData Catalog permissions, S3 location permissions, Lake Formation controls if used
Crawler creates no tablesCrawler role, S3 path, supported format, file layout, permissions
Cross-account consumer cannot query dataTrust policy, resource policy, bucket policy, Lake Formation grants, KMS access
Encrypted object cannot be read despite S3 permissionKMS key policy or grant missing

Performance troubleshooting

SymptomLikely causeReview action
Athena queries scan too much dataPoor partitioning, row-based files, unnecessary columnsPartition, convert to columnar format, select only needed columns
Spark job slow during joinsData skew, shuffle pressure, large joinsPartitioning, broadcast strategy concepts, filter early
Redshift load or query performance poorData layout, sort/distribution concepts, file sizing, workload pressureReview warehouse loading and query design concepts
Streaming consumer lag increasesInsufficient processing throughput or downstream bottleneckReview scaling, batching, retries, and destination capacity
Pipeline cost spikesExcess scans, repeated reprocessing, inefficient file format, no lifecycle policyReview query scan reduction and storage lifecycle

AWS Glue readiness checklist

Glue concepts

  • Explain what the AWS Glue Data Catalog stores.
  • Explain what a Glue crawler does and when not to rely solely on crawlers.
  • Understand Glue jobs as managed ETL jobs.
  • Understand Glue job bookmarks as a way to support incremental processing in supported scenarios.
  • Identify the role of Glue connections for connecting to data stores.
  • Recognize how Glue workflows or external orchestrators can sequence jobs.
  • Know that Spark-based jobs may be affected by partitioning, skew, shuffles, and memory pressure.
  • Identify where to look for Glue job logs and run history.

Glue job failure checklist

Failure clueWhat to inspect
AccessDeniedIAM role, S3 policy, KMS key policy, Lake Formation grants
Schema mismatchData Catalog table, crawler result, source schema changes, transform assumptions
Out-of-memory or executor failuresDataset size, joins, repartitioning, skew, file sizes
Missing partitionsPartition registration, crawler coverage, MSCK/repair-style concepts where relevant
Duplicate outputsRetry behavior, idempotency, output overwrite mode, bookmarks
Empty outputsFilters, source path, bookmark state, partition predicate, upstream ingestion

Minimal PySpark-style concepts to recognize

You do not need to memorize large code blocks, but you should understand the intent of common transform patterns:

## Conceptual pattern: read, filter, transform, write curated output
df = spark.read.parquet("s3://raw-zone/orders/")
clean = (
    df.filter("order_id IS NOT NULL")
      .dropDuplicates(["order_id"])
)
clean.write.mode("overwrite").partitionBy("order_date").parquet("s3://curated-zone/orders/")

Be ready to identify:

  • What data is read and written.
  • Whether the write mode is safe for the scenario.
  • How partitioning affects downstream query pruning.
  • Whether duplicate handling is correct for the business key.
  • Whether overwrite could remove valid data if used incorrectly.

SQL and analytics readiness checks

Query patterns to recognize

  • Filtering by partition columns to reduce scans.
  • Aggregating by business dimensions and time windows.
  • Joining fact and dimension data.
  • Deduplicating with business keys and timestamps.
  • Using window functions conceptually for latest-record selection.
  • Understanding null handling and type casting.
  • Distinguishing raw ingestion schema from curated analytics schema.

Example pattern:

WITH ranked_orders AS (
  SELECT
    order_id,
    customer_id,
    order_status,
    updated_at,
    ROW_NUMBER() OVER (
      PARTITION BY order_id
      ORDER BY updated_at DESC
    ) AS rn
  FROM curated_orders
)
SELECT order_id, customer_id, order_status, updated_at
FROM ranked_orders
WHERE rn = 1;

Can you explain:

  • Why ROW_NUMBER() is used?
  • What happens if updated_at is missing or duplicated?
  • Whether the table should be partitioned by a date column for common queries?
  • Whether this logic belongs in a curated table, a view, or a downstream report?

Data quality and validation checklist

Quality dimensionWhat to testExample failure
CompletenessRequired fields exist and are populatedMissing customer ID
ValidityValues match expected type, range, or patternNegative quantity where not allowed
UniquenessBusiness keys are not duplicated unexpectedlyDuplicate order ID
ConsistencyRelated datasets agreeOrder references unknown customer
TimelinessData arrives within expected freshness windowDaily file missing
AccuracyValues match source of truthAggregates do not reconcile
Schema conformityFields and types match contractString date replaces timestamp
Volume anomalyRecord counts are within expected boundsSudden 90% drop in rows

Readiness prompts:

  • Can you decide whether bad records should fail the pipeline or be quarantined?
  • Can you design a retry that does not reload already accepted records?
  • Can you explain how to alert on missing files or stale partitions?
  • Can you identify whether validation belongs at ingestion, transformation, or consumption?

Orchestration and workflow checklist

NeedPattern to reviewReadiness prompt
Scheduled daily ETLEventBridge schedule plus Glue job or workflowCan you handle missed or failed runs?
Multi-step dependency chainStep Functions, Glue workflows, or MWAA conceptsCan you model success, failure, retry, and branching?
Event-driven object processingS3 event pattern or EventBridgeCan you avoid duplicate processing?
Human-readable DAGsMWAA / Apache Airflow conceptsCan you explain task dependencies and retries?
Conditional routingStep Functions branchingCan you route validation failures separately?
Long-running distributed transformGlue or EMR job orchestrationCan the orchestrator monitor completion and failure?

Workflow decision path

    flowchart TD
	    A[New data or schedule] --> B{Single simple task?}
	    B -->|Yes| C[Trigger job or function directly]
	    B -->|No| D{Multiple dependencies or branches?}
	    D -->|Yes| E[Use workflow orchestration]
	    D -->|No| F[Use scheduled managed job]
	    E --> G{Failure handling needed?}
	    G -->|Yes| H[Add retries, alerts, quarantine, rollback or replay]
	    G -->|No| I[Still log status and outputs]

Security and governance checklist

IAM and permissions

ControlWhat to knowReady signal
IAM roleService assumes a role to access resourcesYou can identify the execution role for Glue, Lambda, or Step Functions
Trust policyDefines who can assume a roleYou can troubleshoot role assumption failures
Identity policyGrants actions to principalsYou can scope actions and resources
Resource policyGrants access at resource levelYou can reason about S3 bucket policies and cross-account access
KMS key policyControls use of encryption keysYou know S3 access alone may not be enough for encrypted data
Lake Formation permissionsGoverns data lake accessYou can separate table permissions from raw S3 access concepts

Encryption and network controls

  • Know where encryption at rest applies: S3, Redshift, databases, streams, logs, and intermediate outputs.
  • Know why encryption in transit matters for service-to-service and client-to-service traffic.
  • Recognize when KMS permissions are needed in addition to service permissions.
  • Understand why private connectivity and VPC endpoints may reduce exposure to public network paths.
  • Recognize that security groups and subnet routing can affect connectivity for jobs accessing data stores.
  • Understand audit needs using CloudTrail, service logs, and access logs where applicable.

Governance prompts

  • Which users can discover the dataset?
  • Which users can query the dataset?
  • Which users can access the underlying S3 objects?
  • Are sensitive columns masked, tokenized, excluded, or restricted?
  • Is access controlled consistently across Athena, Redshift, Glue, and other consumers?
  • Are data retention and deletion expectations reflected in lifecycle or pipeline design?
  • Can you prove who accessed or changed data-related resources?

Monitoring, logging, and operations checklist

Operational questionAWS area to review
Did the job run?Glue job run history, Step Functions execution history, MWAA task status
Did it process the expected data?Row counts, file counts, partition checks, data quality metrics
Did it fail because of permissions?CloudWatch logs, IAM simulation concepts, CloudTrail events
Did it fail because of data?Schema validation, rejected-record logs, bad-record quarantine
Is the pipeline late?Freshness metrics, schedule monitoring, event arrival tracking
Is the stream falling behind?Consumer lag concepts, processing throughput, destination errors
Is the query too expensive?Scan volume, file format, partition pruning, repeated queries
Who changed something?CloudTrail and configuration history concepts

Alerting checklist

  • Job failure.
  • Job duration exceeds expected range.
  • Data volume anomaly.
  • Missing partition or file.
  • Data freshness breach.
  • Excessive rejected records.
  • Stream delivery failures.
  • Query cost or scan-volume anomaly.
  • Unauthorized access attempts or policy changes.

Cost and performance tradeoff checklist

DecisionCost/performance issueBetter answer usually considers
Raw JSON queried repeatedlyHigh scan cost and slower analyticsConvert to Parquet/ORC in curated zone
No partitioningFull scansPartition by common filters
Too many tiny filesPlanning overhead and inefficient readsCompact files into larger analytical objects
Reprocessing full history dailyHigh compute and riskIncremental processing, bookmarks, CDC, watermarks
Always-on cluster for sporadic workloadIdle costServerless or scheduled processing
Unbounded retention in hot storageStorage costLifecycle and retention policies
Broad data accessSecurity and audit riskLeast privilege and governed access
Overly complex custom pipelineOperational costManaged services where requirements fit

Common weak areas and traps

TrapWhy it hurtsHow to fix your readiness
Memorizing services without scenario fitDEA-C01 questions often test judgmentPractice “why this service, why not the others”
Treating S3 as just a bucketData lake design depends on layout, metadata, and governanceReview zones, prefixes, formats, partitions, cataloging
Ignoring IAM in pipeline failuresMany managed-service failures are permission-relatedTrace the execution role and resource access path
Forgetting KMS permissionsEncrypted data may fail even when S3 permission existsInclude key policy and grants in troubleshooting
Overusing crawlersCrawlers do not replace schema governanceKnow when explicit schema control is safer
Partitioning by too many columnsMetadata and small-file problemsMatch partitions to common query predicates
Not designing for retriesDuplicate records and partial outputsUse idempotent writes and deterministic processing
Confusing ETL and orchestrationTransform code and workflow control are separate concernsIdentify job logic versus dependency management
Missing observabilityA pipeline can “succeed” but deliver bad or stale dataMonitor freshness, counts, quality, and failures
Choosing streaming for every problemStreaming adds complexityMatch latency requirement to architecture

Final-week review checklist

7 to 5 days before the exam

  • Re-read the official AWS exam guide for AWS Certified Data Engineer – Associate (DEA-C01).
  • Build a one-page service selection chart for ingestion, storage, processing, orchestration, and governance.
  • Review IAM role assumptions, S3 bucket policies, and KMS access scenarios.
  • Review Glue Data Catalog, crawlers, jobs, bookmarks, and troubleshooting.
  • Practice distinguishing Athena, Redshift, Glue, EMR, Kinesis, Data Firehose, MSK, and DMS.
  • Review data layout: raw/curated zones, file formats, compression, partitioning, and compaction.

4 to 2 days before the exam

  • Work through mixed scenario questions, not just service flashcards.
  • For every missed question, write the decision cue you failed to notice.
  • Practice troubleshooting from symptoms: access denied, slow query, missing partition, duplicate output, schema mismatch.
  • Review data quality patterns and where validation belongs.
  • Review orchestration failure handling: retries, branching, alerts, backfills, and dead-letter paths.
  • Review monitoring tools and what evidence each one provides.

Final 24 hours

  • Skim this Exam Blueprint and mark any remaining weak areas.
  • Review your own missed-question notes.
  • Memorize no unsupported limits, dates, prices, or quotas.
  • Focus on service selection logic and tradeoffs.
  • Sleep instead of attempting to learn a new service from scratch.

Quick self-assessment table

Rate each area before you finish review.

AreaNot readyAlmost readyReady
Ingestion service selection
S3 data lake layout
File formats and partitioning
Glue jobs and Data Catalog
Athena and Redshift use cases
Streaming and CDC concepts
IAM, KMS, and Lake Formation concepts
Orchestration and retries
Monitoring and troubleshooting
Cost and performance tradeoffs

If any row is still “Not ready,” spend your next study session on scenario questions for that area. If most rows are “Almost ready,” shift from reading to timed mixed practice and post-question review.

Browse Certification Practice Tests by Exam Family