DEA-C01 — AWS Certified Data Engineer – Associate Quick Review

Concise independent Quick Review for AWS Certified Data Engineer – Associate (DEA-C01), focused on high-yield services, decision rules, traps, and practice planning.

Quick Review purpose

This Quick Review is for candidates preparing for the real AWS Certified Data Engineer – Associate (DEA-C01) exam who need a fast, practical review before moving into topic drills, mock exams, and detailed explanations.

Use it as an IT Mastery practice guide, not as an AWS publication. The goal is to sharpen service selection, recognize common traps, and connect concepts to original practice questions in a question bank.

The core DEA-C01 mental model

Most questions can be reduced to a data pipeline decision:

  1. Source — application, database, SaaS, stream, log, file, on-premises system.
  2. Ingest — batch, CDC, stream, event, transfer, managed delivery.
  3. Store — S3 data lake, Redshift warehouse, DynamoDB, RDS/Aurora, OpenSearch, or another target.
  4. Catalog and govern — Glue Data Catalog, Lake Formation, IAM, KMS, tags, metadata.
  5. Transform — Glue, EMR, Lambda, Athena SQL, Redshift SQL, Step Functions orchestration.
  6. Serve — Athena, Redshift, QuickSight, APIs, ML, search, downstream applications.
  7. Operate — monitor, retry, checkpoint, validate, secure, audit, optimize cost.
    flowchart LR
	    A[Data source] --> B{Ingestion pattern}
	    B -->|Batch files| C[S3 / DataSync / Transfer Family]
	    B -->|Database migration or CDC| D[AWS DMS]
	    B -->|Streaming records| E[Kinesis Data Streams / MSK]
	    B -->|Managed delivery| F[Kinesis Data Firehose]
	    B -->|SaaS integration| G[AppFlow]
	    C --> H[S3 data lake]
	    D --> H
	    E --> H
	    F --> H
	    G --> H
	    H --> I[Glue Data Catalog]
	    I --> J{Processing}
	    J -->|Serverless Spark ETL| K[AWS Glue]
	    J -->|Custom Spark/Hadoop| L[EMR]
	    J -->|SQL over S3| M[Athena]
	    J -->|Warehouse analytics| N[Redshift]
	    K --> O[Govern, monitor, optimize]
	    L --> O
	    M --> O
	    N --> O

High-yield service map

Need in the questionStrong AWS service candidateWatch for traps
Serverless Spark ETL, schema-aware jobs, job bookmarksAWS GlueGlue Data Catalog stores metadata; it does not transform data by itself.
SQL queries directly over S3 dataAthenaPerformance depends heavily on partitions, columnar formats, and scan reduction.
Managed data warehouse for analyticsAmazon RedshiftNot ideal as a raw object data lake; use COPY, Spectrum, external schemas, and proper table design.
Full load and CDC from databasesAWS Database Migration Service, AWS DMSDMS is not a general-purpose transformation engine.
High-throughput streaming with custom consumersKinesis Data StreamsOrdering is per shard, not global. Partition-key design matters.
Managed stream delivery to S3, Redshift, OpenSearch, or other destinationsKinesis Data FirehoseFirehose is delivery-focused; use Kinesis Data Streams when consumers need custom, low-latency stream processing.
Kafka-compatible streaming workloadsAmazon MSKDo not choose MSK merely because the word “streaming” appears. Look for Kafka compatibility or existing Kafka clients.
Decoupled application messagesSQSSQS is a queue, not a replayable analytics stream in the same sense as Kinesis or Kafka.
Event routing from AWS services or scheduled eventsEventBridgeEventBridge is not a full ETL orchestrator.
Stateful workflow orchestrationStep FunctionsBest when explicit states, retries, branches, and service integrations matter.
Airflow-compatible orchestrationAmazon MWAAChoose when Airflow DAG compatibility or migration is required.
Metadata catalog for data lake tablesAWS Glue Data CatalogPermissions may still involve IAM, S3 policies, KMS, and Lake Formation.
Centralized data lake permissions and fine-grained accessAWS Lake FormationLake Formation does not replace all IAM, network, and KMS considerations.
Sensitive data discovery in S3Amazon MacieMacie discovers and classifies; it is not an ETL service.
Secrets for database connectionsAWS Secrets ManagerPrefer over hardcoded credentials or plain-text job parameters.
Monitoring jobs, logs, alarms, metricsCloudWatchCloudWatch observes; it does not automatically fix bad partition design or failed records.
Audit API activityCloudTrailCloudTrail is audit history, not pipeline health monitoring by itself.

Ingestion decision rules

Batch, file, database, SaaS, and stream choices

Scenario cluePreferWhy
“Move large files from on-premises storage to S3”AWS DataSyncManaged transfer, scheduling, verification, bandwidth controls.
“SFTP/FTPS/FTP endpoint for partners”AWS Transfer FamilyManaged file transfer into S3 or EFS.
“Migrate relational database with minimal downtime”AWS DMS full load plus CDCHandles initial load and ongoing change replication.
“Capture changes from an operational database into S3”AWS DMS CDC, sometimes Kinesis targetsCDC is the key phrase.
“SaaS application data into S3 or Redshift”Amazon AppFlowManaged SaaS connectors and scheduled/event flows.
“Producers emit real-time clickstream events; multiple consumers process them”Kinesis Data Streams or MSKCustom consumers and replay are likely needed.
“Deliver streaming records to S3 with minimal operational overhead”Kinesis Data FirehoseManaged buffering, batching, delivery, and optional transformation.
“Application components need asynchronous decoupling”SQSQueue semantics, retries, DLQs, decoupling.
“Route events from AWS services to targets”EventBridgeEvent pattern matching and event bus routing.

Kinesis Data Streams vs Kinesis Data Firehose

FeatureKinesis Data StreamsKinesis Data Firehose
Primary useCustom stream processingManaged delivery
ConsumersYou build/manage consumersFirehose manages delivery
ReplaySupported within stream retentionNot the main pattern
OrderingPer shardDelivery batching; do not assume per-record processing order
TransformationsConsumer applications, Lambda, analytics servicesOptional lightweight Lambda transformation
Best clue“Multiple consumers,” “custom processing,” “replay,” “low latency”“Deliver to S3/Redshift/OpenSearch,” “minimal management,” “buffer”

Streaming traps

  • Ordering is usually partition-specific. In Kinesis Data Streams, records with the same partition key go to the same shard, so ordering is per shard.
  • Hot shards come from bad partition keys. A timestamp, region, or constant customer type can overload a shard if it concentrates traffic.
  • At-least-once delivery means duplicates can happen. Design idempotent consumers and deduplication where needed.
  • Firehose buffering affects latency. If a question requires very low latency custom processing, Firehose may not be the best fit.
  • SQS is not Kinesis. SQS is excellent for decoupling and retries, but it is not usually the right answer for replayable analytics streams with multiple independent consumers.

Transformation and orchestration review

Choose the processing service

WorkloadPreferDecision rule
Serverless batch ETL using SparkAWS GlueDefault for managed Spark ETL, Glue Data Catalog integration, crawlers, bookmarks.
Highly customized Spark/Hadoop ecosystemAmazon EMRChoose when cluster-level control, frameworks, or custom tuning matters.
Small event transformationLambdaGood for lightweight, short-running transformations; avoid for large ETL.
SQL transformation in warehouseRedshiftStrong for ELT after data is loaded into warehouse tables.
SQL transformation over S3AthenaGood for serverless querying and CTAS/INSERT-style transformations over data lake tables.
Multi-step workflow with branching and retriesStep FunctionsExplicit state machine, error handling, service integrations.
Airflow DAGsAmazon MWAAWhen Airflow compatibility is central.
Glue-centric job sequenceGlue workflows/triggersUseful when most steps are Glue crawlers and jobs.
Schedule or event triggerEventBridgeGood for invoking jobs/functions on a schedule or event pattern.

AWS Glue high-yield points

ConceptWhat to remember
Glue Data CatalogCentral metadata store for databases, tables, schemas, partitions, and connections.
CrawlersInfer schema and create/update catalog metadata. They do not clean, join, or transform data.
JobsPerform ETL, often Spark-based. Jobs can read from S3, JDBC sources, streams, and catalog tables.
Job bookmarksTrack previously processed data to help avoid reprocessing in incremental workloads.
DynamicFramesGlue abstraction that can handle semi-structured data and schema inconsistencies.
ConnectionsStore network and connection information for data stores. Credentials should be protected.
Glue StudioVisual interface for building and monitoring ETL jobs.
Glue Data QualityHelps define and evaluate quality rules; failed records still need operational handling.

Common transformation mistakes

  • Choosing Lambda for heavy joins, large file conversions, or long-running Spark work.
  • Choosing Glue crawlers when the question asks for transformation logic.
  • Choosing EMR when the requirement says serverless and minimal infrastructure management.
  • Ignoring incremental processing. If only new data should be processed, look for bookmarks, CDC, timestamps, watermarks, or checkpoints.
  • Ignoring bad-record handling. Strong pipelines separate valid records, rejected records, and operational alerts.
  • Treating orchestration as transformation. Step Functions and MWAA coordinate work; they are not the processing engine by themselves.

Storage, data lake design, and query performance

Main storage choices

Storage targetBest forWatch for
Amazon S3Data lakes, raw/curated zones, durable object storageObject layout, partitions, file size, compression, lifecycle, security.
Amazon RedshiftWarehousing, BI, complex analyticsDistribution, sort keys, COPY/UNLOAD, workload management, concurrency.
DynamoDBLow-latency key-value and document accessPartition-key design, hot keys, GSIs/LSIs, capacity mode, streams.
RDS/AuroraTransactional relational workloadsNot usually the best answer for large analytical scans.
OpenSearchSearch, log analytics, text search, near-real-time indexingNot a replacement for a warehouse or data lake.
EFSShared file system for computeNot normally the primary analytical data lake store.

S3 data lake layout

A strong S3 lake design usually has zones:

ZonePurposeExample
Raw / landingPreserve source data with minimal changesOriginal JSON, CSV, logs, CDC files.
Cleaned / standardizedValidated, normalized, deduplicatedParquet with consistent schema.
Curated / servingBusiness-ready data setsPartitioned tables for Athena, Redshift Spectrum, or ML.
Quarantine / rejectedBad or suspicious recordsSchema failures, malformed files, validation errors.

File format and partition decisions

Design choiceStrong exam answer
Querying only selected columnsUse columnar formats such as Parquet or ORC.
Reducing scanned dataPartition by common filters and use compression.
Avoiding excessive S3 requestsCompact small files into larger analytical files.
Handling evolving schemaUse catalog updates, compatible formats, and planned schema evolution.
Frequent Athena queriesUse partition pruning, partition projection where appropriate, and columnar storage.
Raw auditabilityKeep immutable raw data before transformation.
Lifecycle cost controlMove older data to lower-cost storage classes when access patterns allow.

Partition traps

  • Partition by query pattern, not by habit. Date partitions are common, but the best key depends on how users filter data.
  • Too many tiny partitions can hurt performance. Hour/minute/customer partitions may create partition explosion.
  • Partition columns are often derived from path structure. The catalog must know the partition values.
  • Columnar format plus partitions is stronger than either alone.
  • Athena cost and speed depend on data scanned. Compress, partition, and select only needed columns.

Redshift review

Redshift decision points

TopicReview rule
LoadingCOPY from S3 is the standard high-throughput loading pattern.
ExportingUNLOAD writes query results back to S3.
External dataRedshift Spectrum queries external S3 data through external schemas and catalog metadata.
DistributionChoose distribution style to reduce data movement, especially for large joins.
Sort keysImprove range-restricted scans and query pruning when aligned with filters.
Workload isolationUse workload management, scaling, or separate designs depending on the requirement.
SecurityCombine IAM roles, VPC/network controls, encryption, and audit logging.

Distribution and sort-key intuition

Question clueLikely design
Large fact table frequently joins to dimension table on customer_idConsider distribution on the join key if it avoids redistribution.
Small dimension table joined oftenReplication-style distribution may help when supported by the design.
Queries filter by date rangesDate/time sort key may help range scans.
Queries filter by many dimensions unpredictablyAvoid overcommitting to one narrow sort strategy without evidence.
Skewed join keyBad DISTKEY candidate even if it appears in joins.

Redshift traps

  • Redshift is not the default place for all raw data; S3 is usually the landing zone for a lake.
  • Redshift Spectrum still needs access to S3 data and metadata.
  • COPY is usually better than row-by-row inserts for large loads.
  • Poor distribution keys cause data skew and network redistribution.
  • Sort keys help only when query predicates can benefit from them.

Catalog, schema, and metadata

RequirementBest concept
Discover schemas in S3 and create table definitionsGlue crawler
Central table definitions for Athena, Glue, and other analytics servicesGlue Data Catalog
Fine-grained lake permissionsLake Formation
Schema compatibility for streaming producers/consumersSchema registry pattern, such as AWS Glue Schema Registry
Governed access by business classificationLF-tags and Lake Formation permissions
Detect personally identifiable or sensitive data in S3Macie
Track operational lineage and transformationsUse catalog metadata, job logs, workflow metadata, and governance tools as appropriate

Schema evolution traps

  • A crawler can detect changes, but automatic schema changes can break downstream consumers.
  • Adding nullable columns is usually safer than renaming or changing data types.
  • Streaming producers and consumers need compatibility controls before bad data reaches storage.
  • Catalog schema and physical data must match. A table definition alone does not fix inconsistent files.
  • Partition schema drift can cause query errors when old and new files differ.

Security and governance review

IAM, resource policies, KMS, and Lake Formation

LayerWhat it controlsCommon trap
IAM identity policyWhat a user, role, or service principal can doGranting identity permissions but forgetting resource policy or KMS access.
S3 bucket policyWho can access bucket and objectsCross-account access often needs both role permissions and bucket policy.
KMS key policy/grantsWho can use encryption keysS3 or Glue access can still fail if KMS decrypt is not allowed.
Lake FormationTable, column, row, and tag-based data lake permissionsLake Formation is separate from ordinary IAM thinking.
Secrets ManagerSecure database credentials and rotationDo not hardcode passwords in Glue scripts or job parameters.
VPC endpointsPrivate access to AWS servicesGlue jobs in private subnets need a route to S3, KMS, Secrets Manager, and other services they call.
CloudTrailAPI audit eventsNot the same as application logs or ETL error logs.
CloudWatchMetrics, logs, alarmsNot a permission system or audit ledger.

Cross-account data access checklist

When a question involves cross-account S3, Glue, Redshift Spectrum, Athena, or Lake Formation, check all of these:

  1. Trust policy — can the principal assume the required role?
  2. Identity policy — does the role allow required actions?
  3. Resource policy — does the bucket, key, queue, or topic allow access?
  4. KMS key policy — can the principal encrypt/decrypt with the key?
  5. Lake Formation grants — if Lake Formation governs the table, are data permissions granted?
  6. Network path — can the service reach the endpoint privately if required?
  7. Catalog sharing — is metadata available to the consuming account?

Encryption review

Data stateTypical controls
At rest in S3Server-side encryption with AWS-managed or customer-managed KMS keys, bucket policies.
At rest in Redshift/RDS/DynamoDBService encryption settings and KMS keys where applicable.
In transitTLS/HTTPS/JDBC over TLS, secure endpoints.
SecretsSecrets Manager or Parameter Store with appropriate encryption and access control.
LogsEncrypt and restrict access to CloudWatch Logs, S3 log buckets, and audit trails.

Operations, reliability, and troubleshooting

Pipeline reliability patterns

NeedPattern
Avoid duplicate processingIdempotent writes, deduplication keys, checkpoints, job bookmarks.
Recover from transient failuresRetries with backoff, DLQs, replayable streams, reprocessing from raw zone.
Detect failed jobsCloudWatch metrics/logs/alarms, EventBridge failure events, workflow status.
Handle bad recordsQuarantine path, validation rules, data quality reports, alerting.
Maintain auditabilityRaw immutable landing zone, CloudTrail, job logs, lineage metadata.
Minimize blast radiusSeparate environments, least privilege roles, isolated prefixes/buckets/accounts.
Reduce costPartition pruning, file compaction, right-sized compute, lifecycle policies.

Service-specific operational points

ServiceOperational focus
AWS GlueJob logs, worker sizing, bookmarks, retries, data quality, connection failures.
Kinesis Data StreamsShard capacity, iterator age, consumer lag, hot shards, retention.
Kinesis Data FirehoseDelivery failures, backup S3 prefix, transformation errors, buffering settings.
AWS DMSReplication lag, task errors, table mapping, endpoint connectivity, CDC status.
AthenaQuery failures, partition metadata, data format errors, scanned data volume.
RedshiftQuery performance, data skew, workload queues, COPY errors, disk/storage pressure.
DynamoDBThrottling, hot partitions, capacity mode, GSI design, stream consumers.
Step FunctionsFailed states, retry/catch behavior, timeout settings, state input/output size.
EventBridgeRule pattern matching, target permissions, dead-letter or retry configuration.

Monitoring trap list

  • A passing pipeline can still produce bad data; monitor quality, not just job success.
  • CloudWatch logs may show the error, but you still need retry, alert, and remediation design.
  • Duplicate events are normal in many distributed systems; consumers must handle them.
  • “Near real time” may require watching consumer lag, iterator age, or delivery delay.
  • A job that scans too much data is both slower and more expensive.

Calculation and capacity review

If a scenario gives per-shard or per-partition limits, calculate from the numbers in the question rather than memorizing changing quotas.

For shard-style capacity planning, use the largest requirement across write throughput, record count, and read throughput:

\[ \text{required shards} = \max( \lceil \text{write throughput} / \text{write capacity per shard} \rceil, \lceil \text{records per second} / \text{records per shard} \rceil, \lceil \text{read throughput} / \text{read capacity per shard} \rceil ) \]

For data-lake query cost/performance questions, focus on scanned data:

\[ \text{data scanned} \approx \text{selected columns} \times \text{matching partitions} \times \text{uncompressed or effective file size} \]

Practical implications:

  • Parquet/ORC reduces scanned columns.
  • Partitions reduce scanned rows/files.
  • Compression reduces physical bytes read.
  • File compaction reduces request overhead.
  • Predicate pushdown helps only when data format and query design support it.

Common DEA-C01 answer-choice traps

TrapBetter reasoning
“Use Lambda for all transformations.”Lambda is for lightweight event processing; Glue or EMR is better for large ETL.
“Use Glue crawler to transform data.”Crawlers infer metadata; Glue jobs transform.
“Use SQS for analytics stream replay.”SQS is a queue; choose Kinesis Data Streams or MSK for stream processing and replay-style consumers.
“Use Firehose for custom multi-consumer stream apps.”Firehose is managed delivery; Kinesis Data Streams or MSK fits custom consumers.
“Use Redshift as the raw data lake.”Land raw data in S3; load curated data into Redshift when warehouse analytics are needed.
“Grant S3 access and ignore KMS.”Encrypted objects require KMS permissions too.
“Lake Formation means IAM no longer matters.”IAM, Lake Formation, S3, KMS, and service roles can all matter.
“Partition by every possible column.”Overpartitioning causes metadata and small-file problems.
“Use CSV because it is simple.”Columnar formats are usually better for analytics over large data.
“Assume exactly-once delivery.”Many services are at-least-once; design deduplication and idempotency.
“Use DMS for complex ETL.”DMS is for migration and replication, not rich transformations.
“Use Athena without considering format.”Athena performance depends on data layout, compression, partitions, and scanned bytes.
“Use public internet paths for private data jobs.”Look for VPC endpoints, private subnets, security groups, and private connectivity.
“Trust the catalog blindly.”Catalog metadata must match actual files and permissions.
“Ignore rejected records.”Real pipelines need quarantine, alerts, and reprocessing strategy.

Quick service-selection drills

Use these as fast mental prompts before attempting original practice questions.

If the stem says…

Stem phraseThink first
“Serverless ETL”AWS Glue
“Spark with minimal infrastructure”AWS Glue
“Existing Hadoop/Spark ecosystem”EMR
“Run SQL on S3”Athena
“Data warehouse analytics”Redshift
“Full load and CDC”AWS DMS
“Kafka-compatible”Amazon MSK
“Managed delivery stream”Kinesis Data Firehose
“Multiple custom stream consumers”Kinesis Data Streams
“Route events from AWS services”EventBridge
“State machine, retries, branching”Step Functions
“Airflow DAGs”Amazon MWAA
“Discover table schema”Glue crawler
“Central table metadata”Glue Data Catalog
“Fine-grained lake permissions”Lake Formation
“Sensitive data discovery in S3”Macie
“Store and rotate credentials”Secrets Manager

Practice plan after this Quick Review

For the AWS Certified Data Engineer – Associate (DEA-C01) exam, do not practice only by memorizing service names. The real skill is choosing among plausible AWS services under constraints.

Use a question bank in this order:

  1. Topic drills: ingestion

    • Kinesis Data Streams vs Firehose vs MSK vs SQS.
    • DMS full load and CDC.
    • DataSync, Transfer Family, and AppFlow scenarios.
  2. Topic drills: storage and catalog

    • S3 partitioning and file formats.
    • Glue Data Catalog and crawlers.
    • Athena, Redshift Spectrum, and Redshift loading.
  3. Topic drills: transformation

    • Glue vs EMR vs Lambda vs Athena SQL vs Redshift SQL.
    • Incremental processing, bookmarks, checkpoints, and bad-record handling.
  4. Topic drills: security and governance

    • IAM plus S3 bucket policies.
    • KMS key-policy failures.
    • Lake Formation permissions and LF-tags.
    • Cross-account access.
  5. Topic drills: operations

    • CloudWatch logs and metrics.
    • Retry, DLQ, replay, deduplication.
    • Cost and performance optimization.
  6. Mixed mock exams

    • Force yourself to explain why each wrong answer is wrong.
    • Track misses by decision type, not just by service.
    • Revisit detailed explanations for every guessed question.

Final quick checklist

Before you start a timed mock exam, confirm you can answer these without notes:

  • Can I distinguish Kinesis Data Streams, Kinesis Data Firehose, MSK, and SQS?
  • Can I choose between Glue, EMR, Lambda, Athena, and Redshift for transformations?
  • Can I explain why S3 layout affects Athena and data-lake performance?
  • Can I identify when DMS CDC is the right ingestion pattern?
  • Can I troubleshoot access failures involving IAM, S3 policies, KMS, and Lake Formation?
  • Can I recognize small-file, overpartitioning, schema drift, and hot partition problems?
  • Can I design for retries, idempotency, DLQs, checkpoints, and quarantine paths?
  • Can I connect monitoring tools to the right failure type?

Next step: use this Quick Review as a checklist, then move into DEA-C01 topic drills and original practice questions with detailed explanations until your mistakes are concentrated in a few identifiable decision areas.

Continue in IT Mastery

Use this Quick Review as a final concept map, then move into IT Mastery for focused topic drills, mixed practice sets, timed mock exams, and detailed explanations. The practice questions are original IT Mastery practice items; they are not official AWS questions, copied live-exam content, or exam dumps.

Browse Certification Practice Tests by Exam Family