DEA-C01 — AWS Certified Data Engineer – Associate Quick Reference

Last revised: June 29, 2026

Compact AWS DEA-C01 reference for data ingestion, transformation, storage, operations, security, and governance decisions.

Exam identity and quick-use approach

This Quick Reference supports independent preparation for the AWS Certified Data Engineer – Associate (DEA-C01) exam from AWS. Use it as a compact decision guide for scenario questions: identify the data source, ingestion pattern, transformation need, storage/query target, governance model, and operational concern.

High-yield DEA-C01 thinking pattern:

Ingest: batch files, database CDC, SaaS, stream, events, or messages.
Store: Amazon S3 data lake, Amazon Redshift warehouse, operational database, or search/time-series target.
Catalog and govern: AWS Glue Data Catalog, AWS Lake Formation, IAM, AWS KMS.
Transform: AWS Glue, Amazon EMR, AWS Lambda, Amazon Managed Service for Apache Flink, SQL CTAS, or Redshift SQL.
Operate: monitor, retry, validate, secure, optimize, and troubleshoot.

Core AWS data engineering architecture

    flowchart LR
	    A[Sources: databases, SaaS, files, apps, streams] --> B[Ingestion: DMS, DataSync, AppFlow, Kinesis, Firehose, MSK]
	    B --> C[S3 raw zone]
	    C --> D[Catalog: AWS Glue Data Catalog]
	    C --> E[Transform: AWS Glue, EMR, Lambda, Athena CTAS]
	    E --> F[S3 curated zone]
	    F --> G[Query: Athena, Redshift Spectrum, EMR, QuickSight]
	    E --> H[Warehouse: Amazon Redshift]
	    D --> I[Governance: Lake Formation, IAM, KMS]
	    G --> J[Consumers]
	    H --> J
	    B --> K[Ops: CloudWatch, CloudTrail, EventBridge]
	    E --> K
	    H --> K

Service selection matrix

Need in the scenario	Usually choose	Why	Watch for
Durable object storage for a data lake	Amazon S3	Scalable object storage, integrates with Glue, Athena, Redshift Spectrum, EMR	S3 is not a relational database; design prefixes, partitions, and file sizes
Metadata catalog for S3 tables	AWS Glue Data Catalog	Central table, schema, partition metadata for analytics services	Catalog stores metadata, not the data itself
Serverless SQL over S3	Amazon Athena	Ad hoc SQL using Glue Data Catalog	Query cost/performance depends heavily on scanned data, partitions, and formats
Managed Spark ETL	AWS Glue ETL	Serverless distributed transformations, crawlers, jobs, workflows	Tune partitioning, file sizes, pushdown, and job bookmarks
Big data frameworks with more cluster control	Amazon EMR	Spark, Hive, Presto/Trino, Hudi/Iceberg workloads with configurable clusters	More operational responsibility than Glue
Cloud data warehouse	Amazon Redshift	High-performance analytics, SQL warehouse, COPY/UNLOAD, Spectrum	Model distribution, sort, workload, and external table scans
Database migration and CDC	AWS Database Migration Service (AWS DMS)	Full load plus ongoing change replication	DMS is not a general-purpose ETL engine
File transfer from on-premises storage to S3	AWS DataSync	Managed transfer for file/object storage migrations and recurring sync	Needs network connectivity and correct IAM/S3/KMS permissions
Managed SFTP/FTPS/FTP endpoint	AWS Transfer Family	External partners exchange files into S3 or Amazon EFS	Do not confuse with DataSync migration/sync
SaaS application data ingestion	Amazon AppFlow	Managed SaaS-to-AWS flows	Best for supported SaaS connectors, not arbitrary streaming apps
Custom real-time stream consumers	Amazon Kinesis Data Streams	Ordered records per partition key, replay, custom consumers	Partition key design affects ordering and hot shards
Managed stream delivery to S3/Redshift/OpenSearch	Amazon Data Firehose	Minimal-code delivery pipeline with buffering and optional transform	Not for multiple custom replayable consumers
Kafka-compatible streaming	Amazon MSK	Managed Apache Kafka ecosystem compatibility	Choose when Kafka APIs/tools are required
Stateful streaming analytics	Amazon Managed Service for Apache Flink	Windowing, state, joins, event-time processing	More suitable than Lambda for complex stream computation
Event routing across AWS services/apps	Amazon EventBridge	Event bus, rules, schedules, SaaS events	Not a high-throughput analytics stream replacement
Queue-based decoupling	Amazon SQS	Durable message queue for workers	Not intended for replayable analytics streams
Short event-driven transformation	AWS Lambda	Lightweight transform, validation, routing	Avoid for large distributed ETL
Multi-step orchestration with branching/retries	AWS Step Functions	Coordinates services, handles state, retries, error paths	Prefer over ad hoc scripts for resilient workflows
Airflow DAG compatibility	Amazon Managed Workflows for Apache Airflow (MWAA)	Managed Apache Airflow	Choose when Airflow is a requirement
Fine-grained data lake governance	AWS Lake Formation	Table, column, row-style governed access with LF-Tags	IAM/S3/KMS permissions still matter
Encryption key control	AWS Key Management Service (AWS KMS)	Customer managed keys, key policies, grants	Cross-account access needs key policy alignment
Secrets for connections	AWS Secrets Manager	Rotatable database/API credentials	Do not hard-code credentials in Glue jobs or notebooks
Logs, metrics, alarms	Amazon CloudWatch	Operational monitoring	Know service-specific metrics and log locations
API audit history	AWS CloudTrail	Tracks management events and optional data events	CloudTrail is audit, not performance monitoring

Scenario keyword shortcuts

Scenario phrase	Strong candidate answer
“Run SQL directly on files in S3”	Athena
“Catalog S3 data for Athena/Redshift Spectrum/Glue”	AWS Glue Data Catalog
“Discover schema and partitions automatically”	AWS Glue crawler
“Fine-grained access to data lake tables and columns”	Lake Formation
“Move on-premises NFS/SMB data to S3 repeatedly”	DataSync
“Partners upload files through SFTP”	AWS Transfer Family
“Replicate relational database changes continuously”	AWS DMS with CDC
“Ingest supported SaaS data without custom connector code”	Amazon AppFlow
“Custom applications need replayable stream records”	Kinesis Data Streams
“Deliver streaming data to S3 with minimal management”	Data Firehose
“Existing Kafka producers and consumers”	Amazon MSK
“Streaming joins, windows, and stateful processing”	Managed Service for Apache Flink
“Complex workflow with retries and branches”	Step Functions
“Existing Airflow DAGs”	MWAA
“Reduce Athena scanned data”	Parquet/ORC, compression, partitioning, column pruning
“Exact set of files for Redshift load”	COPY with manifest
“Query S3 from Redshift”	Redshift Spectrum external tables

Data ingestion reference

Batch, file, and database ingestion

Source or requirement	Choose	Pattern	Common trap
On-premises file shares or object stores	DataSync	Schedule transfer into S3 raw prefixes	Transfer service does not transform business logic
External users send files over SFTP/FTPS/FTP	AWS Transfer Family	Land files in S3, trigger downstream workflow	Transfer Family is for protocol access, not storage analytics
RDBMS full load to S3/Redshift	AWS DMS	Source endpoint, target endpoint, replication task	Validate data types, constraints, and permissions
RDBMS ongoing changes	AWS DMS CDC	Full load plus change data capture to target	Monitor replication lag and source log retention
SaaS data	Amazon AppFlow	Flow from SaaS connector to S3/Redshift/Salesforce targets as supported	Check connector and field mapping support
AWS service events	EventBridge	Rule routes event to target such as Lambda or Step Functions	EventBridge is not a bulk ETL engine
Application-generated files	S3 direct upload or SDK	Write to raw zone with event notification	Use idempotent object naming and downstream deduplication

Streaming ingestion

Requirement	Choose	Design notes
Multiple custom consumers need independent processing	Kinesis Data Streams	Consumers read from stream; partition key controls ordering and load
Delivery to S3, Redshift, OpenSearch, or third-party endpoint with low code	Data Firehose	Configure destination, buffering, optional Lambda transform, backup for failures
Kafka API compatibility	Amazon MSK	Use Kafka producers/consumers, topics, partitions, consumer groups
SQL/window/stateful analytics on streams	Managed Service for Apache Flink	Use event-time processing, windows, joins, state, checkpoints
Message queue for asynchronous workers	SQS	Worker decoupling, not analytics replay
Event bus integration	EventBridge	Event routing, schedules, SaaS events, cross-account event patterns

Ingestion design rules

Separate raw and curated data. Land immutable source data first, then transform.
Make ingestion idempotent. Assume retries can create duplicate deliveries.
Capture metadata: source, ingestion timestamp, schema version, batch ID, and lineage.
Use partition-friendly timestamps. Common S3 partitions include date or hour, but avoid creating excessive tiny partitions.
Validate early. Use data quality checks before promoting data into curated zones.
Monitor lag and failures. Streams, DMS tasks, Firehose delivery, and Glue jobs all expose operational signals.

S3 data lake design

Data lake zones

Zone	Purpose	Typical properties
Raw / bronze	Immutable copy of source data	Source format, append-only, tightly restricted write access
Staging / silver	Cleaned, normalized, deduplicated	Standardized schema, data quality checks, partitioned layout
Curated / gold	Analytics-ready products	Columnar formats, business dimensions, governed access
Sandbox	Exploration and temporary outputs	Lifecycle policies, limited permissions, not production source of truth
Audit / quarantine	Failed or suspicious records	Preserve rejected records with error reason and batch metadata

File format selection

Format	Best for	Avoid when	Exam notes
CSV	Simple exchange, human-readable exports	Large analytics scans, nested structures, strict typing	Easy but inefficient for Athena/Redshift Spectrum
JSON	Semi-structured events	Heavy repeated scans without conversion	Convert to columnar for curated analytics
Avro	Row-oriented data with schema evolution, streaming ecosystems	Pure SQL scan optimization is the main goal	Often used in streaming pipelines
Parquet	Columnar analytics on S3	Frequent single-row updates without table format support	High-yield choice for Athena, Glue, Redshift Spectrum
ORC	Columnar analytics, Hive-style ecosystems	Tooling standardizes on Parquet	Similar exam value to Parquet
Apache Iceberg table	S3 lakehouse tables needing ACID-style operations and schema evolution	Simple immutable append-only files are enough	Useful for governed, evolving analytical tables

Partitioning and layout rules

Rule	Why it matters
Partition by common filter columns	Enables partition pruning in Athena, Glue, EMR, and Spectrum
Avoid overly high-cardinality partitions	Too many small partitions can hurt planning and metadata operations
Avoid many tiny files	Distributed engines spend too much time opening files instead of scanning data
Use columnar compression	Reduces scanned bytes and improves query performance
Keep partition naming consistent	Hive-style paths such as `dt=2026-06-18/` work well with many tools
Compact streaming outputs	Firehose and streaming jobs can create many small objects
Store raw data immutably	Enables replay, audit, and recovery from bad transforms

Example S3 layout:

s3://company-data-lake/raw/source=salesforce/object=account/ingest_date=2026-06-18/
s3://company-data-lake/curated/domain=sales/table=orders/order_date=2026-06-18/
s3://company-data-lake/quarantine/source=orders/ingest_date=2026-06-18/

AWS Glue and Data Catalog reference

Glue components

Component	What it does	Choose when
AWS Glue Data Catalog	Stores databases, tables, schemas, partitions, connections	Analytics services need shared metadata
Crawler	Infers schema and discovers partitions	Data structure is discoverable and changes need catalog updates
Classifier	Helps crawler interpret data	Custom formats or nonstandard records
Glue ETL job	Runs Spark, Python shell, or other supported job types	Transform, clean, join, and write data
Glue Studio	Visual job authoring	Need low-code ETL development
Glue workflow	Coordinates Glue crawlers, jobs, and triggers	Glue-centered pipeline orchestration
Glue trigger	Starts jobs/crawlers on schedule or condition	Simple Glue workflow automation
Glue connection	Stores connection details for JDBC, network, or marketplace connectors	Jobs need source/target connectivity
Glue Data Quality	Evaluates rules against datasets	Validate completeness, uniqueness, ranges, schema expectations
Glue Schema Registry	Manages schemas for streaming/event data	Producers and consumers need schema validation/evolution

Crawler vs explicit schema

Situation	Better choice
Unknown files arrive and schema must be discovered	Glue crawler
Production schema must be controlled and reviewed	Explicit table definition or IaC-managed catalog
Frequent partition additions only	Partition projection, `ALTER TABLE ADD PARTITION`, crawler, or repair pattern
Crawler creates unexpected tables	Adjust folder structure, classifiers, grouping behavior, and crawler scope
Sensitive columns need access control	Catalog plus Lake Formation, not crawler alone

Glue ETL exam points

Topic	Remember
DynamicFrame vs DataFrame	DynamicFrames help with semi-structured data and schema ambiguity; DataFrames expose standard Spark APIs
Job bookmarks	Track previously processed source data; still design idempotent writes
Pushdown predicates	Reduce source data read, especially with partitions
Repartition/coalesce	Manage output file count and parallelism
Small files	Compact to improve Athena/Spectrum/EMR performance
Skew	A few hot keys can slow joins and aggregations
Secrets	Use Secrets Manager or Glue connections, not embedded passwords
VPC access	Jobs accessing private databases need subnet/security group routing and access to S3/logs/secrets
Failure handling	Use retries, checkpoints/bookmarks, quarantine outputs, and CloudWatch logs

Illustrative Glue PySpark pattern:

from awsglue.context import GlueContext
from pyspark.context import SparkContext

glue_ctx = GlueContext(SparkContext.getOrCreate())

orders = glue_ctx.create_dynamic_frame.from_catalog(
    database="raw",
    table_name="orders",
    push_down_predicate="ingest_date >= '2026-06-01'"
)

df = orders.toDF().dropDuplicates(["order_id"])

df.write.mode("append") \
    .partitionBy("order_date") \
    .parquet("s3://example-data-lake/curated/orders/")

Example AWS Glue Data Quality rule style:

Rules = [
  IsComplete "order_id",
  IsUnique "order_id",
  ColumnValues "amount" >= 0,
  ColumnExists "order_date"
]

Athena reference

Need	Athena feature or pattern
Query S3 data with SQL	External tables using Glue Data Catalog
Improve performance	Parquet/ORC, compression, partition pruning, avoid `SELECT *`
Create curated columnar data	CTAS or INSERT INTO from raw table
Control query usage	Workgroups, result locations, query settings
Add partitions	Crawler, `ALTER TABLE ADD PARTITION`, partition projection, or repair for Hive-style partitions
Secure data	IAM, S3 bucket policy, KMS key policy, Lake Formation permissions
Share governed tables	Lake Formation and catalog-based permissions where supported

Athena CTAS pattern:

CREATE TABLE curated_orders
WITH (
  format = 'PARQUET',
  partitioned_by = ARRAY['order_date'],
  external_location = 's3://example-data-lake/curated/orders/'
) AS
SELECT
  order_id,
  customer_id,
  amount,
  order_status,
  order_date
FROM raw_orders
WHERE order_date >= DATE '2026-01-01';

Partition repair pattern for Hive-style S3 paths:

MSCK REPAIR TABLE raw_orders;

Common Athena traps:

Athena queries data in S3; it does not ingest or move data by itself.
Catalog permissions alone are not enough if S3 or KMS denies access.
Crawlers update metadata; they do not optimize file format or clean data.
Partition projection can reduce partition metadata management, but the S3 path pattern must match the table definition.
Columnar formats help most when queries select only needed columns.

Amazon Redshift reference

Redshift vs Athena vs S3 lake

Requirement	Better fit
Ad hoc SQL on raw/curated S3 data	Athena
Managed warehouse with repeated BI queries and modeled tables	Redshift
Query S3 from warehouse without loading all data	Redshift Spectrum
Transform and publish warehouse data to S3	Redshift UNLOAD
Load large S3 datasets into warehouse tables	Redshift COPY
Variable or intermittent warehouse demand	Redshift Serverless may fit
Stable warehouse environment with cluster-level control	Redshift provisioned may fit

Redshift design points

Topic	What to know for DEA-C01
COPY	Preferred bulk load from S3, DynamoDB, EMR, or supported sources
UNLOAD	Writes query results from Redshift to S3
Distribution style	Affects data movement during joins; AUTO can help, but know KEY/EVEN/ALL concepts
Sort keys	Improve range-restricted scans and joins when aligned with query patterns
Compression encoding	Reduces storage and I/O
ANALYZE	Updates statistics for the optimizer
VACUUM	Reclaims/sorts storage where applicable
Spectrum	Queries external S3 data through external schemas/tables
Materialized views	Precompute expensive query results when refresh strategy fits
Workload management	Manage query queues, priorities, and concurrency behavior
Federated query	Query operational databases from Redshift for specific use cases

COPY from S3 pattern:

COPY analytics.orders
FROM 's3://example-data-lake/curated/orders/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoadRole'
FORMAT AS PARQUET;

UNLOAD to S3 pattern:

UNLOAD ('SELECT order_date, SUM(amount) AS revenue FROM analytics.orders GROUP BY order_date')
TO 's3://example-data-lake/exports/revenue/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftUnloadRole'
FORMAT AS PARQUET;

Redshift troubleshooting shortcuts:

Symptom	Check first
COPY fails	IAM role, S3 path, KMS access, file format, load error views
Query slow after load	Statistics, sort/distribution, skew, scanned external data
External table query slow	S3 file format, partitioning, file sizes, Spectrum pruning
Access denied to S3	Redshift role policy, bucket policy, KMS key policy
BI workload contention	Workload management, query design, materialized views

Streaming and event processing reference

Kinesis, Firehose, MSK, Flink

Dimension	Kinesis Data Streams	Data Firehose	Amazon MSK	Managed Service for Apache Flink
Primary role	Durable stream for custom apps	Managed delivery stream	Managed Kafka	Stateful stream processing
Consumers	Custom consumers	Destination delivery, optional transform	Kafka consumers	Flink application
Replay	Yes, within configured retention	Not a replay stream for custom consumers	Kafka retention model	Reads from stream sources
Ordering	Per partition key/shard	Not the main design feature	Per Kafka partition	Depends on source partitioning and app logic
Operations	Stream capacity and consumer design	Destination and delivery config	Kafka cluster/topic/client design	Application state, checkpoints, parallelism
Choose when	You need custom stream processing	You need easy delivery to storage/search/warehouse	You need Kafka compatibility	You need windows, joins, state, event-time logic

Streaming design traps

Partition keys determine ordering and load distribution. Bad keys create hot partitions.
Assume at-least-once delivery in many pipelines; design deduplication and idempotent sinks.
Firehose buffering means it is usually not the answer for the lowest-latency custom consumer requirement.
Lambda works for lightweight stream processing, not complex stateful analytics.
Use dead-letter, backup, or quarantine patterns for failed records.
Monitor consumer lag, delivery failures, throttling, and error logs.

Transformation and orchestration choices

Requirement	Choose	Why
Distributed ETL over large S3 data	AWS Glue Spark job	Serverless managed Spark
Complex big data stack or custom libraries/configuration	Amazon EMR	More control over cluster/runtime
Lightweight event transform	Lambda	Simple code on event triggers
SQL transformation over S3	Athena CTAS/INSERT	Serverless SQL pipeline step
SQL transformation inside warehouse	Redshift SQL, stored procedures, materialized views	Keep warehouse transformations close to modeled data
Stateful stream transformation	Managed Service for Apache Flink	Windows, joins, state
Glue-only pipeline	Glue workflows and triggers	Native Glue orchestration
Multi-service workflow	Step Functions	Branches, retries, service integrations
Airflow DAG requirement	MWAA	Managed Airflow compatibility
Time-based event trigger	EventBridge Scheduler or rules	Schedules pipeline starts

Orchestration exam distinctions

If the question emphasizes	Prefer
Retry policies, branching, human-readable state machine, AWS SDK integrations	Step Functions
Existing Airflow DAGs and operators	MWAA
Simple scheduled Glue job	Glue trigger or EventBridge schedule
Event from S3 starts processing	S3 event notification to Lambda/EventBridge/queue, then orchestrate
Decoupling ingestion from processing	SQS between producer and worker, or stream where replay/order is needed
Failure notification	EventBridge rule, CloudWatch alarm, SNS notification

Security, governance, and access control

Data access control layers

Layer	Controls	Common exam point
IAM identity policy	What principals can call AWS APIs	Required but may not be sufficient alone
S3 bucket/access point policy	Resource-level access to objects	Needed for cross-account and centralized lake patterns
S3 Block Public Access	Prevents public exposure	Keep enabled unless a specific approved public pattern exists
KMS key policy/grants	Who can use encryption keys	Access fails if IAM allows S3 but KMS denies decrypt
Lake Formation permissions	Table, column, and governed data lake access	Use for fine-grained analytics permissions
Glue Data Catalog resource policy	Catalog-level sharing/access	Often relevant in cross-account catalogs
Secrets Manager	Stores database/API credentials	Use with Glue connections and jobs
VPC security groups/routes/endpoints	Network path to private sources and AWS APIs	Glue/DMS in VPC often need S3 and service endpoint access
CloudTrail	Audit of API activity	Enable relevant data events when object-level audit is needed
Amazon Macie	Sensitive data discovery in S3	Helps identify PII/sensitive data exposure

Governance decision table

Requirement	Strong answer
Analysts can query only approved columns	Lake Formation column permissions or governed views
Grant data lake access by business tags	Lake Formation LF-Tags
Encrypt S3 objects with customer-managed key	SSE-KMS with proper key policy
Cross-account Athena access to encrypted S3 data	Align Lake Formation/catalog, S3 bucket policy, IAM, and KMS key policy
Store JDBC password for Glue	Secrets Manager or Glue connection using a secret
Audit who changed Glue table definitions	CloudTrail management events
Detect sensitive data in S3	Macie
Keep Glue job traffic private to AWS services	VPC endpoints and correct routing/security groups
Prevent accidental public S3 lake access	S3 Block Public Access plus least-privilege bucket policies

Least-privilege reminders

Grant jobs only the S3 prefixes, catalog databases/tables, KMS keys, and logs they need.
Separate roles for ingestion, transformation, catalog administration, and consumption.
For cross-account sharing, check all layers: IAM, resource policy, Lake Formation, S3, and KMS.
Avoid embedding credentials in scripts, notebooks, job parameters, or environment variables when Secrets Manager is appropriate.
Use encryption in transit and at rest for data pipelines.

Operations, monitoring, and troubleshooting

Service signals

Service	Monitor
Glue jobs	Job run status, CloudWatch logs, errors, duration, data skew symptoms, output file count
Glue crawlers	Crawler run status, schema changes, tables created, partition discovery
Athena	Query failures, scanned data, workgroup settings, result location, permissions
Redshift	Query performance, load errors, disk/storage pressure, WLM queues, system views
DMS	Task status, table statistics, replication lag, task logs
Kinesis Data Streams	Write/read throttling, iterator age or consumer lag, hot partitions
Firehose	Delivery success/failure, transformation errors, backup records
MSK	Broker health, topic throughput, consumer lag
Flink	Checkpoints, application health, lag, failed records
S3	Object creation, replication/lifecycle status, access errors
Lake Formation	Grant changes, denied access, LF-Tag policy alignment
KMS	Key access denied, disabled key, missing cross-account permissions
Step Functions	Failed states, retries, execution history
EventBridge	Rule matches, target invocation failures, dead-letter targets

Troubleshooting table

Symptom	Likely causes	Fast checks or fixes
Athena scans too much data	Row format, no partitions, no column pruning	Convert to Parquet/ORC, partition by filters, avoid `SELECT *`
Athena access denied	IAM, S3, KMS, Lake Formation, workgroup result location	Test each permission layer
Glue job out of memory or slow	Skew, tiny files, wide shuffle, no predicate pushdown	Repartition, compact, filter early, tune joins
Glue job reprocesses data	Bookmark disabled/reset, changed source path, non-idempotent writes	Enable bookmarks where useful, deduplicate, design idempotent outputs
Crawler creates many tables	Folder structure or classifier mismatch	Narrow crawler scope, fix path layout, adjust classifiers
DMS CDC lag grows	Source log pressure, target bottleneck, network, task config	Check task logs, table stats, target capacity
Firehose delivery fails	Destination permission, KMS, transform error, schema conversion issue	Inspect error logs and backup prefix
Kinesis consumer falls behind	Hot shard, slow consumer, insufficient parallelism	Improve partition key, scale stream/consumer design
Redshift COPY errors	Bad file format, role/KMS issue, incompatible schema	Check load error views and S3 object format
Redshift query slow	Poor distribution/sort, stale stats, queue contention	Analyze tables, review plan, tune WLM and table design
Lake Formation denies query	Missing LF grant, IAM mismatch, location not registered	Check data location registration and table permissions
Glue job cannot reach S3 from VPC	Missing route or VPC endpoint	Add S3 access path and service endpoints as needed
S3 event does not start pipeline	Notification filter mismatch, target policy, unsupported event path	Validate prefix/suffix, destination permissions, event pattern

Data quality, schema, and reliability

Concern	AWS pattern
Validate completeness, uniqueness, ranges	AWS Glue Data Quality rules
Enforce streaming schema compatibility	AWS Glue Schema Registry
Preserve bad records	Quarantine S3 prefix with error metadata
Prevent duplicate records	Idempotent keys, deduplication step, deterministic output paths
Handle schema drift	Crawler review, explicit schema management, compatible schema evolution
Track lineage	Store batch IDs, source metadata, job run IDs, and catalog versions
Recover from bad transform	Reprocess raw immutable data into corrected curated zone
Promote trusted datasets	Raw to curated pipeline with quality gates

Reliability checklist:

Can the pipeline be safely retried?
Are failed records preserved?
Is raw source data immutable?
Are schema changes detected before breaking consumers?
Are job failures visible through alarms or events?
Are downstream writes atomic enough for the service and format used?
Are permissions least-privilege but sufficient across IAM, S3, KMS, and Lake Formation?

Cost and performance optimization

Area	Optimize by
Athena	Columnar formats, compression, partitions, workgroups, CTAS for repeated transforms
S3 lake	Lifecycle policies, compact files, avoid unnecessary copies, design prefixes logically
Glue	Filter early, avoid shuffles, use bookmarks, write partitioned columnar output
Redshift	COPY from S3, sort/distribution design, statistics, materialized views, workload tuning
Kinesis	Balanced partition keys, right stream capacity mode/design, efficient consumers
Firehose	Destination buffering/format conversion, backup failed records, transform only when needed
DMS	Monitor lag, choose appropriate task settings, avoid unnecessary transformations
Cross-service data movement	Keep pipelines regional and avoid unnecessary intermediate hops
Governance	Use Lake Formation and tags to avoid duplicating governed datasets

High-yield optimization principle: for analytics on S3, reduce bytes scanned and reduce file/partition overhead before scaling compute.

Common DEA-C01 traps

Trap	Correct thinking
“Crawler transforms data”	Crawlers infer/update metadata only
“Data Catalog stores the dataset”	It stores metadata; data remains in S3 or source systems
“IAM allow means Athena can read everything”	S3, KMS, Lake Formation, and workgroup settings can still deny access
“Firehose is the same as Kinesis Data Streams”	Firehose is managed delivery; Data Streams supports custom consumers and replay
“DMS is for complex transformations”	DMS is mainly migration/replication with limited transformation
“CSV is fine for large Athena workloads”	Convert curated analytics data to Parquet/ORC
“More partitions always improve performance”	Too many tiny partitions can degrade planning and metadata operations
“Lambda is best for all ETL”	Use Glue/EMR for distributed data processing
“S3 event notification guarantees a complete batch workflow”	Use orchestration and idempotency for multi-file/batch completion logic
“Redshift Spectrum replaces all warehouse modeling”	Spectrum is useful, but internal Redshift tables can be better for repeated BI workloads
“KMS is only an encryption checkbox”	Key policies and grants are common causes of access failures
“Lake Formation replaces all IAM”	Lake Formation works with IAM, S3, catalog, and KMS controls

Final review checklist

Before the exam, be able to answer these quickly:

Which service ingests files, database CDC, SaaS, events, streams, and Kafka?
When should data be stored in S3, Redshift, or queried by Athena?
How do Glue Data Catalog, crawlers, Lake Formation, and Glue jobs differ?
How do you optimize S3 analytics with Parquet, compression, partitioning, and compaction?
What permission layers can block a query: IAM, S3, KMS, Lake Formation, catalog, or network?
How do you troubleshoot failed Glue, Athena, Redshift COPY, DMS, Kinesis, and Firehose workflows?
Which orchestration service fits: Step Functions, Glue workflows, MWAA, or EventBridge?
How do you make pipelines idempotent, observable, recoverable, and governed?

Next step

Use this Quick Reference as a checklist, then practice scenario questions that force you to choose between similar AWS services, especially Glue vs EMR, Athena vs Redshift, Kinesis Data Streams vs Firehose, DMS vs DataSync, and IAM vs Lake Formation permission issues.

Scenario Guide

Data Ingestion and Transformation