DEA-C01 Cheatsheet — Pipelines, Lakes, Warehouses, Service Pickers, SQL Patterns & Governance (High Yield)

High-signal DEA-C01 reference: ingestion patterns (batch/stream/CDC), ETL and orchestration choices, S3 data lakes + Lake Formation governance, Glue Catalog + partitions, Redshift/Athena analytics trade-offs, monitoring/data quality, and security/privacy controls.

On this page

Keep this page open while drilling questions. DEA‑C01 rewards “production data platform realism”: correct service selection, replayability/backfills, partitioning/file formats, monitoring and data quality, and governance-by-default.

Quick facts (DEA-C01)

Item	Value
Questions	65 (multiple-choice + multiple-response)
Time	130 minutes
Passing score	720 (scaled 100–1000)
Cost	150 USD
Domains	D1 34% • D2 26% • D3 22% • D4 18%

Fast strategy (what the exam expects)

If the requirement is replayable + backfillable, design for idempotency, checkpoints, and reprocessing (S3 as durable landing is common).
If you see “best cost/performance for queries on S3”, think Parquet + partitioning + Athena/Redshift Spectrum, not raw CSV scans.
If you see “govern access to S3 data across services”, think Lake Formation + Glue Data Catalog, not just IAM.
If you see “batch vs streaming”, focus on latency, ordering, retention, and operational complexity.
If you see “audit” or “governance”, include CloudTrail, central log storage, and controlled access to logs.

1) End-to-end data platform on AWS (mental model)

    flowchart LR
	  SRC["Sources<br/>(SaaS, DBs, apps, streams)"] --> ING["Ingest<br/>(DMS, AppFlow, Kinesis, MSK)"]
	  ING --> RAW["S3 data lake<br/>(raw/bronze)"]
	  RAW --> ETL["Transform<br/>(Glue, EMR, Lambda)"]
	  ETL --> CUR["S3 curated<br/>(silver/gold)"]
	  CUR --> CAT["Glue Data Catalog"]
	  CAT --> ATH["Athena<br/>(serverless SQL)"]
	  CUR --> RS["Redshift<br/>(warehouse)"]
	  ATH --> BI["QuickSight / BI"]
	  RS --> BI
	  CUR --> GOV["Lake Formation<br/>(permissions)"]
	  ING --> ORCH["Orchestrate<br/>(MWAA, Step Functions, EventBridge)"]
	  ORCH --> ETL
	  MON["Monitor + audit<br/>(CloudWatch, CloudTrail, Macie)"] --> ORCH
	  MON --> RS
	  MON --> ATH

High-yield framing: DEA‑C01 is about the pipeline + platform, not just one service.

2) Ingestion patterns (Domain 1)

Batch vs streaming vs CDC (picker)

Pattern	Best for	Typical AWS answers	Common gotcha
Batch	Daily/hourly loads, predictable schedules	S3 landing + Glue/EMR; EventBridge schedule; AppFlow	Backfills + late data handling
Streaming	Near-real-time events	Kinesis Data Streams; MSK; (optional) Flink	Ordering, retries, consumer lag
CDC (change data capture)	Database replication	AWS DMS	Exactly-once isn’t guaranteed; handle duplicates

Triggers and scheduling (high yield)

Need	Typical best-fit
Run every N minutes	EventBridge schedule
Run when file arrives in S3	S3 event notifications or EventBridge
Complex dependencies + retries	MWAA or Step Functions

3) ETL and processing choices (Domain 1)

Glue vs EMR vs Lambda vs Redshift (fast picker)

You need…	Best-fit (typical)	Why
Managed Spark ETL with less ops	AWS Glue	Serverless-ish ETL + integrations
Full control over Spark (big jobs)	Amazon EMR	More knobs/control; long-running clusters optional
Lightweight transforms or glue code	AWS Lambda	Event-driven, simple steps
SQL transforms close to the warehouse	Amazon Redshift	Push compute to the warehouse when appropriate

File formats (exam-friendly rules)

Use Parquet/ORC for analytics on S3 (columnar + compression).
Avoid raw CSV/JSON at scale for Athena/Redshift Spectrum scans (cost and performance).

4) Catalogs, partitions, and schema drift (Domain 2)

Glue Data Catalog (what it does)

Central metadata store for S3 data (databases/tables/partitions).
Enables engines like Athena, EMR, and Redshift Spectrum to interpret schema.

Crawlers vs explicit DDL

Approach	When it’s best	Risk
Glue crawler	Fast discovery, unknown schemas	Schema drift surprises
Explicit DDL	Strong contracts	More manual maintenance

High-yield rule: keep partitions in sync (MSCK REPAIR / partition projection / crawler updates), or queries “miss” new data.

5) Storage and analytics service selection (Domain 2/3)

Athena vs Redshift (exam picker)

You need…	Best-fit	Why
Ad hoc SQL on S3	Athena	Serverless, pay per scan
High concurrency BI dashboards	Redshift	Warehouse optimization + caching
Query S3 from Redshift	Redshift Spectrum	External tables on S3

Redshift data loading (high yield)

Use COPY from S3 for fast loads (parallel, columnar-friendly).
Use UNLOAD to export query results back to S3.

6) SQL patterns (Domain 1/3)

Partition pruning (Athena mindset)

If your table is partitioned by dt, always filter by it:

1SELECT *
2FROM curated.events
3WHERE dt = '2025-12-12'
4  AND event_type = 'purchase';

CTAS for repeatable outputs (Athena)

1CREATE TABLE curated.daily_sales
2WITH (format='PARQUET', partitioned_by=ARRAY['dt'])
3AS
4SELECT dt, customer_id, SUM(amount) AS total
5FROM raw.sales
6GROUP BY dt, customer_id;

7) Orchestration and reliability (Domain 1/3)

MWAA vs Step Functions (fast picker)

You need…	Best-fit	Why
DAGs, complex dependencies, retries	MWAA (Airflow)	Mature DAG patterns
Serverless state machine orchestration	Step Functions	Visual state, retries, integration patterns

    flowchart LR
	  E["EventBridge schedule"] --> W["Workflow start"]
	  W --> I["Ingest"]
	  I --> V{"Valid?"}
	  V -->|yes| T["Transform"]
	  V -->|no| Q["Quarantine + alert"]
	  T --> C["Catalog/partitions update"]
	  C --> P["Publish dataset"]
	  P --> N["Notify (SNS)"]

High-yield reliability rules:

Design for retries + duplicates (at-least-once is normal).
Make steps idempotent (safe re-runs).
Track freshness/latency SLIs (what matters to users).

8) Monitoring and troubleshooting (Domain 3)

What to monitor

Pipeline health: failures, retries, runtime, backlog/lag
Freshness: “is today’s partition present?”
Cost: scan volume (Athena), cluster usage (EMR/Redshift), data transfer
Security/audit: access logs, permission changes

Common AWS tooling:

CloudWatch (metrics/logs/alarms, Logs Insights)
CloudTrail (API calls; audit)
Macie (PII discovery; policy violations)

9) Data quality (Domain 3)

Data quality dimensions (memorize)

Dimension	Example check
Completeness	Required fields not null
Consistency	Same customer_id format across sources
Accuracy	Values within expected ranges
Integrity	Valid foreign keys / referential relationships

High-yield pattern: run checks in-pipeline, quarantine bad records, and alert.

10) Security and governance (Domain 4)

Lake Formation (why it’s a big deal)

Lake Formation helps you manage fine-grained permissions for data in S3 across engines like Athena/EMR/Redshift Spectrum, using a consistent governance model.

Encryption and key points

Prefer SSE-KMS for S3 and service-level encryption for analytics services.
Use TLS for encryption in transit.
Don’t log secrets or raw PII; keep logs access-controlled.

Audit readiness checklist

CloudTrail enabled and centralized (optionally CloudTrail Lake for queries)
CloudWatch Logs retention + encryption set
Access to logs is restricted (separation of duties)
Data sharing has explicit approvals and is traceable

Next steps

Use the Syllabus as your checklist.
Use Practice to drill weak tasks fast.
Use the Study Plan if you want a 30/60/90‑day schedule.

Syllabus

Practice

Browse Exams — Mock Exams & Practice Tests