DEA-C01 Cheatsheet — Pipelines, Lakes, Warehouses, Service Pickers, SQL Patterns & Governance (High Yield)

High-signal DEA-C01 reference: ingestion patterns (batch/stream/CDC), ETL and orchestration choices, S3 data lakes + Lake Formation governance, Glue Catalog + partitions, Redshift/Athena analytics trade-offs, monitoring/data quality, and security/privacy controls.

Keep this page open while drilling questions. DEA‑C01 rewards “production data platform realism”: correct service selection, replayability/backfills, partitioning/file formats, monitoring and data quality, and governance-by-default.


Quick facts (DEA-C01)

Item Value
Questions 65 (multiple-choice + multiple-response)
Time 130 minutes
Passing score 720 (scaled 100–1000)
Cost 150 USD
Domains D1 34% • D2 26% • D3 22% • D4 18%

Fast strategy (what the exam expects)

  • If the requirement is replayable + backfillable, design for idempotency, checkpoints, and reprocessing (S3 as durable landing is common).
  • If you see “best cost/performance for queries on S3”, think Parquet + partitioning + Athena/Redshift Spectrum, not raw CSV scans.
  • If you see “govern access to S3 data across services”, think Lake Formation + Glue Data Catalog, not just IAM.
  • If you see “batch vs streaming”, focus on latency, ordering, retention, and operational complexity.
  • If you see “audit” or “governance”, include CloudTrail, central log storage, and controlled access to logs.

1) End-to-end data platform on AWS (mental model)

    flowchart LR
	  SRC["Sources<br/>(SaaS, DBs, apps, streams)"] --> ING["Ingest<br/>(DMS, AppFlow, Kinesis, MSK)"]
	  ING --> RAW["S3 data lake<br/>(raw/bronze)"]
	  RAW --> ETL["Transform<br/>(Glue, EMR, Lambda)"]
	  ETL --> CUR["S3 curated<br/>(silver/gold)"]
	  CUR --> CAT["Glue Data Catalog"]
	  CAT --> ATH["Athena<br/>(serverless SQL)"]
	  CUR --> RS["Redshift<br/>(warehouse)"]
	  ATH --> BI["QuickSight / BI"]
	  RS --> BI
	  CUR --> GOV["Lake Formation<br/>(permissions)"]
	  ING --> ORCH["Orchestrate<br/>(MWAA, Step Functions, EventBridge)"]
	  ORCH --> ETL
	  MON["Monitor + audit<br/>(CloudWatch, CloudTrail, Macie)"] --> ORCH
	  MON --> RS
	  MON --> ATH

High-yield framing: DEA‑C01 is about the pipeline + platform, not just one service.


2) Ingestion patterns (Domain 1)

Batch vs streaming vs CDC (picker)

Pattern Best for Typical AWS answers Common gotcha
Batch Daily/hourly loads, predictable schedules S3 landing + Glue/EMR; EventBridge schedule; AppFlow Backfills + late data handling
Streaming Near-real-time events Kinesis Data Streams; MSK; (optional) Flink Ordering, retries, consumer lag
CDC (change data capture) Database replication AWS DMS Exactly-once isn’t guaranteed; handle duplicates

Triggers and scheduling (high yield)

Need Typical best-fit
Run every N minutes EventBridge schedule
Run when file arrives in S3 S3 event notifications or EventBridge
Complex dependencies + retries MWAA or Step Functions

3) ETL and processing choices (Domain 1)

Glue vs EMR vs Lambda vs Redshift (fast picker)

You need… Best-fit (typical) Why
Managed Spark ETL with less ops AWS Glue Serverless-ish ETL + integrations
Full control over Spark (big jobs) Amazon EMR More knobs/control; long-running clusters optional
Lightweight transforms or glue code AWS Lambda Event-driven, simple steps
SQL transforms close to the warehouse Amazon Redshift Push compute to the warehouse when appropriate

File formats (exam-friendly rules)

  • Use Parquet/ORC for analytics on S3 (columnar + compression).
  • Avoid raw CSV/JSON at scale for Athena/Redshift Spectrum scans (cost and performance).

4) Catalogs, partitions, and schema drift (Domain 2)

Glue Data Catalog (what it does)

  • Central metadata store for S3 data (databases/tables/partitions).
  • Enables engines like Athena, EMR, and Redshift Spectrum to interpret schema.

Crawlers vs explicit DDL

Approach When it’s best Risk
Glue crawler Fast discovery, unknown schemas Schema drift surprises
Explicit DDL Strong contracts More manual maintenance

High-yield rule: keep partitions in sync (MSCK REPAIR / partition projection / crawler updates), or queries “miss” new data.


5) Storage and analytics service selection (Domain 2/3)

Athena vs Redshift (exam picker)

You need… Best-fit Why
Ad hoc SQL on S3 Athena Serverless, pay per scan
High concurrency BI dashboards Redshift Warehouse optimization + caching
Query S3 from Redshift Redshift Spectrum External tables on S3

Redshift data loading (high yield)

  • Use COPY from S3 for fast loads (parallel, columnar-friendly).
  • Use UNLOAD to export query results back to S3.

6) SQL patterns (Domain 1/3)

Partition pruning (Athena mindset)

If your table is partitioned by dt, always filter by it:

1SELECT *
2FROM curated.events
3WHERE dt = '2025-12-12'
4  AND event_type = 'purchase';

CTAS for repeatable outputs (Athena)

1CREATE TABLE curated.daily_sales
2WITH (format='PARQUET', partitioned_by=ARRAY['dt'])
3AS
4SELECT dt, customer_id, SUM(amount) AS total
5FROM raw.sales
6GROUP BY dt, customer_id;

7) Orchestration and reliability (Domain 1/3)

MWAA vs Step Functions (fast picker)

You need… Best-fit Why
DAGs, complex dependencies, retries MWAA (Airflow) Mature DAG patterns
Serverless state machine orchestration Step Functions Visual state, retries, integration patterns
    flowchart LR
	  E["EventBridge schedule"] --> W["Workflow start"]
	  W --> I["Ingest"]
	  I --> V{"Valid?"}
	  V -->|yes| T["Transform"]
	  V -->|no| Q["Quarantine + alert"]
	  T --> C["Catalog/partitions update"]
	  C --> P["Publish dataset"]
	  P --> N["Notify (SNS)"]

High-yield reliability rules:

  • Design for retries + duplicates (at-least-once is normal).
  • Make steps idempotent (safe re-runs).
  • Track freshness/latency SLIs (what matters to users).

8) Monitoring and troubleshooting (Domain 3)

What to monitor

  • Pipeline health: failures, retries, runtime, backlog/lag
  • Freshness: “is today’s partition present?”
  • Cost: scan volume (Athena), cluster usage (EMR/Redshift), data transfer
  • Security/audit: access logs, permission changes

Common AWS tooling:

  • CloudWatch (metrics/logs/alarms, Logs Insights)
  • CloudTrail (API calls; audit)
  • Macie (PII discovery; policy violations)

9) Data quality (Domain 3)

Data quality dimensions (memorize)

Dimension Example check
Completeness Required fields not null
Consistency Same customer_id format across sources
Accuracy Values within expected ranges
Integrity Valid foreign keys / referential relationships

High-yield pattern: run checks in-pipeline, quarantine bad records, and alert.


10) Security and governance (Domain 4)

Lake Formation (why it’s a big deal)

Lake Formation helps you manage fine-grained permissions for data in S3 across engines like Athena/EMR/Redshift Spectrum, using a consistent governance model.

Encryption and key points

  • Prefer SSE-KMS for S3 and service-level encryption for analytics services.
  • Use TLS for encryption in transit.
  • Don’t log secrets or raw PII; keep logs access-controlled.

Audit readiness checklist

  • CloudTrail enabled and centralized (optionally CloudTrail Lake for queries)
  • CloudWatch Logs retention + encryption set
  • Access to logs is restricted (separation of duties)
  • Data sharing has explicit approvals and is traceable

Next steps

  • Use the Syllabus as your checklist.
  • Use Practice to drill weak tasks fast.
  • Use the Study Plan if you want a 30/60/90‑day schedule.