DEA-C01 Cheatsheet — Pipelines, Lakes, Warehouses, Service Pickers, SQL Patterns & Governance (High Yield)

High-signal DEA-C01 reference: ingestion patterns (batch/stream/CDC), ETL and orchestration choices, S3 data lakes + Lake Formation governance, Glue Catalog + partitions, Redshift/Athena analytics trade-offs, monitoring/data quality, and security/privacy controls.

Keep this page open while drilling questions. DEA‑C01 rewards “production data platform realism”: correct service selection, replayability/backfills, partitioning/file formats, monitoring and data quality, and governance-by-default.


Quick facts (DEA-C01)

ItemValue
Questions65 (multiple-choice + multiple-response)
Time130 minutes
Passing score720 (scaled 100–1000)
Cost150 USD
DomainsD1 34% • D2 26% • D3 22% • D4 18%

Fast strategy (what the exam expects)

  • If the requirement is replayable + backfillable, design for idempotency, checkpoints, and reprocessing (S3 as durable landing is common).
  • If you see “best cost/performance for queries on S3”, think Parquet + partitioning + Athena/Redshift Spectrum, not raw CSV scans.
  • If you see “govern access to S3 data across services”, think Lake Formation + Glue Data Catalog, not just IAM.
  • If you see “batch vs streaming”, focus on latency, ordering, retention, and operational complexity.
  • If you see “audit” or “governance”, include CloudTrail, central log storage, and controlled access to logs.

1) End-to-end data platform on AWS (mental model)

    flowchart LR
	  SRC["Sources<br/>(SaaS, DBs, apps, streams)"] --> ING["Ingest<br/>(DMS, AppFlow, Kinesis, MSK)"]
	  ING --> RAW["S3 data lake<br/>(raw/bronze)"]
	  RAW --> ETL["Transform<br/>(Glue, EMR, Lambda)"]
	  ETL --> CUR["S3 curated<br/>(silver/gold)"]
	  CUR --> CAT["Glue Data Catalog"]
	  CAT --> ATH["Athena<br/>(serverless SQL)"]
	  CUR --> RS["Redshift<br/>(warehouse)"]
	  ATH --> BI["QuickSight / BI"]
	  RS --> BI
	  CUR --> GOV["Lake Formation<br/>(permissions)"]
	  ING --> ORCH["Orchestrate<br/>(MWAA, Step Functions, EventBridge)"]
	  ORCH --> ETL
	  MON["Monitor + audit<br/>(CloudWatch, CloudTrail, Macie)"] --> ORCH
	  MON --> RS
	  MON --> ATH

High-yield framing: DEA‑C01 is about the pipeline + platform, not just one service.


2) Ingestion patterns (Domain 1)

Batch vs streaming vs CDC (picker)

PatternBest forTypical AWS answersCommon gotcha
BatchDaily/hourly loads, predictable schedulesS3 landing + Glue/EMR; EventBridge schedule; AppFlowBackfills + late data handling
StreamingNear-real-time eventsKinesis Data Streams; MSK; (optional) FlinkOrdering, retries, consumer lag
CDC (change data capture)Database replicationAWS DMSExactly-once isn’t guaranteed; handle duplicates

Triggers and scheduling (high yield)

NeedTypical best-fit
Run every N minutesEventBridge schedule
Run when file arrives in S3S3 event notifications or EventBridge
Complex dependencies + retriesMWAA or Step Functions

3) ETL and processing choices (Domain 1)

Glue vs EMR vs Lambda vs Redshift (fast picker)

You need…Best-fit (typical)Why
Managed Spark ETL with less opsAWS GlueServerless-ish ETL + integrations
Full control over Spark (big jobs)Amazon EMRMore knobs/control; long-running clusters optional
Lightweight transforms or glue codeAWS LambdaEvent-driven, simple steps
SQL transforms close to the warehouseAmazon RedshiftPush compute to the warehouse when appropriate

File formats (exam-friendly rules)

  • Use Parquet/ORC for analytics on S3 (columnar + compression).
  • Avoid raw CSV/JSON at scale for Athena/Redshift Spectrum scans (cost and performance).

4) Catalogs, partitions, and schema drift (Domain 2)

Glue Data Catalog (what it does)

  • Central metadata store for S3 data (databases/tables/partitions).
  • Enables engines like Athena, EMR, and Redshift Spectrum to interpret schema.

Crawlers vs explicit DDL

ApproachWhen it’s bestRisk
Glue crawlerFast discovery, unknown schemasSchema drift surprises
Explicit DDLStrong contractsMore manual maintenance

High-yield rule: keep partitions in sync (MSCK REPAIR / partition projection / crawler updates), or queries “miss” new data.


5) Storage and analytics service selection (Domain 2/3)

Athena vs Redshift (exam picker)

You need…Best-fitWhy
Ad hoc SQL on S3AthenaServerless, pay per scan
High concurrency BI dashboardsRedshiftWarehouse optimization + caching
Query S3 from RedshiftRedshift SpectrumExternal tables on S3

Redshift data loading (high yield)

  • Use COPY from S3 for fast loads (parallel, columnar-friendly).
  • Use UNLOAD to export query results back to S3.

6) SQL patterns (Domain 1/3)

Partition pruning (Athena mindset)

If your table is partitioned by dt, always filter by it:

1SELECT *
2FROM curated.events
3WHERE dt = '2025-12-12'
4  AND event_type = 'purchase';

CTAS for repeatable outputs (Athena)

1CREATE TABLE curated.daily_sales
2WITH (format='PARQUET', partitioned_by=ARRAY['dt'])
3AS
4SELECT dt, customer_id, SUM(amount) AS total
5FROM raw.sales
6GROUP BY dt, customer_id;

7) Orchestration and reliability (Domain 1/3)

MWAA vs Step Functions (fast picker)

You need…Best-fitWhy
DAGs, complex dependencies, retriesMWAA (Airflow)Mature DAG patterns
Serverless state machine orchestrationStep FunctionsVisual state, retries, integration patterns
    flowchart LR
	  E["EventBridge schedule"] --> W["Workflow start"]
	  W --> I["Ingest"]
	  I --> V{"Valid?"}
	  V -->|yes| T["Transform"]
	  V -->|no| Q["Quarantine + alert"]
	  T --> C["Catalog/partitions update"]
	  C --> P["Publish dataset"]
	  P --> N["Notify (SNS)"]

High-yield reliability rules:

  • Design for retries + duplicates (at-least-once is normal).
  • Make steps idempotent (safe re-runs).
  • Track freshness/latency SLIs (what matters to users).

8) Monitoring and troubleshooting (Domain 3)

What to monitor

  • Pipeline health: failures, retries, runtime, backlog/lag
  • Freshness: “is today’s partition present?”
  • Cost: scan volume (Athena), cluster usage (EMR/Redshift), data transfer
  • Security/audit: access logs, permission changes

Common AWS tooling:

  • CloudWatch (metrics/logs/alarms, Logs Insights)
  • CloudTrail (API calls; audit)
  • Macie (PII discovery; policy violations)

9) Data quality (Domain 3)

Data quality dimensions (memorize)

DimensionExample check
CompletenessRequired fields not null
ConsistencySame customer_id format across sources
AccuracyValues within expected ranges
IntegrityValid foreign keys / referential relationships

High-yield pattern: run checks in-pipeline, quarantine bad records, and alert.


10) Security and governance (Domain 4)

Lake Formation (why it’s a big deal)

Lake Formation helps you manage fine-grained permissions for data in S3 across engines like Athena/EMR/Redshift Spectrum, using a consistent governance model.

Encryption and key points

  • Prefer SSE-KMS for S3 and service-level encryption for analytics services.
  • Use TLS for encryption in transit.
  • Don’t log secrets or raw PII; keep logs access-controlled.

Audit readiness checklist

  • CloudTrail enabled and centralized (optionally CloudTrail Lake for queries)
  • CloudWatch Logs retention + encryption set
  • Access to logs is restricted (separation of duties)
  • Data sharing has explicit approvals and is traceable

Next steps

  • Use the Syllabus as your checklist.
  • Use Practice to drill weak tasks fast.
  • Use the Study Plan if you want a 30/60/90‑day schedule.