High-signal DEA-C01 reference: ingestion patterns (batch/stream/CDC), ETL and orchestration choices, S3 data lakes + Lake Formation governance, Glue Catalog + partitions, Redshift/Athena analytics trade-offs, monitoring/data quality, and security/privacy controls.
Keep this page open while drilling questions. DEA‑C01 rewards “production data platform realism”: correct service selection, replayability/backfills, partitioning/file formats, monitoring and data quality, and governance-by-default.
| Item | Value |
|---|---|
| Questions | 65 (multiple-choice + multiple-response) |
| Time | 130 minutes |
| Passing score | 720 (scaled 100–1000) |
| Cost | 150 USD |
| Domains | D1 34% • D2 26% • D3 22% • D4 18% |
flowchart LR
SRC["Sources<br/>(SaaS, DBs, apps, streams)"] --> ING["Ingest<br/>(DMS, AppFlow, Kinesis, MSK)"]
ING --> RAW["S3 data lake<br/>(raw/bronze)"]
RAW --> ETL["Transform<br/>(Glue, EMR, Lambda)"]
ETL --> CUR["S3 curated<br/>(silver/gold)"]
CUR --> CAT["Glue Data Catalog"]
CAT --> ATH["Athena<br/>(serverless SQL)"]
CUR --> RS["Redshift<br/>(warehouse)"]
ATH --> BI["QuickSight / BI"]
RS --> BI
CUR --> GOV["Lake Formation<br/>(permissions)"]
ING --> ORCH["Orchestrate<br/>(MWAA, Step Functions, EventBridge)"]
ORCH --> ETL
MON["Monitor + audit<br/>(CloudWatch, CloudTrail, Macie)"] --> ORCH
MON --> RS
MON --> ATH
High-yield framing: DEA‑C01 is about the pipeline + platform, not just one service.
| Pattern | Best for | Typical AWS answers | Common gotcha |
|---|---|---|---|
| Batch | Daily/hourly loads, predictable schedules | S3 landing + Glue/EMR; EventBridge schedule; AppFlow | Backfills + late data handling |
| Streaming | Near-real-time events | Kinesis Data Streams; MSK; (optional) Flink | Ordering, retries, consumer lag |
| CDC (change data capture) | Database replication | AWS DMS | Exactly-once isn’t guaranteed; handle duplicates |
| Need | Typical best-fit |
|---|---|
| Run every N minutes | EventBridge schedule |
| Run when file arrives in S3 | S3 event notifications or EventBridge |
| Complex dependencies + retries | MWAA or Step Functions |
| You need… | Best-fit (typical) | Why |
|---|---|---|
| Managed Spark ETL with less ops | AWS Glue | Serverless-ish ETL + integrations |
| Full control over Spark (big jobs) | Amazon EMR | More knobs/control; long-running clusters optional |
| Lightweight transforms or glue code | AWS Lambda | Event-driven, simple steps |
| SQL transforms close to the warehouse | Amazon Redshift | Push compute to the warehouse when appropriate |
| Approach | When it’s best | Risk |
|---|---|---|
| Glue crawler | Fast discovery, unknown schemas | Schema drift surprises |
| Explicit DDL | Strong contracts | More manual maintenance |
High-yield rule: keep partitions in sync (MSCK REPAIR / partition projection / crawler updates), or queries “miss” new data.
| You need… | Best-fit | Why |
|---|---|---|
| Ad hoc SQL on S3 | Athena | Serverless, pay per scan |
| High concurrency BI dashboards | Redshift | Warehouse optimization + caching |
| Query S3 from Redshift | Redshift Spectrum | External tables on S3 |
COPY from S3 for fast loads (parallel, columnar-friendly).UNLOAD to export query results back to S3.If your table is partitioned by dt, always filter by it:
1SELECT *
2FROM curated.events
3WHERE dt = '2025-12-12'
4 AND event_type = 'purchase';
1CREATE TABLE curated.daily_sales
2WITH (format='PARQUET', partitioned_by=ARRAY['dt'])
3AS
4SELECT dt, customer_id, SUM(amount) AS total
5FROM raw.sales
6GROUP BY dt, customer_id;
| You need… | Best-fit | Why |
|---|---|---|
| DAGs, complex dependencies, retries | MWAA (Airflow) | Mature DAG patterns |
| Serverless state machine orchestration | Step Functions | Visual state, retries, integration patterns |
flowchart LR
E["EventBridge schedule"] --> W["Workflow start"]
W --> I["Ingest"]
I --> V{"Valid?"}
V -->|yes| T["Transform"]
V -->|no| Q["Quarantine + alert"]
T --> C["Catalog/partitions update"]
C --> P["Publish dataset"]
P --> N["Notify (SNS)"]
High-yield reliability rules:
Common AWS tooling:
| Dimension | Example check |
|---|---|
| Completeness | Required fields not null |
| Consistency | Same customer_id format across sources |
| Accuracy | Values within expected ranges |
| Integrity | Valid foreign keys / referential relationships |
High-yield pattern: run checks in-pipeline, quarantine bad records, and alert.
Lake Formation helps you manage fine-grained permissions for data in S3 across engines like Athena/EMR/Redshift Spectrum, using a consistent governance model.