Exam identity and quick-use approach
This Quick Reference supports independent preparation for the AWS Certified Data Engineer – Associate (DEA-C01) exam from AWS. Use it as a compact decision guide for scenario questions: identify the data source, ingestion pattern, transformation need, storage/query target, governance model, and operational concern.
High-yield DEA-C01 thinking pattern:
- Ingest: batch files, database CDC, SaaS, stream, events, or messages.
- Store: Amazon S3 data lake, Amazon Redshift warehouse, operational database, or search/time-series target.
- Catalog and govern: AWS Glue Data Catalog, AWS Lake Formation, IAM, AWS KMS.
- Transform: AWS Glue, Amazon EMR, AWS Lambda, Amazon Managed Service for Apache Flink, SQL CTAS, or Redshift SQL.
- Operate: monitor, retry, validate, secure, optimize, and troubleshoot.
Core AWS data engineering architecture
flowchart LR
A[Sources: databases, SaaS, files, apps, streams] --> B[Ingestion: DMS, DataSync, AppFlow, Kinesis, Firehose, MSK]
B --> C[S3 raw zone]
C --> D[Catalog: AWS Glue Data Catalog]
C --> E[Transform: AWS Glue, EMR, Lambda, Athena CTAS]
E --> F[S3 curated zone]
F --> G[Query: Athena, Redshift Spectrum, EMR, QuickSight]
E --> H[Warehouse: Amazon Redshift]
D --> I[Governance: Lake Formation, IAM, KMS]
G --> J[Consumers]
H --> J
B --> K[Ops: CloudWatch, CloudTrail, EventBridge]
E --> K
H --> K
Service selection matrix
| Need in the scenario | Usually choose | Why | Watch for |
|---|
| Durable object storage for a data lake | Amazon S3 | Scalable object storage, integrates with Glue, Athena, Redshift Spectrum, EMR | S3 is not a relational database; design prefixes, partitions, and file sizes |
| Metadata catalog for S3 tables | AWS Glue Data Catalog | Central table, schema, partition metadata for analytics services | Catalog stores metadata, not the data itself |
| Serverless SQL over S3 | Amazon Athena | Ad hoc SQL using Glue Data Catalog | Query cost/performance depends heavily on scanned data, partitions, and formats |
| Managed Spark ETL | AWS Glue ETL | Serverless distributed transformations, crawlers, jobs, workflows | Tune partitioning, file sizes, pushdown, and job bookmarks |
| Big data frameworks with more cluster control | Amazon EMR | Spark, Hive, Presto/Trino, Hudi/Iceberg workloads with configurable clusters | More operational responsibility than Glue |
| Cloud data warehouse | Amazon Redshift | High-performance analytics, SQL warehouse, COPY/UNLOAD, Spectrum | Model distribution, sort, workload, and external table scans |
| Database migration and CDC | AWS Database Migration Service (AWS DMS) | Full load plus ongoing change replication | DMS is not a general-purpose ETL engine |
| File transfer from on-premises storage to S3 | AWS DataSync | Managed transfer for file/object storage migrations and recurring sync | Needs network connectivity and correct IAM/S3/KMS permissions |
| Managed SFTP/FTPS/FTP endpoint | AWS Transfer Family | External partners exchange files into S3 or Amazon EFS | Do not confuse with DataSync migration/sync |
| SaaS application data ingestion | Amazon AppFlow | Managed SaaS-to-AWS flows | Best for supported SaaS connectors, not arbitrary streaming apps |
| Custom real-time stream consumers | Amazon Kinesis Data Streams | Ordered records per partition key, replay, custom consumers | Partition key design affects ordering and hot shards |
| Managed stream delivery to S3/Redshift/OpenSearch | Amazon Data Firehose | Minimal-code delivery pipeline with buffering and optional transform | Not for multiple custom replayable consumers |
| Kafka-compatible streaming | Amazon MSK | Managed Apache Kafka ecosystem compatibility | Choose when Kafka APIs/tools are required |
| Stateful streaming analytics | Amazon Managed Service for Apache Flink | Windowing, state, joins, event-time processing | More suitable than Lambda for complex stream computation |
| Event routing across AWS services/apps | Amazon EventBridge | Event bus, rules, schedules, SaaS events | Not a high-throughput analytics stream replacement |
| Queue-based decoupling | Amazon SQS | Durable message queue for workers | Not intended for replayable analytics streams |
| Short event-driven transformation | AWS Lambda | Lightweight transform, validation, routing | Avoid for large distributed ETL |
| Multi-step orchestration with branching/retries | AWS Step Functions | Coordinates services, handles state, retries, error paths | Prefer over ad hoc scripts for resilient workflows |
| Airflow DAG compatibility | Amazon Managed Workflows for Apache Airflow (MWAA) | Managed Apache Airflow | Choose when Airflow is a requirement |
| Fine-grained data lake governance | AWS Lake Formation | Table, column, row-style governed access with LF-Tags | IAM/S3/KMS permissions still matter |
| Encryption key control | AWS Key Management Service (AWS KMS) | Customer managed keys, key policies, grants | Cross-account access needs key policy alignment |
| Secrets for connections | AWS Secrets Manager | Rotatable database/API credentials | Do not hard-code credentials in Glue jobs or notebooks |
| Logs, metrics, alarms | Amazon CloudWatch | Operational monitoring | Know service-specific metrics and log locations |
| API audit history | AWS CloudTrail | Tracks management events and optional data events | CloudTrail is audit, not performance monitoring |
Scenario keyword shortcuts
| Scenario phrase | Strong candidate answer |
|---|
| “Run SQL directly on files in S3” | Athena |
| “Catalog S3 data for Athena/Redshift Spectrum/Glue” | AWS Glue Data Catalog |
| “Discover schema and partitions automatically” | AWS Glue crawler |
| “Fine-grained access to data lake tables and columns” | Lake Formation |
| “Move on-premises NFS/SMB data to S3 repeatedly” | DataSync |
| “Partners upload files through SFTP” | AWS Transfer Family |
| “Replicate relational database changes continuously” | AWS DMS with CDC |
| “Ingest supported SaaS data without custom connector code” | Amazon AppFlow |
| “Custom applications need replayable stream records” | Kinesis Data Streams |
| “Deliver streaming data to S3 with minimal management” | Data Firehose |
| “Existing Kafka producers and consumers” | Amazon MSK |
| “Streaming joins, windows, and stateful processing” | Managed Service for Apache Flink |
| “Complex workflow with retries and branches” | Step Functions |
| “Existing Airflow DAGs” | MWAA |
| “Reduce Athena scanned data” | Parquet/ORC, compression, partitioning, column pruning |
| “Exact set of files for Redshift load” | COPY with manifest |
| “Query S3 from Redshift” | Redshift Spectrum external tables |
Data ingestion reference
Batch, file, and database ingestion
| Source or requirement | Choose | Pattern | Common trap |
|---|
| On-premises file shares or object stores | DataSync | Schedule transfer into S3 raw prefixes | Transfer service does not transform business logic |
| External users send files over SFTP/FTPS/FTP | AWS Transfer Family | Land files in S3, trigger downstream workflow | Transfer Family is for protocol access, not storage analytics |
| RDBMS full load to S3/Redshift | AWS DMS | Source endpoint, target endpoint, replication task | Validate data types, constraints, and permissions |
| RDBMS ongoing changes | AWS DMS CDC | Full load plus change data capture to target | Monitor replication lag and source log retention |
| SaaS data | Amazon AppFlow | Flow from SaaS connector to S3/Redshift/Salesforce targets as supported | Check connector and field mapping support |
| AWS service events | EventBridge | Rule routes event to target such as Lambda or Step Functions | EventBridge is not a bulk ETL engine |
| Application-generated files | S3 direct upload or SDK | Write to raw zone with event notification | Use idempotent object naming and downstream deduplication |
Streaming ingestion
| Requirement | Choose | Design notes |
|---|
| Multiple custom consumers need independent processing | Kinesis Data Streams | Consumers read from stream; partition key controls ordering and load |
| Delivery to S3, Redshift, OpenSearch, or third-party endpoint with low code | Data Firehose | Configure destination, buffering, optional Lambda transform, backup for failures |
| Kafka API compatibility | Amazon MSK | Use Kafka producers/consumers, topics, partitions, consumer groups |
| SQL/window/stateful analytics on streams | Managed Service for Apache Flink | Use event-time processing, windows, joins, state, checkpoints |
| Message queue for asynchronous workers | SQS | Worker decoupling, not analytics replay |
| Event bus integration | EventBridge | Event routing, schedules, SaaS events, cross-account event patterns |
Ingestion design rules
- Separate raw and curated data. Land immutable source data first, then transform.
- Make ingestion idempotent. Assume retries can create duplicate deliveries.
- Capture metadata: source, ingestion timestamp, schema version, batch ID, and lineage.
- Use partition-friendly timestamps. Common S3 partitions include date or hour, but avoid creating excessive tiny partitions.
- Validate early. Use data quality checks before promoting data into curated zones.
- Monitor lag and failures. Streams, DMS tasks, Firehose delivery, and Glue jobs all expose operational signals.
S3 data lake design
Data lake zones
| Zone | Purpose | Typical properties |
|---|
| Raw / bronze | Immutable copy of source data | Source format, append-only, tightly restricted write access |
| Staging / silver | Cleaned, normalized, deduplicated | Standardized schema, data quality checks, partitioned layout |
| Curated / gold | Analytics-ready products | Columnar formats, business dimensions, governed access |
| Sandbox | Exploration and temporary outputs | Lifecycle policies, limited permissions, not production source of truth |
| Audit / quarantine | Failed or suspicious records | Preserve rejected records with error reason and batch metadata |
| Format | Best for | Avoid when | Exam notes |
|---|
| CSV | Simple exchange, human-readable exports | Large analytics scans, nested structures, strict typing | Easy but inefficient for Athena/Redshift Spectrum |
| JSON | Semi-structured events | Heavy repeated scans without conversion | Convert to columnar for curated analytics |
| Avro | Row-oriented data with schema evolution, streaming ecosystems | Pure SQL scan optimization is the main goal | Often used in streaming pipelines |
| Parquet | Columnar analytics on S3 | Frequent single-row updates without table format support | High-yield choice for Athena, Glue, Redshift Spectrum |
| ORC | Columnar analytics, Hive-style ecosystems | Tooling standardizes on Parquet | Similar exam value to Parquet |
| Apache Iceberg table | S3 lakehouse tables needing ACID-style operations and schema evolution | Simple immutable append-only files are enough | Useful for governed, evolving analytical tables |
Partitioning and layout rules
| Rule | Why it matters |
|---|
| Partition by common filter columns | Enables partition pruning in Athena, Glue, EMR, and Spectrum |
| Avoid overly high-cardinality partitions | Too many small partitions can hurt planning and metadata operations |
| Avoid many tiny files | Distributed engines spend too much time opening files instead of scanning data |
| Use columnar compression | Reduces scanned bytes and improves query performance |
| Keep partition naming consistent | Hive-style paths such as dt=2026-06-18/ work well with many tools |
| Compact streaming outputs | Firehose and streaming jobs can create many small objects |
| Store raw data immutably | Enables replay, audit, and recovery from bad transforms |
Example S3 layout:
s3://company-data-lake/raw/source=salesforce/object=account/ingest_date=2026-06-18/
s3://company-data-lake/curated/domain=sales/table=orders/order_date=2026-06-18/
s3://company-data-lake/quarantine/source=orders/ingest_date=2026-06-18/
AWS Glue and Data Catalog reference
Glue components
| Component | What it does | Choose when |
|---|
| AWS Glue Data Catalog | Stores databases, tables, schemas, partitions, connections | Analytics services need shared metadata |
| Crawler | Infers schema and discovers partitions | Data structure is discoverable and changes need catalog updates |
| Classifier | Helps crawler interpret data | Custom formats or nonstandard records |
| Glue ETL job | Runs Spark, Python shell, or other supported job types | Transform, clean, join, and write data |
| Glue Studio | Visual job authoring | Need low-code ETL development |
| Glue workflow | Coordinates Glue crawlers, jobs, and triggers | Glue-centered pipeline orchestration |
| Glue trigger | Starts jobs/crawlers on schedule or condition | Simple Glue workflow automation |
| Glue connection | Stores connection details for JDBC, network, or marketplace connectors | Jobs need source/target connectivity |
| Glue Data Quality | Evaluates rules against datasets | Validate completeness, uniqueness, ranges, schema expectations |
| Glue Schema Registry | Manages schemas for streaming/event data | Producers and consumers need schema validation/evolution |
Crawler vs explicit schema
| Situation | Better choice |
|---|
| Unknown files arrive and schema must be discovered | Glue crawler |
| Production schema must be controlled and reviewed | Explicit table definition or IaC-managed catalog |
| Frequent partition additions only | Partition projection, ALTER TABLE ADD PARTITION, crawler, or repair pattern |
| Crawler creates unexpected tables | Adjust folder structure, classifiers, grouping behavior, and crawler scope |
| Sensitive columns need access control | Catalog plus Lake Formation, not crawler alone |
Glue ETL exam points
| Topic | Remember |
|---|
| DynamicFrame vs DataFrame | DynamicFrames help with semi-structured data and schema ambiguity; DataFrames expose standard Spark APIs |
| Job bookmarks | Track previously processed source data; still design idempotent writes |
| Pushdown predicates | Reduce source data read, especially with partitions |
| Repartition/coalesce | Manage output file count and parallelism |
| Small files | Compact to improve Athena/Spectrum/EMR performance |
| Skew | A few hot keys can slow joins and aggregations |
| Secrets | Use Secrets Manager or Glue connections, not embedded passwords |
| VPC access | Jobs accessing private databases need subnet/security group routing and access to S3/logs/secrets |
| Failure handling | Use retries, checkpoints/bookmarks, quarantine outputs, and CloudWatch logs |
Illustrative Glue PySpark pattern:
from awsglue.context import GlueContext
from pyspark.context import SparkContext
glue_ctx = GlueContext(SparkContext.getOrCreate())
orders = glue_ctx.create_dynamic_frame.from_catalog(
database="raw",
table_name="orders",
push_down_predicate="ingest_date >= '2026-06-01'"
)
df = orders.toDF().dropDuplicates(["order_id"])
df.write.mode("append") \
.partitionBy("order_date") \
.parquet("s3://example-data-lake/curated/orders/")
Example AWS Glue Data Quality rule style:
Rules = [
IsComplete "order_id",
IsUnique "order_id",
ColumnValues "amount" >= 0,
ColumnExists "order_date"
]
Athena reference
| Need | Athena feature or pattern |
|---|
| Query S3 data with SQL | External tables using Glue Data Catalog |
| Improve performance | Parquet/ORC, compression, partition pruning, avoid SELECT * |
| Create curated columnar data | CTAS or INSERT INTO from raw table |
| Control query usage | Workgroups, result locations, query settings |
| Add partitions | Crawler, ALTER TABLE ADD PARTITION, partition projection, or repair for Hive-style partitions |
| Secure data | IAM, S3 bucket policy, KMS key policy, Lake Formation permissions |
| Share governed tables | Lake Formation and catalog-based permissions where supported |
Athena CTAS pattern:
CREATE TABLE curated_orders
WITH (
format = 'PARQUET',
partitioned_by = ARRAY['order_date'],
external_location = 's3://example-data-lake/curated/orders/'
) AS
SELECT
order_id,
customer_id,
amount,
order_status,
order_date
FROM raw_orders
WHERE order_date >= DATE '2026-01-01';
Partition repair pattern for Hive-style S3 paths:
MSCK REPAIR TABLE raw_orders;
Common Athena traps:
- Athena queries data in S3; it does not ingest or move data by itself.
- Catalog permissions alone are not enough if S3 or KMS denies access.
- Crawlers update metadata; they do not optimize file format or clean data.
- Partition projection can reduce partition metadata management, but the S3 path pattern must match the table definition.
- Columnar formats help most when queries select only needed columns.
Amazon Redshift reference
Redshift vs Athena vs S3 lake
| Requirement | Better fit |
|---|
| Ad hoc SQL on raw/curated S3 data | Athena |
| Managed warehouse with repeated BI queries and modeled tables | Redshift |
| Query S3 from warehouse without loading all data | Redshift Spectrum |
| Transform and publish warehouse data to S3 | Redshift UNLOAD |
| Load large S3 datasets into warehouse tables | Redshift COPY |
| Variable or intermittent warehouse demand | Redshift Serverless may fit |
| Stable warehouse environment with cluster-level control | Redshift provisioned may fit |
Redshift design points
| Topic | What to know for DEA-C01 |
|---|
| COPY | Preferred bulk load from S3, DynamoDB, EMR, or supported sources |
| UNLOAD | Writes query results from Redshift to S3 |
| Distribution style | Affects data movement during joins; AUTO can help, but know KEY/EVEN/ALL concepts |
| Sort keys | Improve range-restricted scans and joins when aligned with query patterns |
| Compression encoding | Reduces storage and I/O |
| ANALYZE | Updates statistics for the optimizer |
| VACUUM | Reclaims/sorts storage where applicable |
| Spectrum | Queries external S3 data through external schemas/tables |
| Materialized views | Precompute expensive query results when refresh strategy fits |
| Workload management | Manage query queues, priorities, and concurrency behavior |
| Federated query | Query operational databases from Redshift for specific use cases |
COPY from S3 pattern:
COPY analytics.orders
FROM 's3://example-data-lake/curated/orders/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoadRole'
FORMAT AS PARQUET;
UNLOAD to S3 pattern:
UNLOAD ('SELECT order_date, SUM(amount) AS revenue FROM analytics.orders GROUP BY order_date')
TO 's3://example-data-lake/exports/revenue/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftUnloadRole'
FORMAT AS PARQUET;
Redshift troubleshooting shortcuts:
| Symptom | Check first |
|---|
| COPY fails | IAM role, S3 path, KMS access, file format, load error views |
| Query slow after load | Statistics, sort/distribution, skew, scanned external data |
| External table query slow | S3 file format, partitioning, file sizes, Spectrum pruning |
| Access denied to S3 | Redshift role policy, bucket policy, KMS key policy |
| BI workload contention | Workload management, query design, materialized views |
Streaming and event processing reference
Kinesis, Firehose, MSK, Flink
| Dimension | Kinesis Data Streams | Data Firehose | Amazon MSK | Managed Service for Apache Flink |
|---|
| Primary role | Durable stream for custom apps | Managed delivery stream | Managed Kafka | Stateful stream processing |
| Consumers | Custom consumers | Destination delivery, optional transform | Kafka consumers | Flink application |
| Replay | Yes, within configured retention | Not a replay stream for custom consumers | Kafka retention model | Reads from stream sources |
| Ordering | Per partition key/shard | Not the main design feature | Per Kafka partition | Depends on source partitioning and app logic |
| Operations | Stream capacity and consumer design | Destination and delivery config | Kafka cluster/topic/client design | Application state, checkpoints, parallelism |
| Choose when | You need custom stream processing | You need easy delivery to storage/search/warehouse | You need Kafka compatibility | You need windows, joins, state, event-time logic |
Streaming design traps
- Partition keys determine ordering and load distribution. Bad keys create hot partitions.
- Assume at-least-once delivery in many pipelines; design deduplication and idempotent sinks.
- Firehose buffering means it is usually not the answer for the lowest-latency custom consumer requirement.
- Lambda works for lightweight stream processing, not complex stateful analytics.
- Use dead-letter, backup, or quarantine patterns for failed records.
- Monitor consumer lag, delivery failures, throttling, and error logs.
| Requirement | Choose | Why |
|---|
| Distributed ETL over large S3 data | AWS Glue Spark job | Serverless managed Spark |
| Complex big data stack or custom libraries/configuration | Amazon EMR | More control over cluster/runtime |
| Lightweight event transform | Lambda | Simple code on event triggers |
| SQL transformation over S3 | Athena CTAS/INSERT | Serverless SQL pipeline step |
| SQL transformation inside warehouse | Redshift SQL, stored procedures, materialized views | Keep warehouse transformations close to modeled data |
| Stateful stream transformation | Managed Service for Apache Flink | Windows, joins, state |
| Glue-only pipeline | Glue workflows and triggers | Native Glue orchestration |
| Multi-service workflow | Step Functions | Branches, retries, service integrations |
| Airflow DAG requirement | MWAA | Managed Airflow compatibility |
| Time-based event trigger | EventBridge Scheduler or rules | Schedules pipeline starts |
Orchestration exam distinctions
| If the question emphasizes | Prefer |
|---|
| Retry policies, branching, human-readable state machine, AWS SDK integrations | Step Functions |
| Existing Airflow DAGs and operators | MWAA |
| Simple scheduled Glue job | Glue trigger or EventBridge schedule |
| Event from S3 starts processing | S3 event notification to Lambda/EventBridge/queue, then orchestrate |
| Decoupling ingestion from processing | SQS between producer and worker, or stream where replay/order is needed |
| Failure notification | EventBridge rule, CloudWatch alarm, SNS notification |
Security, governance, and access control
Data access control layers
| Layer | Controls | Common exam point |
|---|
| IAM identity policy | What principals can call AWS APIs | Required but may not be sufficient alone |
| S3 bucket/access point policy | Resource-level access to objects | Needed for cross-account and centralized lake patterns |
| S3 Block Public Access | Prevents public exposure | Keep enabled unless a specific approved public pattern exists |
| KMS key policy/grants | Who can use encryption keys | Access fails if IAM allows S3 but KMS denies decrypt |
| Lake Formation permissions | Table, column, and governed data lake access | Use for fine-grained analytics permissions |
| Glue Data Catalog resource policy | Catalog-level sharing/access | Often relevant in cross-account catalogs |
| Secrets Manager | Stores database/API credentials | Use with Glue connections and jobs |
| VPC security groups/routes/endpoints | Network path to private sources and AWS APIs | Glue/DMS in VPC often need S3 and service endpoint access |
| CloudTrail | Audit of API activity | Enable relevant data events when object-level audit is needed |
| Amazon Macie | Sensitive data discovery in S3 | Helps identify PII/sensitive data exposure |
Governance decision table
| Requirement | Strong answer |
|---|
| Analysts can query only approved columns | Lake Formation column permissions or governed views |
| Grant data lake access by business tags | Lake Formation LF-Tags |
| Encrypt S3 objects with customer-managed key | SSE-KMS with proper key policy |
| Cross-account Athena access to encrypted S3 data | Align Lake Formation/catalog, S3 bucket policy, IAM, and KMS key policy |
| Store JDBC password for Glue | Secrets Manager or Glue connection using a secret |
| Audit who changed Glue table definitions | CloudTrail management events |
| Detect sensitive data in S3 | Macie |
| Keep Glue job traffic private to AWS services | VPC endpoints and correct routing/security groups |
| Prevent accidental public S3 lake access | S3 Block Public Access plus least-privilege bucket policies |
Least-privilege reminders
- Grant jobs only the S3 prefixes, catalog databases/tables, KMS keys, and logs they need.
- Separate roles for ingestion, transformation, catalog administration, and consumption.
- For cross-account sharing, check all layers: IAM, resource policy, Lake Formation, S3, and KMS.
- Avoid embedding credentials in scripts, notebooks, job parameters, or environment variables when Secrets Manager is appropriate.
- Use encryption in transit and at rest for data pipelines.
Operations, monitoring, and troubleshooting
Service signals
| Service | Monitor |
|---|
| Glue jobs | Job run status, CloudWatch logs, errors, duration, data skew symptoms, output file count |
| Glue crawlers | Crawler run status, schema changes, tables created, partition discovery |
| Athena | Query failures, scanned data, workgroup settings, result location, permissions |
| Redshift | Query performance, load errors, disk/storage pressure, WLM queues, system views |
| DMS | Task status, table statistics, replication lag, task logs |
| Kinesis Data Streams | Write/read throttling, iterator age or consumer lag, hot partitions |
| Firehose | Delivery success/failure, transformation errors, backup records |
| MSK | Broker health, topic throughput, consumer lag |
| Flink | Checkpoints, application health, lag, failed records |
| S3 | Object creation, replication/lifecycle status, access errors |
| Lake Formation | Grant changes, denied access, LF-Tag policy alignment |
| KMS | Key access denied, disabled key, missing cross-account permissions |
| Step Functions | Failed states, retries, execution history |
| EventBridge | Rule matches, target invocation failures, dead-letter targets |
Troubleshooting table
| Symptom | Likely causes | Fast checks or fixes |
|---|
| Athena scans too much data | Row format, no partitions, no column pruning | Convert to Parquet/ORC, partition by filters, avoid SELECT * |
| Athena access denied | IAM, S3, KMS, Lake Formation, workgroup result location | Test each permission layer |
| Glue job out of memory or slow | Skew, tiny files, wide shuffle, no predicate pushdown | Repartition, compact, filter early, tune joins |
| Glue job reprocesses data | Bookmark disabled/reset, changed source path, non-idempotent writes | Enable bookmarks where useful, deduplicate, design idempotent outputs |
| Crawler creates many tables | Folder structure or classifier mismatch | Narrow crawler scope, fix path layout, adjust classifiers |
| DMS CDC lag grows | Source log pressure, target bottleneck, network, task config | Check task logs, table stats, target capacity |
| Firehose delivery fails | Destination permission, KMS, transform error, schema conversion issue | Inspect error logs and backup prefix |
| Kinesis consumer falls behind | Hot shard, slow consumer, insufficient parallelism | Improve partition key, scale stream/consumer design |
| Redshift COPY errors | Bad file format, role/KMS issue, incompatible schema | Check load error views and S3 object format |
| Redshift query slow | Poor distribution/sort, stale stats, queue contention | Analyze tables, review plan, tune WLM and table design |
| Lake Formation denies query | Missing LF grant, IAM mismatch, location not registered | Check data location registration and table permissions |
| Glue job cannot reach S3 from VPC | Missing route or VPC endpoint | Add S3 access path and service endpoints as needed |
| S3 event does not start pipeline | Notification filter mismatch, target policy, unsupported event path | Validate prefix/suffix, destination permissions, event pattern |
Data quality, schema, and reliability
| Concern | AWS pattern |
|---|
| Validate completeness, uniqueness, ranges | AWS Glue Data Quality rules |
| Enforce streaming schema compatibility | AWS Glue Schema Registry |
| Preserve bad records | Quarantine S3 prefix with error metadata |
| Prevent duplicate records | Idempotent keys, deduplication step, deterministic output paths |
| Handle schema drift | Crawler review, explicit schema management, compatible schema evolution |
| Track lineage | Store batch IDs, source metadata, job run IDs, and catalog versions |
| Recover from bad transform | Reprocess raw immutable data into corrected curated zone |
| Promote trusted datasets | Raw to curated pipeline with quality gates |
Reliability checklist:
- Can the pipeline be safely retried?
- Are failed records preserved?
- Is raw source data immutable?
- Are schema changes detected before breaking consumers?
- Are job failures visible through alarms or events?
- Are downstream writes atomic enough for the service and format used?
- Are permissions least-privilege but sufficient across IAM, S3, KMS, and Lake Formation?
| Area | Optimize by |
|---|
| Athena | Columnar formats, compression, partitions, workgroups, CTAS for repeated transforms |
| S3 lake | Lifecycle policies, compact files, avoid unnecessary copies, design prefixes logically |
| Glue | Filter early, avoid shuffles, use bookmarks, write partitioned columnar output |
| Redshift | COPY from S3, sort/distribution design, statistics, materialized views, workload tuning |
| Kinesis | Balanced partition keys, right stream capacity mode/design, efficient consumers |
| Firehose | Destination buffering/format conversion, backup failed records, transform only when needed |
| DMS | Monitor lag, choose appropriate task settings, avoid unnecessary transformations |
| Cross-service data movement | Keep pipelines regional and avoid unnecessary intermediate hops |
| Governance | Use Lake Formation and tags to avoid duplicating governed datasets |
High-yield optimization principle: for analytics on S3, reduce bytes scanned and reduce file/partition overhead before scaling compute.
Common DEA-C01 traps
| Trap | Correct thinking |
|---|
| “Crawler transforms data” | Crawlers infer/update metadata only |
| “Data Catalog stores the dataset” | It stores metadata; data remains in S3 or source systems |
| “IAM allow means Athena can read everything” | S3, KMS, Lake Formation, and workgroup settings can still deny access |
| “Firehose is the same as Kinesis Data Streams” | Firehose is managed delivery; Data Streams supports custom consumers and replay |
| “DMS is for complex transformations” | DMS is mainly migration/replication with limited transformation |
| “CSV is fine for large Athena workloads” | Convert curated analytics data to Parquet/ORC |
| “More partitions always improve performance” | Too many tiny partitions can degrade planning and metadata operations |
| “Lambda is best for all ETL” | Use Glue/EMR for distributed data processing |
| “S3 event notification guarantees a complete batch workflow” | Use orchestration and idempotency for multi-file/batch completion logic |
| “Redshift Spectrum replaces all warehouse modeling” | Spectrum is useful, but internal Redshift tables can be better for repeated BI workloads |
| “KMS is only an encryption checkbox” | Key policies and grants are common causes of access failures |
| “Lake Formation replaces all IAM” | Lake Formation works with IAM, S3, catalog, and KMS controls |
Final review checklist
Before the exam, be able to answer these quickly:
- Which service ingests files, database CDC, SaaS, events, streams, and Kafka?
- When should data be stored in S3, Redshift, or queried by Athena?
- How do Glue Data Catalog, crawlers, Lake Formation, and Glue jobs differ?
- How do you optimize S3 analytics with Parquet, compression, partitioning, and compaction?
- What permission layers can block a query: IAM, S3, KMS, Lake Formation, catalog, or network?
- How do you troubleshoot failed Glue, Athena, Redshift COPY, DMS, Kinesis, and Firehose workflows?
- Which orchestration service fits: Step Functions, Glue workflows, MWAA, or EventBridge?
- How do you make pipelines idempotent, observable, recoverable, and governed?
Next step
Use this Quick Reference as a checklist, then practice scenario questions that force you to choose between similar AWS services, especially Glue vs EMR, Athena vs Redshift, Kinesis Data Streams vs Firehose, DMS vs DataSync, and IAM vs Lake Formation permission issues.