DEA-C01 — AWS Certified Data Engineer – Associate Quick Reference

Compact AWS DEA-C01 reference for data ingestion, transformation, storage, operations, security, and governance decisions.

Exam identity and quick-use approach

This Quick Reference supports independent preparation for the AWS Certified Data Engineer – Associate (DEA-C01) exam from AWS. Use it as a compact decision guide for scenario questions: identify the data source, ingestion pattern, transformation need, storage/query target, governance model, and operational concern.

High-yield DEA-C01 thinking pattern:

  1. Ingest: batch files, database CDC, SaaS, stream, events, or messages.
  2. Store: Amazon S3 data lake, Amazon Redshift warehouse, operational database, or search/time-series target.
  3. Catalog and govern: AWS Glue Data Catalog, AWS Lake Formation, IAM, AWS KMS.
  4. Transform: AWS Glue, Amazon EMR, AWS Lambda, Amazon Managed Service for Apache Flink, SQL CTAS, or Redshift SQL.
  5. Operate: monitor, retry, validate, secure, optimize, and troubleshoot.

Core AWS data engineering architecture

    flowchart LR
	    A[Sources: databases, SaaS, files, apps, streams] --> B[Ingestion: DMS, DataSync, AppFlow, Kinesis, Firehose, MSK]
	    B --> C[S3 raw zone]
	    C --> D[Catalog: AWS Glue Data Catalog]
	    C --> E[Transform: AWS Glue, EMR, Lambda, Athena CTAS]
	    E --> F[S3 curated zone]
	    F --> G[Query: Athena, Redshift Spectrum, EMR, QuickSight]
	    E --> H[Warehouse: Amazon Redshift]
	    D --> I[Governance: Lake Formation, IAM, KMS]
	    G --> J[Consumers]
	    H --> J
	    B --> K[Ops: CloudWatch, CloudTrail, EventBridge]
	    E --> K
	    H --> K

Service selection matrix

Need in the scenarioUsually chooseWhyWatch for
Durable object storage for a data lakeAmazon S3Scalable object storage, integrates with Glue, Athena, Redshift Spectrum, EMRS3 is not a relational database; design prefixes, partitions, and file sizes
Metadata catalog for S3 tablesAWS Glue Data CatalogCentral table, schema, partition metadata for analytics servicesCatalog stores metadata, not the data itself
Serverless SQL over S3Amazon AthenaAd hoc SQL using Glue Data CatalogQuery cost/performance depends heavily on scanned data, partitions, and formats
Managed Spark ETLAWS Glue ETLServerless distributed transformations, crawlers, jobs, workflowsTune partitioning, file sizes, pushdown, and job bookmarks
Big data frameworks with more cluster controlAmazon EMRSpark, Hive, Presto/Trino, Hudi/Iceberg workloads with configurable clustersMore operational responsibility than Glue
Cloud data warehouseAmazon RedshiftHigh-performance analytics, SQL warehouse, COPY/UNLOAD, SpectrumModel distribution, sort, workload, and external table scans
Database migration and CDCAWS Database Migration Service (AWS DMS)Full load plus ongoing change replicationDMS is not a general-purpose ETL engine
File transfer from on-premises storage to S3AWS DataSyncManaged transfer for file/object storage migrations and recurring syncNeeds network connectivity and correct IAM/S3/KMS permissions
Managed SFTP/FTPS/FTP endpointAWS Transfer FamilyExternal partners exchange files into S3 or Amazon EFSDo not confuse with DataSync migration/sync
SaaS application data ingestionAmazon AppFlowManaged SaaS-to-AWS flowsBest for supported SaaS connectors, not arbitrary streaming apps
Custom real-time stream consumersAmazon Kinesis Data StreamsOrdered records per partition key, replay, custom consumersPartition key design affects ordering and hot shards
Managed stream delivery to S3/Redshift/OpenSearchAmazon Data FirehoseMinimal-code delivery pipeline with buffering and optional transformNot for multiple custom replayable consumers
Kafka-compatible streamingAmazon MSKManaged Apache Kafka ecosystem compatibilityChoose when Kafka APIs/tools are required
Stateful streaming analyticsAmazon Managed Service for Apache FlinkWindowing, state, joins, event-time processingMore suitable than Lambda for complex stream computation
Event routing across AWS services/appsAmazon EventBridgeEvent bus, rules, schedules, SaaS eventsNot a high-throughput analytics stream replacement
Queue-based decouplingAmazon SQSDurable message queue for workersNot intended for replayable analytics streams
Short event-driven transformationAWS LambdaLightweight transform, validation, routingAvoid for large distributed ETL
Multi-step orchestration with branching/retriesAWS Step FunctionsCoordinates services, handles state, retries, error pathsPrefer over ad hoc scripts for resilient workflows
Airflow DAG compatibilityAmazon Managed Workflows for Apache Airflow (MWAA)Managed Apache AirflowChoose when Airflow is a requirement
Fine-grained data lake governanceAWS Lake FormationTable, column, row-style governed access with LF-TagsIAM/S3/KMS permissions still matter
Encryption key controlAWS Key Management Service (AWS KMS)Customer managed keys, key policies, grantsCross-account access needs key policy alignment
Secrets for connectionsAWS Secrets ManagerRotatable database/API credentialsDo not hard-code credentials in Glue jobs or notebooks
Logs, metrics, alarmsAmazon CloudWatchOperational monitoringKnow service-specific metrics and log locations
API audit historyAWS CloudTrailTracks management events and optional data eventsCloudTrail is audit, not performance monitoring

Scenario keyword shortcuts

Scenario phraseStrong candidate answer
“Run SQL directly on files in S3”Athena
“Catalog S3 data for Athena/Redshift Spectrum/Glue”AWS Glue Data Catalog
“Discover schema and partitions automatically”AWS Glue crawler
“Fine-grained access to data lake tables and columns”Lake Formation
“Move on-premises NFS/SMB data to S3 repeatedly”DataSync
“Partners upload files through SFTP”AWS Transfer Family
“Replicate relational database changes continuously”AWS DMS with CDC
“Ingest supported SaaS data without custom connector code”Amazon AppFlow
“Custom applications need replayable stream records”Kinesis Data Streams
“Deliver streaming data to S3 with minimal management”Data Firehose
“Existing Kafka producers and consumers”Amazon MSK
“Streaming joins, windows, and stateful processing”Managed Service for Apache Flink
“Complex workflow with retries and branches”Step Functions
“Existing Airflow DAGs”MWAA
“Reduce Athena scanned data”Parquet/ORC, compression, partitioning, column pruning
“Exact set of files for Redshift load”COPY with manifest
“Query S3 from Redshift”Redshift Spectrum external tables

Data ingestion reference

Batch, file, and database ingestion

Source or requirementChoosePatternCommon trap
On-premises file shares or object storesDataSyncSchedule transfer into S3 raw prefixesTransfer service does not transform business logic
External users send files over SFTP/FTPS/FTPAWS Transfer FamilyLand files in S3, trigger downstream workflowTransfer Family is for protocol access, not storage analytics
RDBMS full load to S3/RedshiftAWS DMSSource endpoint, target endpoint, replication taskValidate data types, constraints, and permissions
RDBMS ongoing changesAWS DMS CDCFull load plus change data capture to targetMonitor replication lag and source log retention
SaaS dataAmazon AppFlowFlow from SaaS connector to S3/Redshift/Salesforce targets as supportedCheck connector and field mapping support
AWS service eventsEventBridgeRule routes event to target such as Lambda or Step FunctionsEventBridge is not a bulk ETL engine
Application-generated filesS3 direct upload or SDKWrite to raw zone with event notificationUse idempotent object naming and downstream deduplication

Streaming ingestion

RequirementChooseDesign notes
Multiple custom consumers need independent processingKinesis Data StreamsConsumers read from stream; partition key controls ordering and load
Delivery to S3, Redshift, OpenSearch, or third-party endpoint with low codeData FirehoseConfigure destination, buffering, optional Lambda transform, backup for failures
Kafka API compatibilityAmazon MSKUse Kafka producers/consumers, topics, partitions, consumer groups
SQL/window/stateful analytics on streamsManaged Service for Apache FlinkUse event-time processing, windows, joins, state, checkpoints
Message queue for asynchronous workersSQSWorker decoupling, not analytics replay
Event bus integrationEventBridgeEvent routing, schedules, SaaS events, cross-account event patterns

Ingestion design rules

  • Separate raw and curated data. Land immutable source data first, then transform.
  • Make ingestion idempotent. Assume retries can create duplicate deliveries.
  • Capture metadata: source, ingestion timestamp, schema version, batch ID, and lineage.
  • Use partition-friendly timestamps. Common S3 partitions include date or hour, but avoid creating excessive tiny partitions.
  • Validate early. Use data quality checks before promoting data into curated zones.
  • Monitor lag and failures. Streams, DMS tasks, Firehose delivery, and Glue jobs all expose operational signals.

S3 data lake design

Data lake zones

ZonePurposeTypical properties
Raw / bronzeImmutable copy of source dataSource format, append-only, tightly restricted write access
Staging / silverCleaned, normalized, deduplicatedStandardized schema, data quality checks, partitioned layout
Curated / goldAnalytics-ready productsColumnar formats, business dimensions, governed access
SandboxExploration and temporary outputsLifecycle policies, limited permissions, not production source of truth
Audit / quarantineFailed or suspicious recordsPreserve rejected records with error reason and batch metadata

File format selection

FormatBest forAvoid whenExam notes
CSVSimple exchange, human-readable exportsLarge analytics scans, nested structures, strict typingEasy but inefficient for Athena/Redshift Spectrum
JSONSemi-structured eventsHeavy repeated scans without conversionConvert to columnar for curated analytics
AvroRow-oriented data with schema evolution, streaming ecosystemsPure SQL scan optimization is the main goalOften used in streaming pipelines
ParquetColumnar analytics on S3Frequent single-row updates without table format supportHigh-yield choice for Athena, Glue, Redshift Spectrum
ORCColumnar analytics, Hive-style ecosystemsTooling standardizes on ParquetSimilar exam value to Parquet
Apache Iceberg tableS3 lakehouse tables needing ACID-style operations and schema evolutionSimple immutable append-only files are enoughUseful for governed, evolving analytical tables

Partitioning and layout rules

RuleWhy it matters
Partition by common filter columnsEnables partition pruning in Athena, Glue, EMR, and Spectrum
Avoid overly high-cardinality partitionsToo many small partitions can hurt planning and metadata operations
Avoid many tiny filesDistributed engines spend too much time opening files instead of scanning data
Use columnar compressionReduces scanned bytes and improves query performance
Keep partition naming consistentHive-style paths such as dt=2026-06-18/ work well with many tools
Compact streaming outputsFirehose and streaming jobs can create many small objects
Store raw data immutablyEnables replay, audit, and recovery from bad transforms

Example S3 layout:

s3://company-data-lake/raw/source=salesforce/object=account/ingest_date=2026-06-18/
s3://company-data-lake/curated/domain=sales/table=orders/order_date=2026-06-18/
s3://company-data-lake/quarantine/source=orders/ingest_date=2026-06-18/

AWS Glue and Data Catalog reference

Glue components

ComponentWhat it doesChoose when
AWS Glue Data CatalogStores databases, tables, schemas, partitions, connectionsAnalytics services need shared metadata
CrawlerInfers schema and discovers partitionsData structure is discoverable and changes need catalog updates
ClassifierHelps crawler interpret dataCustom formats or nonstandard records
Glue ETL jobRuns Spark, Python shell, or other supported job typesTransform, clean, join, and write data
Glue StudioVisual job authoringNeed low-code ETL development
Glue workflowCoordinates Glue crawlers, jobs, and triggersGlue-centered pipeline orchestration
Glue triggerStarts jobs/crawlers on schedule or conditionSimple Glue workflow automation
Glue connectionStores connection details for JDBC, network, or marketplace connectorsJobs need source/target connectivity
Glue Data QualityEvaluates rules against datasetsValidate completeness, uniqueness, ranges, schema expectations
Glue Schema RegistryManages schemas for streaming/event dataProducers and consumers need schema validation/evolution

Crawler vs explicit schema

SituationBetter choice
Unknown files arrive and schema must be discoveredGlue crawler
Production schema must be controlled and reviewedExplicit table definition or IaC-managed catalog
Frequent partition additions onlyPartition projection, ALTER TABLE ADD PARTITION, crawler, or repair pattern
Crawler creates unexpected tablesAdjust folder structure, classifiers, grouping behavior, and crawler scope
Sensitive columns need access controlCatalog plus Lake Formation, not crawler alone

Glue ETL exam points

TopicRemember
DynamicFrame vs DataFrameDynamicFrames help with semi-structured data and schema ambiguity; DataFrames expose standard Spark APIs
Job bookmarksTrack previously processed source data; still design idempotent writes
Pushdown predicatesReduce source data read, especially with partitions
Repartition/coalesceManage output file count and parallelism
Small filesCompact to improve Athena/Spectrum/EMR performance
SkewA few hot keys can slow joins and aggregations
SecretsUse Secrets Manager or Glue connections, not embedded passwords
VPC accessJobs accessing private databases need subnet/security group routing and access to S3/logs/secrets
Failure handlingUse retries, checkpoints/bookmarks, quarantine outputs, and CloudWatch logs

Illustrative Glue PySpark pattern:

from awsglue.context import GlueContext
from pyspark.context import SparkContext

glue_ctx = GlueContext(SparkContext.getOrCreate())

orders = glue_ctx.create_dynamic_frame.from_catalog(
    database="raw",
    table_name="orders",
    push_down_predicate="ingest_date >= '2026-06-01'"
)

df = orders.toDF().dropDuplicates(["order_id"])

df.write.mode("append") \
    .partitionBy("order_date") \
    .parquet("s3://example-data-lake/curated/orders/")

Example AWS Glue Data Quality rule style:

Rules = [
  IsComplete "order_id",
  IsUnique "order_id",
  ColumnValues "amount" >= 0,
  ColumnExists "order_date"
]

Athena reference

NeedAthena feature or pattern
Query S3 data with SQLExternal tables using Glue Data Catalog
Improve performanceParquet/ORC, compression, partition pruning, avoid SELECT *
Create curated columnar dataCTAS or INSERT INTO from raw table
Control query usageWorkgroups, result locations, query settings
Add partitionsCrawler, ALTER TABLE ADD PARTITION, partition projection, or repair for Hive-style partitions
Secure dataIAM, S3 bucket policy, KMS key policy, Lake Formation permissions
Share governed tablesLake Formation and catalog-based permissions where supported

Athena CTAS pattern:

CREATE TABLE curated_orders
WITH (
  format = 'PARQUET',
  partitioned_by = ARRAY['order_date'],
  external_location = 's3://example-data-lake/curated/orders/'
) AS
SELECT
  order_id,
  customer_id,
  amount,
  order_status,
  order_date
FROM raw_orders
WHERE order_date >= DATE '2026-01-01';

Partition repair pattern for Hive-style S3 paths:

MSCK REPAIR TABLE raw_orders;

Common Athena traps:

  • Athena queries data in S3; it does not ingest or move data by itself.
  • Catalog permissions alone are not enough if S3 or KMS denies access.
  • Crawlers update metadata; they do not optimize file format or clean data.
  • Partition projection can reduce partition metadata management, but the S3 path pattern must match the table definition.
  • Columnar formats help most when queries select only needed columns.

Amazon Redshift reference

Redshift vs Athena vs S3 lake

RequirementBetter fit
Ad hoc SQL on raw/curated S3 dataAthena
Managed warehouse with repeated BI queries and modeled tablesRedshift
Query S3 from warehouse without loading all dataRedshift Spectrum
Transform and publish warehouse data to S3Redshift UNLOAD
Load large S3 datasets into warehouse tablesRedshift COPY
Variable or intermittent warehouse demandRedshift Serverless may fit
Stable warehouse environment with cluster-level controlRedshift provisioned may fit

Redshift design points

TopicWhat to know for DEA-C01
COPYPreferred bulk load from S3, DynamoDB, EMR, or supported sources
UNLOADWrites query results from Redshift to S3
Distribution styleAffects data movement during joins; AUTO can help, but know KEY/EVEN/ALL concepts
Sort keysImprove range-restricted scans and joins when aligned with query patterns
Compression encodingReduces storage and I/O
ANALYZEUpdates statistics for the optimizer
VACUUMReclaims/sorts storage where applicable
SpectrumQueries external S3 data through external schemas/tables
Materialized viewsPrecompute expensive query results when refresh strategy fits
Workload managementManage query queues, priorities, and concurrency behavior
Federated queryQuery operational databases from Redshift for specific use cases

COPY from S3 pattern:

COPY analytics.orders
FROM 's3://example-data-lake/curated/orders/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoadRole'
FORMAT AS PARQUET;

UNLOAD to S3 pattern:

UNLOAD ('SELECT order_date, SUM(amount) AS revenue FROM analytics.orders GROUP BY order_date')
TO 's3://example-data-lake/exports/revenue/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftUnloadRole'
FORMAT AS PARQUET;

Redshift troubleshooting shortcuts:

SymptomCheck first
COPY failsIAM role, S3 path, KMS access, file format, load error views
Query slow after loadStatistics, sort/distribution, skew, scanned external data
External table query slowS3 file format, partitioning, file sizes, Spectrum pruning
Access denied to S3Redshift role policy, bucket policy, KMS key policy
BI workload contentionWorkload management, query design, materialized views

Streaming and event processing reference

DimensionKinesis Data StreamsData FirehoseAmazon MSKManaged Service for Apache Flink
Primary roleDurable stream for custom appsManaged delivery streamManaged KafkaStateful stream processing
ConsumersCustom consumersDestination delivery, optional transformKafka consumersFlink application
ReplayYes, within configured retentionNot a replay stream for custom consumersKafka retention modelReads from stream sources
OrderingPer partition key/shardNot the main design featurePer Kafka partitionDepends on source partitioning and app logic
OperationsStream capacity and consumer designDestination and delivery configKafka cluster/topic/client designApplication state, checkpoints, parallelism
Choose whenYou need custom stream processingYou need easy delivery to storage/search/warehouseYou need Kafka compatibilityYou need windows, joins, state, event-time logic

Streaming design traps

  • Partition keys determine ordering and load distribution. Bad keys create hot partitions.
  • Assume at-least-once delivery in many pipelines; design deduplication and idempotent sinks.
  • Firehose buffering means it is usually not the answer for the lowest-latency custom consumer requirement.
  • Lambda works for lightweight stream processing, not complex stateful analytics.
  • Use dead-letter, backup, or quarantine patterns for failed records.
  • Monitor consumer lag, delivery failures, throttling, and error logs.

Transformation and orchestration choices

RequirementChooseWhy
Distributed ETL over large S3 dataAWS Glue Spark jobServerless managed Spark
Complex big data stack or custom libraries/configurationAmazon EMRMore control over cluster/runtime
Lightweight event transformLambdaSimple code on event triggers
SQL transformation over S3Athena CTAS/INSERTServerless SQL pipeline step
SQL transformation inside warehouseRedshift SQL, stored procedures, materialized viewsKeep warehouse transformations close to modeled data
Stateful stream transformationManaged Service for Apache FlinkWindows, joins, state
Glue-only pipelineGlue workflows and triggersNative Glue orchestration
Multi-service workflowStep FunctionsBranches, retries, service integrations
Airflow DAG requirementMWAAManaged Airflow compatibility
Time-based event triggerEventBridge Scheduler or rulesSchedules pipeline starts

Orchestration exam distinctions

If the question emphasizesPrefer
Retry policies, branching, human-readable state machine, AWS SDK integrationsStep Functions
Existing Airflow DAGs and operatorsMWAA
Simple scheduled Glue jobGlue trigger or EventBridge schedule
Event from S3 starts processingS3 event notification to Lambda/EventBridge/queue, then orchestrate
Decoupling ingestion from processingSQS between producer and worker, or stream where replay/order is needed
Failure notificationEventBridge rule, CloudWatch alarm, SNS notification

Security, governance, and access control

Data access control layers

LayerControlsCommon exam point
IAM identity policyWhat principals can call AWS APIsRequired but may not be sufficient alone
S3 bucket/access point policyResource-level access to objectsNeeded for cross-account and centralized lake patterns
S3 Block Public AccessPrevents public exposureKeep enabled unless a specific approved public pattern exists
KMS key policy/grantsWho can use encryption keysAccess fails if IAM allows S3 but KMS denies decrypt
Lake Formation permissionsTable, column, and governed data lake accessUse for fine-grained analytics permissions
Glue Data Catalog resource policyCatalog-level sharing/accessOften relevant in cross-account catalogs
Secrets ManagerStores database/API credentialsUse with Glue connections and jobs
VPC security groups/routes/endpointsNetwork path to private sources and AWS APIsGlue/DMS in VPC often need S3 and service endpoint access
CloudTrailAudit of API activityEnable relevant data events when object-level audit is needed
Amazon MacieSensitive data discovery in S3Helps identify PII/sensitive data exposure

Governance decision table

RequirementStrong answer
Analysts can query only approved columnsLake Formation column permissions or governed views
Grant data lake access by business tagsLake Formation LF-Tags
Encrypt S3 objects with customer-managed keySSE-KMS with proper key policy
Cross-account Athena access to encrypted S3 dataAlign Lake Formation/catalog, S3 bucket policy, IAM, and KMS key policy
Store JDBC password for GlueSecrets Manager or Glue connection using a secret
Audit who changed Glue table definitionsCloudTrail management events
Detect sensitive data in S3Macie
Keep Glue job traffic private to AWS servicesVPC endpoints and correct routing/security groups
Prevent accidental public S3 lake accessS3 Block Public Access plus least-privilege bucket policies

Least-privilege reminders

  • Grant jobs only the S3 prefixes, catalog databases/tables, KMS keys, and logs they need.
  • Separate roles for ingestion, transformation, catalog administration, and consumption.
  • For cross-account sharing, check all layers: IAM, resource policy, Lake Formation, S3, and KMS.
  • Avoid embedding credentials in scripts, notebooks, job parameters, or environment variables when Secrets Manager is appropriate.
  • Use encryption in transit and at rest for data pipelines.

Operations, monitoring, and troubleshooting

Service signals

ServiceMonitor
Glue jobsJob run status, CloudWatch logs, errors, duration, data skew symptoms, output file count
Glue crawlersCrawler run status, schema changes, tables created, partition discovery
AthenaQuery failures, scanned data, workgroup settings, result location, permissions
RedshiftQuery performance, load errors, disk/storage pressure, WLM queues, system views
DMSTask status, table statistics, replication lag, task logs
Kinesis Data StreamsWrite/read throttling, iterator age or consumer lag, hot partitions
FirehoseDelivery success/failure, transformation errors, backup records
MSKBroker health, topic throughput, consumer lag
FlinkCheckpoints, application health, lag, failed records
S3Object creation, replication/lifecycle status, access errors
Lake FormationGrant changes, denied access, LF-Tag policy alignment
KMSKey access denied, disabled key, missing cross-account permissions
Step FunctionsFailed states, retries, execution history
EventBridgeRule matches, target invocation failures, dead-letter targets

Troubleshooting table

SymptomLikely causesFast checks or fixes
Athena scans too much dataRow format, no partitions, no column pruningConvert to Parquet/ORC, partition by filters, avoid SELECT *
Athena access deniedIAM, S3, KMS, Lake Formation, workgroup result locationTest each permission layer
Glue job out of memory or slowSkew, tiny files, wide shuffle, no predicate pushdownRepartition, compact, filter early, tune joins
Glue job reprocesses dataBookmark disabled/reset, changed source path, non-idempotent writesEnable bookmarks where useful, deduplicate, design idempotent outputs
Crawler creates many tablesFolder structure or classifier mismatchNarrow crawler scope, fix path layout, adjust classifiers
DMS CDC lag growsSource log pressure, target bottleneck, network, task configCheck task logs, table stats, target capacity
Firehose delivery failsDestination permission, KMS, transform error, schema conversion issueInspect error logs and backup prefix
Kinesis consumer falls behindHot shard, slow consumer, insufficient parallelismImprove partition key, scale stream/consumer design
Redshift COPY errorsBad file format, role/KMS issue, incompatible schemaCheck load error views and S3 object format
Redshift query slowPoor distribution/sort, stale stats, queue contentionAnalyze tables, review plan, tune WLM and table design
Lake Formation denies queryMissing LF grant, IAM mismatch, location not registeredCheck data location registration and table permissions
Glue job cannot reach S3 from VPCMissing route or VPC endpointAdd S3 access path and service endpoints as needed
S3 event does not start pipelineNotification filter mismatch, target policy, unsupported event pathValidate prefix/suffix, destination permissions, event pattern

Data quality, schema, and reliability

ConcernAWS pattern
Validate completeness, uniqueness, rangesAWS Glue Data Quality rules
Enforce streaming schema compatibilityAWS Glue Schema Registry
Preserve bad recordsQuarantine S3 prefix with error metadata
Prevent duplicate recordsIdempotent keys, deduplication step, deterministic output paths
Handle schema driftCrawler review, explicit schema management, compatible schema evolution
Track lineageStore batch IDs, source metadata, job run IDs, and catalog versions
Recover from bad transformReprocess raw immutable data into corrected curated zone
Promote trusted datasetsRaw to curated pipeline with quality gates

Reliability checklist:

  • Can the pipeline be safely retried?
  • Are failed records preserved?
  • Is raw source data immutable?
  • Are schema changes detected before breaking consumers?
  • Are job failures visible through alarms or events?
  • Are downstream writes atomic enough for the service and format used?
  • Are permissions least-privilege but sufficient across IAM, S3, KMS, and Lake Formation?

Cost and performance optimization

AreaOptimize by
AthenaColumnar formats, compression, partitions, workgroups, CTAS for repeated transforms
S3 lakeLifecycle policies, compact files, avoid unnecessary copies, design prefixes logically
GlueFilter early, avoid shuffles, use bookmarks, write partitioned columnar output
RedshiftCOPY from S3, sort/distribution design, statistics, materialized views, workload tuning
KinesisBalanced partition keys, right stream capacity mode/design, efficient consumers
FirehoseDestination buffering/format conversion, backup failed records, transform only when needed
DMSMonitor lag, choose appropriate task settings, avoid unnecessary transformations
Cross-service data movementKeep pipelines regional and avoid unnecessary intermediate hops
GovernanceUse Lake Formation and tags to avoid duplicating governed datasets

High-yield optimization principle: for analytics on S3, reduce bytes scanned and reduce file/partition overhead before scaling compute.

Common DEA-C01 traps

TrapCorrect thinking
“Crawler transforms data”Crawlers infer/update metadata only
“Data Catalog stores the dataset”It stores metadata; data remains in S3 or source systems
“IAM allow means Athena can read everything”S3, KMS, Lake Formation, and workgroup settings can still deny access
“Firehose is the same as Kinesis Data Streams”Firehose is managed delivery; Data Streams supports custom consumers and replay
“DMS is for complex transformations”DMS is mainly migration/replication with limited transformation
“CSV is fine for large Athena workloads”Convert curated analytics data to Parquet/ORC
“More partitions always improve performance”Too many tiny partitions can degrade planning and metadata operations
“Lambda is best for all ETL”Use Glue/EMR for distributed data processing
“S3 event notification guarantees a complete batch workflow”Use orchestration and idempotency for multi-file/batch completion logic
“Redshift Spectrum replaces all warehouse modeling”Spectrum is useful, but internal Redshift tables can be better for repeated BI workloads
“KMS is only an encryption checkbox”Key policies and grants are common causes of access failures
“Lake Formation replaces all IAM”Lake Formation works with IAM, S3, catalog, and KMS controls

Final review checklist

Before the exam, be able to answer these quickly:

  • Which service ingests files, database CDC, SaaS, events, streams, and Kafka?
  • When should data be stored in S3, Redshift, or queried by Athena?
  • How do Glue Data Catalog, crawlers, Lake Formation, and Glue jobs differ?
  • How do you optimize S3 analytics with Parquet, compression, partitioning, and compaction?
  • What permission layers can block a query: IAM, S3, KMS, Lake Formation, catalog, or network?
  • How do you troubleshoot failed Glue, Athena, Redshift COPY, DMS, Kinesis, and Firehose workflows?
  • Which orchestration service fits: Step Functions, Glue workflows, MWAA, or EventBridge?
  • How do you make pipelines idempotent, observable, recoverable, and governed?

Next step

Use this Quick Reference as a checklist, then practice scenario questions that force you to choose between similar AWS services, especially Glue vs EMR, Athena vs Redshift, Kinesis Data Streams vs Firehose, DMS vs DataSync, and IAM vs Lake Formation permission issues.

Browse Certification Practice Tests by Exam Family