DEA-C01 Syllabus — Objectives by Domain

Blueprint-aligned learning objectives for AWS Certified Data Engineer — Associate (DEA-C01), organized by domain with quick links to targeted practice.

Use this syllabus as your source of truth for DEA-C01. Work through each domain in order and drill targeted sets after every task.

What’s covered

Domain 1: Data Ingestion and Transformation (34%)

Practice this topic →

Task 1.1 - Perform data ingestion

  • Evaluate throughput and latency requirements and select an ingestion approach that meets service limits and operational needs.
  • Differentiate streaming and batch ingestion patterns and choose based on frequency, historical backfill needs, and timeliness requirements.
  • Design replayable ingestion pipelines (for backfills and reprocessing) and apply idempotency principles to avoid duplicate processing.
  • Differentiate stateful and stateless ingestion transactions and understand implications for ordering, deduplication, and checkpoints.
  • Ingest streaming data from sources such as Amazon Kinesis, Amazon MSK, DynamoDB Streams, AWS DMS, AWS Glue, and Amazon Redshift based on the source type and constraints.
  • Ingest batch data from sources such as Amazon S3, AWS Glue, Amazon EMR, AWS DMS, Amazon Redshift, AWS Lambda, and Amazon AppFlow based on the integration pattern.
  • Configure batch ingestion options such as scheduled runs, incremental loads, partitioning, and checkpoints to support repeatability and performance.
  • Consume data APIs safely by handling pagination, throttling, retries, and authentication while maintaining traceability.
  • Set up schedulers using Amazon EventBridge, Apache Airflow (Amazon MWAA), or time-based schedules for ingestion jobs and AWS Glue crawlers.
  • Use event triggers such as Amazon S3 Event Notifications and EventBridge rules to start ingestion and downstream processing.
  • Invoke AWS Lambda from Amazon Kinesis and design fan-in/fan-out distribution patterns for streaming data pipelines.
  • Implement secure connectivity (for example, IP allowlists) and manage throttling and rate limits for services such as DynamoDB, Amazon RDS, and Amazon Kinesis.

Task 1.2 - Transform and process data

  • Design ETL pipeline steps based on business requirements, target schemas, and downstream analytics or operational needs.
  • Select transformation strategies based on data volume, velocity, and variety (structured, semi-structured, unstructured) and operational constraints.
  • Apply distributed computing concepts and use Apache Spark to process large datasets efficiently.
  • Select intermediate data staging locations (for example, Amazon S3 or temporary tables) to support multi-step processing and recoverability.
  • Optimize container usage for performance needs using Amazon EKS or Amazon ECS and select compute sizing and scaling strategies.
  • Connect to data sources using JDBC/ODBC connectors and manage network access and credentials securely.
  • Integrate data from multiple sources and handle joins, deduplication, normalization, and schema alignment across systems.
  • Optimize costs while processing data by choosing the right compute and execution model (serverless vs provisioned, scaling, and purchasing options).
  • Select and use transformation services such as Amazon EMR, AWS Glue, AWS Lambda, and Amazon Redshift based on workload requirements.
  • Transform data between formats (for example, CSV to Parquet) and apply partitioning and compression for analytics performance.
  • Troubleshoot and debug common transformation failures and performance issues such as data skew, memory pressure, and small-file problems.
  • Create data APIs to make data available to other systems using AWS services while managing schema/version changes.

Task 1.3 - Orchestrate data pipelines

  • Integrate AWS services to build end-to-end data pipelines with explicit dependencies, retries, and repeatable outcomes.
  • Apply event-driven architecture to trigger pipeline steps using services such as Amazon EventBridge and Amazon S3 notifications.
  • Configure pipelines to run on schedules or dependencies, including managing reruns and backfills.
  • Choose serverless workflow patterns and determine when to use AWS Step Functions, AWS Glue workflows, or AWS Lambda orchestration.
  • Build orchestration workflows using services such as AWS Lambda, Amazon MWAA, AWS Step Functions, AWS Glue workflows, and Amazon EventBridge.
  • Implement notifications and alerting for pipeline events using Amazon SNS and Amazon SQS and integrate with failure handling.
  • Design pipelines for performance, availability, scalability, resiliency, and fault tolerance using retries, idempotency, and isolation.
  • Implement and maintain serverless workflows with correct timeouts, error handling, and state management.
  • Define and monitor pipeline SLAs such as data freshness and latency, and configure alarms for breaches.
  • Control parallelism and fan-out patterns safely using workflow constructs (for example, Step Functions Map/Parallel or Airflow task concurrency).
  • Implement checkpointing and understand at-least-once vs exactly-once trade-offs across streaming and batch steps.
  • Secure orchestration components with least-privilege roles and controlled network access for data services and endpoints.

Task 1.4 - Apply programming concepts

  • Implement CI/CD for data pipelines, including automated testing, packaging, and safe deployment practices.
  • Write SQL queries for data extraction and transformation, including joins and multi-step transformations that support pipeline requirements.
  • Optimize SQL queries using techniques such as predicate pushdown, partition pruning, and avoiding expensive joins when possible.
  • Use infrastructure as code (AWS CloudFormation or AWS CDK) to deploy repeatable data pipeline infrastructure.
  • Apply distributed computing concepts to optimize data pipeline code and avoid bottlenecks in parallel processing.
  • Use appropriate data structures and algorithms (for example, graph and tree structures) when modeling or traversing complex data relationships.
  • Optimize code to reduce runtime and resource usage for ingestion and transformation, including efficient I/O and batching.
  • Configure AWS Lambda functions for concurrency and performance (memory, timeout, concurrency limits) in data pipeline workloads.
  • Use Amazon Redshift stored procedures or SQL UDFs to encapsulate transformations and improve repeatability.
  • Use Git commands and workflows (clone, branch, merge, tag) to manage pipeline code changes safely.
  • Package and deploy serverless data pipelines using AWS SAM, including Lambda functions, Step Functions state machines, and DynamoDB tables.
  • Mount and use storage volumes from within Lambda functions (for example, Amazon EFS) when required and understand constraints and best practices.

Domain 2: Data Store Management (26%)

Practice this topic →

Task 2.1 - Choose a data store

  • Compare storage platforms (object, file, relational, NoSQL, streaming) and explain how their characteristics affect data engineering designs.
  • Select AWS data stores and configurations that meet performance demands such as throughput, concurrency, and latency.
  • Choose data storage formats (for example, CSV, TXT, Parquet) and apply compression and partitioning aligned to access patterns.
  • Align data storage choices with migration requirements, including cutover approaches, replication needs, and data movement constraints.
  • Determine appropriate storage solutions for access patterns such as point lookups, scans, time-series, OLTP, OLAP, and streaming.
  • Manage locks and concurrency controls (for example, in Amazon Redshift and Amazon RDS) and prevent contention in multi-user environments.
  • Implement appropriate storage services for cost and performance requirements (for example, Amazon Redshift, Amazon EMR, AWS Lake Formation, Amazon RDS, DynamoDB, Amazon Kinesis Data Streams, Amazon MSK).
  • Configure storage services for access pattern requirements (for example, Redshift distribution/sort design, DynamoDB partition key design, and S3 partition layout).
  • Apply Amazon S3 to appropriate use cases such as data lakes, staging layers, and durable storage of curated datasets.
  • Integrate migration tools such as AWS Transfer Family into data movement and ingestion workflows.
  • Implement data migration or remote access methods such as Amazon Redshift federated queries, materialized views, and Redshift Spectrum.
  • Design hybrid and multi-store architectures with clear responsibilities (for example, lake for raw/curated data, warehouse for BI, stream for real-time ingestion).

Task 2.2 - Understand data cataloging systems

  • Explain the purpose of a data catalog and how it enables discovery, governance, and consistent schema usage across analytics services.
  • Create and maintain a data catalog, including databases and tables, and keep metadata consistent as data changes.
  • Classify data based on requirements and use metadata to support access controls and governance workflows.
  • Identify key components of metadata and catalogs (schemas, partitions, tags, lineage) and how they are used by query engines.
  • Build and reference catalogs using AWS Glue Data Catalog or Apache Hive metastore for consistent schema discovery.
  • Discover schemas and populate data catalogs using AWS Glue crawlers and apply controls to manage schema drift.
  • Synchronize partitions with a data catalog and ensure new partitions are available for query engines promptly.
  • Create and manage source and target connections for cataloging using AWS Glue connections and appropriate network settings.
  • Enable catalog-driven consumption for services such as Athena, Amazon EMR, and Redshift Spectrum while maintaining schema consistency.
  • Troubleshoot catalog issues such as stale partitions, crawler misclassification, and insufficient permissions.
  • Apply governance controls through catalog metadata using Lake Formation permissions and LF-tags.
  • Version and document schemas and catalog changes to support reproducibility, collaboration, and audits.

Task 2.3 - Manage the lifecycle of data

  • Choose storage solutions that meet hot and cold data requirements and design tiered storage based on access frequency.
  • Optimize storage cost across the data lifecycle using mechanisms such as storage class transitions, compression, and partitioning.
  • Define retention policies and archiving strategies aligned to business requirements and legal obligations.
  • Delete and expire data to meet business and legal requirements, including implementing automated expiry processes.
  • Protect data with appropriate resiliency and availability using strategies such as versioning, replication, and backups.
  • Perform load and unload operations to move data between Amazon S3 and Amazon Redshift using efficient patterns.
  • Manage S3 Lifecycle policies to transition objects across storage tiers and enforce retention policies.
  • Expire data when it reaches a specific age using S3 Lifecycle policies and validate outcomes against retention requirements.
  • Manage S3 versioning and understand implications for restore scenarios, rollback, and ongoing storage cost.
  • Use DynamoDB TTL to expire data and design applications that handle eventual deletions safely.
  • Manage lifecycle policies for intermediate and derived datasets to avoid accumulating stale data and unnecessary costs.
  • Apply governance controls to lifecycle operations, including change control, approvals, and audit logs for retention policy changes.

Task 2.4 - Design data models and schema evolution

  • Apply data modeling concepts (normalized/denormalized, star/snowflake) and select a design based on query patterns and constraints.
  • Model structured, semi-structured, and unstructured data and decide when to use schema-on-write versus schema-on-read.
  • Design schemas for Amazon Redshift, DynamoDB, and lake-based tables (governed by Lake Formation) aligned to access patterns.
  • Use indexing, partitioning, sort/distribution strategies, and compression to optimize performance and cost.
  • Plan schema evolution techniques (additive changes, versioning, backfills) and avoid breaking downstream consumers.
  • Address changes to the characteristics of data (volume, cardinality, skew) and update data models to preserve performance and correctness.
  • Establish data lineage to ensure accuracy and trustworthiness of data and support auditability.
  • Perform schema conversion when migrating databases using AWS Schema Conversion Tool (AWS SCT) and AWS DMS schema conversion.
  • Manage schema drift with controlled discovery (for example, AWS Glue crawlers) and compatibility policies for consumers.
  • Use lineage tools to track transformations and provenance (for example, Amazon SageMaker ML Lineage Tracking) and document pipeline metadata.
  • Design backward and forward compatibility approaches for schema changes and validate with consumer contract testing.
  • Document data models and schema evolution policies to support collaboration, governance, and predictable upgrades.

Domain 3: Data Operations and Support (22%)

Practice this topic →

Task 3.1 - Automate data processing by using AWS services

  • Maintain and troubleshoot automated data processing workflows to ensure repeatable business outcomes.
  • Use API calls and SDKs to automate data processing operations and integrate programmatic control into pipelines.
  • Identify which AWS services accept scripting (for example, Amazon EMR, Amazon Redshift, AWS Glue) and integrate scripts into workflows safely.
  • Orchestrate data pipelines using services such as Amazon MWAA and AWS Step Functions to coordinate processing, retries, and dependencies.
  • Troubleshoot Amazon managed workflows and identify common failure modes in orchestration and scheduling.
  • Use AWS service features to process data (for example, Amazon EMR, Amazon Redshift, AWS Glue) and select the best service for the workload.
  • Consume and maintain data APIs, including versioning and access controls that enable stable downstream integrations.
  • Prepare data transformations using tools such as AWS Glue DataBrew and integrate outputs into downstream analytics.
  • Query data using Amazon Athena and create repeatable datasets using views or CTAS patterns.
  • Use AWS Lambda to automate data processing and connect event-driven triggers to processing steps.
  • Manage events and schedulers using Amazon EventBridge and integrate triggers for batch and stream processing.
  • Implement automation guardrails (idempotency, retries, backoff, and rate-limit handling) to prevent duplicate or runaway processing.

Task 3.2 - Analyze data by using AWS services

  • Choose between provisioned and serverless analytics services based on concurrency needs, workload variability, cost, and operational overhead.
  • Write SQL queries with joins, filters, and window functions to satisfy analysis requirements while controlling cost and runtime.
  • Apply cleansing techniques appropriately and document assumptions to ensure repeatable and auditable analysis.
  • Perform aggregations, rolling averages, grouping, and pivoting to create analysis-ready datasets.
  • Use Amazon Athena to query data and create reusable assets such as views and CTAS outputs.
  • Use Athena notebooks with Apache Spark to explore data interactively and validate data assumptions.
  • Visualize data using AWS services such as Amazon QuickSight and design dashboards aligned to business questions.
  • Use AWS Glue DataBrew for profiling, visualizing, and preparing datasets during analysis.
  • Verify and clean data using services and tools such as AWS Lambda, Athena, QuickSight, Jupyter notebooks, and Amazon SageMaker Data Wrangler.
  • Optimize query performance and cost using file formats, partition pruning, statistics, and CTAS patterns where appropriate.
  • Select the appropriate analysis engine (Athena vs Redshift vs EMR) based on latency needs, scale, governance controls, and integration requirements.
  • Share analysis outputs safely by controlling access to derived datasets and applying least-privilege patterns for consumers.

Task 3.3 - Maintain and monitor data pipelines

  • Implement application logging for data pipelines and define what operational data to capture (job metadata, row counts, failures).
  • Log access to AWS services used in pipelines using AWS CloudTrail and integrate events into audit and incident workflows.
  • Monitor pipeline health using Amazon CloudWatch metrics, logs, and alarms and define actionable alert thresholds.
  • Apply best practices for performance tuning and validate improvements using observable metrics and controlled changes.
  • Deploy logging and monitoring solutions that enable traceability across ingestion, transformation, and consumption steps.
  • Use notification services (Amazon SNS and Amazon SQS) to send alerts and integrate with operational runbooks.
  • Troubleshoot performance issues using tools such as CloudWatch Logs Insights, Amazon OpenSearch Service, Athena, and EMR log analysis.
  • Troubleshoot and maintain pipelines (for example, AWS Glue jobs and Amazon EMR steps), including dependency and configuration issues.
  • Extract logs for audits and implement log retention and archival policies that meet compliance requirements.
  • Use Amazon Macie findings to detect sensitive data exposure risks and integrate detection into governance workflows.
  • Create dashboards for key pipeline SLIs/SLOs such as freshness, latency, error rates, and backlog.
  • Automate operational responses for common failures (reruns, retries, backfills) with guardrails and approvals when required.

Task 3.4 - Ensure data quality

  • Define data validation dimensions (completeness, consistency, accuracy, integrity) and select metrics for measuring them.
  • Use data profiling techniques to discover anomalies and guide rule creation.
  • Apply data sampling techniques to validate large datasets efficiently and detect drift in data quality.
  • Detect and mitigate data skew mechanisms that can impact processing performance and downstream correctness.
  • Run data quality checks during processing (for example, null checks and range checks) and decide when to fail fast versus quarantine records.
  • Define data quality rules using services such as AWS Glue DataBrew and manage rules as versioned assets.
  • Investigate data consistency issues using AWS Glue DataBrew profiling and targeted queries to locate root causes.
  • Route invalid or suspicious records to a quarantine dataset and design remediation workflows for later correction.
  • Configure thresholds and alerting for data quality failures and integrate alerts into orchestration failure handling.
  • Maintain a data quality scorecard over time and track regressions after pipeline changes.
  • Detect schema drift and validate schema conformance early to prevent downstream breakages.
  • Implement continuous improvement loops for recurring quality issues using root cause analysis, remediation, and prevention controls.

Domain 4: Data Security and Governance (18%)

Practice this topic →

Task 4.1 - Apply authentication mechanisms

  • Apply VPC security networking concepts (subnets, routing, security groups) to secure data engineering workloads.
  • Differentiate managed and unmanaged services and understand how that affects authentication responsibilities and control points.
  • Compare authentication methods (password-based, certificate-based, and role-based) and choose mechanisms appropriate to the service and workload.
  • Differentiate AWS managed policies and customer managed policies and determine when a custom policy is required.
  • Update VPC security groups to allow only required traffic for data sources, processing jobs, and analytics endpoints.
  • Create and update IAM groups, roles, endpoints, and services used by data pipelines and analytics workloads.
  • Create and rotate credentials using AWS Secrets Manager and avoid embedding secrets in code or configuration.
  • Set up IAM roles for access from services such as Lambda, API Gateway, the AWS CLI, and CloudFormation, including correct trust policies.
  • Apply IAM policies to roles, endpoints, and services such as S3 Access Points and AWS PrivateLink endpoints.
  • Configure private connectivity using VPC endpoints or PrivateLink to access AWS data services without traversing the public internet.
  • Troubleshoot authentication issues (for example, invalid credentials or trust policy errors) and implement least-privilege fixes.
  • Implement identity hygiene for pipelines, including short-lived credentials, federation, and role assumption patterns.

Task 4.2 - Apply authorization mechanisms

  • Differentiate authorization methods (role-based, policy-based, tag-based, attribute-based) and map them to AWS services and use cases.
  • Apply the principle of least privilege by scoping actions and resources to the minimum required for each pipeline component.
  • Implement role-based access control (RBAC) patterns that match expected access patterns for producers, consumers, and administrators.
  • Protect data from unauthorized access across services using IAM conditions, resource policies, tag-based controls, and network isolation.
  • Create custom IAM policies when managed policies do not meet requirements, including using conditions for least privilege.
  • Store application and database credentials securely using AWS Secrets Manager or AWS Systems Manager Parameter Store.
  • Provide database users, groups, and roles appropriate access in databases (for example, Amazon Redshift) aligned to separation of duties.
  • Manage permissions through AWS Lake Formation for Amazon S3 data and integrated engines such as Redshift, EMR, and Athena.
  • Use Lake Formation governance features such as LF-tags to scale permission management across many datasets.
  • Implement cross-account access patterns safely using role assumption and resource sharing while maintaining governance boundaries.
  • Troubleshoot authorization failures by distinguishing IAM policy issues from resource policy and Lake Formation permission issues.
  • Audit and periodically review permissions, including identifying overly permissive policies and applying remediation.

Task 4.3 - Ensure data encryption and masking

  • Select data encryption options available in AWS analytics services such as Amazon Redshift, Amazon EMR, and AWS Glue and apply encryption by default.
  • Differentiate client-side encryption and server-side encryption and choose based on trust boundaries and operational requirements.
  • Protect sensitive data by selecting appropriate anonymization, masking, tokenization, and salting approaches.
  • Apply data masking and anonymization according to compliance laws and company policies and verify that controls are effective.
  • Use AWS KMS keys to encrypt and decrypt data and manage key rotation and access grants safely.
  • Enable encryption in transit (TLS) for data movement and service connections across ingestion, processing, and analytics layers.
  • Configure encryption across AWS account boundaries (for example, cross-account KMS use) while maintaining least privilege.
  • Configure Amazon S3 encryption (SSE-KMS), bucket policies, and default encryption for data lake storage.
  • Enable encryption at rest and in transit for Amazon Redshift and understand how encryption interacts with data loading and sharing.
  • Apply encryption and security controls to streaming and migration integrations (for example, Kinesis, MSK, and DMS) at a high level.
  • Implement column-level or field-level masking patterns to limit exposure of sensitive attributes in analytics outputs.
  • Design audit-ready evidence for encryption and masking controls, including key usage logs and periodic validation checks.

Task 4.4 - Prepare logs for audit

  • Define what application events to log for data pipelines (parameters, row counts, errors) while minimizing sensitive data exposure.
  • Log access to AWS services used by data platforms and understand how CloudTrail events support investigations and audits.
  • Design centralized logging for AWS logs, including storing logs in durable locations with retention and access controls.
  • Use AWS CloudTrail to track API calls that affect data systems, including changes to permissions, configurations, and data access.
  • Store application logs in Amazon CloudWatch Logs and configure retention, encryption, and controlled access.
  • Use AWS CloudTrail Lake for centralized logging queries and generate audit evidence from event queries.
  • Analyze logs using services such as Athena, CloudWatch Logs Insights, and Amazon OpenSearch Service for audit and incident response needs.
  • Integrate AWS services to process large volumes of log data (for example, using Amazon EMR when needed) and maintain scalability.
  • Implement log integrity and immutability patterns (write-once storage, access separation) for audit readiness.
  • Create dashboards and alarms for suspicious activity and policy violations using monitored audit signals.
  • Implement separation of duties by controlling who can write, change, and read audit logs and audit configurations.
  • Document audit procedures and evidence mapping to compliance requirements and keep logs accessible for required retention periods.

Task 4.5 - Understand data privacy and governance

  • Identify and protect personally identifiable information (PII) and other sensitive data types throughout ingestion, storage, processing, and consumption.
  • Explain data sovereignty and how regional restrictions affect storage, processing, backups, and replication strategies.
  • Grant permissions for data sharing (for example, Amazon Redshift data sharing) while controlling scope, consumers, and auditability.
  • Implement PII identification using Amazon Macie and integrate findings with Lake Formation governance controls.
  • Implement data privacy strategies that prevent backups or replications of data to disallowed AWS Regions.
  • Manage configuration changes using AWS Config to detect, track, and remediate governance drift in data platforms.
  • Apply governance frameworks using Lake Formation, tags, and catalog metadata to manage dataset ownership and access.
  • Implement data classification and labeling strategies to drive access controls, retention policies, and monitoring priorities.
  • Define purpose limitation and consent-aware data usage policies (where applicable) and document intended uses for datasets.
  • Maintain auditability of governance decisions by logging policy changes, approvals, and access reviews.
  • Apply multi-account governance patterns at a high level (for example, Organizations guardrails and separation of duties) for data platforms.
  • Monitor and remediate governance drift (permissions creep and misconfigurations) using automated checks and operational processes.

Tip: After each task, write 5–10 “one-liner rules” from your misses (service selection + trade-offs).