AWS DEA-C01 Practice Test: Data Engineer Associate

Prepare for AWS Certified Data Engineer Associate (DEA-C01) with free sample questions, a full-length diagnostic, topic drills, timed practice, ingestion, transformation, storage, operations, governance, and detailed explanations in IT Mastery.

DEA-C01 is AWS’s Data Engineer Associate certification for candidates who need to design and operate ingestion, transformation, storage, monitoring, and governance workflows across modern AWS data platforms. If you are searching for DEA-C01 sample questions, a practice test, mock exam, or exam simulator, this is the main IT Mastery page to start on web and continue on iOS or Android with the same IT Mastery account.

Interactive Practice Center

Start a practice session for AWS Certified Data Engineer - Associate (DEA-C01) below, or open the full app in a new tab. For the best experience, open the full app in a new tab and navigate with swipes/gestures or the mouse wheel—just like on your phone or tablet.

Open Full App in a New Tab

A small set of questions is available for free preview. Subscribers can unlock full access by signing in with the same app-family account they use on web and mobile.

Prefer to practice on your phone or tablet? Download the IT Mastery – AWS, Azure, GCP & CompTIA exam prep app for iOS or IT Mastery app on Google Play (Android) and use the same IT Mastery account across web and mobile.

Free diagnostic: Try the 65-question AWS DEA-C01 full-length practice exam before subscribing. Use it as one data-engineering baseline, then return to IT Mastery for timed mocks, topic drills, explanations, and the full Data Engineer Associate question bank.

What this DEA-C01 practice page gives you

  • a direct route into IT Mastery practice for DEA-C01
  • topic drills and mixed sets across ingestion, transformation, storage, operations, and governance
  • detailed explanations that show why the best AWS data-engineering answer is correct
  • a clear free-preview path before you subscribe
  • the same IT Mastery account across web and mobile

DEA-C01 exam snapshot

  • Vendor: AWS
  • Official exam name: AWS Certified Data Engineer - Associate (DEA-C01)
  • Exam code: DEA-C01
  • Items: 65 total
  • Exam time: 130 minutes
  • Question types: multiple-choice and multiple-response
  • Passing score: 720 scaled

DEA-C01 questions usually reward the option that delivers a replayable, governable, and cost-aware data platform decision rather than a narrow service-first answer.

Topic coverage for DEA-C01 practice

DomainWeight
Data Ingestion and Transformation34%
Data Store Management26%
Data Operations and Support22%
Data Security and Governance18%

DEA-C01 data-platform decision filters

Use these filters when several AWS data services could technically work:

  • Batch vs streaming: identify whether the workload needs scheduled processing, near-real-time ingestion, event streaming, replay, or low-latency analytics.
  • Storage format and layout: look for partitioning, compression, cataloging, file format, schema evolution, and query-pattern clues.
  • Replayability and recovery: prefer designs that can reprocess source data, recover failed steps, and preserve lineage when the scenario requires auditability.
  • Governance boundary: apply encryption, Lake Formation permissions, catalog controls, IAM, data masking, and account boundaries where sensitive data is involved.
  • Operational signal: distinguish pipeline orchestration, monitoring, cost optimization, data-quality checks, and failure handling from pure storage selection.

DEA-C01 readiness map

AreaWhat strong readiness looks like
Data ingestion and transformationYou can choose batch, streaming, ETL, ELT, event-driven, and orchestration patterns that match latency and replay needs.
Data store managementYou can design S3, Glue Data Catalog, Redshift, Athena, OpenSearch, and database storage choices around query and governance requirements.
Data operations and supportYou can reason through monitoring, retries, data quality, job failures, cost, performance, and pipeline observability.
Data security and governanceYou can apply least privilege, encryption, auditing, cross-account sharing, and fine-grained access controls to analytics workflows.

How to use the DEA-C01 simulator efficiently

  1. Start with domain drills so you can isolate whether your misses come from ingestion patterns, storage design, operations, or governance.
  2. Review every miss until you can explain why the best answer fits throughput, latency, cost, replayability, and security constraints better than the alternatives.
  3. Move into mixed sets once you can switch between batch, streaming, catalog, partitioning, monitoring, and fine-grained permission scenarios without hesitation.
  4. Finish with timed runs so the 130-minute pace feels normal before exam day.

Final 7-day DEA-C01 practice sequence

DayPractice focus
7Take the free full-length diagnostic and separate misses into ingestion, storage, operations, and governance buckets.
6Drill ingestion, streaming, ETL/ELT, orchestration, schema, and transformation decisions.
5Drill S3 layout, partitions, file formats, catalogs, Redshift, Athena, and query-performance scenarios.
4Drill data quality, monitoring, retries, cost controls, job failures, and pipeline support cases.
3Drill security, Lake Formation, sharing, encryption, IAM, audit, and cross-account access.
2Complete a timed mixed set and explain the data-flow trade-off behind every miss.
1Review weak service pairs and patterns; avoid cramming unfamiliar analytics features.

When DEA-C01 practice is enough

If several unseen mixed attempts are above roughly 75% and you can explain the data-platform trade-off in each miss, it is usually better to take the exam than keep repeating questions. Readiness means you can choose a reliable, governable AWS data pattern under time pressure.

Focused sample questions

Use these child pages when you want focused IT Mastery practice before returning to mixed sets and timed mocks.

Free study resources

Need concept review first? Read the AWS DEA-C01 Cheat Sheet on Tech Exam Lexicon, then return here for timed mocks, topic drills, and full IT Mastery practice.

Free preview vs premium

  • Free preview: a smaller web set so you can validate the question style and explanation depth.
  • Premium: the full DEA-C01 practice bank, focused drills, mixed sets, timed mock exams, detailed explanations, and progress tracking across web and mobile.

24 DEA-C01 sample questions with detailed explanations

These are original IT Mastery practice questions aligned to DEA-C01 data ingestion, storage, processing, transformation, orchestration, governance, monitoring, and security decisions. They are not AWS exam questions and are not copied from any exam sponsor. Use them to check readiness here, then continue in IT Mastery with mixed sets, topic drills, and timed mocks.

Question 1

Topic: Content Domain 4: Data Security and Governance

A data producer team uses Amazon Redshift data sharing to share a curated set of tables with a consumer team in a different Amazon Redshift namespace. Which statement is INCORRECT about how permissions and access boundaries work for this setup?

  • A. After creating a database from the datashare, the consumer must grant its own users access to query it.
  • B. Granting USAGE on a datashare allows the consumer to query any object in the producer database.
  • C. The producer can authorize a consumer by granting USAGE on the datashare to the consumer namespace.
  • D. Consumers have read-only access to shared objects; they cannot modify producer data through the share.

Best answer: B

Explanation: In Redshift data sharing, consumer access is limited to the objects explicitly added to the datashare. Granting USAGE on the datashare authorizes the consumer namespace to create a consumer database from the share, but it does not broaden access to all producer objects. Shared objects are read-only for consumers, and consumer-side user permissions are managed within the consumer namespace.

Amazon Redshift data sharing enforces clear producer/consumer boundaries: a consumer can only see and query the objects that the producer adds to a datashare. The producer grants access to the datashare (for example, GRANT USAGE ON DATASHARE...) to a specific consumer namespace/account, and the consumer creates a database from that datashare.

Access control is split across two planes:

  • Producer controls what is shared by adding/removing objects and authorizing consumer namespaces.
  • Consumer controls which of its users/roles can use the created database and query the shared schemas.

A key security property is that consumers get read-only access to shared objects and cannot use the share to access non-shared producer data.


Question 2

Topic: Content Domain 3: Data Operations and Support

A data pipeline is orchestrated with AWS Step Functions. It runs a Glue job and then runs an Athena query only if the Glue job succeeds. Several executions have been failing.

Exhibit: Step Functions execution event (excerpt)

1 Type: TaskStateEntered State: StartGlueJob
2 Type: TaskFailed State: StartGlueJob
3 Error: ThrottlingException
4 Cause: Rate exceeded
5 Type: ExecutionFailed

Based on the exhibit, what is the best next step to make the workflow more resilient while keeping the dependency that Athena runs only after the Glue job completes successfully?

  • A. Trigger the Athena query from the Glue job using a job bookmark
  • B. Add a Step Functions Retry policy on StartGlueJob for ThrottlingException with backoff
  • C. Increase the Glue job worker count to reduce throttling
  • D. Replace Step Functions with an EventBridge schedule for each step

Best answer: B

Explanation: The failure is caused by API throttling, not a deterministic data error. Step Functions can handle transient failures by retrying the failed task with an exponential backoff and a maximum attempt count. This preserves the dependency chain because downstream states (Athena) are reached only after the Glue task succeeds.

The core issue is a transient service/API throttling error, shown by Error: ThrottlingException in the execution history (exhibit line 3) and the immediate ExecutionFailed (line 5). In Step Functions, the right way to improve resiliency for transient, retryable errors is to add a Retry block to the failing task state so the workflow automatically retries with backoff and a capped number of attempts.

This keeps dependencies intact because Step Functions will only transition to the next state (running Athena) after the StartGlueJob task returns success; retries happen within the same state until success or the retry policy is exhausted. Key takeaway: handle transient dependency-step failures in the orchestrator with targeted retries rather than redesigning the pipeline.


Question 3

Topic: Content Domain 3: Data Operations and Support

A pipeline ingests application events to Amazon S3 (JSON) every 5 minutes. An AWS Glue Spark job runs hourly to join events to a small reference dataset and compute per-customer hourly aggregates, then writes Parquet to S3 partitioned by event_date for Amazon Athena queries.

The Glue job frequently exceeds its 30-minute SLA and shows one or two shuffle tasks running much longer than the rest. Investigation finds one customer ID accounts for ~55% of events in many hours (a hot key), causing highly unbalanced partitions during the groupBy customer_id aggregation.

Which change will BEST mitigate the skew and improve runtime reliability without changing the aggregate results?

  • A. Increase Glue worker count to add more executors
  • B. Salt customer_id before aggregation, then re-aggregate by customer_id
  • C. Enable Glue job bookmarks for incremental processing
  • D. Partition the output dataset by customer_id instead of event_date

Best answer: B

Explanation: The slowdown is caused by data skew from a hot key, where one customer_id dominates a shuffle partition and creates long-running straggler tasks. Salting the key spreads that customer’s records across multiple partitions for the heavy shuffle stage. A final rollup removes the salt so the output remains semantically identical while improving runtime reliability.

Data skew happens when a shuffle key (for example, customer_id) is not evenly distributed, so one reducer/partition receives most of the rows and becomes a straggler. In Spark-based AWS Glue jobs, this often appears as a small number of shuffle tasks running far longer than the rest and driving overall job runtime and failures.

A common mitigation is key salting for the skewed operation:

  • Add a derived key such as customer_id + salt (where salt is a small random or hash bucket).
  • Aggregate or join using the salted key to distribute the hot customer across many partitions.
  • Perform a second aggregation by the original customer_id to produce identical final results.

This improves performance and operability at the cost of an extra aggregation step and some additional shuffle, but it directly targets the hot-key bottleneck.


Question 4

Topic: Content Domain 2: Data Store Management

A company stores clickstream data as partitioned Parquet files in an Amazon S3 data lake (AWS Glue Data Catalog is in place).

Two new workloads must be supported:

  1. Analysts run interactive, ad hoc SQL a few times per day on S3 data and sometimes need to join to an Amazon Aurora PostgreSQL table without copying the Aurora data. The company prefers a serverless, pay-per-query model.
  2. A nightly job must sessionize and deduplicate ~10 TB using Apache Spark with third-party JARs and custom Spark configuration.

Which TWO AWS services best meet these requirements? (Select TWO.)

  • A. Amazon Athena with AWS Glue Data Catalog and Athena federated queries
  • B. Amazon Redshift Spectrum to query S3 without loading data into tables
  • C. Amazon Redshift provisioned cluster with COPY from S3 for analytics
  • D. AWS Glue ETL job (Spark) for the nightly processing
  • E. Amazon Athena CTAS/INSERT queries to perform the nightly sessionization
  • F. Amazon EMR running Apache Spark to process data from S3

Best answers: A, F

Explanation: Interactive, sporadic SQL directly on S3 with a serverless, pay-per-query model and the ability to query Aurora without data movement maps to Amazon Athena with federated queries. Large-scale nightly Spark processing that requires custom JARs and Spark tuning maps to Amazon EMR.

Match the query pattern and operational model to the engine. For interactive SQL on data in S3 with minimal ops and pay-per-query billing, Amazon Athena is the best fit; it uses the AWS Glue Data Catalog and supports federated queries (via connectors) so Aurora data can be queried without copying it into the lake.

For the nightly 10 TB sessionization and deduplication, the requirement is batch Spark with third-party JARs and custom Spark configuration. Amazon EMR provides managed clusters for Spark where you can control Spark/cluster settings and dependencies, making it a better fit than SQL-only approaches.

Redshift (including Spectrum) is strongest when you want a persistent warehouse for high-concurrency, low-latency analytics rather than sporadic pay-per-query access.


Question 5

Topic: Content Domain 1: Data Ingestion and Transformation

When orchestrating a serverless data ingestion workflow (Amazon EventBridge AWS Step Functions AWS Lambda/AWS Glue), which THREE statements are FALSE/UNSAFE assumptions or design choices? (Select THREE.)

  • A. Buffer bursty ingestion with SQS before invoking Lambda to smooth concurrency spikes
  • B. Set Lambda reserved concurrency to 0 to pause processing; invocations will queue until increased
  • C. Store large intermediate results in S3 and pass object keys (URIs) between workflow steps
  • D. Pass full 50-MB JSON payloads between Step Functions states to avoid S3
  • E. EventBridge delivery is exactly-once and ordered, so downstream functions do not need idempotency
  • F. Use Step Functions Retry/Catch with backoff for transient Lambda errors

Best answers: B, D, E

Explanation: Serverless orchestration requires designing for service limits and for at-least-once delivery behavior. Step Functions should pass small state payloads (pointers to data), not large datasets. Event sources and Lambda throttling can create retries/duplicates rather than durable queuing, so workflows should be idempotent and use explicit buffers such as SQS when needed.

The core idea is to design Step Functions-based pipelines around small control-plane messages and explicit failure/pressure handling. Step Functions state is best used to carry identifiers (like S3 object keys) and execution context, while the data-plane payload lives in durable storage.

Also assume at-least-once delivery for event-driven invocation paths: duplicates and replays can occur, so ingestion and transformation steps should be idempotent (for example, use deterministic S3 keys, conditional writes, or de-duplication records). Finally, Lambda reserved concurrency controls throughput by throttling; it does not create a backlog. If you need buffering during spikes, place SQS/Kinesis between producers and consumers and use retries/DLQs and Step Functions Retry/Catch for controlled recovery.

The safe pattern is explicit buffering plus retries, not relying on implicit ordering or queuing guarantees.


Question 6

Topic: Content Domain 3: Data Operations and Support

Which AWS service records account activity by capturing API calls (who did what, when, and from where) to support auditing and incident investigations across data platform services such as Amazon S3, AWS Glue, and AWS Lake Formation?

  • A. Amazon CloudWatch Logs
  • B. AWS Config
  • C. VPC Flow Logs
  • D. AWS CloudTrail

Best answer: D

Explanation: AWS CloudTrail provides an event record of AWS API activity across your account, which is the primary data source for auditing and incident investigation. It captures details such as the calling identity, timestamp, source IP, and the API action taken, and can deliver events to Amazon S3 or CloudWatch Logs for retention and analysis.

The core concept is API auditing: CloudTrail records AWS control-plane API calls (management events) and, for supported services, can also record data-plane activity (data events) such as Amazon S3 object-level operations. In a data platform, this lets you investigate changes like bucket policy updates, Glue Data Catalog modifications, Lake Formation permission grants, or other service actions by reviewing CloudTrail events.

CloudTrail answers “who did what, when, and from where” by logging event fields such as eventSource, eventName, userIdentity, sourceIPAddress, and requestParameters, and it can deliver logs to S3 for long-term retention and query (for example, with Athena). The closest confusion is AWS Config, which tracks resource configuration state and changes, not a complete record of API calls.


Question 7

Topic: Content Domain 1: Data Ingestion and Transformation

A data engineering team runs an hourly serverless pipeline orchestrated with AWS Step Functions. Each run processes up to 3,000 customer prefixes in Amazon S3 by invoking AWS Lambda, then writes curated output back to S3.

Requirements:

  • The per-customer transformation must run at most once per hour (downstream billing is not idempotent).
  • The AWS account has 1,000 available Lambda concurrency for this workload.
  • The workflow must handle retries for transient errors without creating runaway cost.

Which TWO design choices should the team AVOID?

  • A. Express workflow Map; rely on retries; no idempotency
  • B. Use DynamoDB conditional writes for per-customer idempotency
  • C. Standard workflow: Distributed Map, MaxConcurrency 200, retries
  • D. Run one Glue job hourly with bookmarks; alert on failures
  • E. Send items to SQS; Lambda consumers with reserved concurrency 200
  • F. Distributed Map with no MaxConcurrency limit (3,000 parallel)

Best answers: A, F

Explanation: To keep a serverless workflow reliable, you must control fan-out to stay within Lambda concurrency and choose orchestration semantics that match correctness requirements. Step Functions Express is at-least-once, so it can duplicate work unless the pipeline is explicitly idempotent. Unbounded parallelism can also trigger throttling and retries that inflate cost and risk missing the hourly SLA.

The core concerns are (1) execution semantics during retries and (2) concurrency limits during fan-out. Step Functions Express workflows are designed for high-volume, short-duration use cases and provide at-least-once execution, meaning a task can run more than once; when the business requirement is “at most once” and the downstream operation is not idempotent, you must not depend on retries alone for correctness.

Separately, fanning out 3,000 parallel Lambda invocations can exceed the stated 1,000 available concurrency, leading to TooManyRequestsException throttles, cascading retries, longer runtimes, and higher Step Functions/Lambda costs. Safer patterns include setting MaxConcurrency in a Map/Distributed Map state, buffering work with SQS plus reserved concurrency, and enforcing idempotency (for example, DynamoDB conditional writes) so retries do not create duplicate billing-impacting outputs.


Question 8

Topic: Content Domain 3: Data Operations and Support

A company loads curated dimension and fact tables into Amazon Redshift using an AWS Glue ETL job. The company must ensure that customers.customer_id is unique and that orders.customer_id always references an existing customer.

Which statement is correct about where to enforce these checks?

  • A. Load to Amazon S3 and query with Athena, which enforces primary and foreign keys
  • B. Define PRIMARY KEY and FOREIGN KEY in Redshift to reject violating rows
  • C. Validate uniqueness and referential integrity in the ETL before loading to Redshift
  • D. Use AWS Glue Data Catalog schema to enforce uniqueness and foreign keys at write time

Best answer: C

Explanation: In Amazon Redshift, PRIMARY KEY/UNIQUE/FOREIGN KEY constraints are metadata and are not enforced during inserts or loads. Therefore, if the pipeline must guarantee uniqueness and referential integrity, the validation needs to occur in the ETL process (or another enforcing system) before data is loaded into Redshift.

The core decision is whether the target warehouse will actively enforce referential integrity and uniqueness. In Amazon Redshift, constraints such as PRIMARY KEY, UNIQUE, and FOREIGN KEY are not enforced; they are mainly used for query planning and documentation. That means Redshift will not automatically reject rows that violate these constraints during COPY/INSERT.

To ensure data quality for uniqueness and foreign-key relationships in this setup, implement checks in the ETL layer (for example, in AWS Glue/Spark):

  • Detect duplicate customer_id values before load.
  • Verify each orders.customer_id exists in the customer key set.

Key takeaway: when the warehouse does not enforce constraints, enforce these rules in ETL validation prior to loading.


Question 9

Topic: Content Domain 4: Data Security and Governance

When troubleshooting access failures for AWS data services (for example, Amazon S3, AWS Glue, Amazon Athena) from resources in a VPC, which TWO statements are INCORRECT? (Select TWO.)

  • A. S3 bucket policies can deny access even with IAM Allow.
  • B. Security groups affect reachability, not IAM authorization decisions.
  • C. AccessDeniedException usually indicates a network ACL problem.
  • D. VPC endpoints remove the need for IAM permissions.
  • E. Endpoint policies can restrict access in addition to IAM.
  • F. Timeouts from private subnets often indicate missing NAT/endpoints.

Best answers: C, D

Explanation: Network controls (security groups, NACLs, routing, NAT, VPC endpoints) determine whether you can reach an AWS service endpoint, typically showing up as timeouts or connection errors. IAM and resource-based policies (like S3 bucket policies) determine whether an authenticated request is authorized, typically showing up as AccessDenied-type errors. VPC endpoints do not grant permissions; they only change/restrict the network path.

Troubleshooting starts by classifying the failure symptom: connectivity vs authorization. Network controls such as security groups, NACLs, routes, NAT gateways, and VPC endpoints govern whether traffic can reach the service endpoint; when these are wrong, clients commonly see timeouts, DNS/connect errors, or failed TCP/TLS handshakes. Authorization failures happen after a request reaches the service and are evaluated by IAM and resource policies.

Saying that AccessDeniedException “usually indicates a network ACL problem” is incorrect because AccessDenied is returned by the AWS service after policy evaluation. Saying that “VPC endpoints remove the need for IAM permissions” is also incorrect: endpoints provide private connectivity and can further restrict access via an endpoint policy, but IAM (and resource policies such as S3 bucket policies) must still allow the action. The other statements correctly describe these separations and layered policy controls.


Question 10

Topic: Content Domain 3: Data Operations and Support

A data team uses a nightly AWS Glue job to write Parquet files to Amazon S3 and queries the data using Amazon Athena. Users report frequent query failures.

A data engineer runs an Amazon CloudWatch Logs Insights query over the last 24 hours of the Glue job log group and gets the following aggregated results.

Exhibit: Logs Insights results (last 24 hours)

erroroccurrencessample_message
HIVE_PARTITION_SCHEMA_MISMATCH247Partition dt=2026-02-24 column customer_id is bigint; table is string
AccessDeniedException3Access denied for s3:GetObject
ThrottlingException1Rate exceeded

Which action is the best next step to reduce the recurring failures?

  • A. Standardize column types and update the Glue table schema
  • B. Increase Athena workgroup query timeout and add retries
  • C. Add missing IAM permissions for Athena to start queries
  • D. Request a service quota increase to prevent API throttling

Best answer: A

Explanation: The log aggregation indicates the dominant recurring failure is a Hive partition schema mismatch between stored data and the Athena/Glue table definition. Fixing schema drift (or separating versions) addresses the root cause and will eliminate most failures. The other errors occur only a few times and are unlikely to explain frequent failures.

The key skill is using log aggregation to identify the most frequent root-cause category. In the exhibit, HIVE_PARTITION_SCHEMA_MISMATCH has 247 occurrences, far exceeding the other errors, and the sample message explicitly states a type conflict: partition data has customer_id as bigint while the table expects string.

Best next step is to eliminate schema drift by ensuring the Glue ETL writes a consistent schema per table (or writing to a new location/table for the new schema) and updating the Glue Data Catalog table definition/partitions to match the stored Parquet schema. This targets the error that is clearly recurring in the logs.


Question 11

Topic: Content Domain 4: Data Security and Governance

A company is adopting a domain-oriented data ownership model (producer/consumer). Each business domain owns its curated datasets in an Amazon S3 data lake and registers tables in the AWS Glue Data Catalog. Other domains in separate AWS accounts need to query these datasets with Amazon Athena while the producing domain keeps control over who can access the data and under what conditions.

Which action best matches the core principle being applied?

  • A. Attach an S3 bucket policy that allows read access to any principal in the AWS Organization
  • B. Enforce write-once permissions on the raw zone prefixes to prevent updates
  • C. Copy curated data into each consumer account’s S3 bucket nightly
  • D. Use AWS Lake Formation to share producer-owned databases/tables cross-account with granular permissions (for example, tag-based access control) and consumer resource links

Best answer: D

Explanation: The scenario is about producer/consumer data sharing with domain-owned control over access. AWS Lake Formation is designed for governed data lake access, including cross-account sharing, fine-grained authorization, and centralized auditing. This directly implements data governance while preserving domain ownership of datasets and policies.

The core principle is data governance applied to a producer/consumer sharing pattern: the producing domain should be able to publish data products while controlling access and enabling consumers to discover and query them. AWS Lake Formation supports this by letting producers grant table-, column-, and row-level permissions and share those permissions across AWS accounts (often via Lake Formation resource sharing and consumer-side resource links), while maintaining a single governed copy of the data in S3.

This approach also improves auditability and consistency because access decisions are managed through a centralized permission system rather than ad hoc S3 policies or data duplication. The key takeaway is to enable cross-account consumption through governed sharing rather than copying data or broadly opening storage access.


Question 12

Topic: Content Domain 4: Data Security and Governance

A company is standardizing a four-level data classification model for its S3-based data lake: public, internal, confidential, and regulated. A new dataset was registered in the AWS Glue Data Catalog for Athena querying.

Exhibit: Glue table DDL (excerpt)

CREATE EXTERNAL TABLE claims_raw (
claim_id string,
patient_name string,
ssn string,
credit_card_number string
)
LOCATION 's3://acme-datalake-raw/claims/';

Based on the exhibit, which classification and high-level control mapping is most appropriate for this dataset?

  • A. Regulated; enforce least-privilege access; require SSE-KMS encryption.
  • B. Public; allow broad read access; no encryption required.
  • C. Confidential; allow IAM-only access; use SSE-S3 encryption.
  • D. Internal; restrict to corporate network paths; use SSE-S3.

Best answer: A

Explanation: The table schema includes highly sensitive identifiers (ssn and credit_card_number), which fits the regulated classification. Regulated data typically requires strict, auditable least-privilege access controls and customer-managed key encryption such as S3 SSE-KMS.

The core decision is to classify the dataset based on the sensitivity implied by the schema and then apply matching access and encryption controls. In the exhibit, the presence of ssn and credit_card_number in the Glue table definition indicates regulated data (highly sensitive personal/financial data).

At a high level, regulated data should map to:

  • Strong, least-privilege authorization (for example, Lake Formation or tightly scoped IAM/data access roles)
  • Encryption at rest using KMS keys you control (S3 SSE-KMS, ideally with a customer managed KMS key)
  • Clear auditability of access and changes (for example, CloudTrail/Lake Formation logs)

Key takeaway: the specific fields ssn and credit_card_number in the exhibit are the deciding indicators that push this to the regulated tier, not merely internal or confidential.


Question 13

Topic: Content Domain 2: Data Store Management

A company is building an Amazon S3 data lake with AWS Glue Data Catalog tables and Amazon Athena for SQL access. Data is ingested into a raw zone and transformed into a curated zone. Some curated tables contain PII columns that only a compliance team can access; analysts can access non-PII data. The company wants centralized governance using AWS Lake Formation.

Which design choice is NOT recommended for the table and permission model?

  • A. Grant analysts direct s3:GetObject access to curated S3 prefixes and skip Lake Formation registration
  • B. Separate raw and curated into different Glue databases and limit raw access to ETL roles
  • C. Register S3 data locations in Lake Formation and grant database/table permissions to IAM roles
  • D. Use Lake Formation column-level permissions to restrict PII columns to a compliance role

Best answer: A

Explanation: Lake Formation is intended to be the central policy enforcement point for governed tables, including controlling access to the underlying S3 data through registered data locations. Allowing principals to access curated objects directly in S3 undermines those controls and makes fine-grained table/column permissions ineffective.

In a governed data lake, Lake Formation permissions (database/table/column/row) should be the primary way consumers get access to datasets, with the underlying S3 locations registered so Lake Formation can enforce access consistently. If you grant broad or direct s3:GetObject permissions to data consumers and do not register the location with Lake Formation, users can bypass Data Catalog-based controls and retrieve objects outside the intended table/column permission model.

A sound high-level design is to:

  • Model datasets as Glue tables (often separated by zone/database).
  • Register S3 locations with Lake Formation.
  • Grant least-privilege access using Lake Formation (including column-level for PII).

Key takeaway: avoid permission paths that let users read S3 data directly outside Lake Formation governance.


Question 14

Topic: Content Domain 1: Data Ingestion and Transformation

When choosing an AWS orchestration service for a data pipeline based on dependency management and retry needs, which THREE statements are FALSE/UNSAFE? (Select three.)

  • A. Amazon MWAA (Apache Airflow) is well-suited for complex DAGs with task-level retries and backfills.
  • B. AWS Glue workflows provide the same fine-grained retries/timeouts and human-approval steps as Step Functions.
  • C. Amazon EventBridge is ideal for complex DAG dependencies with per-task retries and backoff.
  • D. AWS Step Functions cannot orchestrate AWS Glue jobs because it only works with AWS Lambda.
  • E. AWS Step Functions can orchestrate AWS Glue jobs using service integrations and Retry/Catch.
  • F. Amazon EventBridge is commonly used to trigger pipelines in an event-driven way, but not to model complex multi-step dependencies.

Best answers: B, C, D

Explanation: Use stateful orchestrators when you need explicit dependencies, durable state, and configurable retries per step (for example, Step Functions or Airflow/MWAA). EventBridge is best for event routing and simple triggering, not for modeling complex DAGs. Glue workflows coordinate Glue components but do not match Step Functions for fine-grained retry/timeouts and advanced control flow.

The key distinction is whether you need a stateful workflow engine (tracks step state and dependencies) versus an event bus (routes events).

Step Functions is purpose-built for multi-step orchestration with explicit state, dependency sequencing, and per-state Retry/Catch policies, and it can directly start and monitor many AWS services (including AWS Glue) through service integrations.

MWAA (Airflow) is also a strong fit for complex DAG scheduling, dependencies, and per-task retries/backfills, but it introduces an environment to operate (Airflow workers, scaling, upgrades).

EventBridge is typically used to trigger or fan out pipeline actions based on events, not to manage complex DAG dependencies with rich per-task retry behavior. Glue workflows coordinate Glue jobs/crawlers with triggers but have comparatively limited workflow control and error-handling granularity.

Choose the orchestration tool that matches the complexity of dependencies and the level of retry/control you must express.


Question 15

Topic: Content Domain 3: Data Operations and Support

An AWS Glue workflow starts a Glue ETL job on a fixed schedule of every 30 minutes to meet a 30-minute freshness SLA. The workflow’s run history shows many runs in WAITING while one run is RUNNING, with no job failures.

The Glue ETL job consistently takes 70 minutes to complete.

Assuming the runtime stays constant, what is the minimum Max concurrent runs value needed to prevent a growing queue of waiting workflow runs? Round up to the next whole number.

  • A. Set Max concurrent runs to 2
  • B. Set Max concurrent runs to 4
  • C. Set Max concurrent runs to 3
  • D. Set Max concurrent runs to 1

Best answer: C

Explanation: The workflow is queuing because the schedule interval is shorter than the job runtime, so runs overlap and wait for capacity. The minimum concurrency is the runtime divided by the interval, rounded up. With a 70-minute runtime and 30-minute schedule, the job needs 3 concurrent runs to keep the queue from growing.

This is a scheduling/worker-capacity symptom: when a job is triggered faster than it completes, new runs pile up in WAITING unless the service can run enough copies in parallel.

Compute the minimum parallelism needed:

  • One run arrives every 30 minutes.
  • Each run occupies a slot for 70 minutes.
  • Minimum concurrent slots needed is (runtime / interval), rounded up.

So: 70 / 30 = 2.33, which rounds up to 3. Setting Max concurrent runs to 3 allows the job to absorb the overlap and stop the backlog growth (assuming sufficient DPUs/account capacity).


Question 16

Topic: Content Domain 3: Data Operations and Support

A data pipeline runs an AWS Glue ETL job every 15 minutes. The job is started by a scheduled AWS Lambda function (no VPC) that calls glue:StartJobRun. Operations reports that 1-2 runs per day are missed and must be started manually. Reliability requirement: the trigger must automatically retry for up to 60 minutes without adding servers to manage.

Exhibit: Lambda log excerpt

ERROR ThrottlingException: Rate exceeded
Task timed out after 3.00 seconds

Which change will fix the root cause with the least operational burden?

  • A. Increase the Lambda timeout to 1 minute and memory to 1,024 MB
  • B. Add more DPUs to the Glue job to reduce the chance of throttling
  • C. Use Amazon EventBridge Scheduler to invoke glue:StartJobRun directly with a retry policy and an SQS dead-letter queue
  • D. Move the schedule to a cron job on an Amazon EC2 instance and call glue:StartJobRun

Best answer: C

Explanation: The failures occur in the scheduling layer, not in the Glue job itself. The Lambda times out after a throttling error, so no subsequent retry occurs and the run is missed. Using EventBridge Scheduler to call StartJobRun provides a managed, serverless trigger with configurable retries (and optional DLQ) to meet the reliability requirement with minimal operations overhead.

Symptom: scheduled runs are occasionally missed, and the Lambda logs show ThrottlingException followed by a short Lambda timeout.

Root cause: the trigger mechanism (Lambda) is not reliably retrying the StartJobRun call when AWS API throttling happens; the function times out quickly, so the job start request is dropped.

Fix: replace the Lambda scheduler with a managed, serverless service that can invoke the Glue API and handle retries for you.

  • Configure Amazon EventBridge Scheduler with an AWS SDK target for glue:StartJobRun
  • Set a retry policy to cover up to 60 minutes
  • (Optional) Configure an SQS dead-letter queue for failed invocations

Key takeaway: use a managed scheduler with retries to improve reliability and reduce operational burden.


Question 17

Topic: Content Domain 2: Data Store Management

Which Amazon S3 feature provides object retention (WORM) by preventing an object version from being overwritten or permanently deleted until a defined retention period expires, helping protect against accidental deletions and support recovery?

  • A. Enable S3 Versioning on the bucket
  • B. Add an S3 Lifecycle rule to transition objects to Glacier
  • C. Enable MFA Delete on the bucket
  • D. Enable S3 Object Lock with a retention period

Best answer: D

Explanation: S3 Object Lock is the S3 capability specifically designed for retention: it applies WORM controls to object versions using a retention period (and optionally a legal hold). This prevents permanent deletion or overwrite until the retention constraint is removed or expires, which protects data from accidental or malicious deletions and supports recovery.

S3 Versioning helps with recovery by keeping multiple versions of an object; a “delete” typically creates a delete marker so prior versions can be restored. However, versioning alone does not stop a user from permanently deleting specific versions.

S3 Object Lock is the retention feature for S3. It enforces write-once-read-many behavior at the object version level by applying:

  • A retention period (Governance or Compliance mode)
  • Optional legal holds

With Object Lock, protected versions cannot be overwritten or permanently deleted until the retention rules allow it, which is the core concept behind object retention for protection and recovery. The key takeaway is that versioning enables rollback, while Object Lock enforces retention against permanent removal.


Question 18

Topic: Content Domain 2: Data Store Management

You are troubleshooting Amazon Athena queries that use the AWS Glue Data Catalog for an S3-backed data lake. Which THREE statements are FALSE/UNSAFE guidance for diagnosing common cataloging issues (missing partitions, schema mis-detection, and permission errors)?

  • A. Run MSCK REPAIR TABLE to discover existing S3 partitions.
  • B. AWS Glue crawlers keep partitions up to date in near real time.
  • C. Schema mis-detection: review sampled files and use a classifier.
  • D. AccessDenied in Athena can be fixed with only S3 permissions.
  • E. Wrong folder naming/case can cause partitions to appear missing.
  • F. Athena returning zero rows usually means a timeout; increase timeout.

Best answers: B, D, F

Explanation: Cataloging problems in Athena typically come from metadata being out of sync (partitions not added), schema inferred incorrectly by a crawler, or missing permissions to Glue Data Catalog/Lake Formation. Crawlers are not continuous, permissions involve more than just S3, and “zero rows” usually indicates partition/filter/schema mismatch rather than timeouts. Effective troubleshooting focuses on validating partitions, schema inference inputs, and metadata access.

When Athena queries an S3 dataset through the Glue Data Catalog, failures usually map to three areas:

  • Partitions: ensure partitions exist in the catalog and match the S3 prefix and partition key values (including case and formatting), then add them with MSCK REPAIR TABLE or ALTER TABLE ADD PARTITION.
  • Schema: Glue crawlers infer types from sampled files; schema drift, mixed file formats, or insufficient sampling can mis-detect columns, so review the input files and use appropriate crawler settings or custom classifiers.
  • Permissions: “AccessDenied” can be due to missing IAM permissions for Glue APIs (and Lake Formation grants if used), even when S3 GetObject is allowed.

The key idea is to validate metadata (partitions/schema) and metadata access, not to treat these as query-runtime timeout issues.


Question 19

Topic: Content Domain 1: Data Ingestion and Transformation

A data engineering team uses AWS Step Functions to orchestrate a serverless ingestion pipeline. For each object that lands in an Amazon S3 landing/ prefix, the workflow validates the file with AWS Lambda and then starts downstream processing. The team expects occasional bursts (tens of thousands of objects in minutes) and must preserve an immutable raw copy for auditing and reprocessing.

Which approach should the team AVOID?

  • A. Write outputs using unique, idempotent S3 keys per input object
  • B. Use unlimited parallelism and overwrite a single S3 object in raw/
  • C. Buffer events with Amazon SQS and use a DLQ for failed messages
  • D. Set a Step Functions Map MaxConcurrency and add retries for throttling errors

Best answer: B

Explanation: In serverless orchestration, you must control concurrency and design for retries and idempotency. Allowing unbounded parallel tasks while overwriting a single object in the raw zone creates race conditions and data loss, and it increases throttling and partial-failure impact during bursts. An immutable raw layer should be append-only and reproducible.

A key serverless workflow principle is to bound concurrency to downstream limits and make each step safe to retry. With Step Functions, bursty fan-out (such as a Map state over many objects) can quickly hit AWS Lambda concurrency or other service limits, so you typically cap parallelism and add retries/backoff for transient failures.

For data platforms, the raw zone should be immutable (append-only) so you can audit and reprocess exactly what arrived. Tasks should write idempotently (unique keys/partitions) so retries or duplicate events don’t overwrite or corrupt prior data. Unbounded fan-out combined with overwriting a shared raw/ object is a classic anti-pattern: it introduces race conditions, loses lineage, and turns normal retries into destructive writes.

The safest designs combine bounded parallelism, durable buffering, and idempotent S3 writes.


Question 20

Topic: Content Domain 1: Data Ingestion and Transformation

You are configuring event-driven ingestion for files landing in Amazon S3 using Amazon S3 event notifications and Amazon EventBridge rules. The pipeline must tolerate at-least-once delivery.

Which TWO statements are FALSE or unsafe assumptions for this design?

  • A. S3 event notifications deliver exactly once, so consumers do not need deduplication.
  • B. Implement idempotent processing by tracking processed object identifiers (for example, key + version ID) in DynamoDB.
  • C. EventBridge guarantees events are delivered to targets in the same order they occurred.
  • D. If strict per-entity ordering is required, use SQS FIFO with a stable MessageGroupId and still keep processing idempotent.
  • E. Buffer S3 object-created events in Amazon SQS to absorb retries and downstream backpressure, while still handling duplicates.
  • F. For asynchronous Lambda invocations, configure an on-failure destination or DLQ to capture events after retries are exhausted.

Best answers: A, C

Explanation: Both S3 event notifications and EventBridge use at-least-once delivery, so duplicates are possible and ordering is not guaranteed. Safe event-driven ingestion designs therefore avoid assuming exactly-once or ordered delivery, and instead use idempotent consumers plus appropriate buffering and failure handling.

The core requirement is handling at-least-once semantics: event sources and intermediaries can retry, and the same logical event can be delivered more than once. With S3 event notifications, duplicate notifications can occur for the same object, so downstream processing must be idempotent (for example, record a processed object key plus version ID/ETag and ignore repeats). With EventBridge, delivery is also at-least-once and you should not assume a global ordering of events at targets.

Safe patterns include using SQS to buffer bursts and retries, configuring Lambda async failure handling (DLQ/destinations), and using SQS FIFO only when you need ordering within a message group-while still designing processing to be idempotent.


Question 21

Topic: Content Domain 1: Data Ingestion and Transformation

You are integrating multiple sources into a single analytics table (for example, app events, web events, and customer updates). You must choose join keys, align timestamps, and define deduplication rules.

Which TWO statements are INCORRECT/UNSAFE practices for this integration?

  • A. Deduplicate by keeping the first record seen for each key, regardless of event time or versioning
  • B. For deduplication, use a deterministic key (for example, event_id) plus an ordering field to keep a consistent winner
  • C. When dimensions change over time, join facts to the dimension version effective at the fact timestamp
  • D. Use received_time (ingestion time) as the timestamp for joining sources
  • E. Prefer stable, unique business identifiers (for example, user_id or customer_id) as join keys
  • F. Normalize timestamps to a single time zone (for example, UTC) before aligning records

Best answers: A, D

Explanation: Using ingestion/arrival time to align records is unsafe because differing delivery delays across sources break time-based joins. Deduplication also must be deterministic and resilient to late or corrected data; “keep the first one seen” commonly loses the correct version. Safe integration patterns rely on stable join keys, normalized timestamps, and explicit event-time/effective-time logic.

When integrating sources, the core idea is to join on identifiers that represent the same real-world entity and to align records using a timestamp that reflects business/event semantics, not pipeline behavior.

Event-time (or effective-time for slowly changing dimensions) should drive alignment because different sources can arrive late, out of order, or with different buffering. Therefore, normalize time zones (commonly to UTC) and join based on event_time/updated_at as appropriate. For duplicates, define a stable dedup key (such as event_id or a composite natural key) and a deterministic tie-breaker (such as highest version, latest event_time, or a source priority) so reruns and late arrivals converge to the same correct output. The main takeaway is to avoid using ingestion time and “first seen wins” rules, because both depend on unpredictable delivery timing.


Question 22

Topic: Content Domain 2: Data Store Management

An Amazon Athena workgroup queries a Glue Data Catalog table clickstream stored in Amazon S3 as Parquet with Snappy compression. The table is partitioned as s3://.../year=YYYY/month=MM/day=DD/hour=HH/ and has about 300,000 partitions.

Users run many saved dashboards that filter only on event_ts (not the partition columns). Queries intermittently fail with this Athena error:

HIVE_METASTORE_ERROR: com.amazonaws.services.glue.model.ThrottlingException:
Rate exceeded (Service: AWSGlue; Operation: GetPartitions)

The data engineer must fix the root cause with minimal change and cannot modify the dashboards or rewrite the dataset. What should the data engineer do?

  • A. Create a Glue Data Catalog partition index on the partition keys
  • B. Run MSCK REPAIR TABLE clickstream on a schedule to refresh partitions
  • C. Repartition the dataset to use only event_date partitions (daily)
  • D. Change the S3 objects to GZIP-compressed CSV to reduce query scan time

Best answer: A

Explanation: The failure is caused by Glue Data Catalog GetPartitions throttling when Athena must enumerate a very large number of partitions for queries that do not filter on partition columns. Creating a Glue partition index is an operational change that improves partition metadata retrieval performance and reduces throttling. This preserves the existing dashboards and S3 dataset layout.

Athena relies on the Glue Data Catalog to retrieve partition metadata. Because the dashboards filter only on event_ts, Athena cannot prune partitions by year/month/day/hour and may need to enumerate a very large set of partitions, leading to GetPartitions throttling and query failures.

Creating a Glue partition index on the table’s partition keys is the minimal operational fix because it:

  • Optimizes partition metadata lookups in Glue for large partitioned tables
  • Reduces latency and throttling risk for GetPartitions
  • Requires no changes to saved queries/dashboards and no data rewrite

The key takeaway is that Glue partition indexes address partition-metadata scalability, whereas file format/compression changes or repartitioning are larger changes and don’t directly resolve Glue API throttling.


Question 23

Topic: Content Domain 1: Data Ingestion and Transformation

A data engineering team uses an LLM (Amazon Bedrock) inside an ETL job to summarize and categorize customer support tickets. The raw tickets can contain PII (names, emails, account numbers). The job writes results to an S3 curated zone for analytics.

Which THREE statements describe appropriate guardrails for this LLM-driven processing of sensitive data?

  • A. Rely on a deterministic temperature setting instead of adding review or filtering
  • B. Fine-tune the model on raw historical tickets to reduce the need for guardrails
  • C. Send full, unredacted tickets because TLS encrypts data in transit to the model
  • D. Redact or mask PII in prompts before invoking the LLM
  • E. Route flagged or low-confidence outputs to human review before publishing
  • F. Ground summaries using only an approved knowledge base and reject uncited claims

Best answers: D, E, F

Explanation: For sensitive data, apply layered guardrails around LLM calls: filter inputs to minimize exposed PII, ground responses to trusted data to constrain what the model can assert, and add human review gates for risky or uncertain outputs. These controls reduce both data leakage risk and incorrect content entering curated analytics datasets.

LLMs can leak sensitive data and generate unsupported text, so guardrails should be applied at multiple points in an ETL workflow handling PII. Start with input filtering to minimize or remove PII before the model sees it (for example, redaction/masking). Then constrain generation by grounding the model on approved enterprise data (RAG from a curated knowledge base) and validating that outputs are supported (such as requiring citations or rejecting ungrounded claims). Finally, use human review for items that are high risk (detected sensitive entities, policy violations) or low confidence before publishing results into a curated zone.

The key idea is to reduce sensitive exposure, constrain what the model is allowed to say, and prevent unsafe outputs from being automatically persisted.


Question 24

Topic: Content Domain 2: Data Store Management

A data platform team notices that over time their datasets have changed (more skewed key distribution, higher cardinality in some columns, and higher null rates). They want to adjust partitioning, indexing, and processing strategies at a high level.

Which THREE statements are INCORRECT/unsafe recommendations? (Select three.)

  • A. For skewed joins in Spark, consider broadcast joins for small dimension tables or salting to spread hot keys.
  • B. If a Redshift table becomes skewed on its distribution key, consider changing the distribution strategy/key and rebalancing the table.
  • C. Partition by a high-cardinality user_id to maximize partition pruning.
  • D. If null rates rise in a frequently filtered column, consider filtering nulls upstream or choosing a different partition key to preserve pruning.
  • E. After large data changes, you can skip refreshing statistics because query planners always infer current distributions at runtime.
  • F. Adding more partitions is always beneficial and has no meaningful downside, even if it creates many tiny partitions/files.

Best answers: C, E, F

Explanation: The incorrect statements ignore key tradeoffs in partitioning and query optimization. Extremely high-cardinality partition keys and excessive partition counts often harm performance due to metadata and small-file overhead. Also, when data distributions change, planners typically need refreshed statistics (and sometimes physical reorganization) to choose efficient join and scan strategies.

As data characteristics shift (skew, cardinality, null rates), you often need to revisit how data is laid out and how jobs execute. Partitioning is most effective when the key has manageable cardinality and aligns with common predicates; otherwise, you can create many tiny partitions/files and increase catalog/listing overhead.

Query optimizers generally do not “figure it out live” for every query-statistics (and, depending on the engine, rebalancing/reclustering) help the planner estimate cardinality and choose join and scan plans. For processing, mitigate skew by changing partitioning/repartitioning, using broadcast joins for small tables, or distributing hot keys (salting) rather than assuming more partitions alone will fix it.

The key is balancing pruning benefits against metadata and execution overhead.

DEA-C01 data engineering map

Use this map after the sample questions to connect individual items to the AWS data engineering pipeline, storage, security, and operations decisions these practice samples test.

    flowchart LR
	  S1["Data source and requirement"] --> S2
	  S2["Ingest batch or streaming data"] --> S3
	  S3["Store raw and curated data"] --> S4
	  S4["Transform validate and catalog datasets"] --> S5
	  S5["Secure govern and monitor pipelines"] --> S6
	  S6["Serve analytics or ML consumers"]

Quick Cheat Sheet

CueWhat to remember
IngestionMatch batch, streaming, CDC, file, API, and event sources to AWS services.
Storage layersSeparate raw lake storage, curated data, warehouse, and serving layers.
TransformationUse Glue, EMR, Lambda, SQL, or managed services based on volume, latency, and complexity.
GovernanceTrack catalog, schema, lineage, encryption, IAM, Lake Formation, and data quality.
OperationsMonitor failures, retries, partitions, cost, freshness, and downstream impact.

Mini Glossary

  • CDC: Change data capture pattern for streaming source data changes.
  • Data catalog: Metadata inventory describing datasets, schemas, and locations.
  • Data lake: Storage architecture for large raw or curated datasets.
  • ETL: Extract, transform, load data integration pattern.
  • Partition: Data layout technique that improves filtering and processing efficiency.

In this section

Revised on Friday, May 15, 2026