Prepare for AWS Certified Data Engineer Associate (DEA-C01) with free sample questions, a full-length diagnostic, topic drills, timed practice, ingestion, transformation, storage, operations, governance, and detailed explanations in IT Mastery.
DEA-C01 is AWS’s Data Engineer Associate certification for candidates who need to design and operate ingestion, transformation, storage, monitoring, and governance workflows across modern AWS data platforms. If you are searching for DEA-C01 sample questions, a practice test, mock exam, or exam simulator, this is the main IT Mastery page to start on web and continue on iOS or Android with the same IT Mastery account.
Start a practice session for AWS Certified Data Engineer - Associate (DEA-C01) below, or open the full app in a new tab. For the best experience, open the full app in a new tab and navigate with swipes/gestures or the mouse wheel—just like on your phone or tablet.
Open Full App in a New TabA small set of questions is available for free preview. Subscribers can unlock full access by signing in with the same app-family account they use on web and mobile.
Prefer to practice on your phone or tablet? Download the IT Mastery – AWS, Azure, GCP & CompTIA exam prep app for iOS or IT Mastery app on Google Play (Android) and use the same IT Mastery account across web and mobile.
Free diagnostic: Try the 65-question AWS DEA-C01 full-length practice exam before subscribing. Use it as one data-engineering baseline, then return to IT Mastery for timed mocks, topic drills, explanations, and the full Data Engineer Associate question bank.
DEA-C01 questions usually reward the option that delivers a replayable, governable, and cost-aware data platform decision rather than a narrow service-first answer.
| Domain | Weight |
|---|---|
| Data Ingestion and Transformation | 34% |
| Data Store Management | 26% |
| Data Operations and Support | 22% |
| Data Security and Governance | 18% |
Use these filters when several AWS data services could technically work:
| Area | What strong readiness looks like |
|---|---|
| Data ingestion and transformation | You can choose batch, streaming, ETL, ELT, event-driven, and orchestration patterns that match latency and replay needs. |
| Data store management | You can design S3, Glue Data Catalog, Redshift, Athena, OpenSearch, and database storage choices around query and governance requirements. |
| Data operations and support | You can reason through monitoring, retries, data quality, job failures, cost, performance, and pipeline observability. |
| Data security and governance | You can apply least privilege, encryption, auditing, cross-account sharing, and fine-grained access controls to analytics workflows. |
| Day | Practice focus |
|---|---|
| 7 | Take the free full-length diagnostic and separate misses into ingestion, storage, operations, and governance buckets. |
| 6 | Drill ingestion, streaming, ETL/ELT, orchestration, schema, and transformation decisions. |
| 5 | Drill S3 layout, partitions, file formats, catalogs, Redshift, Athena, and query-performance scenarios. |
| 4 | Drill data quality, monitoring, retries, cost controls, job failures, and pipeline support cases. |
| 3 | Drill security, Lake Formation, sharing, encryption, IAM, audit, and cross-account access. |
| 2 | Complete a timed mixed set and explain the data-flow trade-off behind every miss. |
| 1 | Review weak service pairs and patterns; avoid cramming unfamiliar analytics features. |
If several unseen mixed attempts are above roughly 75% and you can explain the data-platform trade-off in each miss, it is usually better to take the exam than keep repeating questions. Readiness means you can choose a reliable, governable AWS data pattern under time pressure.
Use these child pages when you want focused IT Mastery practice before returning to mixed sets and timed mocks.
Need concept review first? Read the AWS DEA-C01 Cheat Sheet on Tech Exam Lexicon, then return here for timed mocks, topic drills, and full IT Mastery practice.
These are original IT Mastery practice questions aligned to DEA-C01 data ingestion, storage, processing, transformation, orchestration, governance, monitoring, and security decisions. They are not AWS exam questions and are not copied from any exam sponsor. Use them to check readiness here, then continue in IT Mastery with mixed sets, topic drills, and timed mocks.
Topic: Content Domain 4: Data Security and Governance
A data producer team uses Amazon Redshift data sharing to share a curated set of tables with a consumer team in a different Amazon Redshift namespace. Which statement is INCORRECT about how permissions and access boundaries work for this setup?
Best answer: B
Explanation: In Redshift data sharing, consumer access is limited to the objects explicitly added to the datashare. Granting USAGE on the datashare authorizes the consumer namespace to create a consumer database from the share, but it does not broaden access to all producer objects. Shared objects are read-only for consumers, and consumer-side user permissions are managed within the consumer namespace.
Amazon Redshift data sharing enforces clear producer/consumer boundaries: a consumer can only see and query the objects that the producer adds to a datashare. The producer grants access to the datashare (for example, GRANT USAGE ON DATASHARE...) to a specific consumer namespace/account, and the consumer creates a database from that datashare.
Access control is split across two planes:
A key security property is that consumers get read-only access to shared objects and cannot use the share to access non-shared producer data.
Topic: Content Domain 3: Data Operations and Support
A data pipeline is orchestrated with AWS Step Functions. It runs a Glue job and then runs an Athena query only if the Glue job succeeds. Several executions have been failing.
Exhibit: Step Functions execution event (excerpt)
1 Type: TaskStateEntered State: StartGlueJob
2 Type: TaskFailed State: StartGlueJob
3 Error: ThrottlingException
4 Cause: Rate exceeded
5 Type: ExecutionFailed
Based on the exhibit, what is the best next step to make the workflow more resilient while keeping the dependency that Athena runs only after the Glue job completes successfully?
Retry policy on StartGlueJob for ThrottlingException with backoffBest answer: B
Explanation: The failure is caused by API throttling, not a deterministic data error. Step Functions can handle transient failures by retrying the failed task with an exponential backoff and a maximum attempt count. This preserves the dependency chain because downstream states (Athena) are reached only after the Glue task succeeds.
The core issue is a transient service/API throttling error, shown by Error: ThrottlingException in the execution history (exhibit line 3) and the immediate ExecutionFailed (line 5). In Step Functions, the right way to improve resiliency for transient, retryable errors is to add a Retry block to the failing task state so the workflow automatically retries with backoff and a capped number of attempts.
This keeps dependencies intact because Step Functions will only transition to the next state (running Athena) after the StartGlueJob task returns success; retries happen within the same state until success or the retry policy is exhausted. Key takeaway: handle transient dependency-step failures in the orchestrator with targeted retries rather than redesigning the pipeline.
Topic: Content Domain 3: Data Operations and Support
A pipeline ingests application events to Amazon S3 (JSON) every 5 minutes. An AWS Glue Spark job runs hourly to join events to a small reference dataset and compute per-customer hourly aggregates, then writes Parquet to S3 partitioned by event_date for Amazon Athena queries.
The Glue job frequently exceeds its 30-minute SLA and shows one or two shuffle tasks running much longer than the rest. Investigation finds one customer ID accounts for ~55% of events in many hours (a hot key), causing highly unbalanced partitions during the groupBy customer_id aggregation.
Which change will BEST mitigate the skew and improve runtime reliability without changing the aggregate results?
customer_id before aggregation, then re-aggregate by customer_idcustomer_id instead of event_dateBest answer: B
Explanation: The slowdown is caused by data skew from a hot key, where one customer_id dominates a shuffle partition and creates long-running straggler tasks. Salting the key spreads that customer’s records across multiple partitions for the heavy shuffle stage. A final rollup removes the salt so the output remains semantically identical while improving runtime reliability.
Data skew happens when a shuffle key (for example, customer_id) is not evenly distributed, so one reducer/partition receives most of the rows and becomes a straggler. In Spark-based AWS Glue jobs, this often appears as a small number of shuffle tasks running far longer than the rest and driving overall job runtime and failures.
A common mitigation is key salting for the skewed operation:
customer_id + salt (where salt is a small random or hash bucket).customer_id to produce identical final results.This improves performance and operability at the cost of an extra aggregation step and some additional shuffle, but it directly targets the hot-key bottleneck.
Topic: Content Domain 2: Data Store Management
A company stores clickstream data as partitioned Parquet files in an Amazon S3 data lake (AWS Glue Data Catalog is in place).
Two new workloads must be supported:
Which TWO AWS services best meet these requirements? (Select TWO.)
Best answers: A, F
Explanation: Interactive, sporadic SQL directly on S3 with a serverless, pay-per-query model and the ability to query Aurora without data movement maps to Amazon Athena with federated queries. Large-scale nightly Spark processing that requires custom JARs and Spark tuning maps to Amazon EMR.
Match the query pattern and operational model to the engine. For interactive SQL on data in S3 with minimal ops and pay-per-query billing, Amazon Athena is the best fit; it uses the AWS Glue Data Catalog and supports federated queries (via connectors) so Aurora data can be queried without copying it into the lake.
For the nightly 10 TB sessionization and deduplication, the requirement is batch Spark with third-party JARs and custom Spark configuration. Amazon EMR provides managed clusters for Spark where you can control Spark/cluster settings and dependencies, making it a better fit than SQL-only approaches.
Redshift (including Spectrum) is strongest when you want a persistent warehouse for high-concurrency, low-latency analytics rather than sporadic pay-per-query access.
Topic: Content Domain 1: Data Ingestion and Transformation
When orchestrating a serverless data ingestion workflow (Amazon EventBridge AWS Step Functions AWS Lambda/AWS Glue), which THREE statements are FALSE/UNSAFE assumptions or design choices? (Select THREE.)
Best answers: B, D, E
Explanation: Serverless orchestration requires designing for service limits and for at-least-once delivery behavior. Step Functions should pass small state payloads (pointers to data), not large datasets. Event sources and Lambda throttling can create retries/duplicates rather than durable queuing, so workflows should be idempotent and use explicit buffers such as SQS when needed.
The core idea is to design Step Functions-based pipelines around small control-plane messages and explicit failure/pressure handling. Step Functions state is best used to carry identifiers (like S3 object keys) and execution context, while the data-plane payload lives in durable storage.
Also assume at-least-once delivery for event-driven invocation paths: duplicates and replays can occur, so ingestion and transformation steps should be idempotent (for example, use deterministic S3 keys, conditional writes, or de-duplication records). Finally, Lambda reserved concurrency controls throughput by throttling; it does not create a backlog. If you need buffering during spikes, place SQS/Kinesis between producers and consumers and use retries/DLQs and Step Functions Retry/Catch for controlled recovery.
The safe pattern is explicit buffering plus retries, not relying on implicit ordering or queuing guarantees.
Topic: Content Domain 3: Data Operations and Support
Which AWS service records account activity by capturing API calls (who did what, when, and from where) to support auditing and incident investigations across data platform services such as Amazon S3, AWS Glue, and AWS Lake Formation?
Best answer: D
Explanation: AWS CloudTrail provides an event record of AWS API activity across your account, which is the primary data source for auditing and incident investigation. It captures details such as the calling identity, timestamp, source IP, and the API action taken, and can deliver events to Amazon S3 or CloudWatch Logs for retention and analysis.
The core concept is API auditing: CloudTrail records AWS control-plane API calls (management events) and, for supported services, can also record data-plane activity (data events) such as Amazon S3 object-level operations. In a data platform, this lets you investigate changes like bucket policy updates, Glue Data Catalog modifications, Lake Formation permission grants, or other service actions by reviewing CloudTrail events.
CloudTrail answers “who did what, when, and from where” by logging event fields such as eventSource, eventName, userIdentity, sourceIPAddress, and requestParameters, and it can deliver logs to S3 for long-term retention and query (for example, with Athena). The closest confusion is AWS Config, which tracks resource configuration state and changes, not a complete record of API calls.
Topic: Content Domain 1: Data Ingestion and Transformation
A data engineering team runs an hourly serverless pipeline orchestrated with AWS Step Functions. Each run processes up to 3,000 customer prefixes in Amazon S3 by invoking AWS Lambda, then writes curated output back to S3.
Requirements:
Which TWO design choices should the team AVOID?
Best answers: A, F
Explanation: To keep a serverless workflow reliable, you must control fan-out to stay within Lambda concurrency and choose orchestration semantics that match correctness requirements. Step Functions Express is at-least-once, so it can duplicate work unless the pipeline is explicitly idempotent. Unbounded parallelism can also trigger throttling and retries that inflate cost and risk missing the hourly SLA.
The core concerns are (1) execution semantics during retries and (2) concurrency limits during fan-out. Step Functions Express workflows are designed for high-volume, short-duration use cases and provide at-least-once execution, meaning a task can run more than once; when the business requirement is “at most once” and the downstream operation is not idempotent, you must not depend on retries alone for correctness.
Separately, fanning out 3,000 parallel Lambda invocations can exceed the stated 1,000 available concurrency, leading to TooManyRequestsException throttles, cascading retries, longer runtimes, and higher Step Functions/Lambda costs. Safer patterns include setting MaxConcurrency in a Map/Distributed Map state, buffering work with SQS plus reserved concurrency, and enforcing idempotency (for example, DynamoDB conditional writes) so retries do not create duplicate billing-impacting outputs.
Topic: Content Domain 3: Data Operations and Support
A company loads curated dimension and fact tables into Amazon Redshift using an AWS Glue ETL job. The company must ensure that customers.customer_id is unique and that orders.customer_id always references an existing customer.
Which statement is correct about where to enforce these checks?
Best answer: C
Explanation: In Amazon Redshift, PRIMARY KEY/UNIQUE/FOREIGN KEY constraints are metadata and are not enforced during inserts or loads. Therefore, if the pipeline must guarantee uniqueness and referential integrity, the validation needs to occur in the ETL process (or another enforcing system) before data is loaded into Redshift.
The core decision is whether the target warehouse will actively enforce referential integrity and uniqueness. In Amazon Redshift, constraints such as PRIMARY KEY, UNIQUE, and FOREIGN KEY are not enforced; they are mainly used for query planning and documentation. That means Redshift will not automatically reject rows that violate these constraints during COPY/INSERT.
To ensure data quality for uniqueness and foreign-key relationships in this setup, implement checks in the ETL layer (for example, in AWS Glue/Spark):
customer_id values before load.orders.customer_id exists in the customer key set.Key takeaway: when the warehouse does not enforce constraints, enforce these rules in ETL validation prior to loading.
Topic: Content Domain 4: Data Security and Governance
When troubleshooting access failures for AWS data services (for example, Amazon S3, AWS Glue, Amazon Athena) from resources in a VPC, which TWO statements are INCORRECT? (Select TWO.)
AccessDeniedException usually indicates a network ACL problem.Best answers: C, D
Explanation: Network controls (security groups, NACLs, routing, NAT, VPC endpoints) determine whether you can reach an AWS service endpoint, typically showing up as timeouts or connection errors. IAM and resource-based policies (like S3 bucket policies) determine whether an authenticated request is authorized, typically showing up as AccessDenied-type errors. VPC endpoints do not grant permissions; they only change/restrict the network path.
Troubleshooting starts by classifying the failure symptom: connectivity vs authorization. Network controls such as security groups, NACLs, routes, NAT gateways, and VPC endpoints govern whether traffic can reach the service endpoint; when these are wrong, clients commonly see timeouts, DNS/connect errors, or failed TCP/TLS handshakes. Authorization failures happen after a request reaches the service and are evaluated by IAM and resource policies.
Saying that AccessDeniedException “usually indicates a network ACL problem” is incorrect because AccessDenied is returned by the AWS service after policy evaluation. Saying that “VPC endpoints remove the need for IAM permissions” is also incorrect: endpoints provide private connectivity and can further restrict access via an endpoint policy, but IAM (and resource policies such as S3 bucket policies) must still allow the action. The other statements correctly describe these separations and layered policy controls.
Topic: Content Domain 3: Data Operations and Support
A data team uses a nightly AWS Glue job to write Parquet files to Amazon S3 and queries the data using Amazon Athena. Users report frequent query failures.
A data engineer runs an Amazon CloudWatch Logs Insights query over the last 24 hours of the Glue job log group and gets the following aggregated results.
Exhibit: Logs Insights results (last 24 hours)
| error | occurrences | sample_message |
|---|---|---|
| HIVE_PARTITION_SCHEMA_MISMATCH | 247 | Partition dt=2026-02-24 column customer_id is bigint; table is string |
| AccessDeniedException | 3 | Access denied for s3:GetObject |
| ThrottlingException | 1 | Rate exceeded |
Which action is the best next step to reduce the recurring failures?
Best answer: A
Explanation: The log aggregation indicates the dominant recurring failure is a Hive partition schema mismatch between stored data and the Athena/Glue table definition. Fixing schema drift (or separating versions) addresses the root cause and will eliminate most failures. The other errors occur only a few times and are unlikely to explain frequent failures.
The key skill is using log aggregation to identify the most frequent root-cause category. In the exhibit, HIVE_PARTITION_SCHEMA_MISMATCH has 247 occurrences, far exceeding the other errors, and the sample message explicitly states a type conflict: partition data has customer_id as bigint while the table expects string.
Best next step is to eliminate schema drift by ensuring the Glue ETL writes a consistent schema per table (or writing to a new location/table for the new schema) and updating the Glue Data Catalog table definition/partitions to match the stored Parquet schema. This targets the error that is clearly recurring in the logs.
Topic: Content Domain 4: Data Security and Governance
A company is adopting a domain-oriented data ownership model (producer/consumer). Each business domain owns its curated datasets in an Amazon S3 data lake and registers tables in the AWS Glue Data Catalog. Other domains in separate AWS accounts need to query these datasets with Amazon Athena while the producing domain keeps control over who can access the data and under what conditions.
Which action best matches the core principle being applied?
Best answer: D
Explanation: The scenario is about producer/consumer data sharing with domain-owned control over access. AWS Lake Formation is designed for governed data lake access, including cross-account sharing, fine-grained authorization, and centralized auditing. This directly implements data governance while preserving domain ownership of datasets and policies.
The core principle is data governance applied to a producer/consumer sharing pattern: the producing domain should be able to publish data products while controlling access and enabling consumers to discover and query them. AWS Lake Formation supports this by letting producers grant table-, column-, and row-level permissions and share those permissions across AWS accounts (often via Lake Formation resource sharing and consumer-side resource links), while maintaining a single governed copy of the data in S3.
This approach also improves auditability and consistency because access decisions are managed through a centralized permission system rather than ad hoc S3 policies or data duplication. The key takeaway is to enable cross-account consumption through governed sharing rather than copying data or broadly opening storage access.
Topic: Content Domain 4: Data Security and Governance
A company is standardizing a four-level data classification model for its S3-based data lake: public, internal, confidential, and regulated. A new dataset was registered in the AWS Glue Data Catalog for Athena querying.
Exhibit: Glue table DDL (excerpt)
CREATE EXTERNAL TABLE claims_raw (
claim_id string,
patient_name string,
ssn string,
credit_card_number string
)
LOCATION 's3://acme-datalake-raw/claims/';
Based on the exhibit, which classification and high-level control mapping is most appropriate for this dataset?
Best answer: A
Explanation: The table schema includes highly sensitive identifiers (ssn and credit_card_number), which fits the regulated classification. Regulated data typically requires strict, auditable least-privilege access controls and customer-managed key encryption such as S3 SSE-KMS.
The core decision is to classify the dataset based on the sensitivity implied by the schema and then apply matching access and encryption controls. In the exhibit, the presence of ssn and credit_card_number in the Glue table definition indicates regulated data (highly sensitive personal/financial data).
At a high level, regulated data should map to:
Key takeaway: the specific fields ssn and credit_card_number in the exhibit are the deciding indicators that push this to the regulated tier, not merely internal or confidential.
Topic: Content Domain 2: Data Store Management
A company is building an Amazon S3 data lake with AWS Glue Data Catalog tables and Amazon Athena for SQL access. Data is ingested into a raw zone and transformed into a curated zone. Some curated tables contain PII columns that only a compliance team can access; analysts can access non-PII data. The company wants centralized governance using AWS Lake Formation.
Which design choice is NOT recommended for the table and permission model?
s3:GetObject access to curated S3 prefixes and skip Lake Formation registrationBest answer: A
Explanation: Lake Formation is intended to be the central policy enforcement point for governed tables, including controlling access to the underlying S3 data through registered data locations. Allowing principals to access curated objects directly in S3 undermines those controls and makes fine-grained table/column permissions ineffective.
In a governed data lake, Lake Formation permissions (database/table/column/row) should be the primary way consumers get access to datasets, with the underlying S3 locations registered so Lake Formation can enforce access consistently. If you grant broad or direct s3:GetObject permissions to data consumers and do not register the location with Lake Formation, users can bypass Data Catalog-based controls and retrieve objects outside the intended table/column permission model.
A sound high-level design is to:
Key takeaway: avoid permission paths that let users read S3 data directly outside Lake Formation governance.
Topic: Content Domain 1: Data Ingestion and Transformation
When choosing an AWS orchestration service for a data pipeline based on dependency management and retry needs, which THREE statements are FALSE/UNSAFE? (Select three.)
Best answers: B, C, D
Explanation: Use stateful orchestrators when you need explicit dependencies, durable state, and configurable retries per step (for example, Step Functions or Airflow/MWAA). EventBridge is best for event routing and simple triggering, not for modeling complex DAGs. Glue workflows coordinate Glue components but do not match Step Functions for fine-grained retry/timeouts and advanced control flow.
The key distinction is whether you need a stateful workflow engine (tracks step state and dependencies) versus an event bus (routes events).
Step Functions is purpose-built for multi-step orchestration with explicit state, dependency sequencing, and per-state Retry/Catch policies, and it can directly start and monitor many AWS services (including AWS Glue) through service integrations.
MWAA (Airflow) is also a strong fit for complex DAG scheduling, dependencies, and per-task retries/backfills, but it introduces an environment to operate (Airflow workers, scaling, upgrades).
EventBridge is typically used to trigger or fan out pipeline actions based on events, not to manage complex DAG dependencies with rich per-task retry behavior. Glue workflows coordinate Glue jobs/crawlers with triggers but have comparatively limited workflow control and error-handling granularity.
Choose the orchestration tool that matches the complexity of dependencies and the level of retry/control you must express.
Topic: Content Domain 3: Data Operations and Support
An AWS Glue workflow starts a Glue ETL job on a fixed schedule of every 30 minutes to meet a 30-minute freshness SLA. The workflow’s run history shows many runs in WAITING while one run is RUNNING, with no job failures.
The Glue ETL job consistently takes 70 minutes to complete.
Assuming the runtime stays constant, what is the minimum Max concurrent runs value needed to prevent a growing queue of waiting workflow runs? Round up to the next whole number.
Best answer: C
Explanation: The workflow is queuing because the schedule interval is shorter than the job runtime, so runs overlap and wait for capacity. The minimum concurrency is the runtime divided by the interval, rounded up. With a 70-minute runtime and 30-minute schedule, the job needs 3 concurrent runs to keep the queue from growing.
This is a scheduling/worker-capacity symptom: when a job is triggered faster than it completes, new runs pile up in WAITING unless the service can run enough copies in parallel.
Compute the minimum parallelism needed:
So: 70 / 30 = 2.33, which rounds up to 3. Setting Max concurrent runs to 3 allows the job to absorb the overlap and stop the backlog growth (assuming sufficient DPUs/account capacity).
Topic: Content Domain 3: Data Operations and Support
A data pipeline runs an AWS Glue ETL job every 15 minutes. The job is started by a scheduled AWS Lambda function (no VPC) that calls glue:StartJobRun. Operations reports that 1-2 runs per day are missed and must be started manually. Reliability requirement: the trigger must automatically retry for up to 60 minutes without adding servers to manage.
Exhibit: Lambda log excerpt
ERROR ThrottlingException: Rate exceeded
Task timed out after 3.00 seconds
Which change will fix the root cause with the least operational burden?
glue:StartJobRun directly with a retry policy and an SQS dead-letter queueglue:StartJobRunBest answer: C
Explanation: The failures occur in the scheduling layer, not in the Glue job itself. The Lambda times out after a throttling error, so no subsequent retry occurs and the run is missed. Using EventBridge Scheduler to call StartJobRun provides a managed, serverless trigger with configurable retries (and optional DLQ) to meet the reliability requirement with minimal operations overhead.
Symptom: scheduled runs are occasionally missed, and the Lambda logs show ThrottlingException followed by a short Lambda timeout.
Root cause: the trigger mechanism (Lambda) is not reliably retrying the StartJobRun call when AWS API throttling happens; the function times out quickly, so the job start request is dropped.
Fix: replace the Lambda scheduler with a managed, serverless service that can invoke the Glue API and handle retries for you.
glue:StartJobRunKey takeaway: use a managed scheduler with retries to improve reliability and reduce operational burden.
Topic: Content Domain 2: Data Store Management
Which Amazon S3 feature provides object retention (WORM) by preventing an object version from being overwritten or permanently deleted until a defined retention period expires, helping protect against accidental deletions and support recovery?
Best answer: D
Explanation: S3 Object Lock is the S3 capability specifically designed for retention: it applies WORM controls to object versions using a retention period (and optionally a legal hold). This prevents permanent deletion or overwrite until the retention constraint is removed or expires, which protects data from accidental or malicious deletions and supports recovery.
S3 Versioning helps with recovery by keeping multiple versions of an object; a “delete” typically creates a delete marker so prior versions can be restored. However, versioning alone does not stop a user from permanently deleting specific versions.
S3 Object Lock is the retention feature for S3. It enforces write-once-read-many behavior at the object version level by applying:
With Object Lock, protected versions cannot be overwritten or permanently deleted until the retention rules allow it, which is the core concept behind object retention for protection and recovery. The key takeaway is that versioning enables rollback, while Object Lock enforces retention against permanent removal.
Topic: Content Domain 2: Data Store Management
You are troubleshooting Amazon Athena queries that use the AWS Glue Data Catalog for an S3-backed data lake. Which THREE statements are FALSE/UNSAFE guidance for diagnosing common cataloging issues (missing partitions, schema mis-detection, and permission errors)?
MSCK REPAIR TABLE to discover existing S3 partitions.Best answers: B, D, F
Explanation: Cataloging problems in Athena typically come from metadata being out of sync (partitions not added), schema inferred incorrectly by a crawler, or missing permissions to Glue Data Catalog/Lake Formation. Crawlers are not continuous, permissions involve more than just S3, and “zero rows” usually indicates partition/filter/schema mismatch rather than timeouts. Effective troubleshooting focuses on validating partitions, schema inference inputs, and metadata access.
When Athena queries an S3 dataset through the Glue Data Catalog, failures usually map to three areas:
MSCK REPAIR TABLE or ALTER TABLE ADD PARTITION.GetObject is allowed.The key idea is to validate metadata (partitions/schema) and metadata access, not to treat these as query-runtime timeout issues.
Topic: Content Domain 1: Data Ingestion and Transformation
A data engineering team uses AWS Step Functions to orchestrate a serverless ingestion pipeline. For each object that lands in an Amazon S3 landing/ prefix, the workflow validates the file with AWS Lambda and then starts downstream processing. The team expects occasional bursts (tens of thousands of objects in minutes) and must preserve an immutable raw copy for auditing and reprocessing.
Which approach should the team AVOID?
raw/MaxConcurrency and add retries for throttling errorsBest answer: B
Explanation: In serverless orchestration, you must control concurrency and design for retries and idempotency. Allowing unbounded parallel tasks while overwriting a single object in the raw zone creates race conditions and data loss, and it increases throttling and partial-failure impact during bursts. An immutable raw layer should be append-only and reproducible.
A key serverless workflow principle is to bound concurrency to downstream limits and make each step safe to retry. With Step Functions, bursty fan-out (such as a Map state over many objects) can quickly hit AWS Lambda concurrency or other service limits, so you typically cap parallelism and add retries/backoff for transient failures.
For data platforms, the raw zone should be immutable (append-only) so you can audit and reprocess exactly what arrived. Tasks should write idempotently (unique keys/partitions) so retries or duplicate events don’t overwrite or corrupt prior data. Unbounded fan-out combined with overwriting a shared raw/ object is a classic anti-pattern: it introduces race conditions, loses lineage, and turns normal retries into destructive writes.
The safest designs combine bounded parallelism, durable buffering, and idempotent S3 writes.
Topic: Content Domain 1: Data Ingestion and Transformation
You are configuring event-driven ingestion for files landing in Amazon S3 using Amazon S3 event notifications and Amazon EventBridge rules. The pipeline must tolerate at-least-once delivery.
Which TWO statements are FALSE or unsafe assumptions for this design?
Best answers: A, C
Explanation: Both S3 event notifications and EventBridge use at-least-once delivery, so duplicates are possible and ordering is not guaranteed. Safe event-driven ingestion designs therefore avoid assuming exactly-once or ordered delivery, and instead use idempotent consumers plus appropriate buffering and failure handling.
The core requirement is handling at-least-once semantics: event sources and intermediaries can retry, and the same logical event can be delivered more than once. With S3 event notifications, duplicate notifications can occur for the same object, so downstream processing must be idempotent (for example, record a processed object key plus version ID/ETag and ignore repeats). With EventBridge, delivery is also at-least-once and you should not assume a global ordering of events at targets.
Safe patterns include using SQS to buffer bursts and retries, configuring Lambda async failure handling (DLQ/destinations), and using SQS FIFO only when you need ordering within a message group-while still designing processing to be idempotent.
Topic: Content Domain 1: Data Ingestion and Transformation
You are integrating multiple sources into a single analytics table (for example, app events, web events, and customer updates). You must choose join keys, align timestamps, and define deduplication rules.
Which TWO statements are INCORRECT/UNSAFE practices for this integration?
event_id) plus an ordering field to keep a consistent winnerreceived_time (ingestion time) as the timestamp for joining sourcesuser_id or customer_id) as join keysBest answers: A, D
Explanation: Using ingestion/arrival time to align records is unsafe because differing delivery delays across sources break time-based joins. Deduplication also must be deterministic and resilient to late or corrected data; “keep the first one seen” commonly loses the correct version. Safe integration patterns rely on stable join keys, normalized timestamps, and explicit event-time/effective-time logic.
When integrating sources, the core idea is to join on identifiers that represent the same real-world entity and to align records using a timestamp that reflects business/event semantics, not pipeline behavior.
Event-time (or effective-time for slowly changing dimensions) should drive alignment because different sources can arrive late, out of order, or with different buffering. Therefore, normalize time zones (commonly to UTC) and join based on event_time/updated_at as appropriate. For duplicates, define a stable dedup key (such as event_id or a composite natural key) and a deterministic tie-breaker (such as highest version, latest event_time, or a source priority) so reruns and late arrivals converge to the same correct output. The main takeaway is to avoid using ingestion time and “first seen wins” rules, because both depend on unpredictable delivery timing.
Topic: Content Domain 2: Data Store Management
An Amazon Athena workgroup queries a Glue Data Catalog table clickstream stored in Amazon S3 as Parquet with Snappy compression. The table is partitioned as s3://.../year=YYYY/month=MM/day=DD/hour=HH/ and has about 300,000 partitions.
Users run many saved dashboards that filter only on event_ts (not the partition columns). Queries intermittently fail with this Athena error:
HIVE_METASTORE_ERROR: com.amazonaws.services.glue.model.ThrottlingException:
Rate exceeded (Service: AWSGlue; Operation: GetPartitions)
The data engineer must fix the root cause with minimal change and cannot modify the dashboards or rewrite the dataset. What should the data engineer do?
MSCK REPAIR TABLE clickstream on a schedule to refresh partitionsevent_date partitions (daily)Best answer: A
Explanation: The failure is caused by Glue Data Catalog GetPartitions throttling when Athena must enumerate a very large number of partitions for queries that do not filter on partition columns. Creating a Glue partition index is an operational change that improves partition metadata retrieval performance and reduces throttling. This preserves the existing dashboards and S3 dataset layout.
Athena relies on the Glue Data Catalog to retrieve partition metadata. Because the dashboards filter only on event_ts, Athena cannot prune partitions by year/month/day/hour and may need to enumerate a very large set of partitions, leading to GetPartitions throttling and query failures.
Creating a Glue partition index on the table’s partition keys is the minimal operational fix because it:
GetPartitionsThe key takeaway is that Glue partition indexes address partition-metadata scalability, whereas file format/compression changes or repartitioning are larger changes and don’t directly resolve Glue API throttling.
Topic: Content Domain 1: Data Ingestion and Transformation
A data engineering team uses an LLM (Amazon Bedrock) inside an ETL job to summarize and categorize customer support tickets. The raw tickets can contain PII (names, emails, account numbers). The job writes results to an S3 curated zone for analytics.
Which THREE statements describe appropriate guardrails for this LLM-driven processing of sensitive data?
Best answers: D, E, F
Explanation: For sensitive data, apply layered guardrails around LLM calls: filter inputs to minimize exposed PII, ground responses to trusted data to constrain what the model can assert, and add human review gates for risky or uncertain outputs. These controls reduce both data leakage risk and incorrect content entering curated analytics datasets.
LLMs can leak sensitive data and generate unsupported text, so guardrails should be applied at multiple points in an ETL workflow handling PII. Start with input filtering to minimize or remove PII before the model sees it (for example, redaction/masking). Then constrain generation by grounding the model on approved enterprise data (RAG from a curated knowledge base) and validating that outputs are supported (such as requiring citations or rejecting ungrounded claims). Finally, use human review for items that are high risk (detected sensitive entities, policy violations) or low confidence before publishing results into a curated zone.
The key idea is to reduce sensitive exposure, constrain what the model is allowed to say, and prevent unsafe outputs from being automatically persisted.
Topic: Content Domain 2: Data Store Management
A data platform team notices that over time their datasets have changed (more skewed key distribution, higher cardinality in some columns, and higher null rates). They want to adjust partitioning, indexing, and processing strategies at a high level.
Which THREE statements are INCORRECT/unsafe recommendations? (Select three.)
user_id to maximize partition pruning.Best answers: C, E, F
Explanation: The incorrect statements ignore key tradeoffs in partitioning and query optimization. Extremely high-cardinality partition keys and excessive partition counts often harm performance due to metadata and small-file overhead. Also, when data distributions change, planners typically need refreshed statistics (and sometimes physical reorganization) to choose efficient join and scan strategies.
As data characteristics shift (skew, cardinality, null rates), you often need to revisit how data is laid out and how jobs execute. Partitioning is most effective when the key has manageable cardinality and aligns with common predicates; otherwise, you can create many tiny partitions/files and increase catalog/listing overhead.
Query optimizers generally do not “figure it out live” for every query-statistics (and, depending on the engine, rebalancing/reclustering) help the planner estimate cardinality and choose join and scan plans. For processing, mitigate skew by changing partitioning/repartitioning, using broadcast joins for small tables, or distributing hot keys (salting) rather than assuming more partitions alone will fix it.
The key is balancing pruning benefits against metadata and execution overhead.
Use this map after the sample questions to connect individual items to the AWS data engineering pipeline, storage, security, and operations decisions these practice samples test.
flowchart LR
S1["Data source and requirement"] --> S2
S2["Ingest batch or streaming data"] --> S3
S3["Store raw and curated data"] --> S4
S4["Transform validate and catalog datasets"] --> S5
S5["Secure govern and monitor pipelines"] --> S6
S6["Serve analytics or ML consumers"]
| Cue | What to remember |
|---|---|
| Ingestion | Match batch, streaming, CDC, file, API, and event sources to AWS services. |
| Storage layers | Separate raw lake storage, curated data, warehouse, and serving layers. |
| Transformation | Use Glue, EMR, Lambda, SQL, or managed services based on volume, latency, and complexity. |
| Governance | Track catalog, schema, lineage, encryption, IAM, Lake Formation, and data quality. |
| Operations | Monitor failures, retries, partitions, cost, freshness, and downstream impact. |