Try 10 focused AWS DEA-C01 questions on Data Ingestion and Transformation, with explanations, then continue with IT Mastery.
Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.
| Field | Detail |
|---|---|
| Exam route | AWS DEA-C01 |
| Topic area | Data Ingestion and Transformation |
| Blueprint weight | 34% |
| Page purpose | Focused sample questions before returning to mixed practice |
Use this page to isolate Data Ingestion and Transformation for AWS DEA-C01. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.
| Pass | What to do | What to record |
|---|---|---|
| First attempt | Answer without checking the explanation first. | The fact, rule, calculation, or judgment point that controlled your answer. |
| Review | Read the explanation even when you were correct. | Why the best answer is stronger than the closest distractor. |
| Repair | Repeat only missed or uncertain items after a short break. | The pattern behind misses, not the answer letter. |
| Transfer | Return to mixed practice once the topic feels stable. | Whether the same skill holds up when the topic is no longer obvious. |
Blueprint context: 34% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.
These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.
Topic: Data Ingestion and Transformation
A data engineering team is designing a daily ETL workflow orchestrated with AWS Step Functions to build curated analytics tables in Amazon S3 for Athena. The workflow must have clear stages (ingest, validate, transform, load, publish) and must support retries and an alert path when validation fails.
Which design choice should the team AVOID because it is a data platform anti-pattern?
Options:
A. Validate schema and record counts before running transforms, and route failures to an SNS notification
B. Publish by atomically updating a “current” pointer (for example, a manifest or view) only after load succeeds
C. Overwrite the raw S3 landing prefix with cleaned data
D. Ingest to an S3 raw prefix and write transformed output to a separate S3 curated prefix
Best answer: C
Explanation: You should avoid overwriting the raw/landing zone with processed data because it destroys the immutable source-of-truth needed for replay, debugging, and audits. A well-designed ETL workflow keeps ingest data separate from validated/transformed outputs. It also uses explicit success/failure paths so only fully successful runs are published.
A high-level ETL workflow should separate stages and data zones so each step has clear inputs/outputs and you can trace, replay, and troubleshoot runs. The key principle is to keep the ingest/raw stage immutable: land the original data, then validate and transform into downstream zones (staging/curated) without mutating the raw artifacts.
A typical orchestration pattern is:
Overwriting the raw prefix violates governance and operational resilience because it removes the ability to reproduce results from the original inputs.
Topic: Data Ingestion and Transformation
You are using AWS Serverless Application Model (AWS SAM) to deploy a serverless data-ingestion workflow that includes AWS Lambda functions, an AWS Step Functions state machine, and a DynamoDB table.
Select TWO statements that are true about packaging and deploying these components with AWS SAM.
Options:
A. sam package uploads local artifacts to Amazon S3 and outputs a template with S3 URIs
B. sam deploy deploys resources without using AWS CloudFormation
C. CodeUri must reference an Amazon S3 location before you can run sam build
D. DynamoDB tables cannot be created by SAM and must be deployed separately
E. AWS::Serverless::SimpleTable automatically creates a DynamoDB global table across Regions
F. SAM supports AWS::Serverless::StateMachine to define and deploy Step Functions workflows
Correct answers: A and F
Explanation: AWS SAM is a CloudFormation transform that can describe and deploy multiple serverless resources, including Lambda functions and Step Functions state machines. When you run sam package, SAM uploads local build artifacts to Amazon S3 and rewrites the template to point to those S3 objects so CloudFormation can deploy them.
At a high level, AWS SAM templates are transformed into standard AWS CloudFormation during deployment. SAM can define Lambda functions and Step Functions workflows in the same application (for example, using AWS::Serverless::Function and AWS::Serverless::StateMachine).
Packaging and deployment typically work like this:
sam build prepares local artifacts.sam package uploads local artifacts to S3 and produces an output template containing S3 URIs.sam deploy uses CloudFormation to create/update the stack.SAM can also create DynamoDB tables (for example, via AWS::DynamoDB::Table or AWS::Serverless::SimpleTable), but features like multi-Region global tables still require explicit configuration rather than being automatic.
sam package uploading artifacts to S3 matches how SAM produces a deployable template.AWS::Serverless::StateMachine reflects SAM’s support for Step Functions resources.CodeUri must be S3 pre-build, or SimpleTable auto-creates global tables are incorrect for SAM’s workflow and resource behavior.Topic: Data Ingestion and Transformation
A data engineer is documenting dependencies for an ETL pipeline where a single transformation step can require outputs from multiple upstream steps, and the execution plan must not allow circular dependencies. Which data structure is the most appropriate high-level representation of this processing logic?
Options:
A. Tree
B. Hash table
C. Queue
D. Directed acyclic graph (DAG)
Best answer: D
Explanation: ETL step dependencies are best represented as a directed acyclic graph (DAG). A DAG allows a node to have multiple upstream parents (multiple prerequisites) while enforcing that no circular dependencies exist.
The core requirement is representing step-to-step dependencies where a transformation can depend on multiple prior outputs, and the dependency model must forbid cycles. A directed graph captures ordering via directed edges (upstream downstream), and making it acyclic ensures there is a valid execution order (a topological sort).
A tree is too restrictive because it typically implies a strict hierarchy where each node has only one parent, which does not fit multi-input transformations. The key takeaway is to use a DAG for workflow/dependency modeling when joins/merges create multiple prerequisites and cycles must be prevented.
Topic: Data Ingestion and Transformation
In an Amazon S3 data lake, a table is stored as partitioned Parquet files and queried with Amazon Athena. A daily AWS Glue ETL job processes only the new/changed source records and then rewrites only the impacted date partitions (for example, dt=2026-02-25) in the curated S3 prefix, leaving all other partitions untouched to avoid reprocessing the full dataset while keeping results correct.
Which data engineering term best describes this incremental transformation strategy?
Options:
A. Iceberg snapshot
B. Partition overwrite
C. Partition pruning
D. CDC-based upsert (MERGE)
Best answer: B
Explanation: This pattern is a partition-level incremental rewrite: the ETL job recalculates and replaces only partitions that contain changed data. That minimizes recomputation because unchanged partitions are not rewritten. Correctness is preserved because each affected partition is fully regenerated to reflect the latest inputs.
Partition overwrite is an incremental transformation strategy commonly used in S3-backed, partitioned datasets where updating individual rows in-place is not the primary mechanism. The job identifies which partition keys are impacted by new or changed records (for example, specific dt values), recomputes the full output for just those partitions, and writes them back by replacing the existing files for those partitions.
This reduces work and cost versus rebuilding the entire table while maintaining correctness because each rewritten partition becomes a complete, current representation for that key. A closely related but different concept is partition pruning, which is a query-time optimization that skips scanning irrelevant partitions; it does not describe an ETL write strategy.
Topic: Data Ingestion and Transformation
A company runs an AWS Glue Spark ETL job that must extract data from two sources:
The job must authenticate without hardcoding passwords in code and must use JDBC/ODBC-style connectivity. Which THREE actions will meet these requirements?
Options:
A. Create an AWS Glue JDBC connection for Aurora that specifies a reachable private subnet and security group
B. Use the Amazon Athena ODBC driver to query Aurora and SQL Server directly without a VPC connection
C. Export data to Amazon S3 first, then use an ODBC driver to read it from the databases
D. Create an AWS Glue JDBC connection for the on-prem SQL Server and run the job in a subnet that routes to the VPN
E. Store the database credentials in AWS Secrets Manager and configure the Glue connections to use the secret
F. Make Aurora publicly accessible and allow inbound from the internet so Glue can connect from outside the VPC
Correct answers: A, D and E
Explanation: AWS Glue connects to databases through JDBC connections that can be attached to specific VPC subnets and security groups. For private databases and on-premises sources, the Glue job must run with network paths to those endpoints (private subnets for Aurora and VPN routing for on-prem). To avoid hardcoded passwords, use Secrets Manager-backed credentials in the Glue connections.
The core requirement is establishing JDBC/ODBC-style connectivity while honoring network isolation and secure authentication. AWS Glue database access is typically done with JDBC connections, and when you attach a connection to a VPC, Glue creates elastic network interfaces (ENIs) in the selected subnets and applies the selected security groups.
To satisfy the stem:
Options that bypass VPC routing or rely on making private databases public do not meet the stated constraints.
Topic: Data Ingestion and Transformation
A data engineering team orchestrates ingestion pipelines with AWS Step Functions and AWS Glue. The team needs near-real-time notifications when a pipeline succeeds or fails, and it also wants to optionally buffer state-change events for downstream automation.
Which statement is INCORRECT about using Amazon SNS and/or Amazon SQS for these pipeline state-change notifications?
Options:
A. Create a CloudWatch alarm on a Glue job failure metric and set the alarm action to an SNS topic.
B. Use Amazon SQS to send email/SMS alerts directly to on-call engineers when a pipeline fails.
C. Subscribe an SQS queue to an SNS topic to buffer pipeline events for consumers, and configure a DLQ/redrive policy.
D. Use Amazon EventBridge rules for Step Functions execution status changes to publish to an SNS topic.
Best answer: B
Explanation: SNS is the AWS service that delivers notifications to endpoints such as email, SMS, HTTP/S, and Lambda, and it integrates cleanly with EventBridge and CloudWatch alarms. SQS is used to durably buffer messages for asynchronous consumers, typically by subscribing a queue to an SNS topic or targeting the queue from EventBridge.
For pipeline state changes, a common pattern is to generate an event (from Step Functions via EventBridge or from CloudWatch alarms) and publish it to an SNS topic for operator notifications and fanout. SNS can deliver directly to email/SMS and invoke Lambda or publish to HTTP/S endpoints.
SQS is not a notification delivery service; it stores messages so worker applications can poll and process them reliably. When you want both human notifications and buffered events for automation, publish once to SNS and add multiple subscriptions (for example, email plus an SQS queue). For the SQS subscription path, configure a dead-letter queue and redrive policy to handle poison messages and processing failures.
Key takeaway: use SNS for alert delivery and SQS for durable buffering/decoupling.
Topic: Data Ingestion and Transformation
A data engineer is building an AWS Glue job that integrates two sources into a curated S3 table:
The curated dataset is joined on a customer identifier, aligned by event_time, and must support job retries and backfills without creating duplicate integrated rows.
Which action best reflects the core principle that should guide the deduplication and replay behavior for this integration?
Options:
A. Use separate IAM roles for ETL jobs and data governance admins
B. Encrypt all source and curated data with TLS and AWS KMS
C. Use deterministic join keys and idempotent upserts into the curated table
D. Store source extracts as append-only objects in the raw S3 zone
Best answer: C
Explanation: The guiding principle is idempotent processing: running the same integration multiple times should not change the outcome. Defining stable join keys plus deterministic deduplication (often using event time and a unique record identifier) enables safe retries and backfills. An upsert-style write ensures duplicates are not reintroduced when data is replayed.
Idempotent processing means a pipeline can safely retry or reprocess the same inputs and still produce the same outputs. For multi-source integration, this typically requires (1) a stable join key (often a composite business key), (2) timestamp alignment rules (for example, choosing the correct event_time window or latest version), and (3) deterministic deduplication rules so repeated deliveries or CDC replays don’t create additional curated rows.
A common high-level approach is:
event_time (and possibly a version/sequence) to pick the winning record.This directly supports retries and backfills, whereas security/governance controls or raw-zone immutability don’t by themselves prevent duplicate integrated results.
Topic: Data Ingestion and Transformation
A company ingests clickstream events (JSON, ~1 KB each) into Amazon Kinesis Data Streams at a steady 5,000 events/second. The pipeline must enrich events with a small reference dataset and write partitioned Parquet to Amazon S3 within 2 minutes end-to-end. The team wants a managed, serverless operational model (no cluster management). Because the source delivery is at-least-once, the transformation must tolerate retries/reprocessing without creating duplicate outputs.
Which option is the best choice and most directly applies the core principle of idempotent processing?
Options:
A. Run an AWS Glue streaming ETL job with checkpointing
B. Run Spark on an Amazon EMR cluster reading from Kinesis
C. Load to Amazon Redshift and transform using SQL in-place
D. Use AWS Lambda to transform each record from Kinesis
Best answer: A
Explanation: An AWS Glue streaming ETL job provides a managed, serverless way to run continuous transformations at this throughput while meeting a minutes-level SLA. With streaming checkpoints, the job can restart or replay data without producing duplicate outputs, which directly implements the idempotent processing principle for at-least-once sources.
The key principle is idempotent processing: a transformation should be safe to retry so that reprocessing the same input does not create duplicate or inconsistent results. With Kinesis Data Streams, retries and replays can occur, so the transform layer needs built-in support for tracking progress and resuming.
AWS Glue streaming ETL (Spark Structured Streaming) is a good fit here because it is managed/serverless, supports state and continuous processing, and uses checkpoints to track what has been processed so restarts don’t re-emit duplicates.
A per-record function approach can work for simple stateless transforms, but for sustained high event rates plus enrichment and replay safety, a managed streaming ETL job with checkpointing is typically the most direct fit.
Topic: Data Ingestion and Transformation
In an AWS Glue (Apache Spark) ETL job, which observation most strongly indicates partition/data skew as the primary cause of poor performance?
Options:
A. Shuffle read/write metrics are high and evenly distributed across tasks
B. Most tasks finish quickly, but a few tasks run much longer than the rest
C. All tasks take about the same time and scale linearly with input size
D. Executors show low CPU but consistently high S3 read latency
Best answer: B
Explanation: Partition skew happens when a small number of partitions contain much more data than others, causing a “long tail” where only a few tasks keep running while most finish. This leads to poor parallelism because the job’s stage completion is gated by those straggler tasks.
In Spark-based processing (including AWS Glue Spark jobs), each stage is completed only when all tasks for that stage finish. With partition/data skew, some partitions are much larger (often from hot keys in joins/aggregations), so the tasks processing those partitions take far longer than the rest. The cluster can look partially idle near the end of a stage because most executors have finished their smaller partitions while a few executors are stuck on oversized partitions.
A strong skew signal is:
Key takeaway: skew produces highly uneven task runtimes; general resource shortages or I/O bottlenecks tend to slow tasks more uniformly.
Topic: Data Ingestion and Transformation
A team orchestrates a daily ingestion and transformation pipeline using AWS Step Functions and AWS Glue. Auditors require end-to-end traceability so the team can prove which source data and which transformation version produced any curated dataset.
Which statement is NOT correct (unsafe) for designing auditable run metadata?
Options:
A. Keep only the latest successful run metadata because S3 data is immutable.
B. Generate a unique run ID and persist run start/end timestamps and status.
C. Store the transformation version (for example, Git commit or image digest).
D. Record the source version identifiers used by the run.
Best answer: A
Explanation: Auditable pipelines require durable, queryable history of each execution, including run IDs, timestamps, source versions, and transformation versions. Deleting prior run records breaks the ability to reconstruct lineage for older outputs and prevents proving what happened during backfills or reruns. Retaining historical run metadata is a core requirement for traceability.
Traceability depends on being able to reconstruct “what ran, when it ran, what it read, and what it produced” for any point in time. A good design generates a run ID for every execution and persists run-level metadata in a durable store (for example, DynamoDB, a metadata RDS database, or an append-only S3/Glue table) including timestamps and outcome.
To make lineage auditable, each run record should also capture (1) source version identifiers (such as an S3 object version ID, database snapshot ID, or CDC high-watermark/LSN) and (2) transformation version (such as a Git commit hash, Glue job script version, or container image digest). Keeping only the latest run metadata is unsafe because audits often require reconstructing historical outputs and explaining reruns/backfills even when data files are immutable.
Use the AWS DEA-C01 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.
Try AWS DEA-C01 on Web View AWS DEA-C01 Practice Test
Read the AWS DEA-C01 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.