AWS DEA-C01: Data Ingestion and Transformation

Try 10 focused AWS DEA-C01 questions on Data Ingestion and Transformation, with explanations, then continue with IT Mastery.

On this page

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try AWS DEA-C01 on Web View full AWS DEA-C01 practice page

Topic snapshot

FieldDetail
Exam routeAWS DEA-C01
Topic areaData Ingestion and Transformation
Blueprint weight34%
Page purposeFocused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Data Ingestion and Transformation for AWS DEA-C01. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

PassWhat to doWhat to record
First attemptAnswer without checking the explanation first.The fact, rule, calculation, or judgment point that controlled your answer.
ReviewRead the explanation even when you were correct.Why the best answer is stronger than the closest distractor.
RepairRepeat only missed or uncertain items after a short break.The pattern behind misses, not the answer letter.
TransferReturn to mixed practice once the topic feels stable.Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 34% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.

Question 1

Topic: Data Ingestion and Transformation

A data engineering team is designing a daily ETL workflow orchestrated with AWS Step Functions to build curated analytics tables in Amazon S3 for Athena. The workflow must have clear stages (ingest, validate, transform, load, publish) and must support retries and an alert path when validation fails.

Which design choice should the team AVOID because it is a data platform anti-pattern?

Options:

  • A. Validate schema and record counts before running transforms, and route failures to an SNS notification

  • B. Publish by atomically updating a “current” pointer (for example, a manifest or view) only after load succeeds

  • C. Overwrite the raw S3 landing prefix with cleaned data

  • D. Ingest to an S3 raw prefix and write transformed output to a separate S3 curated prefix

Best answer: C

Explanation: You should avoid overwriting the raw/landing zone with processed data because it destroys the immutable source-of-truth needed for replay, debugging, and audits. A well-designed ETL workflow keeps ingest data separate from validated/transformed outputs. It also uses explicit success/failure paths so only fully successful runs are published.

A high-level ETL workflow should separate stages and data zones so each step has clear inputs/outputs and you can trace, replay, and troubleshoot runs. The key principle is to keep the ingest/raw stage immutable: land the original data, then validate and transform into downstream zones (staging/curated) without mutating the raw artifacts.

A typical orchestration pattern is:

  • Ingest to an S3 raw prefix (original files)
  • Validate (schema, completeness, duplication) with a failure branch for alerts
  • Transform (for example, AWS Glue job) to curated
  • Load (write partitioned curated data)
  • Publish only after successful load (for example, update a manifest/view)

Overwriting the raw prefix violates governance and operational resilience because it removes the ability to reproduce results from the original inputs.

  • Separate raw and curated zones supports clear stage boundaries and reprocessing.
  • Validation with an alert branch defines an explicit failure path without publishing bad data.
  • Publish only after success prevents partial loads from becoming the consumer-facing dataset.

Question 2

Topic: Data Ingestion and Transformation

You are using AWS Serverless Application Model (AWS SAM) to deploy a serverless data-ingestion workflow that includes AWS Lambda functions, an AWS Step Functions state machine, and a DynamoDB table.

Select TWO statements that are true about packaging and deploying these components with AWS SAM.

Options:

  • A. sam package uploads local artifacts to Amazon S3 and outputs a template with S3 URIs

  • B. sam deploy deploys resources without using AWS CloudFormation

  • C. CodeUri must reference an Amazon S3 location before you can run sam build

  • D. DynamoDB tables cannot be created by SAM and must be deployed separately

  • E. AWS::Serverless::SimpleTable automatically creates a DynamoDB global table across Regions

  • F. SAM supports AWS::Serverless::StateMachine to define and deploy Step Functions workflows

Correct answers: A and F

Explanation: AWS SAM is a CloudFormation transform that can describe and deploy multiple serverless resources, including Lambda functions and Step Functions state machines. When you run sam package, SAM uploads local build artifacts to Amazon S3 and rewrites the template to point to those S3 objects so CloudFormation can deploy them.

At a high level, AWS SAM templates are transformed into standard AWS CloudFormation during deployment. SAM can define Lambda functions and Step Functions workflows in the same application (for example, using AWS::Serverless::Function and AWS::Serverless::StateMachine).

Packaging and deployment typically work like this:

  • Build (optional): sam build prepares local artifacts.
  • Package: sam package uploads local artifacts to S3 and produces an output template containing S3 URIs.
  • Deploy: sam deploy uses CloudFormation to create/update the stack.

SAM can also create DynamoDB tables (for example, via AWS::DynamoDB::Table or AWS::Serverless::SimpleTable), but features like multi-Region global tables still require explicit configuration rather than being automatic.

  • OK: the statement about sam package uploading artifacts to S3 matches how SAM produces a deployable template.
  • OK: the statement about AWS::Serverless::StateMachine reflects SAM’s support for Step Functions resources.
  • NO: the statement claiming SAM can’t create DynamoDB tables is wrong because SAM/CloudFormation can define them in the same stack.
  • NO: the statements claiming deploy bypasses CloudFormation, CodeUri must be S3 pre-build, or SimpleTable auto-creates global tables are incorrect for SAM’s workflow and resource behavior.

Question 3

Topic: Data Ingestion and Transformation

A data engineer is documenting dependencies for an ETL pipeline where a single transformation step can require outputs from multiple upstream steps, and the execution plan must not allow circular dependencies. Which data structure is the most appropriate high-level representation of this processing logic?

Options:

  • A. Tree

  • B. Hash table

  • C. Queue

  • D. Directed acyclic graph (DAG)

Best answer: D

Explanation: ETL step dependencies are best represented as a directed acyclic graph (DAG). A DAG allows a node to have multiple upstream parents (multiple prerequisites) while enforcing that no circular dependencies exist.

The core requirement is representing step-to-step dependencies where a transformation can depend on multiple prior outputs, and the dependency model must forbid cycles. A directed graph captures ordering via directed edges (upstream downstream), and making it acyclic ensures there is a valid execution order (a topological sort).

A tree is too restrictive because it typically implies a strict hierarchy where each node has only one parent, which does not fit multi-input transformations. The key takeaway is to use a DAG for workflow/dependency modeling when joins/merges create multiple prerequisites and cycles must be prevented.

  • Tree hierarchy is too restrictive because ETL steps can have multiple upstream dependencies.
  • Hash table lookup stores key/value associations but does not represent execution dependencies.
  • Queue ordering represents a linear processing order, not a branching/merging dependency structure.

Question 4

Topic: Data Ingestion and Transformation

In an Amazon S3 data lake, a table is stored as partitioned Parquet files and queried with Amazon Athena. A daily AWS Glue ETL job processes only the new/changed source records and then rewrites only the impacted date partitions (for example, dt=2026-02-25) in the curated S3 prefix, leaving all other partitions untouched to avoid reprocessing the full dataset while keeping results correct.

Which data engineering term best describes this incremental transformation strategy?

Options:

  • A. Iceberg snapshot

  • B. Partition overwrite

  • C. Partition pruning

  • D. CDC-based upsert (MERGE)

Best answer: B

Explanation: This pattern is a partition-level incremental rewrite: the ETL job recalculates and replaces only partitions that contain changed data. That minimizes recomputation because unchanged partitions are not rewritten. Correctness is preserved because each affected partition is fully regenerated to reflect the latest inputs.

Partition overwrite is an incremental transformation strategy commonly used in S3-backed, partitioned datasets where updating individual rows in-place is not the primary mechanism. The job identifies which partition keys are impacted by new or changed records (for example, specific dt values), recomputes the full output for just those partitions, and writes them back by replacing the existing files for those partitions.

This reduces work and cost versus rebuilding the entire table while maintaining correctness because each rewritten partition becomes a complete, current representation for that key. A closely related but different concept is partition pruning, which is a query-time optimization that skips scanning irrelevant partitions; it does not describe an ETL write strategy.

  • Partition pruning is a read/query optimization that reduces scanned partitions, not a write/update method.
  • CDC-based upsert (MERGE) is row-level change application (common in warehouses/table formats that support merges), not “rewrite the whole partition.”
  • Iceberg snapshot is a table-format metadata/versioning concept; it’s not the name of the ETL technique of replacing only specific partitions.

Question 5

Topic: Data Ingestion and Transformation

A company runs an AWS Glue Spark ETL job that must extract data from two sources:

  • An Amazon Aurora PostgreSQL DB cluster in private subnets (not publicly accessible)
  • An on-premises Microsoft SQL Server reachable from the VPC through an existing AWS Site-to-Site VPN

The job must authenticate without hardcoding passwords in code and must use JDBC/ODBC-style connectivity. Which THREE actions will meet these requirements?

Options:

  • A. Create an AWS Glue JDBC connection for Aurora that specifies a reachable private subnet and security group

  • B. Use the Amazon Athena ODBC driver to query Aurora and SQL Server directly without a VPC connection

  • C. Export data to Amazon S3 first, then use an ODBC driver to read it from the databases

  • D. Create an AWS Glue JDBC connection for the on-prem SQL Server and run the job in a subnet that routes to the VPN

  • E. Store the database credentials in AWS Secrets Manager and configure the Glue connections to use the secret

  • F. Make Aurora publicly accessible and allow inbound from the internet so Glue can connect from outside the VPC

Correct answers: A, D and E

Explanation: AWS Glue connects to databases through JDBC connections that can be attached to specific VPC subnets and security groups. For private databases and on-premises sources, the Glue job must run with network paths to those endpoints (private subnets for Aurora and VPN routing for on-prem). To avoid hardcoded passwords, use Secrets Manager-backed credentials in the Glue connections.

The core requirement is establishing JDBC/ODBC-style connectivity while honoring network isolation and secure authentication. AWS Glue database access is typically done with JDBC connections, and when you attach a connection to a VPC, Glue creates elastic network interfaces (ENIs) in the selected subnets and applies the selected security groups.

To satisfy the stem:

  • Use a Glue JDBC connection that targets Aurora’s private endpoint and runs in subnets/security groups that can reach the DB port.
  • Use a Glue JDBC connection for the on-prem SQL Server, ensuring the chosen subnets have routes to the Site-to-Site VPN and security rules allow the traffic.
  • Use AWS Secrets Manager for credentials so the job does not store passwords in code.

Options that bypass VPC routing or rely on making private databases public do not meet the stated constraints.

  • OK Create a Glue JDBC connection for Aurora with VPC subnet/SG so the job can reach the private endpoint.
  • OK Create a Glue JDBC connection for on-prem SQL Server and run in subnets that route to the VPN.
  • OK Use Secrets Manager with Glue connections to avoid hardcoding credentials.
  • NO Athena ODBC is for querying with Athena (primarily S3-based data), not direct JDBC access to Aurora/SQL Server in this scenario.
  • NO Making Aurora public violates the private-only connectivity constraint and increases exposure.
  • NO Exporting to S3 first changes the source connectivity approach and does not provide direct database JDBC/ODBC connectivity.

Question 6

Topic: Data Ingestion and Transformation

A data engineering team orchestrates ingestion pipelines with AWS Step Functions and AWS Glue. The team needs near-real-time notifications when a pipeline succeeds or fails, and it also wants to optionally buffer state-change events for downstream automation.

Which statement is INCORRECT about using Amazon SNS and/or Amazon SQS for these pipeline state-change notifications?

Options:

  • A. Create a CloudWatch alarm on a Glue job failure metric and set the alarm action to an SNS topic.

  • B. Use Amazon SQS to send email/SMS alerts directly to on-call engineers when a pipeline fails.

  • C. Subscribe an SQS queue to an SNS topic to buffer pipeline events for consumers, and configure a DLQ/redrive policy.

  • D. Use Amazon EventBridge rules for Step Functions execution status changes to publish to an SNS topic.

Best answer: B

Explanation: SNS is the AWS service that delivers notifications to endpoints such as email, SMS, HTTP/S, and Lambda, and it integrates cleanly with EventBridge and CloudWatch alarms. SQS is used to durably buffer messages for asynchronous consumers, typically by subscribing a queue to an SNS topic or targeting the queue from EventBridge.

For pipeline state changes, a common pattern is to generate an event (from Step Functions via EventBridge or from CloudWatch alarms) and publish it to an SNS topic for operator notifications and fanout. SNS can deliver directly to email/SMS and invoke Lambda or publish to HTTP/S endpoints.

SQS is not a notification delivery service; it stores messages so worker applications can poll and process them reliably. When you want both human notifications and buffered events for automation, publish once to SNS and add multiple subscriptions (for example, email plus an SQS queue). For the SQS subscription path, configure a dead-letter queue and redrive policy to handle poison messages and processing failures.

Key takeaway: use SNS for alert delivery and SQS for durable buffering/decoupling.

  • Event-driven notifications using EventBridge to route Step Functions status changes to SNS is a standard pattern.
  • Alarm-based alerting is supported because CloudWatch alarms can trigger SNS actions on state changes.
  • Buffering for automation is appropriate by subscribing SQS to SNS and adding a DLQ/redrive policy for resilience.

Question 7

Topic: Data Ingestion and Transformation

A data engineer is building an AWS Glue job that integrates two sources into a curated S3 table:

  • Mobile app events (retries can resend the same event)
  • Orders from a transactional database (CDC)

The curated dataset is joined on a customer identifier, aligned by event_time, and must support job retries and backfills without creating duplicate integrated rows.

Which action best reflects the core principle that should guide the deduplication and replay behavior for this integration?

Options:

  • A. Use separate IAM roles for ETL jobs and data governance admins

  • B. Encrypt all source and curated data with TLS and AWS KMS

  • C. Use deterministic join keys and idempotent upserts into the curated table

  • D. Store source extracts as append-only objects in the raw S3 zone

Best answer: C

Explanation: The guiding principle is idempotent processing: running the same integration multiple times should not change the outcome. Defining stable join keys plus deterministic deduplication (often using event time and a unique record identifier) enables safe retries and backfills. An upsert-style write ensures duplicates are not reintroduced when data is replayed.

Idempotent processing means a pipeline can safely retry or reprocess the same inputs and still produce the same outputs. For multi-source integration, this typically requires (1) a stable join key (often a composite business key), (2) timestamp alignment rules (for example, choosing the correct event_time window or latest version), and (3) deterministic deduplication rules so repeated deliveries or CDC replays don’t create additional curated rows.

A common high-level approach is:

  • Define a unique record identifier per source (or composite key).
  • Use event_time (and possibly a version/sequence) to pick the winning record.
  • Write to curated storage with upsert/merge semantics keyed on that identifier.

This directly supports retries and backfills, whereas security/governance controls or raw-zone immutability don’t by themselves prevent duplicate integrated results.

  • Raw zone immutability is good practice, but it doesn’t ensure curated outputs are duplicate-free on replays.
  • Encryption protects confidentiality, not correctness of joins, time alignment, or deduplication.
  • Separation of duties improves governance, but it doesn’t make transformations replay-safe.

Question 8

Topic: Data Ingestion and Transformation

A company ingests clickstream events (JSON, ~1 KB each) into Amazon Kinesis Data Streams at a steady 5,000 events/second. The pipeline must enrich events with a small reference dataset and write partitioned Parquet to Amazon S3 within 2 minutes end-to-end. The team wants a managed, serverless operational model (no cluster management). Because the source delivery is at-least-once, the transformation must tolerate retries/reprocessing without creating duplicate outputs.

Which option is the best choice and most directly applies the core principle of idempotent processing?

Options:

  • A. Run an AWS Glue streaming ETL job with checkpointing

  • B. Run Spark on an Amazon EMR cluster reading from Kinesis

  • C. Load to Amazon Redshift and transform using SQL in-place

  • D. Use AWS Lambda to transform each record from Kinesis

Best answer: A

Explanation: An AWS Glue streaming ETL job provides a managed, serverless way to run continuous transformations at this throughput while meeting a minutes-level SLA. With streaming checkpoints, the job can restart or replay data without producing duplicate outputs, which directly implements the idempotent processing principle for at-least-once sources.

The key principle is idempotent processing: a transformation should be safe to retry so that reprocessing the same input does not create duplicate or inconsistent results. With Kinesis Data Streams, retries and replays can occur, so the transform layer needs built-in support for tracking progress and resuming.

AWS Glue streaming ETL (Spark Structured Streaming) is a good fit here because it is managed/serverless, supports state and continuous processing, and uses checkpoints to track what has been processed so restarts don’t re-emit duplicates.

A per-record function approach can work for simple stateless transforms, but for sustained high event rates plus enrichment and replay safety, a managed streaming ETL job with checkpointing is typically the most direct fit.

  • Per-record functions can become costly/operationally noisy at sustained high invocation rates and make replay-safe dedup/state handling more complex.
  • Managed clusters (Spark on EMR) meet throughput but conflict with the “no cluster management” constraint.
  • Warehouse-first transforms (Redshift) add latency and are not the right primary mechanism for streaming enrichment to S3.

Question 9

Topic: Data Ingestion and Transformation

In an AWS Glue (Apache Spark) ETL job, which observation most strongly indicates partition/data skew as the primary cause of poor performance?

Options:

  • A. Shuffle read/write metrics are high and evenly distributed across tasks

  • B. Most tasks finish quickly, but a few tasks run much longer than the rest

  • C. All tasks take about the same time and scale linearly with input size

  • D. Executors show low CPU but consistently high S3 read latency

Best answer: B

Explanation: Partition skew happens when a small number of partitions contain much more data than others, causing a “long tail” where only a few tasks keep running while most finish. This leads to poor parallelism because the job’s stage completion is gated by those straggler tasks.

In Spark-based processing (including AWS Glue Spark jobs), each stage is completed only when all tasks for that stage finish. With partition/data skew, some partitions are much larger (often from hot keys in joins/aggregations), so the tasks processing those partitions take far longer than the rest. The cluster can look partially idle near the end of a stage because most executors have finished their smaller partitions while a few executors are stuck on oversized partitions.

A strong skew signal is:

  • Many tasks complete quickly
  • A small set of tasks have very large durations (stragglers)
  • Stage completion time is dominated by that tail

Key takeaway: skew produces highly uneven task runtimes; general resource shortages or I/O bottlenecks tend to slow tasks more uniformly.

  • Uniform task durations points to overall throughput limits, not skew-driven stragglers.
  • High S3 read latency is an I/O bottleneck symptom rather than uneven partition sizes.
  • Evenly distributed shuffle suggests widespread shuffle cost, not a few oversized partitions dominating runtime.

Question 10

Topic: Data Ingestion and Transformation

A team orchestrates a daily ingestion and transformation pipeline using AWS Step Functions and AWS Glue. Auditors require end-to-end traceability so the team can prove which source data and which transformation version produced any curated dataset.

Which statement is NOT correct (unsafe) for designing auditable run metadata?

Options:

  • A. Keep only the latest successful run metadata because S3 data is immutable.

  • B. Generate a unique run ID and persist run start/end timestamps and status.

  • C. Store the transformation version (for example, Git commit or image digest).

  • D. Record the source version identifiers used by the run.

Best answer: A

Explanation: Auditable pipelines require durable, queryable history of each execution, including run IDs, timestamps, source versions, and transformation versions. Deleting prior run records breaks the ability to reconstruct lineage for older outputs and prevents proving what happened during backfills or reruns. Retaining historical run metadata is a core requirement for traceability.

Traceability depends on being able to reconstruct “what ran, when it ran, what it read, and what it produced” for any point in time. A good design generates a run ID for every execution and persists run-level metadata in a durable store (for example, DynamoDB, a metadata RDS database, or an append-only S3/Glue table) including timestamps and outcome.

To make lineage auditable, each run record should also capture (1) source version identifiers (such as an S3 object version ID, database snapshot ID, or CDC high-watermark/LSN) and (2) transformation version (such as a Git commit hash, Glue job script version, or container image digest). Keeping only the latest run metadata is unsafe because audits often require reconstructing historical outputs and explaining reruns/backfills even when data files are immutable.

  • Run ID + timestamps supports correlation across steps and provides an audit trail of execution timing and status.
  • Source versions are required to prove exactly which input snapshot/offset/object set was processed.
  • Transformation versioning is necessary to show which code/config produced the outputs, especially after logic changes.

Continue with full practice

Use the AWS DEA-C01 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try AWS DEA-C01 on Web View AWS DEA-C01 Practice Test

Free review resource

Read the AWS DEA-C01 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.

Revised on Thursday, May 14, 2026