Try 10 focused AWS DEA-C01 questions on Data Operations and Support, with explanations, then continue with IT Mastery.
Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.
| Field | Detail |
|---|---|
| Exam route | AWS DEA-C01 |
| Topic area | Data Operations and Support |
| Blueprint weight | 22% |
| Page purpose | Focused sample questions before returning to mixed practice |
Use this page to isolate Data Operations and Support for AWS DEA-C01. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.
| Pass | What to do | What to record |
|---|---|---|
| First attempt | Answer without checking the explanation first. | The fact, rule, calculation, or judgment point that controlled your answer. |
| Review | Read the explanation even when you were correct. | Why the best answer is stronger than the closest distractor. |
| Repair | Repeat only missed or uncertain items after a short break. | The pattern behind misses, not the answer letter. |
| Transfer | Return to mixed practice once the topic feels stable. | Whether the same skill holds up when the topic is no longer obvious. |
Blueprint context: 22% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.
These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.
Topic: Data Operations and Support
A data engineering team stores raw events in Amazon S3 and uses Amazon Athena notebooks (Spark) to explore a newly onboarded dataset. They must also build a production pipeline that creates curated Apache Iceberg tables every night with a 2-hour SLA, automatic retries, and run history for audits.
Which proposals should the team AVOID because they misuse interactive notebooks for production ETL? (Select THREE.)
Options:
A. Schedule the notebook to run hourly for incremental processing
B. Use a notebook as the primary nightly ETL implementation
C. Use a notebook to debug a single failed partition interactively
D. Prototype transformations on a small sample partition in a notebook
E. Run an 18-hour historical backfill only from a notebook
F. Port proven logic into an AWS Glue job and orchestrate it
Correct answers: A, B and E
Explanation: Athena notebooks with Spark are best for interactive exploration, prototyping, and targeted troubleshooting. For repeatable, SLA-bound data transformations (nightly loads, hourly incrementals, and large backfills), teams should use scheduled ETL jobs with orchestration, retries, monitoring, and auditable run history.
The core decision is whether the workload is exploratory/interactive or production ETL. Athena notebooks with Spark are optimized for ad hoc analysis: iterating on logic, sampling data, and quickly investigating issues. Production pipelines (nightly curated table builds, frequent incrementals, and long backfills) require operational guarantees such as consistent scheduling, automatic retries, alerting/metrics, version-controlled deployment, and auditable run history.
If the work must be repeatable and meet an SLA, implement it as a managed ETL job (for example AWS Glue or EMR Spark) and orchestrate with a scheduler/orchestrator (for example Step Functions, MWAA, or EventBridge). Keep notebooks for interactive development and investigation, then promote the finalized code into the scheduled ETL system.
Topic: Data Operations and Support
A company runs batch data pipelines using AWS Glue jobs orchestrated by AWS Step Functions. Source and curated data is stored in Amazon S3. Auditors require logs to be retained for 7 years and to provide sufficient context for investigations: who performed an action, what action occurred, when it occurred, and where it originated (account/Region/source IP where applicable).
Which THREE actions will best meet these audit logging requirements?
Options:
A. Enable S3 server access logging on the data lake buckets only
B. Enable CloudTrail data events for S3 object-level access on audited buckets
C. Enable VPC Flow Logs on pipeline subnets and retain in S3
D. Write structured Glue and Step Functions execution logs to CloudWatch Logs and archive to S3
E. Create an organization multi-Region CloudTrail that delivers to central S3
F. Rely on CloudWatch metrics and alarms for audit investigations
Correct answers: B, D and E
Explanation: For audit readiness, you need immutable, long-retained records of API activity and data access that include identity, action, time, and origin. AWS CloudTrail provides this context for both management events and (when enabled) S3 object-level data events. Detailed service execution logs should also be retained to reconstruct pipeline runs and enrich investigations beyond API calls.
The core requirement is end-to-end audit evidence with sufficient context (who/what/when/where) and long-term retention. CloudTrail is the primary audit log for AWS because it records the calling identity, API action, timestamp, Region, and source IP (when applicable). To capture object-level reads/writes in S3, you must enable CloudTrail data events for the specific buckets; management events alone won’t show each object GET/PUT.
Service execution logs (Glue driver/executor logs and Step Functions execution history/logging) add operational detail about what occurred during a run (run IDs, step outcomes, errors). Retaining these logs in CloudWatch Logs with an archival path to S3 supports the 7-year requirement and provides investigation detail that complements CloudTrail’s API-level view.
Network- and metric-only signals can support troubleshooting, but they don’t satisfy identity- and action-level audit requirements by themselves.
Topic: Data Operations and Support
A data pipeline runs an AWS Glue job every hour. The operations team uses Amazon CloudWatch Logs Insights to look for recurring failure causes by counting distinct jobRunId values that logged an error.
Exhibit: Logs Insights results (last 7 days)
error= "S3SlowDown" failed_runs=14
error= "HIVE_PARTITION_SCHEMA_MISMATCH" failed_runs=7
TOTAL failed_runs=21
Which issue is the most recurring cause of failures, and what percentage of failed runs does it represent (rounded to the nearest whole percent)?
Options:
A. S3SlowDown, 67%
B. HIVE_PARTITION_SCHEMA_MISMATCH, 67%
C. HIVE_PARTITION_SCHEMA_MISMATCH, 50%
D. S3SlowDown, 33%
Best answer: A
Explanation: The Logs Insights output shows 14 distinct failed Glue runs with S3SlowDown out of 21 total failed runs. Dividing 14 by 21 gives about 0.667, which is 66.7% and rounds to 67%. This identifies S3SlowDown as the most recurring high-level failure cause for the pipeline.
A common way to identify recurring pipeline issues is to aggregate logs by error type and count unique failing executions (here, distinct jobRunId). From the results, compute each error’s share of failed runs by dividing its failed_runs by the total failed runs.
Because S3SlowDown accounts for the largest proportion of failed runs, it is the most recurring failure cause in this time window.
Topic: Data Operations and Support
A data team is automating a daily batch pipeline using Amazon EventBridge, AWS Step Functions, and AWS Glue. The team must pass run context (for example, process_date, partition, run_id) through multiple steps and preserve end-to-end traceability for audits.
Which statement is INCORRECT or unsafe for this design?
Options:
A. Persist execution ARN and inputs/outputs in DynamoDB for auditing.
B. Include process_date in EventBridge detail and propagate downstream.
C. Pass run_id in Step Functions input and Glue job args.
D. Rely on Glue job bookmarks to carry run context.
Best answer: D
Explanation: Use explicit context propagation (event payloads, Step Functions state input/output, and Glue job parameters) so each step receives the same run_id/date and can record it with outputs. For traceability, persist identifiers such as the Step Functions execution ARN and Glue job run IDs alongside inputs/outputs in a durable store. Glue bookmarks are for incremental reads/writes and do not replace explicit run context passing across services.
The core design pattern is to treat run context as explicit data that is carried by the orchestration layer and passed into each processing step. With Step Functions, you can keep process_date/run_id in the execution input and map those values into task parameters (for example, Glue job arguments), while also using the execution ARN/name as a stable trace identifier. For audits, store a run record (inputs, outputs, execution ARN, Glue job run IDs, timestamps) in a durable metadata store such as DynamoDB and/or embed the run_id in output paths/partitions.
Glue job bookmarks are useful for incremental processing state within a Glue job, but they are not designed to propagate run identifiers/dates across multiple services or to provide a complete, queryable run ledger.
run_id/dates into each task.Topic: Data Operations and Support
A data lake pipeline ingests streaming click events through Amazon Kinesis Data Firehose into Amazon S3 (raw). An AWS Glue job transforms the data into S3 (curated) partitioned by dt and hour, and Amazon Athena is used by analysts.
The team has a 15-minute freshness SLA for curated data and wants a high-level pipeline health dashboard that uses metrics, logs, and alerts to detect regressions (late runs, failures, abnormal drops/spikes in processed records).
Which TWO actions should the data engineer take?
Options:
A. Enable SSE-KMS on the curated S3 bucket to prevent regressions
B. Use the Glue crawler last run time as the dataset freshness metric
C. Publish custom CloudWatch metrics for freshness lag and row counts
D. Use Lake Formation permission audits as the primary health signal
E. Configure Athena partition projection to reduce missing-partition issues
F. Create CloudWatch Logs metric filters on Glue job logs for errors
Correct answers: C and F
Explanation: A health dashboard needs time-series signals that map to SLAs and regressions. Emitting CloudWatch custom metrics for freshness lag and processed record counts provides direct, queryable KPIs, while turning log errors into CloudWatch metrics enables alerting on failures without manual log review. Together, these provide high-level visibility and actionable alarms.
Pipeline health dashboards are most effective when they use a small set of operational KPIs as metrics and then back them with logs for root cause. For the freshness SLA, you need a metric that measures end-to-end lag (for example, “now” minus the newest dt/hour successfully written to curated). For regressions, you need run-level volume and error signals you can graph and alarm on.
A practical pattern is:
Security and catalog features are important, but they don’t directly provide SLA/regression observability signals.
Topic: Data Operations and Support
A data engineering team runs AWS Glue jobs and AWS Lambda functions that ingest data to an Amazon S3 data lake. Over the last month, the pipeline has had intermittent failures, and the team needs to quickly identify the most recurring failure patterns from logs.
Which action should the team NOT take while implementing a log-analysis approach using services such as Amazon Athena, Amazon OpenSearch Service, or CloudWatch Logs Insights?
Options:
A. Make the S3 bucket that stores exported logs publicly readable for easier sharing
B. Export CloudWatch Logs to S3 and query them with Athena using partitions by date
C. Use CloudWatch Logs Insights to aggregate errors by message and count occurrences
D. Index structured error fields in OpenSearch and use aggregations to find top errors
Best answer: A
Explanation: To identify recurring pipeline issues, the team should aggregate and query logs with managed analytics tools while keeping logs protected. Making log storage publicly accessible is a high-risk anti-pattern because logs commonly contain sensitive operational details. The correct approach is to analyze logs with least-privilege access and appropriate controls.
The core task is high-level recurring issue detection, which is well served by log aggregation and grouping in CloudWatch Logs Insights, Athena (after exporting logs to S3), or OpenSearch (after indexing structured fields). These approaches let the team quantify repeated failures (for example, counts by error type, job name, or stage) without changing the pipeline behavior.
Making a log bucket publicly readable is an obvious data platform anti-pattern: operational logs can include internal resource names, account identifiers, and sometimes data fragments, so exposing them breaks the least-privilege principle and undermines governance. The key takeaway is to improve observability while maintaining strong access control over log data.
Topic: Data Operations and Support
A data engineering team is automating batch processing for a data lake on Amazon S3. They use AWS Glue, Amazon EMR, and Amazon Redshift.
Which statement is INCORRECT about using these service features to process data effectively?
Options:
A. Amazon EMR managed scaling can add or remove core/task nodes.
B. AWS Glue job bookmarks help process only new data in reruns.
C. AWS Glue crawlers can convert CSV files to Parquet automatically.
D. Redshift Spectrum can query data in S3 without loading it.
Best answer: C
Explanation: AWS Glue crawlers are for metadata discovery and catalog updates, not for transforming file formats. Converting CSV to Parquet requires a processing job (for example, AWS Glue ETL or EMR). The other statements describe commonly used automation features for incremental processing, cluster right-sizing, and querying data in S3.
Choose service features based on what part of the pipeline you are automating: cataloging, transforming, scaling compute, or serving queries. AWS Glue crawlers infer schemas, detect partitions, and update the AWS Glue Data Catalog so downstream jobs and query engines can discover new datasets, but crawlers do not modify data in Amazon S3. File format conversions and other transformations are performed by compute (for example, AWS Glue ETL Spark jobs or EMR Spark).
For ongoing operations, Glue job bookmarks help incremental ETL by tracking what was already processed, EMR managed scaling adjusts cluster capacity automatically to match workload demand, and Redshift Spectrum lets you run SQL over data stored in S3 without loading it into Redshift tables.
Topic: Data Operations and Support
A data engineering team defines a freshness SLA for a curated Amazon S3 table that is queried by Amazon Athena. They want an automated timeliness check and alert threshold.
Which option best defines a timeliness (freshness) check and an appropriate high-level alerting threshold?
Options:
A. Verify the table schema matches the AWS Glue Data Catalog; alert when a new column is detected.
B. Verify Athena partition pruning occurs; alert when scanned bytes increase beyond a budget.
C. Verify each partition has the expected row count; alert when the count deviates from the historical average.
D. Verify the newest partition’s event_time is within the SLA; alert when age exceeds the SLA (optionally after N consecutive misses).
Best answer: D
Explanation: A timeliness (freshness) check validates that data is recent enough for downstream use by comparing the latest available data timestamp (often the newest partition) to a defined freshness SLA. The alert threshold is therefore driven by “data age” exceeding the SLA, commonly with a small tolerance such as requiring consecutive failures to avoid paging on transient delays.
Timeliness (also called freshness) is a data quality dimension that measures whether a dataset is updated within an agreed freshness SLA (for example, “data must be no more than 30 minutes old”). A practical freshness check computes the dataset’s “age” by finding the most recent available data timestamp (often the newest partition or max event_time) and comparing it to the current time.
Alerting thresholds are typically expressed as:
This differs from completeness checks (missing/extra records or partitions) and from operational performance signals like partition pruning or scan cost.
Topic: Data Operations and Support
You are defining audit logging for AWS data pipelines. Auditors require logs that answer who performed an action, what was done, when it occurred, and where it originated, and the logs must be retained for more than 90 days.
Select TWO statements that are true.
Options:
A. AWS CloudTrail event records include identity, API action, time, Region, and source IP.
B. To retain CloudTrail activity beyond 90 days, you must configure a CloudTrail trail to deliver logs to a destination such as Amazon S3.
C. Amazon CloudWatch Logs automatically captures all AWS API calls across services without CloudTrail.
D. Amazon VPC Flow Logs identify the IAM user or role that accessed an Amazon S3 object.
E. AWS Glue job logs in Amazon CloudWatch Logs always contain the IAM principal that started the job run.
F. Amazon S3 server access logs record full request and response payloads for object uploads and downloads.
Correct answers: A and B
Explanation: AWS CloudTrail is the primary service for recording AWS API activity with rich audit context (identity, action, time, and network/Region details). Because CloudTrail Event history is not intended for long-term retention, you must configure a trail that delivers events to durable storage such as Amazon S3 to meet audit retention needs.
For audit requirements like who/what/when/where, use AWS CloudTrail because it records AWS API calls with fields such as the calling principal (userIdentity), the API operation (eventName), the timestamp (eventTime), and origin details like sourceIPAddress and awsRegion.
CloudTrail “Event history” in the console is only a short-term view. For audit-grade retention and centralized evidence collection, configure a CloudTrail trail to deliver events to an immutable/durable destination (commonly Amazon S3, optionally also CloudWatch Logs for near-real-time monitoring). This ensures you can retain and retrieve the required context for the full audit window.
Service-specific logs (application logs, network flow logs) can complement CloudTrail, but they do not replace it for authoritative API-level accountability.
Topic: Data Operations and Support
You are optimizing BI dashboards (for example, Amazon QuickSight querying Amazon Redshift) to improve query latency and reduce repeated computation. Which TWO statements are INCORRECT or unsafe guidance for data modeling and join strategy? (Select TWO.)
Options:
A. Design joins to be one-to-many from dimensions to the fact to avoid row multiplication.
B. Precompute common rollups using summary tables or materialized views.
C. Fully normalize into 3NF and rely on many runtime joins for best dashboard performance.
D. Apply filters as early as possible to reduce data before joins and aggregations.
E. If joins create duplicates, add DISTINCT in dashboard queries rather than changing the model.
F. Use a star schema with conformed dimensions and a central fact table.
Correct answers: C and E
Explanation: Dashboards perform best when the model minimizes expensive runtime joins and avoids row multiplication that forces repeated aggregation. Fully normalizing for interactive analytics and relying on DISTINCT to “fix” duplicates both increase work per query and can hide incorrect results. Favor star-like models, controlled join cardinality, and precomputed aggregates to reduce repeated computation.
The core idea is to reduce work the dashboard must redo on every refresh: fewer complex joins, predictable join cardinality, and reusable pre-aggregations.
Two unsafe statements are:
DISTINCT to compensate for row multiplication; it is expensive and often indicates an incorrect many-to-many relationship that should be remodeled (for example, by correcting grain, introducing a proper bridge table, or aggregating before joining).In contrast, star schemas with conformed dimensions, one-to-many dimension-to-fact joins, early filtering, and precomputed rollups/materialized views are common strategies to improve dashboard performance and reduce duplicated computation.
Use the AWS DEA-C01 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.
Try AWS DEA-C01 on Web View AWS DEA-C01 Practice Test
Read the AWS DEA-C01 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.