Free AWS DEA-C01 Full-Length Practice Exam: 65 Questions

May 1, 2026

Try 65 free AWS DEA-C01 questions across the exam domains, with explanations, then continue with full IT Mastery practice.

This free full-length AWS DEA-C01 practice exam includes 65 original IT Mastery questions across the exam domains.

These questions are for self-assessment. They are not official exam questions and do not imply affiliation with the exam sponsor.

Count note: this page uses the full-length practice count maintained in the Mastery exam catalog. Some certification vendors publish total questions, scored questions, duration, or unscored/pretest-item rules differently; always confirm exam-day rules with the sponsor.

Need concept review first? Read the AWS DEA-C01 Cheat Sheet on Tech Exam Lexicon, then return here for timed mocks and full IT Mastery practice.

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try AWS DEA-C01 on Web View full AWS DEA-C01 practice page

Exam snapshot

Exam route: AWS DEA-C01
Practice-set question count: 65
Time limit: 130 minutes
Practice style: mixed-domain diagnostic run with answer explanations

Full-length exam mix

Domain	Weight
Data Ingestion and Transformation	34%
Data Store Management	26%
Data Operations and Support	22%
Data Security and Governance	18%

Use this as one diagnostic run. IT Mastery gives you timed mocks, topic drills, analytics, code-reading practice where relevant, and full practice.

Practice questions

Questions 1-25

Question 1

Topic: Data Operations and Support

A data team is defining operational SLIs/SLOs and monitoring signals for an hourly batch pipeline that lands curated data in Amazon S3 and is queried through Amazon Athena.

Which TWO statements are FALSE/unsafe for this purpose?

Options:

A. Track completeness via expected partitions/control totals vs actual.
B. Measure freshness by now - max(event_time) in curated data.
C. Use ETL job duration alone as a freshness SLI.
D. Define latency SLI as event time to Athena queryability.
E. Set SLOs (e.g., 99%) and alarm on sustained breaches.
F. Treat a successful job run as sufficient completeness evidence.

Correct answers: C and F

Explanation: Freshness and completeness SLIs must be tied to the data produced (timestamps, partitions, control totals), not just whether code ran quickly or exited successfully. Job-duration and job-success signals are useful operational metrics, but they do not, by themselves, prove data is current or complete. Good SLOs set explicit targets on these SLIs and drive alerts from measured breaches.

An SLI measures an observable property of the data pipeline output, while an SLO sets a target for that SLI (for example, a percentile target over time). For data pipelines, freshness and latency are usually anchored to event-time or partition-time, and completeness is anchored to expected volumes or expected partitions.

“Use ETL job duration alone as a freshness SLI” is unsafe because runtime says nothing about backlog, late arrivals, or upstream delay.
“Treat a successful job run as sufficient completeness evidence” is unsafe because success can still produce partial outputs.
Measuring freshness as now - max(event_time) directly reflects how current the data is.
Defining latency as event-time to Athena queryability captures end-to-end availability (including publication/catalog/partition readiness).
Completeness checks should compare actual outputs to expectations (partitions, counts, or control totals).
SLO targets (for example, 99% within X minutes) are what you alert against, based on the measured SLI.

Key takeaway: prioritize data-derived signals (timestamps, partitions, totals) over job-only signals for SLIs.

Runtime vs freshness: Job duration is a performance metric, but freshness needs a data-time vs wall-clock comparison.
Success vs completeness: A successful run can still miss late files, skip partitions, or drop records.
End-to-end latency: Measuring to “queryable in Athena” ensures the dataset is actually consumable.
Completeness signals: Expected partitions/control totals provide objective validation beyond logs.

Question 2

Topic: Data Store Management

A company ingests clickstream files to Amazon S3, runs an AWS Glue ETL job to write Parquet to s3://datalake/curated/events/, and queries the curated data with Amazon Athena using the AWS Glue Data Catalog.

The curated prefix is partitioned by dt=YYYY-MM-DD/hour=HH and keeps 24 months of history. A Glue crawler is scheduled hourly on s3://datalake/curated/events/ to discover new partitions and keep the catalog current. The crawler now takes ~50 minutes to run, often overlaps the next scheduled run, and the newest hour is sometimes missing from the catalog when analysts query.

Which change should the data engineer make to improve cost and reliability without changing the ingestion/ETL logic or the S3 layout?

Options:

A. Set the crawler recrawl behavior to crawl new folders only
B. Increase the crawler capacity to reduce crawl duration
C. Disable schema updates in the crawler output settings
D. Run the crawler once per day to avoid overlapping runs

Best answer: A

Explanation: The crawler is slow because it repeatedly scans a very large historical prefix to find a small number of new hourly partitions. Configuring incremental crawling (new folders only) reduces the amount of S3 data the crawler needs to examine each run, lowering cost and making it far more likely the crawler finishes before the next schedule. The tradeoff is that changes in existing folders may require a periodic full recrawl.

AWS Glue crawlers populate the Data Catalog by listing and sampling data in the configured data store. When a crawler is pointed at a large, long-retained, partitioned S3 prefix, a full recrawl can become expensive and may not complete within the required SLA.

Using the crawler’s incremental recrawl setting (for example, crawling only new folders) aligns the crawl work with the operational goal in this scenario: discover newly arrived partition folders each hour and register them as partitions/tables in the Data Catalog. This typically reduces crawl duration and DPU-hours, and it improves reliability by preventing schedule overlap.

Key takeaway: use incremental crawler behavior for fast partition discovery, and perform an occasional full recrawl when you need to re-evaluate older data for schema changes.

More crawler capacity can shorten runs but still repeatedly scans the full 24-month prefix, increasing cost without addressing the root cause.
Daily crawling reduces overlap but breaks the hourly freshness requirement for Athena queries.
Disabling schema updates doesn’t materially reduce S3 discovery work and can cause the catalog to miss legitimate schema evolution.

Question 3

Topic: Data Store Management

A data engineer must catalog and run AWS Glue ETL jobs against an Amazon RDS for PostgreSQL database that is reachable only from private subnets in a VPC. The engineer will use an AWS Glue connection.

Which statement is INCORRECT about securely designing this access?

Options:

A. If the Glue job runs in private subnets, it needs a network path (for example, NAT or VPC endpoints) to reach required AWS services.
B. The database must be placed in a public subnet with a public IP address for Glue to connect using a Glue connection.
C. Database credentials can be referenced from AWS Secrets Manager in the Glue connection to avoid hardcoding passwords.
D. A Glue connection can create ENIs in selected private subnets and use security groups to reach the database.

Best answer: B

Explanation: AWS Glue connections are designed to reach private data stores by creating elastic network interfaces (ENIs) in your VPC subnets and applying security groups. You secure access with network controls (subnets, routes, security groups) and by using managed secrets for credentials. Making the database public is not required and is generally less secure.

The core concept is that an AWS Glue connection provides VPC networking context so Glue can access private resources without exposing them publicly. When you configure a connection for a JDBC data store, Glue attaches ENIs in the specified subnets and uses the specified security groups; as long as those subnets have a route to the database and security group rules allow it, the database can remain private.

For credentials, prefer referencing AWS Secrets Manager from the connection so secrets are not embedded in job scripts. Also remember that Glue jobs running in private subnets still need connectivity to AWS control-plane and data-plane endpoints they use (commonly via NAT or VPC endpoints), otherwise jobs may fail even if the database is reachable. The key takeaway is to keep the data source private and control access with VPC networking plus least-privilege secret and IAM access.

ENIs and security groups is accurate: Glue uses ENIs in your subnets and enforces security group rules.
Secrets Manager for credentials is accurate: using a referenced secret is the common secure pattern for JDBC credentials.
Private subnet egress needs is accurate: without NAT/endpoints, jobs can lose access to required AWS service endpoints.
Public subnet requirement is the unsafe misconception: it unnecessarily exposes the database and is not required for Glue connectivity.

Question 4

Topic: Data Store Management

A company is implementing a business data catalog in Amazon SageMaker Catalog.

Requirements:

Create separate catalog domains for Finance, Marketing, and Operations.
Within each domain, use projects so only project members can publish and manage assets.
Curated data products can be broadly discoverable, but assets that contain PII must not be discoverable or accessible outside a Data Governance group.
Only curated, approved datasets should be promoted as “certified” assets.

Which TWO actions should you AVOID? (Select TWO.)

Options:

A. Create a domain per business unit and assign domain owners and stewards using IAM Identity Center groups
B. Register raw landing-zone datasets that include unmasked PII as discoverable assets to speed up self-service discovery
C. Create one shared domain for all teams and rely only on naming conventions to separate Finance, Marketing, and Operations assets
D. Publish curated S3/Glue-based datasets as assets with business metadata (owner, glossary terms) and mark them certified only after approval
E. Configure PII assets so discovery and access are limited to the Data Governance group, while curated non-PII assets remain broadly discoverable
F. Create projects within each domain and restrict asset publishing permissions to project members

Correct answers: B and C

Explanation: SageMaker Catalog is organized around domains, projects, and assets, and it is commonly used to enable governed discovery. Actions that collapse required domain boundaries or make PII broadly discoverable directly violate the stated separation and governance constraints. The safe choices preserve domain/project isolation and apply least-privilege discovery and certification practices.

In SageMaker Catalog, a domain is the top-level business boundary, projects are the collaboration and permission boundary for publishing/managing assets, and assets represent discoverable data products (with metadata such as owner, glossary terms, and certification state). Given the requirements, you should model each business unit as its own domain, then create projects within each domain to control who can publish and manage assets. For governance, treat PII as a restricted class of assets: limit both discovery and access to the Data Governance group. Also, only promote curated, approved datasets as certified assets; raw landing-zone data (especially with PII) should not be broadly discoverable.

Key takeaway: preserve domain boundaries and restrict PII discovery/access while using certification to signal trusted assets.

Single shared domain fails because it breaks the explicit requirement for separate domains and weakens governance boundaries.
Making raw PII discoverable fails because it violates the requirement that PII must not be discoverable outside the Data Governance group.
Domain/project ownership with groups aligns with operating domains and projects with controlled membership.
Certification after approval aligns with using assets to represent curated, trusted data products.

Question 5

Topic: Data Security and Governance

A company runs hundreds of Amazon EMR clusters that generate about 5 TB/day of application and step logs. Security requires immutable, centralized log retention for 7 years, and auditors must be able to query the last 90 days quickly with least-privilege access. The team wants a cost-effective, scalable design.

Which approach should the company AVOID for preparing these logs for audit?

Options:

A. Deliver EMR logs to S3 with SSE-KMS encryption
B. Store logs in a public-read S3 bucket for auditors
C. Apply S3 Object Lock for WORM log retention
D. Partition logs by date in S3 for Athena queries

Best answer: B

Explanation: Audit logs should be centrally stored with strong access controls because they often contain sensitive operational details. For large volumes, S3-based storage with encryption and immutable retention scales well and is cost-effective, while query performance can be maintained with partitioning and serverless analytics. Making logs broadly accessible undermines the security and governance goal even if it seems operationally convenient.

The core principle for audit logging is integrity and confidentiality: logs must be tamper-resistant and accessible only to approved identities under least privilege. At multi-terabyte-per-day scale, Amazon S3 is a common durable, low-cost log store; SSE-KMS protects data at rest, and S3 Object Lock can enforce WORM retention to support audit requirements.

To balance cost and performance for investigations:

Keep long-term retention in S3 with lifecycle policies.
Optimize queries on recent logs by partitioning (for example, by dt=YYYY-MM-DD) and using Athena.

Granting public read access to the log bucket is an audit anti-pattern because it creates uncontrolled disclosure risk and weakens governance regardless of downstream analytics choices.

Public access breaks least-privilege and can expose sensitive audit evidence.
Encrypt at rest supports confidentiality for centralized log storage.
Partitioning reduces Athena scan size and improves query cost/performance.
Object Lock helps ensure log immutability for audit retention.

Question 6

Topic: Data Ingestion and Transformation

A data engineer is building a pull-based ingestion pipeline from a third-party REST API into an Amazon S3 data lake (raw zone). The API returns up to 1,000 records per call and includes a next_cursor value when more pages are available. The API enforces strict rate limits and sometimes returns HTTP 429 with a Retry-After header.

The pipeline must be restartable after failures and must avoid missing or reprocessing large ranges of data.

Which TWO actions should the data engineer take? (Select TWO.)

Options:

A. Skip checkpointing and deduplicate nightly with Amazon Athena queries on S3
B. Increase the AWS Lambda function memory size to prevent HTTP 429 throttling
C. Run an AWS Glue crawler after each page and use new partitions as the checkpoint
D. Use Amazon Kinesis Data Firehose to pull the REST API and land data in S3
E. Persist the next_cursor checkpoint after each successful page in DynamoDB
F. Retry 429/5xx responses using exponential backoff with jitter and honor Retry-After

Correct answers: E and F

Explanation: Use cursor-based pagination with a durable checkpoint so the pipeline can resume exactly where it left off after a failure. Handle throttling by honoring Retry-After and applying exponential backoff with jitter for retryable errors. Together, these patterns maximize completeness and correctness while staying within API limits.

For pull-based API ingestion, you typically need (1) a pagination strategy and (2) resilience to throttling and transient failures. With cursor pagination, the most reliable checkpoint is the last successfully processed cursor (or page token) written to a durable store (for example, DynamoDB) after each successful write to S3; on restart, the job resumes from that cursor.

When the source enforces rate limits, handle HTTP 429 and transient 5xx errors with controlled retries:

Honor Retry-After when present.
Use exponential backoff with jitter and a max retry/timeout budget.
Alert/escalate when retries are exhausted.

This combination prevents missed pages and minimizes duplicates without relying on expensive downstream cleanup.

OK: Cursor checkpoint in DynamoDB Works as a durable, restart-safe pagination checkpoint.
OK: Exponential backoff + Retry-After Reduces throttling and safely retries transient failures.
NO: More Lambda memory Does not change the vendor API’s throttling policy.
NO: Crawler as checkpoint Glue crawlers catalog data; they are not a reliable pagination/checkpoint mechanism.
NO: Firehose pulls API Firehose ingests pushed/streamed data; it doesn’t natively paginate a pull API.
NO: Dedup instead of checkpointing Downstream dedup can be costly and doesn’t prevent missed pages or large replays.

Question 7

Topic: Data Store Management

A company stores 200 TB of clickstream events in Amazon S3 and queries the data using Amazon Athena with tables in the AWS Glue Data Catalog. Files are currently uncompressed JSON under a single prefix (no partitions). Most queries filter on event_date (range) and app_id (equality) and select only a few columns. The team must reduce Athena scan cost and improve query performance.

Which THREE actions follow best practices for indexing, partitioning, and compression in this scenario? (Select THREE.)

Options:

A. Keep JSON format and avoid compression to reduce ETL complexity
B. Partition the S3 layout by event_date and app_id
C. Create an Athena index on event_date using CREATE INDEX
D. Convert the dataset to Parquet with Snappy compression
E. Partition the S3 layout by user_id
F. Create a Glue Data Catalog partition index on event_date and app_id

Correct answers: B, D and F

Explanation: For Athena, the biggest performance and cost gains come from reducing data scanned and improving partition pruning. Storing data in a columnar format (such as Parquet) with compression cuts I/O dramatically, and partitioning on the most common filter columns limits reads to relevant prefixes. At large partition counts, a Glue partition index can also reduce metadata lookup overhead.

Athena is a serverless query engine where cost and latency are strongly influenced by how many bytes are scanned and how efficiently it can skip irrelevant data. Converting row-based JSON into a columnar format (Parquet) and applying compression reduces storage and the amount of data Athena must read, while also enabling better predicate pushdown. Partitioning the dataset by the columns most frequently used in WHERE clauses (here, event_date and app_id) allows partition pruning so Athena reads only the necessary partitions. When a table accumulates many partitions, creating a Glue Data Catalog partition index can speed up partition metadata retrieval and reduce planning time. The key takeaway is to optimize around query access patterns: partition on low-to-moderate cardinality filter keys and use columnar compressed files.

OK Convert to Parquet with Snappy: reduces bytes scanned and improves performance.
OK Partition by event_date and app_id: enables effective partition pruning for common filters.
OK Glue partition index on event_date and app_id: improves partition metadata lookup at scale.
NO Partition by user_id: high cardinality creates too many small partitions and excess overhead.
NO Athena CREATE INDEX: Athena doesn’t support user-managed indexes like relational databases.
NO Uncompressed JSON: maximizes scan size and is typically slow and costly in Athena.

Question 8

Topic: Data Ingestion and Transformation

You are optimizing an AWS Glue Spark job that ingests data from Amazon S3, transforms it, and writes curated data back to S3 for Athena queries. Select TWO statements that correctly describe high-level ways to reduce unnecessary I/O, minimize serialization overhead, and improve parallelism.

Options:

A. Apply filters and select only needed columns as early as possible
B. Write many small JSON files to maximize parallelism and speed Athena queries
C. Convert between DynamicFrames and DataFrames repeatedly to speed up processing
D. Use Python UDFs instead of built-in Spark SQL functions for best performance
E. Use coalesce(1) before writing to create one file for faster writes
F. Prefer Parquet/ORC over JSON/CSV to reduce I/O and serialization

Correct answers: A and F

Explanation: Using efficient storage formats and reducing the amount of data processed are the biggest high-level wins in Glue Spark. Columnar formats (with compression) cut read/write volume and parsing overhead, and early filtering/projection reduces downstream shuffles and materialization. Together these changes typically improve both runtime and cost without changing business logic.

The core optimization idea is to move less data and do less (de)serialization while letting Spark parallelize work across partitions. For S3-based pipelines, file format and when you reduce the dataset matter more than micro-optimizations.

Use columnar, compressed formats (Parquet/ORC) for curated zones to reduce bytes read/written and improve scan efficiency.
Push down reduction early: filter rows and select only required columns before joins/aggregations to shrink shuffle and memory pressure.

By contrast, adding extra conversions, forcing single-file output, or relying on Python UDFs usually increases overhead and reduces parallelism.

NO: Repeated DynamicFrame↔DataFrame conversions add serialization/planning overhead; pick one abstraction and minimize conversions.
NO: coalesce(1) collapses partitions, creating a bottleneck at write time; keep multiple output files and manage size via partitioning/compaction.
NO: Python UDFs often bypass Spark’s optimizer and run slower; prefer built-in Spark SQL functions or vectorized approaches.
NO: Many small JSON files increase S3 request overhead and query planning/scan inefficiency; prefer fewer, larger columnar files.

Question 9

Topic: Data Operations and Support

A company uses Amazon Athena to query an Amazon S3 clickstream table in Parquet. The table is partitioned by dt (UTC date). Each dt partition scans about 64 GB.

A dashboard must report clicks for February 1, 2026 in America/Los_Angeles (UTC-8). The current query filters dt = '2026-02-01' and event_ts between 2026-02-01 00:00:00 and 2026-02-01 23:59:59, and it undercounts.

The table retains 90 days of partitions.

Which change will fix the time zone issue and approximately how much data will Athena scan (round to the nearest GB)?

Options:

A. Filter dt in two days and use a UTC window (~128 GB)
B. Filter on date(at_timezone(event_ts)) only (~5,760 GB)
C. Keep dt='2026-02-01'; shift event_ts to UTC (~64 GB)
D. Change Athena session time zone; keep dt='2026-02-01' (~64 GB)

Best answer: A

Explanation: A local day in America/Los_Angeles (UTC-8) crosses a UTC date boundary. To return the correct February 1 local-day results while preserving partition pruning, the query must read both relevant UTC dt partitions and filter event_ts using the corresponding UTC time range. With two daily partitions at 64 GB each, the scan is about 128 GB.

This is an inconsistent time zone boundary problem combined with partition pruning. The business day is defined in America/Los_Angeles, but the table is partitioned by UTC date (dt).

For February 1, 2026 in UTC-8, the corresponding UTC interval is:

Start: 2026-02-01 08:00:00 UTC
End: 2026-02-02 07:59:59 UTC

That interval spans two UTC dt partitions (2026-02-01 and 2026-02-02). If each partition scans 64 GB, scanning both is:

2 partitions × 64 GB/partition = 128 GB

Filtering only one dt value will undercount, while filtering only on a transformed timestamp typically prevents partition pruning and scans many more partitions.

Single-partition filter still misses the portion of the local day that falls in dt='2026-02-02' UTC.
Timestamp-only time zone conversion can be logically correct but won’t fix undercount if dt pruning excludes needed partitions.
Function on event_ts only is likely to bypass dt partition pruning and can scan all 90 partitions (about 90 × 64 GB).

Question 10

Topic: Data Ingestion and Transformation

You are selecting AWS services for near-real-time ingestion. Which TWO statements are false or unsafe?

(Select TWO.)

Options:

A. DynamoDB Streams is designed for EC2 log ingestion, unlimited size
B. Kinesis Data Streams can ingest Oracle CDC directly, no CDC tool
C. Amazon MSK fits Kafka-compatible producers/consumers needing Kafka features
D. AWS DMS can do ongoing CDC replication into Kinesis or S3
E. Kinesis Data Streams provides shard-based ordering and scalable throughput
F. DynamoDB Streams captures DynamoDB item changes for downstream processing

Correct answers: A and B

Explanation: Match the ingestion service to the event source and protocol. Database CDC requires a CDC-capable service such as AWS DMS (or a custom CDC connector) before sending events to a stream. DynamoDB Streams only publishes DynamoDB table change events, while Kinesis Data Streams and Amazon MSK are general-purpose streaming platforms for application-produced records.

The core decision is source type versus what the ingestion service can natively capture. Kinesis Data Streams and Amazon MSK ingest records that producers explicitly publish (Kinesis API or Kafka protocol). They do not automatically “tap” databases for CDC. DynamoDB Streams is specialized: it emits change records only from a DynamoDB table.

Practical mapping:

DynamoDB table changes - DynamoDB Streams
Relational database CDC - AWS DMS (ongoing replication) to targets like Kinesis/S3
Kafka ecosystem/protocol needs - Amazon MSK
Application event/log producers - Kinesis Data Streams (or MSK), depending on constraints

The unsafe statements are the ones claiming Kinesis directly ingests Oracle CDC and that DynamoDB Streams is a general log ingestion service.

Oracle CDC claim is unsafe because CDC requires reading DB logs (use DMS or a CDC connector before streaming).
EC2 logs via DynamoDB Streams is unsafe because Streams only contains DynamoDB table change events.
DynamoDB Streams for item changes is accurate for capturing inserts/updates/deletes.
DMS/MSK/Kinesis positioning is accurate: DMS for CDC, MSK for Kafka compatibility, Kinesis for shard-based streaming.

Question 11

Topic: Data Security and Governance

Select THREE statements that correctly describe authentication mechanisms for private data sources and how they are applied to AWS data pipeline components.

Options:

A. Attach an IAM role to an AWS Glue job for S3 access.
B. Prefer IAM user access keys in code for non-AWS sources.
C. Put database passwords in Lambda environment variables for simplicity.
D. Store JDBC passwords in Secrets Manager and reference via Glue connection.
E. Use a Redshift IAM role for COPY/UNLOAD when accessing S3.
F. AWS DMS to Amazon S3 targets cannot use IAM roles.

Correct answers: A, D and E

Explanation: Role-based access is the default for AWS-to-AWS access because services can assume IAM roles and use temporary credentials. Secrets-based credentials are appropriate when a pipeline component must authenticate to an external system (for example, a database) with a username/password, ideally stored in AWS Secrets Manager. These patterns reduce long-lived credential exposure and improve rotation and auditing.

Use role-based authentication (IAM roles) when an AWS service needs to call AWS APIs (for example, Glue reading/writing S3, Redshift COPY/UNLOAD to S3). The service assumes a role and obtains short-lived credentials governed by IAM policies.

Use secrets-based authentication (for example, AWS Secrets Manager) when the data source requires a shared secret such as a database username/password. Pipeline components should reference the secret at runtime instead of hardcoding credentials in code, config files, or environment variables.

Key takeaway: prefer IAM roles for AWS resources; use Secrets Manager for external credentials and rotation.

OK: IAM role on a Glue job is the standard way to authorize S3/KMS access.
OK: Secrets Manager is appropriate for database/JDBC credentials used by pipeline components.
OK: Redshift can use an IAM role for S3 access during COPY/UNLOAD.
NO: Embedding IAM user access keys in code creates long-lived credential risk; use roles or Secrets Manager.
NO: Lambda environment variables are not a recommended primary secrets store; use Secrets Manager/Parameter Store.
NO: DMS can use an IAM role for certain targets like Amazon S3; it is not limited to username/password.

Question 12

Topic: Data Ingestion and Transformation

A data team uses Amazon EventBridge to start an AWS Step Functions state machine every 5 minutes. The state machine invokes AWS Lambda functions to ingest new records from a partner API, write them to Amazon S3, and update a downstream table. Step Functions is configured with retries and a Catch path for transient failures, and EventBridge can occasionally deliver duplicate events.

Which design action best reflects the core principle that supports safe retries and duplicate event delivery in this serverless orchestration?

Options:

A. Make each Lambda idempotent by recording a processing key and using conditional writes
B. Use IAM least-privilege policies for EventBridge, Step Functions, and each Lambda role
C. Prevent overwrites in the raw S3 prefix by writing each ingest to a new object key
D. Encrypt S3 objects and any state in DynamoDB using AWS KMS customer managed keys

Best answer: A

Explanation: Because Step Functions can retry and EventBridge can deliver the same event more than once, the workflow must tolerate repeated invocations without changing the final result. The core principle is idempotent processing: repeated executions with the same input should produce the same outcome. Tracking a unique processing key and enforcing it with conditional writes prevents duplicates while still allowing retries for resilience.

When EventBridge triggers Step Functions, you should assume at-least-once delivery and that retries may occur during transient errors. The core principle that makes this safe is idempotent processing: each processing attempt must be able to run more than once without producing duplicate side effects.

A common pattern is to generate or extract a stable idempotency key (for example, partner record ID + ingest window or the Step Functions execution input hash) and store it in a durable store such as DynamoDB. Each Lambda checks/claims the key using a conditional write so only the first attempt performs the write/update, while subsequent retries become no-ops or return the previously computed result. This preserves correctness while keeping Step Functions retries enabled for availability.

Immutability of raw zone helps auditing and replay, but it does not by itself prevent duplicate downstream updates from retries.
Least privilege reduces blast radius, but it does not address duplicate event delivery or repeated task execution.
Encryption at rest protects confidentiality, but it does not ensure correctness under retries.

Question 13

Topic: Data Ingestion and Transformation

A data engineer is designing a high-level ETL workflow on AWS with clear stages: ingest, validate, transform, load, and publish. The workflow must have explicit success and failure paths so downstream consumers only read trusted data.

Which TWO statements are INCORRECT (false/unsafe) for this design? (Select TWO.)

Options:

A. Land source data in an immutable S3 raw zone during ingest
B. Treat the workflow as successful if publish runs, even when transform/load errors occurred
C. Validate schema and business rules before transform; quarantine rejects
D. Make the load step idempotent so retries do not duplicate data
E. Publish only after a successful load; stop downstream on failures
F. If some records fail validation, continue without tracking; consumers can filter

Correct answers: B and F

Explanation: A well-designed ETL workflow defines clear stage boundaries and uses gates so only validated, successfully loaded data is published. Failure paths must be explicit (for example, quarantine invalid data, fail the run, and alert) rather than silently passing problems downstream. Success should only be emitted when all required upstream stages complete correctly.

The core design principle is stage gating with explicit branching: ingest writes immutable raw data, validate decides whether data is acceptable (or which records are rejected), transform only runs on accepted inputs, load writes to the target in an idempotent way, and publish exposes the curated result (for example, updating a pointer/manifest, adding a partition, or emitting an event) only after load succeeds.

At a high level, orchestrators such as AWS Step Functions or managed schedulers should model:

A single success path where each stage depends on the previous one completing successfully.
Failure paths using retries for transient errors, and Catch/fallbacks that route bad records to a quarantine location and raise alarms.
A clear contract to consumers: published data is complete and validated for that run.

The key takeaway is to avoid “silent success” patterns that hide validation or load failures and shift data quality responsibility to downstream users.

Silent invalid records is unsafe because validation must produce an explicit outcome (quarantine, counts/reasons, and a defined fail/partial-success decision).
Success despite upstream errors is unsafe because publish must be gated on successful transform and load, otherwise consumers can read incomplete/corrupt outputs.
Immutable raw landing is a good practice because it preserves an auditable source-of-truth for reprocessing.
Idempotent loads are a good practice because retries and reruns are common in orchestrated pipelines.

Question 14

Topic: Data Store Management

A company is standardizing on Amazon Redshift for analytics. The data engineering team must choose between Amazon Redshift provisioned and Amazon Redshift Serverless for different workloads.

Select TWO statements that are true.

Options:

A. Redshift Serverless requires selecting a node type and number of nodes to meet performance requirements.
B. In Redshift Serverless, setting a maximum RPU helps limit compute consumption and cost for that workgroup.
C. Redshift Serverless is a good fit for spiky or intermittent workloads because it automatically adjusts capacity and you pay for compute used.
D. Redshift Serverless is always the lowest-cost option for 24/7 steady, high-utilization BI workloads.
E. Redshift provisioned automatically scales compute to zero when the cluster is idle.
F. Redshift provisioned is generally a better fit for consistently high, predictable workloads where you want fixed capacity and stable performance.

Correct answers: B and C

Explanation: Redshift Serverless is intended for workloads with variable demand because it manages capacity for you and charges based on usage. It also provides cost-control guardrails such as a maximum RPU setting to limit peak compute consumption. Provisioned clusters are typically chosen when you want fixed capacity and highly predictable performance and spend for steady workloads.

The core trade-off is how compute capacity is managed and billed. Redshift Serverless abstracts cluster sizing and scales compute based on demand, which is typically cost-effective for unpredictable, spiky, or intermittent usage patterns. You can apply guardrails (such as maximum RPU) to help control peak compute usage and reduce the risk of unexpected spend.

Redshift provisioned requires you to choose and manage cluster capacity (node type/count). That model is commonly preferred for steady, predictable, always-on workloads where you want consistent performance characteristics and more predictable baseline costs, especially when capacity can be planned in advance.

In this question, the true statements are the ones describing serverless auto-capacity with usage-based billing and the use of max RPU as a cost-control mechanism.

NO (scales to zero) Provisioned clusters do not automatically scale to zero just because they are idle.
NO (node sizing required) Serverless does not require choosing node types/counts; it abstracts cluster sizing.
NO (always cheapest) Always-on high utilization can favor provisioned capacity; serverless is not guaranteed to be cheapest.
OK (steady workload fit) Provisioned is commonly used for consistent, predictable workloads with fixed capacity needs.

Question 15

Topic: Data Store Management

A company stores clickstream data in Amazon S3 as Parquet with prefixes like s3://datalake/curated/clicks/dt=YYYY-MM-DD/hour=HH/. An Amazon Athena dashboard must include new data within 15 minutes of landing in S3. The dataset creates thousands of new partitions per day. The team must keep Athena query costs low by maximizing partition pruning and must avoid solutions that require frequent full-prefix S3 listing because of governance and operational overhead.

Which solution BEST meets these requirements?

Options:

A. Schedule an AWS Glue crawler to run every 15 minutes
B. Query the data as an unpartitioned Athena table
C. Run Athena MSCK REPAIR TABLE before each dashboard refresh
D. Use S3 events to trigger Lambda to BatchCreatePartition in Glue

Best answer: D

Explanation: Athena relies on AWS Glue Data Catalog partition metadata to find and prune partitioned data efficiently. Creating partitions as new S3 keys arrive synchronizes the catalog with the dataset within the 15-minute SLA without repeatedly listing the full dataset prefix. This preserves correctness (latest partitions are queryable) and reduces query and maintenance cost.

For partitioned Athena tables, the Glue Data Catalog’s partition metadata determines which S3 prefixes Athena reads. If new dt/hour folders exist in S3 but are missing from the catalog, queries can return incomplete results or require expensive discovery steps. An event-driven approach that extracts partition values from the object key and calls Glue BatchCreatePartition keeps the catalog synchronized quickly and at scale.

Emit S3 ObjectCreated events to an SQS queue (or EventBridge).
Invoke Lambda in batches to parse dt and hour from the key.
Call Glue BatchCreatePartition to register new partitions.

This avoids frequent full-prefix listing while enabling effective partition pruning for lower Athena scan costs.

MSCK REPAIR overhead requires listing many prefixes/objects, which is slow and costly at thousands of partitions/day.
Frequent crawlers still perform broad S3 discovery, adding cost/latency and governance friction for frequent listing.
No partitioning forces larger scans because Athena can’t prune by dt/hour, increasing query cost.

Question 16

Topic: Data Ingestion and Transformation

A company ingests 5 KB JSON clickstream events into Amazon Kinesis Data Streams. The required transformation is lightweight (add two lookup fields and drop PII) and must complete within 1 second of an event arriving. The team wants a fully managed, event-driven operational model with no long-running clusters to patch or scale.

Which transformation service should the data engineer choose?

Options:

A. AWS Lambda triggered by Kinesis Data Streams
B. Amazon EMR Spark Structured Streaming on EC2
C. AWS Glue streaming ETL job
D. Amazon Redshift ELT after loading events into tables

Best answer: A

Explanation: The deciding factor is the sub-second, per-event latency requirement with an event-driven operational model. AWS Lambda can process each Kinesis record immediately and scales automatically without managing servers or clusters. The other services are better suited to micro-batch/cluster-based streaming or batch ELT into a warehouse, which makes consistently meeting a 1-second SLA harder.

This scenario is best matched to a serverless, record-at-a-time transformation. With Kinesis as the source and a simple enrichment/redaction step, AWS Lambda can be invoked as records arrive and complete processing within the 1-second SLA while keeping operations minimal (no always-on compute to manage).

Glue streaming and EMR streaming are designed for larger, stateful streaming jobs and typically involve micro-batching and/or more operational overhead (job infrastructure or clusters). Redshift transformations generally assume data is loaded first and then transformed with SQL (ELT), which is a batch-oriented pattern and is not intended for per-event, sub-second processing.

Key takeaway: when the discriminator is strict low latency plus “no clusters,” choose Lambda over cluster or warehouse-based approaches.

Glue streaming is managed but is typically micro-batch oriented and less aligned to a strict 1-second per-event SLA.
EMR streaming can meet low latency, but it requires running and operating streaming clusters.
Redshift ELT requires loading data before SQL transforms, which adds batch latency and does not fit per-event processing.

Question 17

Topic: Data Store Management

Which statement best defines a business data catalog (as opposed to a technical catalog such as the AWS Glue Data Catalog) and how it supports governance and data sharing workflows?

Options:

A. It stores table schemas, S3 locations, partitions, and SerDe details for query engines and ETL jobs
B. It is the service that enforces fine-grained data permissions using centralized grants and LF-tags on databases and tables
C. It manages KMS envelope encryption keys and rotates data keys used to encrypt datasets at rest
D. It provides business context (owners, glossary, descriptions) and workflow-based sharing (search, request access, approvals) for governed data consumption

Best answer: D

Explanation: A business data catalog is designed for human-facing discovery and governance: it adds business metadata (definitions, owners, stewardship) and enables controlled sharing via access request and approval workflows. A technical catalog primarily serves engines by holding schemas and physical locations, not business context and collaboration processes.

The core distinction is purpose and audience. A technical catalog (for example, AWS Glue Data Catalog) is a metadata store used by services like Glue, Athena, and Redshift Spectrum to resolve database/table definitions, S3 locations, partitions, and serialization formats. A business data catalog targets people and governance: it layers business-friendly metadata (descriptions, glossary terms, ownership/stewardship) on top of technical metadata and supports data sharing workflows such as search and discovery, requesting access, approvals, and tracking who can consume datasets. In AWS, services in the “data discovery and collaboration” space (for example, Amazon DataZone) align to the business-catalog role, while Glue Data Catalog aligns to the technical-catalog role.

Technical metadata store describes a technical catalog used by query/ETL engines, not business governance workflows.
Permission enforcement aligns to Lake Formation governance controls, which are adjacent to (but not the same as) a business catalog.
Encryption management refers to KMS envelope encryption concepts, which are unrelated to cataloging and sharing workflows.

Question 18

Topic: Data Security and Governance

A data pipeline in us-east-1 uses Amazon Kinesis Data Firehose to deliver JSON files to s3://datalake-prod/landing/ (SSE-KMS with CMK key-123). An AWS Glue job reads only from landing/events/, converts to Parquet, writes to s3://datalake-prod/curated/events/, and updates one AWS Glue Data Catalog table analytics.curated_events. Amazon Athena queries only the curated location.

The Glue job’s IAM role currently has these managed policies attached: AmazonS3FullAccess and AWSGlueConsoleFullAccess. Security requires least-privilege scoping and disallows wildcard access to all S3 buckets/keys. The team wants to fix this without breaking the pipeline.

Which change is the best optimization?

Options:

A. Replace the managed policies with a customer-managed policy scoped to the two S3 prefixes, the specific KMS key, and only the required Glue Data Catalog actions for analytics.curated_events.
B. Add an S3 bucket policy that allows the glue.amazonaws.com service principal to read and write anywhere in datalake-prod, then remove all permissions from the Glue role.
C. Replace AmazonS3FullAccess with AmazonS3ReadOnlyAccess and keep AWSGlueConsoleFullAccess for catalog updates.
D. Keep the managed policies and add AWS CloudTrail alerts to detect unexpected S3 access by the Glue role.

Best answer: A

Explanation: The requirement is to apply least privilege when managed policies are too broad. A customer-managed IAM policy lets you restrict access to only landing/events/ read, curated/events/ write, the specific CMK needed for SSE-KMS, and only the necessary Glue Data Catalog actions for the single table. This reduces blast radius and improves security operability while keeping the pipeline functional.

Managed policies like AmazonS3FullAccess and AWSGlueConsoleFullAccess are typically far broader than a production ETL job needs. The least-privilege approach is to replace them with a customer-managed policy that allows only the exact actions and resources required by the Glue job:

S3: ListBucket on datalake-prod with a prefix condition, GetObject on landing/events/*, and PutObject (and optional multipart actions) on curated/events/*.
KMS: Decrypt for reads and Encrypt/GenerateDataKey for writes, scoped to CMK key-123.
Glue Data Catalog: only the table/database read/update actions needed to maintain analytics.curated_events, scoped to those catalog resources.

This meets the explicit constraint against wildcard access while preserving reliability (no unexpected AccessDenied) and improving security posture through reduced permissions.

Audit instead of reduce improves visibility but does not meet the least-privilege requirement for broad permissions.
Service-principal bucket policy removes identity-based control and becomes overly permissive for the whole bucket, violating the scoping constraint.
Read-only S3 breaks the job because it must write Parquet outputs (and usually needs KMS encrypt/data key permissions for SSE-KMS writes).

Question 19

Topic: Data Ingestion and Transformation

A data platform uses an Amazon EventBridge schedule to start an AWS Step Functions Standard state machine every 5 minutes. The workflow reads a batch of S3 object keys from an Amazon SQS queue, then fans out processing to AWS Lambda and AWS Glue jobs. During traffic spikes, the team must prevent downstream system overload and ensure failures are handled safely.

Which statement is INCORRECT about designing and operating this serverless workflow?

Options:

A. Use Map MaxConcurrency to cap parallel processing
B. Set Lambda reserved concurrency to protect downstream dependencies
C. Use retries with backoff and a DLQ for poison messages
D. Step Functions Standard is exactly-once, so idempotency isn’t needed

Best answer: D

Explanation: The unsafe statement is the claim that Step Functions Standard is exactly-once and therefore idempotency is unnecessary. In practice, retries, timeouts, and partial failures can cause the same unit of work to be attempted more than once. Designing idempotent processing (or explicit deduplication) is required for correctness in serverless orchestration.

Serverless orchestration is built around controlled concurrency and resilient failure handling. Step Functions Standard can retry failed tasks (or tasks can time out and be retried), and integrations that involve polling/queueing can also result in duplicate attempts. Because of this at-least-once reality, correctness is achieved by making each unit of work idempotent (for example, writing with conditional puts, using idempotency keys, or tracking processed S3 keys/ETags).

Separately, you manage load by capping parallelism (such as Step Functions Map MaxConcurrency) and by limiting Lambda concurrency to protect databases/APIs. For failures, use retries with exponential backoff for transient errors and route non-retryable “poison” messages to a DLQ for later inspection and redrive.

Exactly-once assumption is unsafe because retries/timeouts can re-run work; design for idempotency/deduplication.
Cap fan-out parallelism with Map MaxConcurrency to prevent stampedes during spikes.
Limit Lambda concurrency is a standard control to avoid overloading downstream systems.
Retries + DLQ is appropriate to separate transient failures from poison messages and support operational recovery.

Question 20

Topic: Data Security and Governance

A data engineering team is creating least-privilege IAM permissions for roles that read curated data in Amazon S3 and run analytics jobs. Which statement is INCORRECT?

Options:

A. Use tools like IAM policy validation/Access Analyzer and CloudTrail to confirm the policy’s effective access.
B. To allow s3:GetObject, granting access to only the bucket ARN is sufficient.
C. For S3 least privilege, use bucket ARN for s3:ListBucket and object ARNs for s3:GetObject.
D. If a managed policy is too broad, create a customer-managed policy scoped to required actions and resources.

Best answer: B

Explanation: In IAM, least-privilege scoping requires matching actions to the correct resource types. Amazon S3 object-level actions (for example, s3:GetObject) must be granted on object ARNs, while bucket-level actions (for example, s3:ListBucket) use the bucket ARN and can be further restricted with conditions. Creating a custom customer-managed policy is the right approach when AWS managed policies are too permissive.

The core idea is least-privilege IAM design: when AWS managed policies don’t fit, write a customer-managed policy that grants only the needed actions and scopes Resource precisely.

For Amazon S3, actions map to different resource types:

Bucket-level actions like s3:ListBucket use arn:aws:s3:::bucket and can be narrowed with conditions such as s3:prefix.
Object-level actions like s3:GetObject require object ARNs such as arn:aws:s3:::bucket/prefix/* (or arn:aws:s3:::bucket/*).

Therefore, granting s3:GetObject on only the bucket ARN will not correctly authorize object reads and is not a valid least-privilege pattern. The key takeaway is to scope both actions and resources accurately and validate the effective permissions before rollout.

Custom policy for least privilege is appropriate when managed policies include unnecessary permissions.
S3 resource scoping correctly distinguishes bucket ARNs for listing from object ARNs for reads.
Validation and audit using policy checks and CloudTrail helps confirm the policy matches intended access.

Question 21

Topic: Data Operations and Support

An AWS Glue Spark job that computes daily metrics with groupBy(user_id) is intermittently missing its SLA. You review the stage summary below.

Exhibit: Stage 12 (shuffle) by partition

Partition	Records read	Task time
0	820,000,000	58 min
1	21,000,000	2.1 min
2	19,000,000	1.9 min
3	20,000,000	2.0 min

Based only on the exhibit, what is the best next step to mitigate the issue?

Options:

A. Decrease spark.sql.shuffle.partitions to reduce overhead
B. Salt the user_id key before aggregation to spread work
C. Increase the number of Glue workers for the job
D. Coalesce to 1 partition before the groupBy to simplify processing

Best answer: B

Explanation: The exhibit shows extreme data skew: one shuffle partition processes far more records and runs far longer than the others. This indicates hot keys or unbalanced partitioning around user_id, creating a straggler that dictates overall stage runtime. A key-salting approach spreads the skewed key’s rows across multiple partitions to balance the workload.

Data skew in distributed processing appears as a small number of partitions doing most of the work, causing “straggler” tasks and long stage completion times. In the exhibit, Partition 0 reads 820,000,000 records and takes 58 minutes, while the other partitions read ~20,000,000 records and take ~2 minutes, which is a classic hot-key/unbalanced-partition symptom.

A high-level mitigation for skewed groupBy(user_id) is to add a temporary random “salt” to the key (for example, user_id + salt) so the heavy key’s rows are split across many partitions, then perform a second aggregation to combine the salted results back to per-user_id totals. Key takeaway: fix the skewed key distribution rather than only adding more compute.

More workers only may reduce wall time somewhat, but it doesn’t correct the hot partition shown by Partition 0 dominating records and runtime.
Fewer shuffle partitions typically concentrates data further and can worsen skew when one partition is already overloaded.
Single partition removes parallelism and would amplify the bottleneck instead of mitigating the 58-minute straggler.

Question 22

Topic: Data Ingestion and Transformation

A data engineer is building an ingestion job that pulls orders from a third-party REST API into Amazon S3 every 5 minutes. The API enforces rate limits and returns 429 Too Many Requests during bursts. The API supports cursor-based pagination by returning a nextPageToken in each response, and new/updated orders can arrive while a full pull is in progress.

Which statement is INCORRECT about designing pagination, backoff, and checkpointing for this ingestion job?

Options:

A. Use offset/page-number pagination and checkpoint only the page number to avoid duplicates even if new orders arrive.
B. Design loads to be idempotent (for example, dedupe by order ID) to handle retries safely.
C. Checkpoint the cursor token only after a page is successfully written downstream.
D. Retry 429 responses using exponential backoff with jitter and a max retry limit.

Best answer: A

Explanation: Cursor-based pagination with a persisted nextPageToken is designed for changing datasets, while offset/page-number pagination can shift when new records arrive. Reliable ingestion also requires controlled retries for throttling and a durable checkpoint that advances only after successful writes. Idempotent processing prevents duplicates when retries or replays occur.

The core design goal is to make API ingestion resilient to throttling and failures while ensuring correct continuity across runs. For APIs that return a nextPageToken, cursor-based pagination plus a durable checkpoint (such as storing the last successfully processed token) is the safer approach because it avoids the shifting-window problem that occurs when new/updated records change the ordering of results.

Operationally:

Treat 429 and transient 5xx errors as retryable with exponential backoff and jitter.
Advance the checkpoint only after the page is durably written (so a crash causes a safe re-read).
Make downstream writes idempotent (or deduplicate) so retries/replays don’t create incorrect duplicates.

Offset/page-number pagination with only a page checkpoint is the risky pattern when the source data can change mid-ingestion.

Offset/page pagination is unsafe when new/updated orders arrive because records can shift between pages.
Checkpoint-after-write prevents skipping data when a job fails mid-page.
Backoff with jitter reduces repeated throttling and protects the API while retrying.
Idempotent loads make retries and reprocessing safe without corrupting results.

Question 23

Topic: Data Ingestion and Transformation

You are transforming nested JSON event data (arrays and objects) into an Amazon S3 data lake for analytics with AWS Glue and Amazon Athena/Redshift Spectrum.

Which THREE statements are INCORRECT or unsafe guidance for handling nested and semi-structured data?

Options:

A. Convert nested JSON to CSV because CSV preserves nesting and boosts performance.
B. Normalize one-to-many arrays into parent and child tables when appropriate.
C. Athena can query nested fields using dot notation and UNNEST.
D. You must fully flatten all nested fields before writing to Amazon S3.
E. Athena/Redshift Spectrum cannot read Parquet struct types without flattening.
F. AWS Glue can relationalize or explode arrays into child tables.

Correct answers: A, D and E

Explanation: Nested data does not always need to be flattened up front; Athena and Redshift Spectrum can query nested fields directly (especially in columnar formats like Parquet). Flattening/normalizing is a design choice driven by query patterns, one-to-many relationships, and performance/cost tradeoffs. CSV is usually a poor target for nested data because it loses structure and is expensive to scan.

The core decision is whether to keep nested structures (schema-on-read) or to flatten/normalize them (schema-on-write) based on how the data will be queried.

Nested types are supported in common lake query engines: with Parquet/ORC you can keep struct/array columns and access them with dot notation, and use UNNEST/explode patterns for arrays.
Flattening everything into one wide table is not required and can be harmful (very wide schemas, repeated values, harder evolution).
When an array represents a true one-to-many entity (for example, line items), normalizing into a parent table plus child table(s) with keys is often cleaner.
Converting to CSV does not “preserve nesting”; it typically forces lossy stringification and increases scan/parsing cost.

A good default is: keep nested in columnar storage, then selectively unnest/relationalize where it improves the main access patterns.

Forced full flattening isn’t required; nested columnar types can be stored and queried directly.
CSV preserves nesting is false; CSV is flat and often increases scan/parsing overhead.
No Parquet struct support is incorrect; Athena and Spectrum can read nested Parquet types and unnest when needed.

Question 24

Topic: Data Ingestion and Transformation

A team stores an AWS Glue ETL script in a Git repository. A security review requires removing database credentials from source code and enabling managed secret rotation.

Exhibit: Glue script excerpt

1 JDBC_URL = "jdbc:postgresql://sales-db:5432/sales"
2 USER = "etl_user"
3 PASSWORD = "P@ssw0rd!"

Based only on the exhibit, what is the best next step to implement safe configuration management for this pipeline?

Options:

A. Upload the script to an encrypted S3 bucket
B. Pass the password as a Glue job argument in plaintext
C. Base64-encode the password before committing the script
D. Store credentials in AWS Secrets Manager and fetch at runtime

Best answer: D

Explanation: The exhibit shows a plaintext secret embedded directly in the ETL code. Replacing the hardcoded value with a runtime lookup from AWS Secrets Manager keeps secrets out of source control and lets you use managed rotation. Access can be limited to the Glue job IAM role.

Safe configuration management means keeping secrets out of code and retrieving them securely at runtime using managed services. In the exhibit, the issue is explicit: line 3 hardcodes PASSWORD = "P@ssw0rd!", which risks exposure through the repository and makes rotation operationally difficult.

The appropriate fix is to:

Store the database credentials in AWS Secrets Manager (optionally with automatic rotation).
Grant the Glue job role least-privilege permission to read that secret.
Pass only a non-secret identifier (for example, the secret ARN/name) to the job and resolve it at runtime.

This removes the secret from the script while keeping access controlled and auditable.

Encoding isn’t protection because Base64 is reversible and still exposes the secret.
Encrypted storage doesn’t fix the code because the secret remains in the script (line 3) even if S3 is encrypted.
Plaintext parameters still leak because a plaintext job argument is still a secret in configuration and logs.

Question 25

Topic: Data Store Management

Which statement is INCORRECT about selecting a data store for vector similarity workloads on AWS?

Options:

A. Amazon OpenSearch Service can provide low-latency ANN vector search but requires capacity planning for shards and memory.
B. Aurora PostgreSQL with pgvector is a good fit when vector search must be combined with relational queries and ACID transactions.
C. Amazon S3 queried with Amazon Athena is a safe choice for millisecond-latency nearest-neighbor search without building vector indexes.
D. For batch/offline similarity scoring, storing embeddings in Amazon S3 and processing with Spark can be more cost-effective than an always-on search cluster.

Best answer: C

Explanation: Online vector search typically needs a purpose-built vector index (often approximate nearest neighbor) hosted in a database/search engine designed for low-latency queries. OpenSearch and Aurora PostgreSQL with pgvector can support these patterns with different scaling and operational tradeoffs. S3 with Athena is primarily for analytical scans and is not an appropriate choice for millisecond-latency similarity retrieval.

The key tradeoff in vector workloads is whether you need interactive, low-latency nearest-neighbor retrieval (usually requiring a vector index such as HNSW/IVF) versus offline/batch processing.

Aurora PostgreSQL with pgvector can be a strong choice when embeddings live alongside relational data and you need SQL joins, constraints, and transactional consistency, but you still must design indexes and consider scaling limits of a relational engine. OpenSearch Service is commonly used for scalable, low-latency similarity search using ANN techniques, with operational responsibilities around cluster sizing, shard layout, and memory.

By contrast, Athena queries data in-place in S3 and is optimized for analytical scans, not interactive nearest-neighbor search, so relying on it for millisecond vector retrieval is unsafe.

OpenSearch for ANN is reasonable for low-latency similarity search, but you must plan cluster resources and index layout.
Aurora + pgvector is reasonable when relational filtering/joins and ACID needs are central to the application.
S3 + Spark for batch is reasonable for offline scoring where throughput and cost matter more than per-query latency.
Athena for millisecond k-NN is unsafe because scan-based query engines are not designed for interactive nearest-neighbor retrieval without specialized indexing.

Questions 26-50

Question 26

Topic: Data Operations and Support

A data pipeline ingests clickstream events hourly into an S3 curated bucket partitioned as dt=YYYY-MM-DD/hour=HH. A dashboard has a freshness SLA: each hourly partition must be queryable in Athena by HH:20 UTC. Traffic volume naturally varies by up to ±30% for the same hour across days.

The on-call team wants alerts that are actionable (target -6 pages/day). Warnings can go to Slack; pages go to PagerDuty.

Which TWO alerting approaches should you AVOID because they create unsafe/noisy thresholds for timeliness or completeness? (Select TWO.)

Options:

A. Publish a lag metric (now minus newest processed event time) and page only if lag >20 minutes for 2 consecutive 10-minute periods
B. Use CloudWatch anomaly detection on record-count metrics and page only for sustained deviations beyond the learned band
C. Compute hourly record count and alert when it drops more than 50% versus a 7-day baseline for that same hour; send a Slack warning for 20% to 50% drops
D. Page if the hourly record count is not exactly equal to yesterday’s count for the same hour
E. Alert if the expected dt/hour partition prefix is missing or contains only 0-byte objects at HH:20
F. Page when the latest partition arrives later than HH:20 for a single 1-minute evaluation period

Correct answers: D and F

Explanation: Freshness and completeness checks should be tied to the SLA and natural variability, with thresholds that reduce false positives. Paging on single short-lived SLA misses or using exact record-count equality creates noisy, non-actionable alerts. Using sustained-breach windows and baseline/anomaly-based completeness thresholds better matches operational goals.

Timeliness (freshness) checks validate that data is available by the stated SLA time; good alerting adds a short persistence window so you page only when the breach is likely real and ongoing. Completeness checks validate that “enough” data arrived; thresholds must account for expected volume variability (seasonality, day-to-day changes) using baselines (rolling median/average) or anomaly bands.

A practical pattern is:

Timeliness: measure partition-arrival time or processing lag and alert on sustained SLA breach.
Completeness: compare record counts (or bytes/events) to a historical baseline for the same hour and alert only on material drops.

The key takeaway is to align thresholds to the SLA and expected variance to avoid alert fatigue while still catching true data quality incidents quickly.

Single-minute page is too sensitive; brief delays can resolve without intervention.
Exact count equality conflicts with the stated ->30% natural variation and will trigger frequent false positives.
Sustained lag breach is actionable because it requires the SLA miss to persist.
Baseline/anomaly completeness is appropriate because it tolerates normal fluctuations while detecting material drops.

Question 27

Topic: Data Operations and Support

A daily AWS Glue Studio job reads s3://datalake/raw/orders/, applies a mapping that renames customer_id to cust_id, and then runs an AWS Glue Data Quality EvaluateDataQuality step to catch empty IDs and invalid values before writing Parquet to s3://datalake/curated/orders/.

Since the rename change, the job fails during the data quality step.

Exhibit: CloudWatch Logs (excerpt)

ERROR EvaluateDataQuality: AnalysisException: cannot resolve 'customer_id'
  given input columns: [cust_id, order_id, order_status, order_ts]
Ruleset orders_curated_rules:
  Completeness "customer_id" > 0.99

Which change will fix the root cause with the least disruption while keeping data quality checks in the job?

Options:

A. Update the ruleset to use cust_id instead of customer_id
B. Grant the job role glue:GetTable on the Data Catalog
C. Increase the Glue job DPUs to avoid Spark executor errors
D. Rerun the Glue crawler to refresh the table schema

Best answer: A

Explanation: The failure occurs inside the EvaluateDataQuality step because the ruleset references customer_id, which no longer exists after the mapping rename. Aligning the ruleset with the transformed schema allows the job to run and still enforce completeness/validity checks before writing curated output. This is the smallest change because it does not alter the pipeline structure or data flow.

Symptom: the Glue job fails at EvaluateDataQuality with an AnalysisException stating it cannot resolve customer_id.

Root cause: the mapping step renamed customer_id to cust_id, but the Glue Data Quality ruleset still evaluates Completeness "customer_id" > 0.99. When the evaluation runs against the post-mapping frame, Spark cannot find the referenced column.

Fix: edit the existing ruleset to reference the current column name (cust_id) and keep the evaluation step in the same position so it continues to validate the transformed/curated schema before writing to S3. Key takeaway: data quality rules must match the schema at the point in the job where they are evaluated.

IAM permission change is unlikely because the error is a Spark column-resolution failure, not AccessDenied.
More DPUs won’t help because the job fails deterministically due to an invalid ruleset expression.
Rerun crawler updates the catalog, but it does not rewrite the ruleset or change the job’s in-memory column names.

Question 28

Topic: Data Operations and Support

A data engineer runs Amazon Athena queries on an S3 table stored in Parquet and partitioned by dt (daily). There are 90 dt partitions. Each partition scans 100 GB when using SELECT *. If a query selects only user_id and page_id, Athena scans 10 GB per partition (column pruning).

A new report needs only user_id and page_id for the last 7 days. Athena costs USD 5.00 per TB scanned. Assume 1 TB = 1,000 GB. Round to the nearest cent.

Which option has the lowest estimated Athena cost for one run?

Options:

A. No dt filter and select user_id,page_id (USD 4.50)
B. Filter dt to last 7 days and select user_id,page_id (USD 0.35)
C. Filter dt to last 7 days and SELECT * (USD 3.50)
D. Add LIMIT 10,000 without a dt filter (USD 45.00)

Best answer: B

Explanation: The lowest-cost Athena query is the one that both limits partitions with a dt predicate and projects only the required columns. With Parquet, selecting fewer columns reduces bytes scanned, and partition pruning limits how many partitions are read. Using both together minimizes scanned data and typically improves performance and cost.

Athena charges by data scanned, so the goal is to reduce scanned bytes by (1) pruning partitions with a predicate on the partition key and (2) scanning only the needed columns with a columnar format like Parquet.

For the query that filters to 7 days and selects only user_id and page_id:

\[ \begin{aligned} \text{GB scanned} &= 7 \times 10\ \text{GB} = 70\ \text{GB}\\ \text{TB scanned} &= 70/1000 = 0.07\ \text{TB}\\ \text{Cost} &= 0.07 \times USD 5.00 = USD 0.35 \end{aligned} \]

Any approach that reads more partitions or uses SELECT * increases bytes scanned (and cost).

SELECT * on 7 days still scans 100 GB per day because it reads all columns.
No partition filter scans 90 partitions even if only two columns are selected.
LIMIT misconception does not prevent scanning partitions/columns needed to evaluate the query.

Question 29

Topic: Data Ingestion and Transformation

A data engineer uses AWS Glue to transform clickstream events from an S3 raw zone for querying in Amazon Athena. The source JSON has schema drift: the event_time field is sometimes an ISO-8601 string and sometimes an epoch-milliseconds number, which causes downstream jobs to fail when a consistent timestamp is expected.

A daily Athena report queries the last 7 days of data.

Raw JSON size: 150 GB/day
Curated Parquet size (after transformation): 60 GB/day
Data retention: 30 days
Athena cost: USD 5.00 per TB scanned, where 1 TB = 1,000 GB (round to nearest cent)

Which solution best enforces a consistent schema and type normalization during transformation and results in the lowest Athena scan cost for the daily report query?

Options:

A. Glue ETL applies explicit mappings to a fixed schema, converts event_time to timestamp, writes Parquet partitioned by normalized event_date (YYYY-MM-DD); USD 2.10/query
B. Athena view uses try_cast to normalize event_time at query time over raw JSON; USD 22.50/query
C. Glue crawler infers schema from raw JSON daily and Athena queries raw data; USD 22.50/query
D. Glue ETL casts event_time to timestamp and writes Parquet with no partitions; USD 9.00/query

Best answer: A

Explanation: Use a transformation step that enforces a fixed schema and normalizes types before writing curated data. Writing Parquet partitioned by a normalized date enables Athena partition pruning and avoids failures from mixed event_time representations. The daily report then scans only the 7 required partitions, minimizing bytes scanned and cost.

The core requirement is to prevent downstream failures from schema drift by enforcing a consistent target schema (including type normalization) during transformation, not at query time. In AWS Glue, this is done by applying an explicit mapping and converting mixed representations (ISO string and epoch ms) into a single timestamp type, then writing curated data.

For Athena cost, partition pruning limits scanned data to the last 7 days of curated Parquet:

Data scanned = 7 days \(\times\) 60 GB/day = 420 GB
Convert to TB: 420 GB \(/\) 1,000 = 0.42 TB
Cost = 0.42 TB \(\times\) USD 5.00/TB = USD 2.10 per query

Key takeaway: enforce types and normalize partition keys during ETL so both correctness and scan efficiency are achieved.

Unpartitioned Parquet still enforces types, but scans 30 \(\times\) 60 GB = 1.8 TB for a 30-day table, increasing cost.
Query-time casting can mask bad records but doesn’t enforce a curated schema for downstream consumers and still scans the full 30-day raw dataset.
Crawler inference on raw JSON does not normalize inconsistent representations into a guaranteed type-safe curated dataset and retains higher scan volume.

Question 30

Topic: Data Operations and Support

A data pipeline uses AWS Step Functions to orchestrate an AWS Lambda validation step followed by an AWS Glue ETL job. Each run must carry a run_date and a unique run_id through all steps, and operators must be able to trace outputs back to the specific orchestration run.

Which mechanism best meets this requirement?

Options:

A. Store run_date and run_id as S3 object tags
B. Use CloudTrail event IDs to correlate the steps
C. Pass context in Step Functions input/output with execution ARN
D. Write run_date and run_id into Glue Data Catalog table properties

Best answer: C

Explanation: AWS Step Functions is designed to pass run-specific context between automated processing steps by carrying a JSON payload through the workflow. Each workflow run also has a unique execution identifier (execution ARN), which provides high-level traceability from outputs back to the specific run.

To pass run context such as dates, partitions, and run IDs through automated processing steps, use a mechanism that natively propagates per-execution values across the orchestration. AWS Step Functions does this by storing the execution input and each state’s output as JSON and passing that data to subsequent states (optionally shaping it with state Parameters). Step Functions also assigns each run a unique execution identifier (execution ARN), which can be injected into task inputs for end-to-end traceability and used to correlate logs and artifacts produced by Lambda and Glue for that run.

Key takeaway: use the workflow engine’s execution-scoped input/output and execution ID rather than static metadata or audit logs.

S3 object tags are per-object metadata and are not a native way to propagate run context between tasks.
Glue Data Catalog properties are table-level metadata and are not intended for per-run values.
CloudTrail event IDs audit API calls but don’t carry your pipeline’s run context through processing steps.

Question 31

Topic: Data Store Management

Which THREE statements are true about AWS-backed approaches for storing embeddings and performing vector similarity search? (Select THREE.)

Options:

A. Aurora PostgreSQL with pgvector can store embeddings and run similarity SQL.
B. Athena can perform k-NN similarity search directly over S3 embeddings.
C. Amazon OpenSearch Service supports vector fields and k-NN queries.
D. RDS for MySQL includes built-in vector indexing and similarity functions.
E. DynamoDB natively supports vector distance queries for similarity search.
F. Self-managed FAISS on EC2 can serve similarity search from stored embeddings.

Correct answers: A, C and F

Explanation: Vector search is used to find semantically similar items by comparing embedding vectors. On AWS, common high-level approaches include managed search engines that support vector indexing, relational databases with vector extensions, or self-managed ANN libraries running on compute. Serverless SQL query over files and key-value databases are not, by themselves, vector similarity engines.

The core decision is choosing a store and query engine that can both persist embeddings and execute similarity (nearest-neighbor) queries efficiently. AWS-native options include OpenSearch for managed vector indexing and k-NN retrieval, and PostgreSQL-compatible databases (such as Aurora PostgreSQL) using pgvector to store vectors and compute distances in SQL. If you need custom algorithms or tighter control over indexing behavior, you can run an ANN library (for example, FAISS) on Amazon EC2 and manage the index lifecycle yourself.

Amazon OpenSearch Service vector + k-NN: OK
Aurora PostgreSQL + pgvector similarity SQL: OK
FAISS on EC2 (self-managed ANN service): OK
Athena over S3 for k-NN similarity: NO (not a vector engine)
DynamoDB native vector distance queries: NO
RDS for MySQL built-in vector indexing: NO

Pick services that provide vector-aware indexing and similarity query capabilities, not just storage.

Athena over S3 is great for SQL analytics, but it doesn’t provide vector indexing/k-NN search.
DynamoDB is a key-value/NoSQL store and does not natively offer vector distance operators.
RDS for MySQL does not generally provide built-in vector indexes/functions for similarity search.

Question 32

Topic: Data Security and Governance

A data lake S3 bucket in Account A is encrypted with SSE-KMS using a customer managed key (CMK) in Account A. A Glue ETL job in Account B assumes the IAM role GlueETLRoleB to read objects from the bucket.

Current setup:

The S3 bucket policy in Account A allows s3:GetObject to arn:aws:iam::<AccountB>:role/GlueETLRoleB.
GlueETLRoleB has an IAM policy that allows kms:Decrypt on the CMK ARN in Account A.

The Glue job fails with the following error:

AccessDeniedException: User: arn:aws:sts::<AccountB>:assumed-role/GlueETLRoleB/... is not authorized to perform: kms:Decrypt on resource: arn:aws:kms:us-east-1:<AccountA>:key/<key-id> because no resource-based policy allows the kms:Decrypt action

Which change will fix the root cause with the MINIMAL change while keeping SSE-KMS enabled?

Options:

A. Grant Lake Formation SELECT on the database to GlueETLRoleB
B. Add GlueETLRoleB to the CMK key policy in Account A
C. Switch the bucket to SSE-S3 encryption
D. Add kms:Decrypt to the S3 bucket policy in Account A

Best answer: B

Explanation: The failure occurs at AWS KMS, not S3 or Glue, because the CMK in Account A does not trust the cross-account role. For cross-account decryption, KMS authorization must be granted by the CMK’s resource-based key policy (or a KMS grant) in the key-owning account. Updating the key policy to allow the Account B role resolves the AccessDenied with minimal change.

Symptom: the Glue job can reach S3 but fails with kms:Decrypt AccessDenied stating that no resource-based policy allows the action.

Root cause: in cross-account scenarios, an IAM policy on the caller role is not sufficient by itself for KMS. The CMK in Account A must also allow the principal from Account B via the CMK key policy (or a KMS grant). Without that trust, KMS denies decryption even if S3 access is allowed.

Fix: update the CMK key policy in Account A to allow arn:aws:iam::<AccountB>:role/GlueETLRoleB (or an approved Account B principal) to use the key for required actions such as kms:Decrypt (and commonly kms:GenerateDataKey* for SSE-KMS workflows).

Key takeaway: cross-account KMS access is controlled by the key policy/grants in the key-owning account.

Bucket policy misconception S3 bucket policies don’t grant KMS key usage; KMS requires its own authorization.
Changing encryption type Switching to SSE-S3 avoids KMS but violates the constraint to keep SSE-KMS enabled.
Wrong control plane Lake Formation permissions govern table/data access, not KMS decrypt authorization for SSE-KMS objects.

Question 33

Topic: Data Ingestion and Transformation

A data platform team deploys the same CloudFormation stack to dev and prod to create an AWS Glue connection used by ingestion jobs. The team must keep environment-specific secrets out of source control and avoid passing secret values through deployment parameters/logs.

Exhibit: CloudFormation snippet

Parameters:
  Env: {Type: String}
  DbPassword: {Type: String, Default: "P@ssw0rd"}
Resources:
  GlueConn:
    Type: AWS::Glue::Connection
    Properties:
      ConnectionInput:
        ConnectionProperties: {PASSWORD: !Ref DbPassword}

Based on the exhibit, what is the best next step?

Options:

A. Replace the password with a Secrets Manager dynamic reference per environment
B. Store the password in a CloudFormation Mapping keyed by Env
C. Base64-encode the password in the template default value
D. Keep the parameter but set NoEcho: true and pass the value at deploy time

Best answer: A

Explanation: The exhibit shows a plaintext default password and uses it directly for the Glue connection, which violates safe environment-specific configuration practices. Using a Secrets Manager dynamic reference lets CloudFormation retrieve the secret at deploy time without embedding it in the template or providing it as a parameter value.

The core issue is secret handling in infrastructure as code. In the exhibit, DbPassword is defined with a plaintext Default: "P@ssw0rd" and then used as PASSWORD: !Ref DbPassword, which would place a secret in source control and encourage reuse across environments.

A better pattern is to create a separate AWS Secrets Manager secret for each environment (for example, one for dev and one for prod) and reference it from the template using a CloudFormation dynamic reference (resolved at deploy time). This keeps secrets out of the template and avoids passing secret values through stack parameters or build logs.

Key takeaway: the exhibit’s plaintext Default value is the signal to move the password into a managed secret and reference it dynamically.

Mappings aren’t secret storage because mapped values still live in the template, like the plaintext Default shown.
NoEcho still passes a secret since the value must be provided as a parameter value, which the requirement says to avoid.
Base64 is not protection because it’s reversible encoding and still embeds the secret like the exhibit does.

Question 34

Topic: Data Ingestion and Transformation

A data ingestion Lambda function uncompresses large objects from Amazon S3, performs transformations, and writes results back to S3. The function’s local storage is insufficient, so the team is considering mounting Amazon EFS.

Which statements about using additional storage with AWS Lambda are FALSE/UNSAFE? (Select THREE.)

Options:

A. Mount an Amazon EBS volume directly to a Lambda function.
B. EFS throughput can bottleneck; optimize I/O and choose throughput mode.
C. A Lambda function can use EFS without being in a VPC.
D. EFS-mounted files have the same latency as Lambda /tmp.
E. Increase Lambda ephemeral storage when you only need temporary space.
F. EFS allows multiple invocations to share files; manage concurrency.

Correct answers: A, C and D

Explanation: When Lambda needs more disk than local ephemeral storage, it can mount Amazon EFS, but EFS access is over the network and has different performance characteristics. EFS also requires the Lambda function to run in a VPC to reach EFS mount targets. Lambda cannot mount EC2-style block storage such as Amazon EBS.

Lambda has two common ways to handle “more disk” needs: increase Lambda ephemeral storage (for scratch space that is local to the execution environment) or mount Amazon EFS (for shared, persistent file storage). EFS is an NFS file system accessed over the network via VPC mount targets, so I/O is generally higher-latency than local /tmp and can be constrained by EFS throughput and access patterns (for example, many small reads).

Operationally, using EFS with Lambda requires VPC configuration (subnets and security groups) so the function can connect to EFS. Also, because EFS is shared, concurrent invocations can contend for the same files, so coordination (naming, locking, idempotency) may be needed. The key takeaway is: choose ephemeral storage for fast temporary scratch, and EFS for shared/persistent storage with network performance tradeoffs.

EBS with Lambda is unsafe because Lambda cannot directly attach/mount EBS volumes.
Local-latency assumption is unsafe because EFS is network-attached and has different latency/throughput than /tmp.
No VPC requirement is unsafe because Lambda must run in a VPC to access EFS mount targets.

Question 35

Topic: Data Operations and Support

When troubleshooting incorrect results in Amazon Athena or Amazon Redshift queries (unexpected row counts, missing rows, or day-boundary mismatches), which TWO statements are false or unsafe?

(Select TWO.)

Options:

A. Check join cardinality; 1-to-many inflates counts
B. Convert to local time zone before applying DATE() filters
C. A WHERE filter on the right table keeps LEFT JOIN unmatched rows
D. Athena timestamps are time zone–naive; store offsets/UTC
E. Normalize timestamps to UTC before daily aggregates
F. Adding DISTINCT is a safe fix for join duplicates

Correct answers: C and F

Explanation: Common analytics errors come from JOIN/filter interactions, join cardinality changes, and inconsistent handling of timestamps across time zones. The unsafe statements are the ones that incorrectly describe LEFT JOIN behavior with filters and that treat DISTINCT as a generally safe “fix” for duplicates. Normalizing and converting timestamps deliberately, and validating join cardinality, are reliable troubleshooting steps.

At a high level, troubleshoot query correctness by validating three things: join semantics, filter placement, and time handling. With LEFT JOINs, predicates applied after the join can change which rows survive; filtering on columns from the right side in a WHERE clause removes the NULL-extended rows and often defeats the purpose of the LEFT JOIN (move such predicates into the ON clause when you intend to preserve unmatched rows).

Time issues often look like “missing” or “extra” records around midnight; avoid mixing local timestamps and UTC by standardizing storage (commonly UTC) and converting explicitly before bucketing or applying date-based filters. Finally, unexpected count inflation is frequently caused by a 1-to-many join; verify join keys and cardinality rather than masking symptoms.

A quick “dedupe with DISTINCT” can hide the real issue and can remove valid duplicates.

LEFT JOIN + WHERE Filtering on right-table columns in WHERE removes NULL-extended rows; use ON for preserved unmatched rows.
DISTINCT as a fix DISTINCT is not generally safe; it can drop valid duplicates and mask join problems.
Join cardinality checks Verifying 1-to-many expansion is a standard way to explain inflated aggregates.
Time zone consistency Storing/normalizing to UTC and converting before DATE()/bucketing prevents day-boundary errors.

Question 36

Topic: Data Store Management

Which statement is INCORRECT about using technical catalogs vs business catalogs in an AWS data platform?

Options:

A. Technical catalogs store schema, partitions, and S3 locations for queries.
B. Business catalog entries can support governed sharing via Lake Formation.
C. AWS Glue Data Catalog is primarily a business glossary and approval tool.
D. Business catalogs add glossary, ownership, and request/approval workflows.

Best answer: C

Explanation: AWS Glue Data Catalog is designed to register technical metadata such as table schemas, partitions, and data locations for services like AWS Glue and Amazon Athena. Business catalogs focus on business context (glossary, ownership) and enable governance and data sharing workflows such as access requests and approvals.

A technical catalog is optimized for compute engines: it tracks datasets as technical objects (databases/tables), including schemas, partition keys, and physical locations (for example, S3 prefixes) so ETL and query services can discover and read data consistently. A business catalog sits above that layer to help people find and safely use data: it adds business-friendly descriptions, glossary terms, ownership/stewardship, classifications, and processes for requesting and approving access. In AWS, governed sharing is typically enforced with controls such as Lake Formation permissions, while a business catalog can provide the workflow and context to route access requests to the right owners and document approved sharing.

Key takeaway: Glue Data Catalog is for technical metadata; business catalogs enable governance-oriented discovery and sharing workflows.

Glue as business workflow is incorrect because Glue Data Catalog focuses on technical table metadata, not glossary/approvals.
Technical metadata purpose is accurate: schemas/partitions/locations enable Athena/Glue to query and process data.
Business context and workflows is accurate: business catalogs capture ownership/glossary and support request/approval processes.
Governed sharing linkage is accurate: business-catalog sharing workflows commonly rely on enforcement like Lake Formation permissions.

Question 37

Topic: Data Ingestion and Transformation

A data pipeline uses AWS Step Functions to orchestrate a Transform Task state that invokes an AWS Lambda function once per incoming S3 object. If the Lambda fails, Step Functions should retry transient failures and then route the original input to an Amazon SQS dead-letter queue (DLQ) for investigation.

Constraints:

Each Lambda attempt runs for 25 seconds before failing.
The Retry policy uses a fixed delay of IntervalSeconds between attempts.
After retries are exhausted, a Catch sends the input to the SQS DLQ.
The pipeline must place a permanently bad object into the DLQ within 3 minutes (180 seconds) of the first attempt starting.
Ignore Step Functions overhead; round to the nearest second.

Which configuration meets the requirements?

Options:

A. MaxAttempts 4, IntervalSeconds 20, Catch to SQS DLQ
B. MaxAttempts 4, IntervalSeconds 30, Catch to SQS DLQ
C. MaxAttempts 4, IntervalSeconds 20, no Catch
D. MaxAttempts 5, IntervalSeconds 20, Catch to SQS DLQ

Best answer: A

Explanation: To guarantee a bad object reaches the DLQ within 180 seconds, add up all execution time across attempts plus the retry wait time between attempts. The only viable choice both stays under the 3-minute bound and includes a Catch path that sends the original input to an SQS DLQ after retries are exhausted.

This is a resiliency pattern decision: use Step Functions Retry for transient Lambda failures, then use Catch to divert poison-pill inputs to a DLQ for later analysis.

Compute worst-case time to reach the DLQ as:

total run time = (MaxAttempts) (25 seconds per attempt)
total wait time = (MaxAttempts <= 1) (IntervalSeconds)

\[ \begin{aligned} T &= 4\times 25 + (4-1)\times 20 \\ &= 100 + 60 \\ &= 160\text{ s} \end{aligned} \]

Since 160 seconds is within 180 seconds and the Catch routes the failed input to SQS, this configuration satisfies both the SLA and fault-tolerance requirements.

Too many attempts exceeds the 180-second bound because the extra attempt adds another 25 seconds plus an additional wait.
Too long between retries exceeds the 180-second bound due to larger fixed delays between attempts.
No DLQ path fails the requirement to route permanently bad objects for investigation after retries.

Question 38

Topic: Data Ingestion and Transformation

An hourly AWS Glue ETL job reads JSON files from s3://dl/raw/orders/ and writes Parquet to s3://dl/curated/orders/ partitioned by order_date (append mode). The job is occasionally retried or manually re-run after transient failures. After these re-runs, analysts report doubled counts in Athena, and the duplicates match entire raw files being ingested again.

Exhibit: CloudWatch log excerpt

INFO Job bookmark option: job-bookmark-disable
INFO Reading s3://dl/raw/orders/ingest_date=2026-02-25/
INFO Processed files: 12,480
...
ERROR S3Exception: Service Unavailable (503)
INFO Retrying job run

Which change will fix the root cause with the least disruption while still allowing late-arriving files to be picked up in later runs?

Options:

A. Increase the Glue job’s number of workers to reduce runtime
B. Add a DynamoDB table to store every processed S3 object key
C. Change the Glue job to overwrite the entire curated table each run
D. Enable AWS Glue job bookmarks for the S3 source

Best answer: D

Explanation: The duplicates are caused by a stateless ingestion pattern that re-reads the same S3 inputs on job retries or manual re-runs, then appends them again to the curated dataset. Enabling AWS Glue job bookmarks makes the ingestion stateful by persisting what has already been processed. This preserves correctness at scale while still allowing late-arriving files to be processed when they first appear.

Symptom: Athena shows doubled counts after Glue job retries/re-runs, and duplicates align to whole input files.

Root cause: the job is stateless (job-bookmark-disable) and reads a broad S3 prefix each run; when a run is retried or repeated, the same raw objects are ingested again and appended, creating duplicates.

Fix: enable AWS Glue job bookmarks for the S3 source so Glue persists ingestion state (which files/partitions were already processed) and skips them on subsequent runs, while still ingesting any newly arrived (including late) files when they appear. Key takeaway: use stateful ingestion (bookmarks/checkpoints) when correctness requires “process each input once” across failures and replays.

Full overwrite removes duplicates but is high disruption/cost and increases failure blast radius for large datasets.
External state table can work but adds a new component and operational burden, not the least-change fix.
More workers may reduce runtime but does not prevent reprocessing the same files on retries/re-runs.

Question 39

Topic: Data Ingestion and Transformation

A data engineer must implement a daily transformation for a data lake on AWS.

Exhibit: Ingestion ticket

1) Source: s3://datalake/raw/orders/ (JSON gzip), ~5 TB/day
2) Record example:
   {"order_id":"123","ts":"2026-02-20T12:01:02Z",
    "customer":{"id":"c7","tier":"gold"},
    "items":[{"sku":"A1","qty":2},{"sku":"B9","qty":1}]}
3) Target: one row per item, Parquet, partitioned by dt=YYYY-MM-DD
4) Columns may appear/disappear between days (schema drift)

Which option is the most appropriate language/tool choice to meet the ticket requirements?

Options:

A. AWS Lambda function written in Java
B. AWS Glue ETL job using PySpark (Python)
C. Bash script with jq to flatten JSON
D. Amazon Athena SQL view over the raw JSON

Best answer: B

Explanation: An AWS Glue Spark job written in PySpark is best suited for parsing nested JSON, exploding arrays into multiple rows, and writing partitioned Parquet at scale. The exhibit indicates ~5 TB/day (line 1), an items array that must become one row per item (line 2–3), and schema drift (line 4), all of which are common Glue ETL use cases.

The core decision is choosing a language/tool that can reliably transform large volumes of semi-structured data into an optimized lake format. The exhibit indicates high throughput (~5 TB/day on line 1), nested JSON with an array (items on line 2) that must be flattened to “one row per item” (line 3), and schema drift (line 4).

A Glue ETL job using Spark with PySpark is appropriate because it can:

Read JSON from S3 at scale.
Flatten nested structures using Spark transforms (for example, exploding items).
Handle schema drift more safely (for example, with Glue DynamicFrames and mapping).
Write partitioned Parquet to S3 for downstream query performance.

A key takeaway is that set-based SQL-only approaches can work for some transforms, but Spark-based ETL is typically the better fit when nested data, schema drift, and very large daily volumes are all present.

Athena-only SQL is not the best fit because the exhibit requires writing partitioned Parquet outputs (line 3) while handling schema drift (line 4) on ~5 TB/day (line 1).
Bash + jq is poorly suited for ~5 TB/day and producing consistent partitioned Parquet (line 1 and line 3).
Java Lambda is not appropriate for processing ~5 TB/day of files and flattening nested arrays into Parquet at scale (line 1 and line 3).

Question 40

Topic: Data Security and Governance

A team runs AWS Glue ETL jobs every 15 minutes that read from an Amazon Aurora PostgreSQL database by using a JDBC connection. The database password must be rotated every 30 days, and the team wants the rotation to require no job updates and to reduce operational risk (avoid custom scripts and manual coordination).

Which approach best meets these requirements?

Options:

A. Store the password in the Glue job parameters and rotate it by updating and redeploying the job on a schedule
B. Store the credentials in AWS Secrets Manager, enable Aurora hosted rotation, and have the Glue connection reference the secret ARN
C. Store the credentials as an SSM Parameter Store SecureString and rotate it monthly with an EventBridge rule and Lambda
D. Create an IAM user for the ETL jobs, store the access keys in Secrets Manager, and rotate the access keys every 30 days

Best answer: B

Explanation: Using AWS Secrets Manager with Aurora hosted rotation provides a managed, low-touch rotation mechanism and keeps the Glue job configuration stable by referencing the secret rather than a specific password value. The jobs retrieve the current credential at runtime via the secret, avoiding manual updates and reducing the chance of rotation-related outages.

The deciding factor is using a managed, integrated rotation mechanism that automatically updates the database credential and lets clients continue to reference the same secret.

With AWS Secrets Manager you can:

Store the Aurora username/password as a secret
Enable Aurora hosted rotation (Secrets Manager runs the rotation workflow)
Configure the consumer (the Glue JDBC connection) to reference the secret ARN so it always uses the AWSCURRENT value

This reduces operational risk by eliminating custom rotation code and preventing missed redeploys/coordination errors during password changes. A custom rotation built around Parameter Store can work, but it shifts rotation reliability and failure handling onto your team.

Custom rotation plumbing using Parameter Store increases operational burden (you must build, schedule, and handle failures/rollbacks yourself).
Redeploy on rotation creates an avoidable failure window if any job/config is missed or updates lag the rotation.
Rotating IAM access keys doesn’t address the database password rotation requirement for a JDBC database login.

Question 41

Topic: Data Operations and Support

A daily AWS Glue ETL job reads raw data from Amazon S3 and writes Parquet files to an S3 curated bucket. The job began failing right after the curated bucket was changed to use SSE-KMS with a customer managed key (CMK).

A data engineer runs a CloudWatch Logs Insights query on the Glue job log group and sees the following recurring error.

message (top)
------------------------------------------------------------
AccessDeniedException: not authorized to perform kms:GenerateDataKey
on arn:aws:kms:us-east-1:111122223333:key/abcd-...

stats
------------------------------------------------------------
count(*) = 31 (last 7 days)

Which action will fix the root cause with the LEAST change?

Options:

A. Disable SSE-KMS on the curated bucket and use SSE-S3 instead
B. Increase the Glue job timeout and number of DPUs to avoid intermittent failures
C. Add an S3 gateway VPC endpoint for the curated bucket path
D. Grant the Glue job role permission to use the CMK for encrypt operations

Best answer: D

Explanation: The recurring failure in Logs Insights is an authorization error against AWS KMS (kms:GenerateDataKey) that started immediately after enabling SSE-KMS. That indicates the ETL job can reach S3, but cannot use the CMK required to encrypt new objects. The minimal fix is to allow the Glue job role (and the CMK key policy) to use the key for encryption/data key generation.

Symptom: Glue job runs now fail consistently with AccessDeniedException for kms:GenerateDataKey, and Logs Insights shows the same message repeating over multiple days.

Root cause: When an S3 bucket uses SSE-KMS, the writer’s identity must be allowed to use the CMK. The Glue job role lacks the required KMS permissions (and/or the CMK key policy doesn’t trust that role), so S3 cannot obtain a data key on the role’s behalf during PutObject.

Fix: Update permissions to allow the Glue job role to use the CMK for writes (commonly kms:GenerateDataKey, kms:Encrypt, and kms:Decrypt as needed) and ensure the CMK key policy permits the role.

This resolves the authorization failure without changing the storage design or job sizing.

Switching to SSE-S3 would work but changes the encryption approach instead of fixing the permission issue.
Adding a VPC endpoint addresses network paths to S3, not KMS authorization errors.
Increasing timeout/DPUs can help performance issues, but it cannot resolve AccessDenied to KMS.

Question 42

Topic: Data Security and Governance

A company collects CloudTrail, VPC Flow Logs, and application logs from 20 AWS accounts (same Region). Logs arrive continuously and are stored in Amazon S3 as newline-delimited JSON, producing millions of small objects per day (~12 TB/day). The security team uses Amazon Athena for audit queries that filter by event_date, account_id, and awsRegion, and they have a 30-minute query SLA. Query costs and runtimes are increasing due to large scan sizes and excessive file counts.

Which TWO actions will best improve Athena cost and performance while keeping the logs suitable for audit? (Select TWO.)

Options:

A. Load the logs into DynamoDB and query with PartiQL for audits
B. Increase AWS Glue crawler frequency to update partitions every 5 minutes
C. Use Amazon EMR Spark to compact and write partitioned Parquet to S3
D. Use Kinesis Data Firehose dynamic partitioning with JSON-to-Parquet conversion
E. Keep JSON in S3 and optimize by creating Athena views only
F. Transition all logs to S3 Glacier Deep Archive after 1 day

Correct answers: C and D

Explanation: Athena performance and cost are dominated by data scanned and file/partition layout. Converting high-volume JSON logs into compact, partitioned Parquet reduces scanned bytes for selective audit queries and avoids excessive overhead from millions of small objects. Using EMR for batch compaction and Firehose for streaming delivery are scalable integrations that address both problems.

For large-scale audit logging in S3, the main levers for Athena are: fewer/larger objects (avoid the small-files problem), columnar formats (Parquet/ORC), and partitions aligned to common filters (such as date/account/Region).

A practical pattern is to keep the original raw logs for audit purposes, and produce a curated/query-optimized copy:

Use EMR (Spark) to batch compact and transform existing raw JSON into partitioned Parquet.
Use Kinesis Data Firehose for ongoing ingestion with dynamic partitioning and Parquet conversion (via a Glue schema), so new data lands query-ready.

This reduces total bytes scanned per query and improves runtime by pruning partitions and reading only needed columns, while remaining compatible with audit workflows that rely on S3 + Athena + the Glue Data Catalog.

OK Use EMR Spark to compact and write partitioned Parquet to S3: reduces small files and bytes scanned.
OK Use Firehose dynamic partitioning with JSON-to-Parquet conversion: continuously lands query-optimized data with fewer/smaller scans.
NO Increase Glue crawler frequency: updates metadata but does not fix scan size or small-file overhead.
NO Load into DynamoDB: not cost-effective or well-suited for large log analytics queries.
NO Transition to Glacier Deep Archive after 1 day: blocks Athena querying and hurts audit query timeliness.
NO Use Athena views only: doesn’t change file counts, format, or scanned bytes.

Question 43

Topic: Data Store Management

Select TWO statements that are true about configuring Amazon Redshift tables to match common access patterns (joins and filters).

Options:

A. DISTSTYLE EVEN is always best for join-heavy workloads
B. A well-chosen DISTKEY can reduce data redistribution for joins
C. DISTSTYLE ALL is recommended for large fact tables to speed joins
D. INTERLEAVED sort keys are best when filters mostly use the first column
E. Declaring a PRIMARY KEY in Redshift enforces uniqueness automatically
F. A compound sort key helps when filtering by a leading column range

Correct answers: B and F

Explanation: In Redshift, distribution choices primarily affect how much data must move across nodes during joins, while sort keys primarily affect how efficiently Redshift can skip reading irrelevant blocks. Choosing a DISTKEY aligned to frequent join keys and a compound sort key aligned to common range predicates are two standard, high-impact optimizations.

Redshift performance is heavily influenced by where data lives (distribution) and how it is ordered on disk (sort keys). A DISTKEY is most useful when many large joins occur on a single, high-cardinality column; colocating rows that join together reduces network redistribution. Sort keys help Redshift prune blocks using zone maps; a compound sort key is best when queries commonly apply range/equality filters on the leading sort column.

Practical guidance:

Use DISTKEY on a frequent join column to reduce shuffle.
Use a compound SORTKEY when filters typically start with the same first column.

The wrong choices either over-replicate data, misapply interleaved sorting, or assume constraints are enforced when they are informational only.

OK: The statement about a join-aligned DISTKEY is accurate; it can reduce redistribution when join keys are colocated.
OK: The statement about compound sort keys and leading-column range filters is accurate; it improves block pruning for that pattern.
NO: “DISTSTYLE EVEN is always best” is too absolute; KEY/ALL/AUTO can be better depending on join patterns and table sizes.
NO: Interleaved sort keys are generally for multiple, equally important filter columns (not primarily the first column).
NO: DISTSTYLE ALL is intended for small dimension tables; replicating a large fact table is inefficient.
NO: Redshift PRIMARY KEY/FOREIGN KEY constraints are not enforced for uniqueness; they are used mainly for planning/metadata.

Question 44

Topic: Data Security and Governance

Which THREE statements are true about AWS Key Management Service (AWS KMS) key concepts, key types, and key management models for encrypting and decrypting data? (Select THREE.)

Options:

A. Customer managed KMS keys require importing your own key material
B. With an AWS managed key, you cannot edit the key policy or directly manage its lifecycle
C. KMS Encrypt/Decrypt use symmetric KMS keys, and key material stays protected in KMS
D. Envelope encryption commonly uses GenerateDataKey and stores the encrypted data key with the ciphertext
E. Asymmetric KMS keys can be used with GenerateDataKey for envelope encryption
F. A key policy is optional because IAM policies alone can grant KMS key access

Correct answers: B, C and D

Explanation: KMS typically encrypts data using symmetric KMS keys and supports envelope encryption by generating and protecting data keys. You choose between AWS managed keys and customer managed keys based on how much control you need over permissions and lifecycle. Understanding which APIs apply to symmetric vs. asymmetric keys prevents incorrect designs.

The core concepts are symmetric encryption with KMS, envelope encryption, and key management responsibility. For most data-engineering workloads, KMS Encrypt/Decrypt operations are performed with symmetric KMS keys, while applications encrypt large payloads locally using data keys.

A common envelope encryption flow is:

Call GenerateDataKey to get a plaintext data key plus an encrypted data key.
Use the plaintext data key to encrypt data outside KMS.
Store the encrypted data key with the ciphertext.
Later call Decrypt on the encrypted data key to re-obtain the plaintext key for decryption.

AWS managed keys are managed by AWS (including key policy and lifecycle), while customer managed keys provide more direct control. A common pitfall is assuming asymmetric keys work with data-key generation or that IAM alone can replace the KMS key policy.

OK: Symmetric KMS keys are the standard for Encrypt/Decrypt, and KMS does not hand you the CMK key material in plaintext.
OK: GenerateDataKey + storing the encrypted data key is the textbook envelope encryption pattern for large objects.
OK: AWS managed keys are AWS-controlled; you generally can’t modify their key policy or directly manage their lifecycle.
NO: GenerateDataKey is for symmetric keys, the key policy is always part of KMS authorization, and importing key material is optional (not required) for customer managed keys.

Question 45

Topic: Data Security and Governance

A company runs a multi-account data lake (Amazon S3, AWS Glue, IAM) and must support governance investigations. Investigators need to view a 1-year history of resource configuration changes across all accounts and Regions from a central audit account. Evidence must be tamper-resistant, and investigators must have read-only access.

Which TWO actions should you AVOID because they violate these requirements? (Select TWO.)

Options:

A. Deliver AWS Config history to per-account S3 buckets that local admins can delete from
B. Disable AWS Config recording for IAM and S3 after capturing an initial baseline snapshot
C. Use AWS Config resource timeline and AWS Config advanced queries in the audit account
D. Use AWS Config managed rules to continuously evaluate key governance controls
E. Enable AWS Config in all accounts and Regions and aggregate data into the audit account
F. Set AWS Config retention to at least 1 year for configuration items

Correct answers: A and B

Explanation: AWS Config supports governance investigations by recording configuration items (changes over time) and allowing centralized querying and timelines through an aggregator. The approach must preserve a complete 1-year history for key resource types and protect the evidence from deletion or modification by workload administrators. Actions that weaken evidence integrity or stop change recording undermine compliance investigations.

AWS Config is designed to record and retain resource configuration changes (configuration items) and provide investigation tooling such as resource timeline and queries. To support governance investigations across multiple accounts/Regions, you typically enable AWS Config recorders everywhere, deliver the data to a controlled destination, and use an AWS Config aggregator in a central audit account for cross-account visibility.

To meet the stated requirements, the solution must:

Continuously record changes for governed resource types (for example, S3 and IAM)
Retain configuration history for at least 1 year
Centralize access for investigators with read-only permissions
Make the evidence tamper-resistant (workload admins should not be able to delete or alter the record)

The key takeaway is that governance investigations depend on both completeness (continuous recording) and integrity (protected, centralized retention).

Per-account deletable storage is not acceptable because it allows the subjects of investigations to tamper with evidence.
Stopping recording to save cost is not acceptable because it removes the continuous change history needed for audits.
Central aggregation is acceptable because it provides cross-account/Region visibility from the audit account.
Timeline/queries, retention, and rules are acceptable because they directly support investigating and demonstrating governance over time.

Question 46

Topic: Data Security and Governance

A company uses AWS Lake Formation with the AWS Glue Data Catalog to govern a data lake in Amazon S3. Data is queried with Amazon Athena by analysts from Finance and Marketing. The company wants attribute-based access control using LF-tags such as department and classification so access can scale without per-table grants, and access to the curated S3 zone must not bypass Lake Formation.

Which THREE actions should you AVOID?

Options:

A. Grant database-level SELECT to a shared analyst role
B. Apply LF-tags to tables/columns and grant permissions using LF-tag policies
C. Remove IAMAllowedPrincipals access and require Lake Formation permissions
D. Add an S3 bucket policy granting analysts s3:GetObject to curated
E. Set new databases/tables to “Use only IAM access control”
F. Automate LF-tagging of newly crawled tables before analysts query them

Correct answers: A, D and E

Explanation: LF-tag ABAC works only when Lake Formation is the enforcement point and access is granted through LF-tag policies (or other Lake Formation grants) that align to the desired attributes. Any configuration that enables direct S3 access or shifts authorization to IAM alone can bypass governance. Also, overly broad Lake Formation grants are not constrained by tags because tags don’t act as denies.

The core idea of LF-tag ABAC is: tag Data Catalog resources (databases, tables, columns) with LF-tags, then grant permissions to IAM principals using LF-tag policies so access scales as new resources inherit tags. For this to work, Lake Formation must be the gatekeeper for both catalog access and underlying S3 data access.

Avoid configurations that break one of these rules:

Bypass Lake Formation by permitting direct S3 reads to curated data.
Disable Lake Formation enforcement by using IAM-only access control for catalog resources.
Use broad Lake Formation grants (like database-wide SELECT) expecting LF-tags to “limit” them; LF-tags grant access but do not restrict existing permissions.

The safe pattern is to tag resources and grant access via LF-tag policies, keeping S3 access aligned to Lake Formation governance.

Direct S3 access is not governed by LF-tag policies and can circumvent Lake Formation controls.
IAM-only catalog mode prevents Lake Formation from enforcing LF-tag-based permissions.
Broad LF grants are not narrowed by tags because Lake Formation permissions are additive.
Automating tagging and using LF-tag policies supports scalable ABAC without per-table grants.

Question 47

Topic: Data Store Management

Select TWO statements that are true about choosing a lakehouse table format (for example, Apache Iceberg) versus traditional data warehouse tables (for example, Amazon Redshift managed tables) when considering transactionality, schema evolution, and interoperability.

Options:

A. Redshift tables provide open interoperability across Spark and Presto.
B. Lakehouse formats cannot support row-level updates on S3.
C. Iceberg tables on S3 can provide ACID and schema evolution.
D. Redshift managed tables are directly queryable by Athena in place.
E. Iceberg metadata enables multiple engines to share the same table.
F. Glue crawlers automatically create Iceberg tables from Parquet folders.

Correct answers: C and E

Explanation: Lakehouse table formats such as Apache Iceberg add transactional table metadata on top of files in Amazon S3, enabling ACID-style commits and controlled schema evolution. Because the format is open, multiple query engines can interoperate against the same table definition rather than being tied to a single warehouse engine.

The key distinction is where table semantics live. Traditional warehouse tables (such as Amazon Redshift managed tables) implement transactions and schema enforcement inside the warehouse engine and storage layer, which is optimized for that engine but not designed for other analytics engines to query the same tables directly.

Lakehouse table formats (such as Apache Iceberg) store data files in Amazon S3 and maintain table state (snapshots/manifests/metadata) that provides:

Transaction-like atomic commits (so readers see consistent snapshots)
Explicit schema evolution (add/rename/drop columns with tracked metadata)
Interoperability (multiple engines can use the same open table spec)

If you primarily need cross-engine access on S3 with governed evolution and consistent reads, a lakehouse table format is a strong fit compared with engine-specific warehouse tables.

OK: Iceberg adds metadata to enable atomic commits and managed schema changes on S3.
NO: Athena cannot query Redshift managed storage “in place”; it queries S3 data (or uses federated connectors).
NO: Glue crawlers can catalog datasets, but they do not automatically convert folders into Iceberg-managed tables.
NO: Lakehouse formats can support updates/deletes (typically via file rewrites and metadata commits), not “cannot.”

Question 48

Topic: Data Ingestion and Transformation

You are selecting programming languages/tools for ingestion and transformation tasks in an AWS data lake (Amazon S3, AWS Glue, Amazon Athena, Amazon EMR). Which THREE statements are INCORRECT or unsafe guidance?

Options:

A. Use Python in AWS Glue to transform data and call AWS APIs with boto3.
B. EMR Serverless for Spark supports only PySpark, not Scala or Java.
C. Use Scala or Java for Spark ETL needing typed APIs or custom Spark code.
D. Use Athena SQL to call REST APIs during a query to enrich each row.
E. Use Bash for complex nested JSON transforms and schema evolution at scale.
F. Use SQL in Athena to filter and aggregate partitioned Parquet in S3.

Correct answers: B, D and E

Explanation: Choose languages based on what the managed service actually supports and what the task requires. Bash is primarily for orchestration and shell-level automation, not large-scale structured transformations. Athena SQL runs inside the query engine and does not enrich rows by calling arbitrary REST APIs. Spark on EMR Serverless is not limited to PySpark-only workloads.

A good rule is to match the language to the execution engine and the kind of work being done. SQL is the right tool when you’re using a SQL engine (Athena/Redshift) for set-based transforms and analytics. Python is common for AWS-native ETL and integration work (Glue scripts, lightweight transforms, SDK calls). Scala/Java are appropriate when you need deeper Spark control, typed APIs, or existing JVM libraries.

The unsafe guidance is:

Treating Bash as a primary data-transformation language for large, nested datasets; it’s better for scripting, orchestration, and file operations.
Expecting Athena SQL to call external REST APIs per row; enrichment should be done before/after querying (for example, ETL in Glue/Spark/Lambda).
Assuming EMR Serverless Spark is PySpark-only; Spark applications can be authored in supported Spark languages, including Scala and Java.

The key takeaway is to align the language with the service runtime and transformation needs.

SQL for SQL engines works well for partition pruning and set-based transforms in Athena.
Scala/Java for Spark is valid when you need JVM libraries, typed DataFrames/Datasets, or advanced Spark integration.
Python for Glue and AWS SDK is a common, supported choice for ETL scripts and AWS API integration.

Question 49

Topic: Data Ingestion and Transformation

You are orchestrating a serverless data ingestion pipeline with AWS Step Functions that invokes AWS Lambda functions and starts AWS Glue jobs.

Which TWO statements are false or unsafe design assumptions for this workflow?

Options:

A. Express Workflows are exactly-once, so idempotency and deduplication are unnecessary
B. Step Functions supports configuring a DLQ on the state machine for failed executions
C. Use Retry/Catch with backoff to handle transient task failures
D. Prefer Glue/other service integrations for work that exceeds Lambda runtime limits
E. Tasks should be idempotent because retries can run the same step multiple times
F. Lambda reserved concurrency can cap parallelism and cause throttling when exceeded

Correct answers: A and B

Explanation: Step Functions does not provide a dead-letter queue that you attach to a state machine; you must explicitly model failure handling and routing. Also, Express Workflows can deliver at-least-once execution, so duplicate task attempts are possible and designs that skip idempotency are unsafe. The other statements describe common, recommended patterns for concurrency control and failure handling.

The key concepts are explicit failure handling in Step Functions, concurrency controls in Lambda, and at-least-once behaviors that require idempotent processing.

A Step Functions workflow can’t be “given a DLQ”; instead, use Retry/Catch to route failures (for example to SQS/SNS/EventBridge) and alarm on execution failures.
Retries are normal in orchestrations, so steps must be idempotent and/or deduplicate.
Lambda reserved concurrency is a hard cap; exceeding it results in throttling (which callers like Step Functions can handle via retries).
Keep Lambda for short tasks; use Glue (or other managed compute) for longer-running transforms and orchestrate them with service integrations.
Express Workflows are at-least-once, so exactly-once assumptions are unsafe.

The unsafe assumptions are the ones that expect built-in DLQ attachment and exactly-once behavior in Express.

“DLQ on state machine” is unsafe because Step Functions requires you to model failure paths (Catch/Retry) and route errors yourself.
“Express is exactly-once” is unsafe because Express Workflows are at-least-once and can produce duplicates.
Reserved concurrency is a valid control to limit downstream load, but it can throttle when the cap is reached.
Retry/Catch + idempotency is recommended because transient errors and retries are expected failure modes.

Question 50

Topic: Data Store Management

A team stores lakehouse tables (for example, Apache Iceberg, Delta Lake, or Apache Hudi) in Amazon S3 and wants to manage table lifecycle costs and performance.

Which THREE statements are true about tiering/compaction and managing snapshots and obsolete files for these tables?

Options:

A. Expiring old snapshots can enable deletion of unreferenced data files
B. Compaction rewrites many small files into fewer larger files
C. Deleting snapshot metadata immediately deletes all snapshot data files
D. Compaction removes the ability to time travel to recent versions
E. Deleting S3 data files outside the table can cause inconsistency
F. S3 Glacier tiering is a safe substitute for snapshot expiration

Correct answers: A, B and E

Explanation: Lakehouse tables rely on metadata (snapshots/commits) that reference immutable data files in S3. Compaction is a maintenance operation that reduces small files for better performance, while snapshot/retention management controls how much history is kept and which unreferenced files can be safely removed. Directly manipulating underlying S3 objects can break metadata consistency.

The core idea is that lakehouse tables separate table metadata (snapshots/commits and manifests) from the underlying data files in S3. Compaction improves performance by rewriting many small files into fewer larger files, but it should preserve logical table contents and snapshot semantics.

Snapshot expiration (or retention cleanup) reduces how much historical metadata is kept for time travel and can make older data files eligible for deletion only when they are no longer referenced by any retained snapshot. Because the metadata is the source of truth, deleting underlying S3 objects “out of band” (outside the table maintenance process) can leave the metadata pointing to missing files and cause failures or incorrect reads. The key takeaway is to use table-aware maintenance for compaction and cleanup instead of treating the table like unmanaged S3 folders.

OK: Compaction reduces small-file overhead by rewriting files into fewer, larger files.
OK: Expiring snapshots/retention can make unreferenced files eligible for safe cleanup.
OK: Out-of-band deletion in S3 can orphan metadata references and break reads.
NO: Compaction does not inherently remove time travel; S3 Glacier tiering does not replace table-aware retention; removing metadata does not delete files that are still referenced elsewhere.

Questions 51-65

Question 51

Topic: Data Operations and Support

A data engineering team lands daily vendor CSV files in an Amazon S3 raw/ prefix. The files frequently have inconsistent column names, leading/trailing spaces, and occasional missing required fields. The team wants an automated, low-code step that prepares the data for downstream transformation (for example, an AWS Glue ETL job) by standardizing the schema and isolating bad records, while publishing operational signals for monitoring.

Which TWO actions should the team take?

Options:

A. Use AWS Lake Formation permissions to prevent files with missing fields from being written to S3
B. Use an S3-triggered AWS Lambda function to parse and clean each CSV before writing to S3
C. Run a DataBrew data quality job to validate required fields and publish results/metrics for monitoring
D. Use Athena CTAS to overwrite the raw S3 prefix with cleaned CSV files
E. Configure an AWS Glue crawler to correct invalid values while inferring the schema
F. Create a Glue DataBrew recipe and run a scheduled recipe job to write standardized output to S3

Correct answers: C and F

Explanation: AWS Glue DataBrew is designed for low-code data preparation before transformation, using reusable recipes and automated jobs. A DataBrew recipe job standardizes columns and formats into a prepared zone in S3. A DataBrew data quality job validates required fields and produces results that can be monitored and used to drive remediation workflows.

The core idea is to introduce an automated data-prep layer before downstream ETL by using managed, low-code preparation capabilities. Glue DataBrew lets you define repeatable preparation logic (a recipe) to normalize column names, trim whitespace, cast data types, and output to a curated/prepared S3 prefix (often as Parquet for efficient downstream processing). To operationalize correctness, run a DataBrew data quality job (with rules for required fields and basic validity checks) and publish job outcomes/metrics so operations can monitor failures and route bad rows/files to a quarantine prefix for remediation.

This approach prepares data reliably without custom parsing code and integrates cleanly into scheduled or orchestrated processing.

OK (recipe job) Creates repeatable, low-code preparation and writes standardized outputs for downstream ETL.
OK (data quality job) Validates required fields and produces monitorable results to support remediation.
NO (crawler fixes data) Crawlers infer schema/partitions and update the Data Catalog; they do not cleanse or correct values.
NO (Athena overwrite raw) CTAS creates new datasets; overwriting raw inputs is not the right mechanism for controlled prep and quarantine.
NO (Lambda row cleaning) Custom code can work but increases operational burden and isn’t the intended low-code data-prep tool here.
NO (Lake Formation blocks bad data) Lake Formation controls access/governance, not content validation or record-level data cleansing.

Question 52

Topic: Data Ingestion and Transformation

A company runs a nightly AWS Glue batch job that writes curated Parquet files to Amazon S3 for querying in Amazon Athena. The curated dataset contains 30 days of data, averaging 40 GB per day (assume 1 TB = 1,000 GB). Analysts run 50 Athena queries per day, and each query filters on exactly one event_date.

Currently, the data is written to a single S3 prefix (no partitions), so each query scans all 1.2 TB. Athena costs USD 5 per TB scanned. The Glue job is sometimes re-run for the same day, so the write must be idempotent and the load should be incremental.

Which ingestion configuration best meets these requirements, and what will the Athena query cost be per 30-day month? Round to the nearest dollar.

Options:

A. Partition by device_type, use bookmarks, overwrite partitions; USD 9,000/month
B. Write to one prefix, infer schema each run, append-only; USD 9,000/month
C. Write to event_date/hour partitions, append new files only; USD 300/month
D. Write Parquet to event_date=YYYY-MM-DD/, use bookmarks, overwrite partitions; USD 300/month

Best answer: D

Explanation: Partitioning the curated S3 layout by event_date lets Athena prune data to only the day being queried, reducing bytes scanned per query from 1.2 TB to 40 GB. With 50 queries per day over 30 days, monthly scanned TB becomes 60 TB, and at USD 5 per TB the monthly cost is USD 300. Using Glue job bookmarks enables incremental loads, and overwriting the target partition makes reruns idempotent.

The core idea is to align S3 partitioning with the most common query predicate (event_date) so Athena can do partition pruning, while configuring the batch job for incremental processing and idempotent writes.

Cost calculation (30-day month):

With event_date partitioning, each query scans one day: 40 GB = 0.04 TB.
Queries per month: 50/day \(\times\) 30 days = 1,500 queries.
Monthly scanned TB: 1,500 \(\times\) 0.04 TB = 60 TB.
Monthly cost: 60 TB \(\times\) USD 5/TB = USD 300.

\[ \begin{aligned} \text{Queries/month} &= 50 \times 30 = 1500 \\ \text{TB/query} &= 40/1000 = 0.04 \\ \text{TB/month} &= 1500 \times 0.04 = 60 \\ \text{Cost/month} &= 60 \times 5 = 300 \end{aligned} \]

Using Glue job bookmarks (or an equivalent high-water mark) supports incremental loads, and overwriting the event_date partition prevents duplicate data when a day is reprocessed.

No partitions still scans the full 1.2 TB for every query, so cost stays high.
Wrong partition key (such as device_type) won’t prune scans for date-filtered queries.
Append-only reruns can duplicate data for a day, so writes are not idempotent even if scan cost is low.

Question 53

Topic: Data Ingestion and Transformation

A data engineer wants to reduce the cost of running transformation queries in Amazon Athena by avoiding unnecessary data scans. What factor does Athena primarily use to determine the cost of a query?

Options:

A. The amount of data scanned by the query
B. The total query runtime in seconds
C. The number of concurrent queries submitted
D. The number of rows returned in the result set

Best answer: A

Explanation: Amazon Athena query costs are primarily driven by how much data the query scans. Techniques like partitioning, column selection, and using columnar compressed formats reduce bytes scanned. Lower scanned bytes typically translates directly to lower Athena cost.

The key cost-optimization lever for Athena is reducing the amount of data scanned, because Athena pricing is based primarily on bytes scanned per query. To avoid unnecessary scans during transformations, structure data so queries can read less data (for example, partition by common filters such as date, select only required columns, and store data in columnar compressed formats like Parquet or ORC). These approaches reduce the bytes read from Amazon S3, which is what Athena uses to calculate query cost. In contrast, query runtime, rows returned, and concurrency are not the primary billing unit for Athena queries.

Runtime-based billing is more typical of provisioned compute, not Athena’s per-scan model.
Rows returned can be small even when scanning large datasets, so it is not the billing driver.
Concurrency affects throughput and limits, but not the per-query pricing basis.

Question 54

Topic: Data Store Management

A data engineering team is selecting between an Amazon Redshift provisioned cluster and Amazon Redshift Serverless for a new analytics platform. The platform runs interactive dashboards during business hours, sporadic ad hoc SQL from analysts, and occasional end-of-month reporting spikes.

Which THREE requirements best support choosing Amazon Redshift Serverless for this workload? (Select THREE.)

Options:

A. Require fixed monthly cost with pre-purchased capacity
B. Unpredictable spikes with long idle periods
C. Minimize capacity planning and cluster administration
D. Pay mainly for intermittent, ad hoc query usage
E. Run steady 24/7 ETL and reporting at constant load
F. Need manual WLM queue tuning for deterministic performance

Correct answers: B, C and D

Explanation: Amazon Redshift Serverless is designed for workloads with variable or unpredictable demand where automatic scaling and reduced administration are priorities. It is typically a better fit when usage is intermittent (including long idle periods) and teams want a pay-for-use model instead of managing and sizing a cluster.

The core decision is whether the workload is variable enough to benefit from on-demand scaling and usage-based pricing (Serverless), or steady enough to justify owning fixed capacity (provisioned).

Unpredictable spikes with long idle periods: OK (elastic scaling; avoids paying for idle nodes)
Minimize capacity planning and cluster administration: OK (AWS manages capacity and scaling)
Pay mainly for intermittent, ad hoc query usage: OK (pay for consumption rather than fixed cluster size)
Require fixed monthly cost with pre-purchased capacity: NO (provisioned/Reserved Instances better for strict cost predictability)
Run steady 24/7 ETL and reporting at constant load: NO (provisioned often more cost-controlled for steady utilization)
Need manual WLM queue tuning for deterministic performance: NO (provisioned provides more direct, traditional performance controls)

If the workload is consistently busy and you want tighter cost/performance control, provisioned is usually the better match.

Fixed monthly spend aligns better with provisioned capacity (including reservations) than usage-based scaling.
Constant 24/7 utilization often favors provisioned clusters because you avoid paying on-demand rates continuously.
Manual performance controls like traditional WLM tuning are a more common reason to prefer provisioned deployments.

Question 55

Topic: Data Store Management

A data lake on Amazon S3 is queried with Amazon Athena. The curated dataset currently uses daily partitions and is queried with a rolling 30-day window.

Requirements:

The upstream source regularly introduces new columns and occasionally renames fields and changes types (for example, string to int). The team wants to minimize pipeline breakage.
Athena query cost must stay under USD 200 per month.

Usage and pricing (assume constant for the month):

Analysts run 3 queries per day, each scanning all data from the last 30 days.
Athena price: USD 5 per TB scanned.
For this question, use 1 TB = 1,000 GB and round to the nearest dollar.
If stored as Parquet, the dataset size is 10 GB per day.
If stored as JSON, the dataset size is 40 GB per day.

Which solution meets the schema-evolution requirement while also meeting the monthly Athena cost target?

Options:

A. Store curated data as JSON and rely on Athena schema-on-read
B. Store curated data as CSV and reject schema drift in Glue ETL
C. Use an Athena-managed Apache Iceberg table in Parquet
D. Use a partitioned Parquet Hive table with a daily Glue crawler

Best answer: C

Explanation: Apache Iceberg is designed for table-level schema evolution, including adding columns, renaming fields, and certain type changes, which reduces downstream breakage. With Parquet at 10 GB/day, the monthly Athena scan cost stays under the USD 200 target given the stated query pattern.

To minimize pipeline breakage during schema evolution (new columns, renamed fields, type changes), a table format that tracks schema and supports evolution operations is needed; Apache Iceberg provides this while still using Parquet files for efficient scans.

Cost check using the given query pattern and units:

\[ \begin{aligned} \text{GB scanned/query} &= 30\ \text{days} \times 10\ \text{GB/day} = 300\ \text{GB} \\ \text{TB/query} &= 300/1000 = 0.3\ \text{TB} \\ \text{Monthly TB} &= 0.3\ \text{TB} \times (3\ \text{queries/day} \times 30\ \text{days}) = 27\ \text{TB} \\ \text{Monthly cost} &= 27\ \text{TB} \times USD 5 = USD 135 \end{aligned} \]

This stays under USD 200, and Iceberg’s schema evolution avoids brittle “break on rename/type change” behavior common with simple Hive-style tables.

Hive table + crawler can add columns, but renames/type changes often require rewrites or compatibility workarounds that can break consumers.
JSON schema-on-read still leaves field renames/type changes to be handled in every downstream query/job and costs about USD 540/month: 30 \(\times\) 40 GB = 1.2 TB/query; 90 queries/month \(\times\) 1.2 TB \(\times\) USD 5.
Rigid CSV enforcement prevents evolution rather than accommodating it, increasing operational breakage when the source changes.

Question 56

Topic: Data Operations and Support

A near-real-time analytics pipeline in us-east-1 ingests clickstream events (Kinesis Data Firehose -> Amazon S3 raw), transforms them (AWS Glue streaming job -> Amazon S3 curated), and serves queries (Amazon Athena). The business SLA is curated data must be queryable within 15 minutes of event time, and on-call must be paged within 5 minutes when the SLA is at risk. Today, operators mostly review Glue logs after users report stale dashboards.

Which change is the best way to improve operability and SLA regression detection with minimal added cost and no new third-party tooling?

Options:

A. Increase Glue job workers to reduce processing time variability
B. Rely on CloudWatch Logs Insights queries over Glue logs every 5 minutes
C. Publish stage-level custom CloudWatch metrics and build a CloudWatch dashboard with alarms
D. Enable CloudTrail S3 data events on the raw and curated buckets

Best answer: C

Explanation: Create an SLA-focused health dashboard by turning pipeline signals into CloudWatch metrics, then alerting on those metrics. Use managed service metrics (Glue failures/duration) plus custom metrics (end-to-end freshness/lag and record-count deltas) to detect regressions before consumers notice. The tradeoff is a small amount of engineering to emit metrics and a small ongoing CloudWatch custom metric cost.

High-level pipeline health dashboards work best when they track a few SLA-oriented metrics (freshness/lag, success/failure, and throughput) and drive alarms from those metrics rather than from ad hoc log searching. In this pipeline, you can use native CloudWatch metrics for Glue (job run state, duration) and publish custom metrics at key points (Firehose delivery delay, curated S3 watermark time, and optionally row-count/late-record rates) from the Glue job or a small Lambda triggered by EventBridge/S3. Then create a CloudWatch dashboard and CloudWatch alarms that page via SNS when freshness approaches or breaches 15 minutes.

The key tradeoff is paying for custom metrics (and maintaining the metric emission), but it provides reliable, low-latency detection and a single operational view of the SLA.

CloudTrail data events can be expensive at scale and do not directly express end-to-end SLA freshness/lag.
More Glue workers may reduce latency sometimes, but it doesn’t create SLA tracking, regression detection, or paging.
Logs Insights polling is less reliable for paging and higher operational effort than alarming on purpose-built metrics.

Question 57

Topic: Data Security and Governance

You are troubleshooting common authorization failures in AWS data engineering workloads (for example, Athena/Glue/Redshift Spectrum reading data in Amazon S3, AWS Lake Formation-governed tables, and cross-account access with IAM roles). Which THREE statements describe correct high-level causes and fixes for these authorization failures?

Options:

A. AssumeRole failures are commonly fixed by correcting the target role trust policy to allow the caller
B. If S3 access is allowed, Lake Formation permissions are automatically bypassed for Athena
C. To fix sts:AssumeRole AccessDenied, add sts:AssumeRole to the target role permissions policy
D. An S3 AccessDenied can be caused by bucket policy, SCP, or a permissions boundary overriding an IAM allow
E. Lake Formation permission errors require Lake Formation grants to the querying role or principal
F. For SSE-KMS objects in S3, s3:GetObject alone is sufficient because S3 decrypts without KMS permissions

Correct answers: A, D and E

Explanation: Authorization failures typically come from the specific control plane enforcing access: Lake Formation grants for governed data, multiple IAM/S3 policy layers for object access, and role trust policies for STS role assumption. The best fixes align to the layer producing the denial, rather than adding unrelated permissions. Correct troubleshooting starts by identifying the service error and then adjusting the matching permission mechanism.

Use the error message to identify which authorization system is denying the request, then apply the matching fix.

Statement	OK/NO	Brief fix / why
Lake Formation permission errors require LF grants	OK	Grant LF permissions (for example, SELECT on DB/table) to the principal used, and ensure the role can call `lakeformation:GetDataAccess`.
S3 AccessDenied can be caused by bucket policy/SCP/permissions boundary	OK	Check for missing allows or explicit denies across IAM policy, bucket policy, permissions boundary, and SCPs.
AssumeRole failures are fixed by trust policy changes	OK	Update the target role trust policy to trust the caller (and satisfy any conditions like external ID).
S3 allow bypasses Lake Formation for Athena	NO	Lake Formation governance is enforced independently of S3 object permissions when enabled.
Add `sts:AssumeRole` to the target role permissions policy	NO	The target role’s trust policy controls who can assume it; identity permissions alone don’t grant trust.
`s3:GetObject` is enough for SSE-KMS objects	NO	The caller also needs KMS permissions (and key policy access) to decrypt.

Key takeaway: fix the specific layer that produced the denial (LF grants, policy evaluation layers, or trust policy).

Bypass Lake Formation is incorrect because LF permissions are still required for governed datasets even if S3 allows access.
AssumeRole via permission policy is incorrect because the trust relationship (role trust policy) is what authorizes the caller to assume the role.
SSE-KMS without KMS access is incorrect because decryption requires KMS permissions and key policy access in addition to S3 permissions.

Question 58

Topic: Data Store Management

A data lake in Amazon S3 is queried with Amazon Athena through the AWS Glue Data Catalog. Pipeline today:

Ingest: Amazon Kinesis Data Firehose writes JSON to s3://lake/raw/ every minute.
Transform: An hourly AWS Glue job converts raw to a curated table in s3://lake/curated/events/.
Store: The curated table is partitioned by year/month/day/hour and contains ~50,000–100,000 small JSON files per hour (each ~1–5 MB).
Query: Most Athena queries filter on event_date (last 7 days) and region, and aggregate by hour.

The team needs to reduce Athena cost and improve query reliability (fewer timeouts) without increasing end-to-end data availability beyond 15 minutes.

Which change is the best improvement to the curated table layout?

Options:

A. Partition by user_id to maximize pruning for selective queries
B. Keep JSON and add minute-level partitions to reduce scanned data
C. Write Parquet and compact into ~256 MB files; partition by day and region
D. Write one file per day per region to minimize file count

Best answer: C

Explanation: Converting the curated dataset to columnar Parquet reduces bytes scanned and improves CPU efficiency in Athena. Compaction to fewer, larger files cuts S3/listing and Glue/Athena planning overhead caused by many small objects. Partitioning by day and region matches the common filters, so pruning remains effective while meeting the 15-minute freshness constraint with frequent compaction.

The core optimization is balancing partition pruning with file sizing. With Athena, too many small files increase query planning time and metadata/listing overhead, and JSON increases scan cost because it is row-based and not efficiently splittable for column pruning.

A good layout for the described access patterns is:

Store curated data as Parquet (optionally with Snappy).
Partition on the most common low-to-medium cardinality filters (event_date by day, and region).
Compact outputs to larger files (commonly ~128–512 MB) on a cadence that preserves the 15-minute SLA.

This reduces both “too many objects” overhead and scanned bytes, while still allowing Athena to skip entire partitions for date/region filters. The tradeoff is added ETL/compaction work and slightly more operational complexity.

High-cardinality partitions like user_id typically create an explosion of partitions and small files, increasing overhead and often worsening performance.
Over-partitioning to the minute level usually increases partition/file counts and planning time, even if each partition is smaller.
Single huge files reduce file count but can hurt parallelism and make incremental rewrites/late-arriving data handling more expensive.

Question 59

Topic: Data Security and Governance

Select TWO true statements about AWS KMS key rotation and access auditing for KMS keys used to encrypt data platform resources (for example, Amazon S3 SSE-KMS, AWS Glue, Amazon Redshift).

Options:

A. Best risk reduction is allowing all IAM principals to use the key.
B. Automatic rotation can be enabled on AWS managed keys.
C. Key rotation changes the key ARN, requiring S3 object re-encryption.
D. Enable automatic rotation for symmetric customer managed keys annually.
E. CloudTrail does not record failed KMS Decrypt attempts.
F. CloudTrail logs KMS API calls like Decrypt and GenerateDataKey.

Correct answers: D and F

Explanation: KMS key usage can be audited by reviewing AWS CloudTrail events for KMS API calls, which show which principal invoked cryptographic operations. For symmetric customer managed keys, enabling automatic rotation reduces long-term exposure by periodically rotating key material without requiring applications to update the key ARN.

The core ideas are (1) rotate key material to limit blast radius over time and (2) audit every use of the key. For symmetric customer managed KMS keys, you can enable automatic rotation so AWS creates new backing key material on a schedule while preserving the same key ARN and alias, keeping integrations stable. For auditing, AWS CloudTrail records KMS events (for example, Decrypt, Encrypt, GenerateDataKey, CreateGrant) so you can investigate which IAM role/user accessed a key and from where.

To reduce misuse risk at a high level, combine rotation and auditing with least privilege on the key policy/IAM policies, and add guardrails such as kms:ViaService, kms:CallerAccount, and encryption context conditions to scope how/where the key can be used. The key takeaway is to limit who can call KMS APIs and continuously monitor those calls.

OK CloudTrail logging of KMS APIs supports access auditing and investigations.
OK Automatic rotation applies to symmetric customer managed keys and keeps the same key ARN.
NO AWS managed keys rotate automatically; you can’t enable/disable their rotation settings.
NO Rotation does not change the key ARN; applications generally don’t need reconfiguration solely due to rotation.

Question 60

Topic: Data Ingestion and Transformation

A team is designing a replay strategy for a streaming ingestion pipeline (for example, Amazon MSK or Amazon Kinesis Data Streams) where consumers may need to reprocess past events after code fixes or downstream outages.

Which statement is NOT correct about designing for replayability?

Options:

A. Track consumer progress with offsets and be able to reset them
B. Make downstream writes idempotent to tolerate duplicates during replay
C. Set stream/topic retention longer than the maximum replay window
D. Commit offsets before processing to maximize throughput and still replay safely

Best answer: D

Explanation: Replayability depends on (1) keeping the source data long enough to reread it and (2) managing offsets so consumers can resume or rewind deterministically. Committing offsets before processing breaks this because a failure after the commit can cause the consumer group to advance past unprocessed events, making recovery and correct reprocessing unreliable.

A replay strategy for streaming ingestion is built on the combination of retention, offsets, and safe reprocessing. Retention (topic/stream) must cover the longest period you might need to reprocess; once events age out, replay is impossible. Offsets (or equivalent sequence tracking) represent consumer progress and must be durable and resettable so you can rewind to a known point for reprocessing.

To keep replay safe:

Commit/checkpoint offsets only after records have been successfully processed and durably written.
Ensure downstream writes are idempotent (or deduplicate using a stable event ID) because replay commonly produces duplicates.

The key takeaway is that retention enables rereads, but correct offset management and idempotent processing make those rereads trustworthy.

Retention window is required because expired events cannot be replayed.
Resettable offsets are central to reprocessing from a chosen point-in-time.
Idempotent sinks prevent duplicates from causing incorrect results during replay.
Early offset commits are unsafe because failures after commit can skip unprocessed events.

Question 61

Topic: Data Ingestion and Transformation

When troubleshooting slow Apache Spark stages (for example, on AWS Glue or Amazon EMR), which TWO statements are FALSE/unsafe? (Select TWO.)

Options:

A. High shuffle spill may improve with more memory or fewer cores
B. Repartitioning to very high counts always improves performance
C. Raising shuffle partitions can help when tasks are too large
D. Many small files on S3 can bottleneck I/O; reduce file counts
E. Skewed join keys can create straggler tasks; mitigate skew
F. Broadcast the larger join input to avoid shuffle

Correct answers: B and F

Explanation: Two statements are unsafe because they claim universal performance improvements from actions that often backfire. Broadcasting a large dataset can cause executor OOM and instability, and blindly repartitioning to extremely high partition counts increases shuffle, task scheduling overhead, and S3 I/O amplification. Effective debugging aligns shuffle volume, partition sizing, and parallelism with cluster resources and data characteristics (especially key skew).

Spark performance issues commonly come from expensive shuffles, skewed partitions (a few “hot” keys), and I/O bottlenecks (especially many small files on S3). Broadcasting is meant for the small side of a join; broadcasting a large input is unsafe because it pushes that dataset into executor memory and can trigger OOM or heavy GC.

Similarly, repartitioning does not “always” help: too many partitions creates lots of tiny tasks, increases shuffle metadata, and amplifies read/write overhead. A better approach is to size partitions so each task does a reasonable amount of work, then tune parallelism (for example spark.sql.shuffle.partitions) to match available cores.

The other statements are generally valid: skew mitigation reduces stragglers, spill often indicates memory pressure (or too much concurrency per executor), and reducing small files/using columnar formats helps relieve S3 I/O pressure.

Unsafe broadcast guidance: Broadcasting the larger join input risks executor memory exhaustion and job instability.
“Always repartition” myth: Very high partition counts can increase shuffle, scheduling overhead, and I/O, reducing throughput.
Skew is a straggler driver: Hot keys can make a few tasks dominate stage time, so skew-aware techniques can help.
I/O amplification on S3: Many small files increase listing/opens and write overhead, so reducing file counts is often beneficial.

Question 62

Topic: Data Operations and Support

When using AWS CloudTrail to audit and investigate incidents in an AWS data platform, which THREE statements are false or unsafe?

Options:

A. Store trail logs in Amazon S3 with SSE-KMS and restricted access.
B. Query CloudTrail logs with Athena or CloudTrail Lake during incidents.
C. S3 object-level access is logged by default in CloudTrail.
D. Send CloudTrail events to CloudWatch Logs for near-real-time alerting.
E. CloudTrail Event history provides indefinite retention for investigations.
F. A single-Region trail captures API activity in every AWS Region.

Correct answers: C, E and F

Explanation: CloudTrail is used to audit AWS API activity, but it must be configured correctly for investigation workflows. Event history is not meant for long-term retention, object-level data events (such as S3 objects) are not logged by default, and Region coverage depends on using multi-Region trails or deploying trails per Region.

CloudTrail records AWS API activity primarily through management events (for example, IAM, AWS Glue job updates, Redshift cluster changes). For incident investigations, you typically need durable retention and searchable logs, which means creating a trail that delivers to an S3 bucket (often encrypted with SSE-KMS and tightly permissioned) or using CloudTrail Lake for longer-term storage and querying.

Some high-volume activity is not captured unless explicitly enabled as data events (for example, S3 object-level reads/writes). Also, CloudTrail is Region-aware: to see activity across Regions in one place, configure a multi-Region trail (or collect logs from multiple Region trails). A common operations pattern is streaming CloudTrail events to CloudWatch Logs/EventBridge to alert on sensitive API calls while retaining logs for later forensics.

Indefinite event history is unsafe because Event history is not a long-term retention mechanism.
Default S3 object logging is false because object-level S3 activity requires enabling CloudTrail data events.
Single-Region covers all Regions is false because cross-Region coverage requires a multi-Region trail or additional trails.

Question 63

Topic: Data Operations and Support

A data engineering team stores daily customer interaction data (Parquet) in Amazon S3. A new upstream release introduced data quality issues: duplicate event_id values and invalid event_timestamp formats.

Requirements:

Data contains PII and must not be moved outside AWS-managed services.
All outputs must remain encrypted with SSE-KMS in S3.
The solution should be cost-efficient and avoid always-on compute.

Which TWO approaches should the team AVOID for verifying and cleaning the data? (Select TWO.)

Options:

A. Use an AWS Lambda function to validate schema/timestamp format on new objects and quarantine bad files to an S3 prefix, emitting CloudWatch metrics
B. Run Amazon Athena queries to identify invalid timestamps, then use CTAS to write cleaned Parquet to an SSE-KMS curated prefix
C. Download the raw S3 objects to a laptop, clean with local Python, and re-upload
D. Use SageMaker Studio Data Wrangler to define cleaning steps on a sample via Athena and run a processing job that writes cleaned SSE-KMS output to S3
E. Create an always-on Amazon EMR cluster to run Spark cleaning jobs continuously, regardless of data arrival
F. Use AWS Glue DataBrew to profile the dataset and apply a recipe that standardizes timestamps and removes duplicates, writing SSE-KMS output to S3

Correct answers: C and E

Explanation: Approaches that export PII outside AWS-managed services or rely on always-on compute violate the explicit security and cost requirements. Serverless/on-demand AWS services can both verify data quality (duplicates, format validity) and produce an encrypted, auditable cleaned dataset in S3.

Match the remediation to the issue type while honoring explicit constraints. For duplicates and invalid formats, common patterns are to (1) validate and quantify issues, (2) quarantine or rewrite bad records, and (3) write a cleaned dataset to a curated S3 location.

Athena, Glue DataBrew, Lambda, and Data Wrangler can all perform verification/cleaning within AWS and write results back to S3 encrypted with SSE-KMS. In contrast, exporting raw data to a local workstation breaks PII handling controls, and always-on clusters are an unnecessary cost when on-demand/serverless options meet the need. Key takeaway: keep PII processing in AWS-managed services and prefer on-demand/serverless execution for routine data quality tasks.

Local download of PII breaks the requirement to keep PII within AWS-managed services and typically reduces auditability.
Always-on EMR is a cost anti-pattern when the workload can run on-demand or serverless.
Athena + CTAS is appropriate for SQL-based validation and writing a cleaned Parquet dataset back to SSE-KMS S3.
Glue DataBrew/Lambda/Data Wrangler are acceptable AWS-native ways to profile, validate, quarantine, and transform data while keeping outputs encrypted in S3.

Question 64

Topic: Data Operations and Support

A company uses Amazon Athena to query curated clickstream data in Amazon S3. The AWS Glue Data Catalog table is partitioned by dt (format YYYY-MM-DD). Analysts run hourly dashboards that must return exactly correct results and the data must remain protected by existing AWS Lake Formation permissions.

Which TWO actions should you AVOID when trying to reduce Athena query cost and improve performance by limiting scanned data?

Options:

A. Convert curated data to Parquet with Snappy compression
B. Use Athena CTAS to create a smaller, partitioned dashboard table
C. Add TABLESAMPLE to queries to scan fewer bytes
D. Increase partition granularity (dt/hour) for common filters
E. Add dt predicates and select only required columns
F. Grant broad S3 read access to bypass Lake Formation

Correct answers: C and F

Explanation: Avoid approaches that reduce scanned data by changing the data returned or by weakening governance controls. Using sampling can make dashboards incorrect because it intentionally returns only a fraction of rows. Bypassing Lake Formation to “make queries work” breaks the requirement to preserve existing permissions, even if it might speed access.

Athena cost and performance are primarily driven by bytes scanned. To reduce scans safely, use partition pruning (predicates directly on partition columns like dt), select only the needed columns (avoid SELECT *), and store data in compressed columnar formats such as Parquet to enable column projection and efficient reads.

Actions to avoid fall into two categories under the stated requirements:

Correctness risks: techniques like TABLESAMPLE intentionally return incomplete data, which violates “exactly correct results” even if they reduce scanned bytes.
Governance/security risks: changing S3 access to bypass Lake Formation undermines the existing permission model and can expose data; it doesn’t address scan efficiency and violates the explicit control requirement.

The safest optimizations reduce scanned data without changing semantics or loosening access controls.

Sampling changes results TABLESAMPLE lowers scan volume by reading fewer rows, so dashboards can be wrong.
Bypassing governance broad S3 permissions can circumvent Lake Formation controls and violate least-privilege requirements.
Partition pruning adding direct dt (and possibly hour) predicates is an appropriate way to limit scanned partitions.
Columnar + CTAS Parquet compression and CTAS-derived, partitioned dashboard tables typically reduce bytes scanned without altering query correctness.

Question 65

Topic: Data Store Management

A data team is trying to speed up repeated dashboard queries in Amazon Redshift by creating a materialized view.

Exhibit: Redshift SQL and error

CREATE MATERIALIZED VIEW analytics.mv_orders_enriched AS
SELECT o.order_id, o.order_ts, c.segment
FROM spectrum_ext.orders o
JOIN public.customers c ON o.customer_id = c.customer_id;
-- ERROR: Materialized views are not supported for external tables

Based on the exhibit, what is the best next step to achieve materialized-view-style performance for this dataset?

Options:

A. Replace the materialized view with a standard Redshift view over the external table
B. Use Redshift federated queries to read the S3 data and then create the materialized view
C. Create the same materialized view in Amazon Athena over the S3 table
D. Load the external data into a Redshift table, then create the materialized view

Best answer: D

Explanation: The exhibit shows Redshift rejecting the statement because the materialized view references a Spectrum external table (spectrum_ext.orders). To get materialized-view benefits, the referenced data must be stored in Redshift tables, which can then be materialized and refreshed to serve dashboards quickly.

This is a Redshift Spectrum vs. Redshift materialized view compatibility issue. In the exhibit, the FROM spectrum_ext.orders line indicates the query reads an external table, and the final line explicitly states: ERROR: Materialized views are not supported for external tables. That means you cannot use a Redshift materialized view to precompute results that depend on Spectrum external tables.

To achieve MV-like performance, first ingest the needed S3 data into Redshift (for example, using COPY into a native table), then build and refresh the materialized view on Redshift-managed tables. The key takeaway is that Spectrum is for querying S3 in place, while Redshift materialized views require Redshift-resident base data.

Standard view won’t materialize results; it will still read spectrum_ext.orders at query time.
Federated queries are for querying supported remote databases, not S3 external tables shown as spectrum_ext.*.
Athena materialized view is not applicable here; the exhibit is a Redshift MV statement failing on external tables.

Continue with full practice

Use the AWS DEA-C01 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try AWS DEA-C01 on Web View AWS DEA-C01 Practice Test

Focused topic pages

Free review resource

Read the AWS DEA-C01 Cheat Sheet on Tech Exam Lexicon for concept review before another timed run.

Revised on Thursday, May 14, 2026

Data Security and Governance

Browse Certification Practice Tests by Exam Family

Free AWS DEA-C01 Full-Length Practice Exam: 65 Questions

Exam snapshot

Full-length exam mix

Practice questions

Questions 1-25

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Question 11

Question 12

Question 13

Question 14

Question 15

Question 16

Question 17

Question 18

Question 19

Question 20

Question 21

Question 22

Question 23

Question 24

Question 25

Questions 26-50

Question 26

Question 27

Question 28

Question 29

Question 30

Question 31

Question 32

Question 33

Question 34

Question 35

Question 36

Question 37

Question 38

Question 39

Question 40

Question 41

Question 42

Question 43

Question 44

Question 45

Question 46

Question 47

Question 48

Question 49

Question 50

Questions 51-65

Question 51

Question 52

Question 53

Question 54

Question 55

Question 56

Question 57

Question 58

Question 59

Question 60

Question 61

Question 62

Question 63

Question 64

Question 65

Continue with full practice

Focused topic pages

Free review resource