Try 65 free AWS DEA-C01 questions across the exam domains, with explanations, then continue with full IT Mastery practice.
This free full-length AWS DEA-C01 practice exam includes 65 original IT Mastery questions across the exam domains.
These questions are for self-assessment. They are not official exam questions and do not imply affiliation with the exam sponsor.
Count note: this page uses the full-length practice count maintained in the Mastery exam catalog. Some certification vendors publish total questions, scored questions, duration, or unscored/pretest-item rules differently; always confirm exam-day rules with the sponsor.
Need concept review first? Read the AWS DEA-C01 Cheat Sheet on Tech Exam Lexicon, then return here for timed mocks and full IT Mastery practice.
Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.
| Domain | Weight |
|---|---|
| Data Ingestion and Transformation | 34% |
| Data Store Management | 26% |
| Data Operations and Support | 22% |
| Data Security and Governance | 18% |
Use this as one diagnostic run. IT Mastery gives you timed mocks, topic drills, analytics, code-reading practice where relevant, and full practice.
Topic: Data Operations and Support
A data team is defining operational SLIs/SLOs and monitoring signals for an hourly batch pipeline that lands curated data in Amazon S3 and is queried through Amazon Athena.
Which TWO statements are FALSE/unsafe for this purpose?
Options:
A. Track completeness via expected partitions/control totals vs actual.
B. Measure freshness by now - max(event_time) in curated data.
C. Use ETL job duration alone as a freshness SLI.
D. Define latency SLI as event time to Athena queryability.
E. Set SLOs (e.g., 99%) and alarm on sustained breaches.
F. Treat a successful job run as sufficient completeness evidence.
Correct answers: C and F
Explanation: Freshness and completeness SLIs must be tied to the data produced (timestamps, partitions, control totals), not just whether code ran quickly or exited successfully. Job-duration and job-success signals are useful operational metrics, but they do not, by themselves, prove data is current or complete. Good SLOs set explicit targets on these SLIs and drive alerts from measured breaches.
An SLI measures an observable property of the data pipeline output, while an SLO sets a target for that SLI (for example, a percentile target over time). For data pipelines, freshness and latency are usually anchored to event-time or partition-time, and completeness is anchored to expected volumes or expected partitions.
now - max(event_time) directly reflects how current the data is.Key takeaway: prioritize data-derived signals (timestamps, partitions, totals) over job-only signals for SLIs.
Topic: Data Store Management
A company ingests clickstream files to Amazon S3, runs an AWS Glue ETL job to write Parquet to s3://datalake/curated/events/, and queries the curated data with Amazon Athena using the AWS Glue Data Catalog.
The curated prefix is partitioned by dt=YYYY-MM-DD/hour=HH and keeps 24 months of history. A Glue crawler is scheduled hourly on s3://datalake/curated/events/ to discover new partitions and keep the catalog current. The crawler now takes ~50 minutes to run, often overlaps the next scheduled run, and the newest hour is sometimes missing from the catalog when analysts query.
Which change should the data engineer make to improve cost and reliability without changing the ingestion/ETL logic or the S3 layout?
Options:
A. Set the crawler recrawl behavior to crawl new folders only
B. Increase the crawler capacity to reduce crawl duration
C. Disable schema updates in the crawler output settings
D. Run the crawler once per day to avoid overlapping runs
Best answer: A
Explanation: The crawler is slow because it repeatedly scans a very large historical prefix to find a small number of new hourly partitions. Configuring incremental crawling (new folders only) reduces the amount of S3 data the crawler needs to examine each run, lowering cost and making it far more likely the crawler finishes before the next schedule. The tradeoff is that changes in existing folders may require a periodic full recrawl.
AWS Glue crawlers populate the Data Catalog by listing and sampling data in the configured data store. When a crawler is pointed at a large, long-retained, partitioned S3 prefix, a full recrawl can become expensive and may not complete within the required SLA.
Using the crawler’s incremental recrawl setting (for example, crawling only new folders) aligns the crawl work with the operational goal in this scenario: discover newly arrived partition folders each hour and register them as partitions/tables in the Data Catalog. This typically reduces crawl duration and DPU-hours, and it improves reliability by preventing schedule overlap.
Key takeaway: use incremental crawler behavior for fast partition discovery, and perform an occasional full recrawl when you need to re-evaluate older data for schema changes.
Topic: Data Store Management
A data engineer must catalog and run AWS Glue ETL jobs against an Amazon RDS for PostgreSQL database that is reachable only from private subnets in a VPC. The engineer will use an AWS Glue connection.
Which statement is INCORRECT about securely designing this access?
Options:
A. If the Glue job runs in private subnets, it needs a network path (for example, NAT or VPC endpoints) to reach required AWS services.
B. The database must be placed in a public subnet with a public IP address for Glue to connect using a Glue connection.
C. Database credentials can be referenced from AWS Secrets Manager in the Glue connection to avoid hardcoding passwords.
D. A Glue connection can create ENIs in selected private subnets and use security groups to reach the database.
Best answer: B
Explanation: AWS Glue connections are designed to reach private data stores by creating elastic network interfaces (ENIs) in your VPC subnets and applying security groups. You secure access with network controls (subnets, routes, security groups) and by using managed secrets for credentials. Making the database public is not required and is generally less secure.
The core concept is that an AWS Glue connection provides VPC networking context so Glue can access private resources without exposing them publicly. When you configure a connection for a JDBC data store, Glue attaches ENIs in the specified subnets and uses the specified security groups; as long as those subnets have a route to the database and security group rules allow it, the database can remain private.
For credentials, prefer referencing AWS Secrets Manager from the connection so secrets are not embedded in job scripts. Also remember that Glue jobs running in private subnets still need connectivity to AWS control-plane and data-plane endpoints they use (commonly via NAT or VPC endpoints), otherwise jobs may fail even if the database is reachable. The key takeaway is to keep the data source private and control access with VPC networking plus least-privilege secret and IAM access.
Topic: Data Store Management
A company is implementing a business data catalog in Amazon SageMaker Catalog.
Requirements:
Which TWO actions should you AVOID? (Select TWO.)
Options:
A. Create a domain per business unit and assign domain owners and stewards using IAM Identity Center groups
B. Register raw landing-zone datasets that include unmasked PII as discoverable assets to speed up self-service discovery
C. Create one shared domain for all teams and rely only on naming conventions to separate Finance, Marketing, and Operations assets
D. Publish curated S3/Glue-based datasets as assets with business metadata (owner, glossary terms) and mark them certified only after approval
E. Configure PII assets so discovery and access are limited to the Data Governance group, while curated non-PII assets remain broadly discoverable
F. Create projects within each domain and restrict asset publishing permissions to project members
Correct answers: B and C
Explanation: SageMaker Catalog is organized around domains, projects, and assets, and it is commonly used to enable governed discovery. Actions that collapse required domain boundaries or make PII broadly discoverable directly violate the stated separation and governance constraints. The safe choices preserve domain/project isolation and apply least-privilege discovery and certification practices.
In SageMaker Catalog, a domain is the top-level business boundary, projects are the collaboration and permission boundary for publishing/managing assets, and assets represent discoverable data products (with metadata such as owner, glossary terms, and certification state). Given the requirements, you should model each business unit as its own domain, then create projects within each domain to control who can publish and manage assets. For governance, treat PII as a restricted class of assets: limit both discovery and access to the Data Governance group. Also, only promote curated, approved datasets as certified assets; raw landing-zone data (especially with PII) should not be broadly discoverable.
Key takeaway: preserve domain boundaries and restrict PII discovery/access while using certification to signal trusted assets.
Topic: Data Security and Governance
A company runs hundreds of Amazon EMR clusters that generate about 5 TB/day of application and step logs. Security requires immutable, centralized log retention for 7 years, and auditors must be able to query the last 90 days quickly with least-privilege access. The team wants a cost-effective, scalable design.
Which approach should the company AVOID for preparing these logs for audit?
Options:
A. Deliver EMR logs to S3 with SSE-KMS encryption
B. Store logs in a public-read S3 bucket for auditors
C. Apply S3 Object Lock for WORM log retention
D. Partition logs by date in S3 for Athena queries
Best answer: B
Explanation: Audit logs should be centrally stored with strong access controls because they often contain sensitive operational details. For large volumes, S3-based storage with encryption and immutable retention scales well and is cost-effective, while query performance can be maintained with partitioning and serverless analytics. Making logs broadly accessible undermines the security and governance goal even if it seems operationally convenient.
The core principle for audit logging is integrity and confidentiality: logs must be tamper-resistant and accessible only to approved identities under least privilege. At multi-terabyte-per-day scale, Amazon S3 is a common durable, low-cost log store; SSE-KMS protects data at rest, and S3 Object Lock can enforce WORM retention to support audit requirements.
To balance cost and performance for investigations:
dt=YYYY-MM-DD) and using Athena.Granting public read access to the log bucket is an audit anti-pattern because it creates uncontrolled disclosure risk and weakens governance regardless of downstream analytics choices.
Topic: Data Ingestion and Transformation
A data engineer is building a pull-based ingestion pipeline from a third-party REST API into an Amazon S3 data lake (raw zone). The API returns up to 1,000 records per call and includes a next_cursor value when more pages are available. The API enforces strict rate limits and sometimes returns HTTP 429 with a Retry-After header.
The pipeline must be restartable after failures and must avoid missing or reprocessing large ranges of data.
Which TWO actions should the data engineer take? (Select TWO.)
Options:
A. Skip checkpointing and deduplicate nightly with Amazon Athena queries on S3
B. Increase the AWS Lambda function memory size to prevent HTTP 429 throttling
C. Run an AWS Glue crawler after each page and use new partitions as the checkpoint
D. Use Amazon Kinesis Data Firehose to pull the REST API and land data in S3
E. Persist the next_cursor checkpoint after each successful page in DynamoDB
F. Retry 429/5xx responses using exponential backoff with jitter and honor Retry-After
Correct answers: E and F
Explanation: Use cursor-based pagination with a durable checkpoint so the pipeline can resume exactly where it left off after a failure. Handle throttling by honoring Retry-After and applying exponential backoff with jitter for retryable errors. Together, these patterns maximize completeness and correctness while staying within API limits.
For pull-based API ingestion, you typically need (1) a pagination strategy and (2) resilience to throttling and transient failures. With cursor pagination, the most reliable checkpoint is the last successfully processed cursor (or page token) written to a durable store (for example, DynamoDB) after each successful write to S3; on restart, the job resumes from that cursor.
When the source enforces rate limits, handle HTTP 429 and transient 5xx errors with controlled retries:
Retry-After when present.This combination prevents missed pages and minimizes duplicates without relying on expensive downstream cleanup.
Retry-After Reduces throttling and safely retries transient failures.Topic: Data Store Management
A company stores 200 TB of clickstream events in Amazon S3 and queries the data using Amazon Athena with tables in the AWS Glue Data Catalog. Files are currently uncompressed JSON under a single prefix (no partitions). Most queries filter on event_date (range) and app_id (equality) and select only a few columns. The team must reduce Athena scan cost and improve query performance.
Which THREE actions follow best practices for indexing, partitioning, and compression in this scenario? (Select THREE.)
Options:
A. Keep JSON format and avoid compression to reduce ETL complexity
B. Partition the S3 layout by event_date and app_id
C. Create an Athena index on event_date using CREATE INDEX
D. Convert the dataset to Parquet with Snappy compression
E. Partition the S3 layout by user_id
F. Create a Glue Data Catalog partition index on event_date and app_id
Correct answers: B, D and F
Explanation: For Athena, the biggest performance and cost gains come from reducing data scanned and improving partition pruning. Storing data in a columnar format (such as Parquet) with compression cuts I/O dramatically, and partitioning on the most common filter columns limits reads to relevant prefixes. At large partition counts, a Glue partition index can also reduce metadata lookup overhead.
Athena is a serverless query engine where cost and latency are strongly influenced by how many bytes are scanned and how efficiently it can skip irrelevant data. Converting row-based JSON into a columnar format (Parquet) and applying compression reduces storage and the amount of data Athena must read, while also enabling better predicate pushdown. Partitioning the dataset by the columns most frequently used in WHERE clauses (here, event_date and app_id) allows partition pruning so Athena reads only the necessary partitions. When a table accumulates many partitions, creating a Glue Data Catalog partition index can speed up partition metadata retrieval and reduce planning time. The key takeaway is to optimize around query access patterns: partition on low-to-moderate cardinality filter keys and use columnar compressed files.
event_date and app_id: enables effective partition pruning for common filters.event_date and app_id: improves partition metadata lookup at scale.user_id: high cardinality creates too many small partitions and excess overhead.CREATE INDEX: Athena doesn’t support user-managed indexes like relational databases.Topic: Data Ingestion and Transformation
You are optimizing an AWS Glue Spark job that ingests data from Amazon S3, transforms it, and writes curated data back to S3 for Athena queries. Select TWO statements that correctly describe high-level ways to reduce unnecessary I/O, minimize serialization overhead, and improve parallelism.
Options:
A. Apply filters and select only needed columns as early as possible
B. Write many small JSON files to maximize parallelism and speed Athena queries
C. Convert between DynamicFrames and DataFrames repeatedly to speed up processing
D. Use Python UDFs instead of built-in Spark SQL functions for best performance
E. Use coalesce(1) before writing to create one file for faster writes
F. Prefer Parquet/ORC over JSON/CSV to reduce I/O and serialization
Correct answers: A and F
Explanation: Using efficient storage formats and reducing the amount of data processed are the biggest high-level wins in Glue Spark. Columnar formats (with compression) cut read/write volume and parsing overhead, and early filtering/projection reduces downstream shuffles and materialization. Together these changes typically improve both runtime and cost without changing business logic.
The core optimization idea is to move less data and do less (de)serialization while letting Spark parallelize work across partitions. For S3-based pipelines, file format and when you reduce the dataset matter more than micro-optimizations.
By contrast, adding extra conversions, forcing single-file output, or relying on Python UDFs usually increases overhead and reduces parallelism.
coalesce(1) collapses partitions, creating a bottleneck at write time; keep multiple output files and manage size via partitioning/compaction.Topic: Data Operations and Support
A company uses Amazon Athena to query an Amazon S3 clickstream table in Parquet. The table is partitioned by dt (UTC date). Each dt partition scans about 64 GB.
A dashboard must report clicks for February 1, 2026 in America/Los_Angeles (UTC-8). The current query filters dt = '2026-02-01' and event_ts between 2026-02-01 00:00:00 and 2026-02-01 23:59:59, and it undercounts.
The table retains 90 days of partitions.
Which change will fix the time zone issue and approximately how much data will Athena scan (round to the nearest GB)?
Options:
A. Filter dt in two days and use a UTC window (~128 GB)
B. Filter on date(at_timezone(event_ts)) only (~5,760 GB)
C. Keep dt='2026-02-01'; shift event_ts to UTC (~64 GB)
D. Change Athena session time zone; keep dt='2026-02-01' (~64 GB)
Best answer: A
Explanation: A local day in America/Los_Angeles (UTC-8) crosses a UTC date boundary. To return the correct February 1 local-day results while preserving partition pruning, the query must read both relevant UTC dt partitions and filter event_ts using the corresponding UTC time range. With two daily partitions at 64 GB each, the scan is about 128 GB.
This is an inconsistent time zone boundary problem combined with partition pruning. The business day is defined in America/Los_Angeles, but the table is partitioned by UTC date (dt).
For February 1, 2026 in UTC-8, the corresponding UTC interval is:
That interval spans two UTC dt partitions (2026-02-01 and 2026-02-02). If each partition scans 64 GB, scanning both is:
2 partitions × 64 GB/partition = 128 GB
Filtering only one dt value will undercount, while filtering only on a transformed timestamp typically prevents partition pruning and scans many more partitions.
dt='2026-02-02' UTC.dt pruning excludes needed partitions.event_ts only is likely to bypass dt partition pruning and can scan all 90 partitions (about 90 × 64 GB).Topic: Data Ingestion and Transformation
You are selecting AWS services for near-real-time ingestion. Which TWO statements are false or unsafe?
(Select TWO.)
Options:
A. DynamoDB Streams is designed for EC2 log ingestion, unlimited size
B. Kinesis Data Streams can ingest Oracle CDC directly, no CDC tool
C. Amazon MSK fits Kafka-compatible producers/consumers needing Kafka features
D. AWS DMS can do ongoing CDC replication into Kinesis or S3
E. Kinesis Data Streams provides shard-based ordering and scalable throughput
F. DynamoDB Streams captures DynamoDB item changes for downstream processing
Correct answers: A and B
Explanation: Match the ingestion service to the event source and protocol. Database CDC requires a CDC-capable service such as AWS DMS (or a custom CDC connector) before sending events to a stream. DynamoDB Streams only publishes DynamoDB table change events, while Kinesis Data Streams and Amazon MSK are general-purpose streaming platforms for application-produced records.
The core decision is source type versus what the ingestion service can natively capture. Kinesis Data Streams and Amazon MSK ingest records that producers explicitly publish (Kinesis API or Kafka protocol). They do not automatically “tap” databases for CDC. DynamoDB Streams is specialized: it emits change records only from a DynamoDB table.
Practical mapping:
The unsafe statements are the ones claiming Kinesis directly ingests Oracle CDC and that DynamoDB Streams is a general log ingestion service.
Topic: Data Security and Governance
Select THREE statements that correctly describe authentication mechanisms for private data sources and how they are applied to AWS data pipeline components.
Options:
A. Attach an IAM role to an AWS Glue job for S3 access.
B. Prefer IAM user access keys in code for non-AWS sources.
C. Put database passwords in Lambda environment variables for simplicity.
D. Store JDBC passwords in Secrets Manager and reference via Glue connection.
E. Use a Redshift IAM role for COPY/UNLOAD when accessing S3.
F. AWS DMS to Amazon S3 targets cannot use IAM roles.
Correct answers: A, D and E
Explanation: Role-based access is the default for AWS-to-AWS access because services can assume IAM roles and use temporary credentials. Secrets-based credentials are appropriate when a pipeline component must authenticate to an external system (for example, a database) with a username/password, ideally stored in AWS Secrets Manager. These patterns reduce long-lived credential exposure and improve rotation and auditing.
Use role-based authentication (IAM roles) when an AWS service needs to call AWS APIs (for example, Glue reading/writing S3, Redshift COPY/UNLOAD to S3). The service assumes a role and obtains short-lived credentials governed by IAM policies.
Use secrets-based authentication (for example, AWS Secrets Manager) when the data source requires a shared secret such as a database username/password. Pipeline components should reference the secret at runtime instead of hardcoding credentials in code, config files, or environment variables.
Key takeaway: prefer IAM roles for AWS resources; use Secrets Manager for external credentials and rotation.
Topic: Data Ingestion and Transformation
A data team uses Amazon EventBridge to start an AWS Step Functions state machine every 5 minutes. The state machine invokes AWS Lambda functions to ingest new records from a partner API, write them to Amazon S3, and update a downstream table. Step Functions is configured with retries and a Catch path for transient failures, and EventBridge can occasionally deliver duplicate events.
Which design action best reflects the core principle that supports safe retries and duplicate event delivery in this serverless orchestration?
Options:
A. Make each Lambda idempotent by recording a processing key and using conditional writes
B. Use IAM least-privilege policies for EventBridge, Step Functions, and each Lambda role
C. Prevent overwrites in the raw S3 prefix by writing each ingest to a new object key
D. Encrypt S3 objects and any state in DynamoDB using AWS KMS customer managed keys
Best answer: A
Explanation: Because Step Functions can retry and EventBridge can deliver the same event more than once, the workflow must tolerate repeated invocations without changing the final result. The core principle is idempotent processing: repeated executions with the same input should produce the same outcome. Tracking a unique processing key and enforcing it with conditional writes prevents duplicates while still allowing retries for resilience.
When EventBridge triggers Step Functions, you should assume at-least-once delivery and that retries may occur during transient errors. The core principle that makes this safe is idempotent processing: each processing attempt must be able to run more than once without producing duplicate side effects.
A common pattern is to generate or extract a stable idempotency key (for example, partner record ID + ingest window or the Step Functions execution input hash) and store it in a durable store such as DynamoDB. Each Lambda checks/claims the key using a conditional write so only the first attempt performs the write/update, while subsequent retries become no-ops or return the previously computed result. This preserves correctness while keeping Step Functions retries enabled for availability.
Topic: Data Ingestion and Transformation
A data engineer is designing a high-level ETL workflow on AWS with clear stages: ingest, validate, transform, load, and publish. The workflow must have explicit success and failure paths so downstream consumers only read trusted data.
Which TWO statements are INCORRECT (false/unsafe) for this design? (Select TWO.)
Options:
A. Land source data in an immutable S3 raw zone during ingest
B. Treat the workflow as successful if publish runs, even when transform/load errors occurred
C. Validate schema and business rules before transform; quarantine rejects
D. Make the load step idempotent so retries do not duplicate data
E. Publish only after a successful load; stop downstream on failures
F. If some records fail validation, continue without tracking; consumers can filter
Correct answers: B and F
Explanation: A well-designed ETL workflow defines clear stage boundaries and uses gates so only validated, successfully loaded data is published. Failure paths must be explicit (for example, quarantine invalid data, fail the run, and alert) rather than silently passing problems downstream. Success should only be emitted when all required upstream stages complete correctly.
The core design principle is stage gating with explicit branching: ingest writes immutable raw data, validate decides whether data is acceptable (or which records are rejected), transform only runs on accepted inputs, load writes to the target in an idempotent way, and publish exposes the curated result (for example, updating a pointer/manifest, adding a partition, or emitting an event) only after load succeeds.
At a high level, orchestrators such as AWS Step Functions or managed schedulers should model:
Catch/fallbacks that route bad records to a quarantine location and raise alarms.The key takeaway is to avoid “silent success” patterns that hide validation or load failures and shift data quality responsibility to downstream users.
Topic: Data Store Management
A company is standardizing on Amazon Redshift for analytics. The data engineering team must choose between Amazon Redshift provisioned and Amazon Redshift Serverless for different workloads.
Select TWO statements that are true.
Options:
A. Redshift Serverless requires selecting a node type and number of nodes to meet performance requirements.
B. In Redshift Serverless, setting a maximum RPU helps limit compute consumption and cost for that workgroup.
C. Redshift Serverless is a good fit for spiky or intermittent workloads because it automatically adjusts capacity and you pay for compute used.
D. Redshift Serverless is always the lowest-cost option for 24/7 steady, high-utilization BI workloads.
E. Redshift provisioned automatically scales compute to zero when the cluster is idle.
F. Redshift provisioned is generally a better fit for consistently high, predictable workloads where you want fixed capacity and stable performance.
Correct answers: B and C
Explanation: Redshift Serverless is intended for workloads with variable demand because it manages capacity for you and charges based on usage. It also provides cost-control guardrails such as a maximum RPU setting to limit peak compute consumption. Provisioned clusters are typically chosen when you want fixed capacity and highly predictable performance and spend for steady workloads.
The core trade-off is how compute capacity is managed and billed. Redshift Serverless abstracts cluster sizing and scales compute based on demand, which is typically cost-effective for unpredictable, spiky, or intermittent usage patterns. You can apply guardrails (such as maximum RPU) to help control peak compute usage and reduce the risk of unexpected spend.
Redshift provisioned requires you to choose and manage cluster capacity (node type/count). That model is commonly preferred for steady, predictable, always-on workloads where you want consistent performance characteristics and more predictable baseline costs, especially when capacity can be planned in advance.
In this question, the true statements are the ones describing serverless auto-capacity with usage-based billing and the use of max RPU as a cost-control mechanism.
Topic: Data Store Management
A company stores clickstream data in Amazon S3 as Parquet with prefixes like s3://datalake/curated/clicks/dt=YYYY-MM-DD/hour=HH/. An Amazon Athena dashboard must include new data within 15 minutes of landing in S3. The dataset creates thousands of new partitions per day. The team must keep Athena query costs low by maximizing partition pruning and must avoid solutions that require frequent full-prefix S3 listing because of governance and operational overhead.
Which solution BEST meets these requirements?
Options:
A. Schedule an AWS Glue crawler to run every 15 minutes
B. Query the data as an unpartitioned Athena table
C. Run Athena MSCK REPAIR TABLE before each dashboard refresh
D. Use S3 events to trigger Lambda to BatchCreatePartition in Glue
Best answer: D
Explanation: Athena relies on AWS Glue Data Catalog partition metadata to find and prune partitioned data efficiently. Creating partitions as new S3 keys arrive synchronizes the catalog with the dataset within the 15-minute SLA without repeatedly listing the full dataset prefix. This preserves correctness (latest partitions are queryable) and reduces query and maintenance cost.
For partitioned Athena tables, the Glue Data Catalog’s partition metadata determines which S3 prefixes Athena reads. If new dt/hour folders exist in S3 but are missing from the catalog, queries can return incomplete results or require expensive discovery steps. An event-driven approach that extracts partition values from the object key and calls Glue BatchCreatePartition keeps the catalog synchronized quickly and at scale.
ObjectCreated events to an SQS queue (or EventBridge).dt and hour from the key.BatchCreatePartition to register new partitions.This avoids frequent full-prefix listing while enabling effective partition pruning for lower Athena scan costs.
dt/hour, increasing query cost.Topic: Data Ingestion and Transformation
A company ingests 5 KB JSON clickstream events into Amazon Kinesis Data Streams. The required transformation is lightweight (add two lookup fields and drop PII) and must complete within 1 second of an event arriving. The team wants a fully managed, event-driven operational model with no long-running clusters to patch or scale.
Which transformation service should the data engineer choose?
Options:
A. AWS Lambda triggered by Kinesis Data Streams
B. Amazon EMR Spark Structured Streaming on EC2
C. AWS Glue streaming ETL job
D. Amazon Redshift ELT after loading events into tables
Best answer: A
Explanation: The deciding factor is the sub-second, per-event latency requirement with an event-driven operational model. AWS Lambda can process each Kinesis record immediately and scales automatically without managing servers or clusters. The other services are better suited to micro-batch/cluster-based streaming or batch ELT into a warehouse, which makes consistently meeting a 1-second SLA harder.
This scenario is best matched to a serverless, record-at-a-time transformation. With Kinesis as the source and a simple enrichment/redaction step, AWS Lambda can be invoked as records arrive and complete processing within the 1-second SLA while keeping operations minimal (no always-on compute to manage).
Glue streaming and EMR streaming are designed for larger, stateful streaming jobs and typically involve micro-batching and/or more operational overhead (job infrastructure or clusters). Redshift transformations generally assume data is loaded first and then transformed with SQL (ELT), which is a batch-oriented pattern and is not intended for per-event, sub-second processing.
Key takeaway: when the discriminator is strict low latency plus “no clusters,” choose Lambda over cluster or warehouse-based approaches.
Topic: Data Store Management
Which statement best defines a business data catalog (as opposed to a technical catalog such as the AWS Glue Data Catalog) and how it supports governance and data sharing workflows?
Options:
A. It stores table schemas, S3 locations, partitions, and SerDe details for query engines and ETL jobs
B. It is the service that enforces fine-grained data permissions using centralized grants and LF-tags on databases and tables
C. It manages KMS envelope encryption keys and rotates data keys used to encrypt datasets at rest
D. It provides business context (owners, glossary, descriptions) and workflow-based sharing (search, request access, approvals) for governed data consumption
Best answer: D
Explanation: A business data catalog is designed for human-facing discovery and governance: it adds business metadata (definitions, owners, stewardship) and enables controlled sharing via access request and approval workflows. A technical catalog primarily serves engines by holding schemas and physical locations, not business context and collaboration processes.
The core distinction is purpose and audience. A technical catalog (for example, AWS Glue Data Catalog) is a metadata store used by services like Glue, Athena, and Redshift Spectrum to resolve database/table definitions, S3 locations, partitions, and serialization formats. A business data catalog targets people and governance: it layers business-friendly metadata (descriptions, glossary terms, ownership/stewardship) on top of technical metadata and supports data sharing workflows such as search and discovery, requesting access, approvals, and tracking who can consume datasets. In AWS, services in the “data discovery and collaboration” space (for example, Amazon DataZone) align to the business-catalog role, while Glue Data Catalog aligns to the technical-catalog role.
Topic: Data Security and Governance
A data pipeline in us-east-1 uses Amazon Kinesis Data Firehose to deliver JSON files to s3://datalake-prod/landing/ (SSE-KMS with CMK key-123). An AWS Glue job reads only from landing/events/, converts to Parquet, writes to s3://datalake-prod/curated/events/, and updates one AWS Glue Data Catalog table analytics.curated_events. Amazon Athena queries only the curated location.
The Glue job’s IAM role currently has these managed policies attached: AmazonS3FullAccess and AWSGlueConsoleFullAccess. Security requires least-privilege scoping and disallows wildcard access to all S3 buckets/keys. The team wants to fix this without breaking the pipeline.
Which change is the best optimization?
Options:
A. Replace the managed policies with a customer-managed policy scoped to the two S3 prefixes, the specific KMS key, and only the required Glue Data Catalog actions for analytics.curated_events.
B. Add an S3 bucket policy that allows the glue.amazonaws.com service principal to read and write anywhere in datalake-prod, then remove all permissions from the Glue role.
C. Replace AmazonS3FullAccess with AmazonS3ReadOnlyAccess and keep AWSGlueConsoleFullAccess for catalog updates.
D. Keep the managed policies and add AWS CloudTrail alerts to detect unexpected S3 access by the Glue role.
Best answer: A
Explanation: The requirement is to apply least privilege when managed policies are too broad. A customer-managed IAM policy lets you restrict access to only landing/events/ read, curated/events/ write, the specific CMK needed for SSE-KMS, and only the necessary Glue Data Catalog actions for the single table. This reduces blast radius and improves security operability while keeping the pipeline functional.
Managed policies like AmazonS3FullAccess and AWSGlueConsoleFullAccess are typically far broader than a production ETL job needs. The least-privilege approach is to replace them with a customer-managed policy that allows only the exact actions and resources required by the Glue job:
ListBucket on datalake-prod with a prefix condition, GetObject on landing/events/*, and PutObject (and optional multipart actions) on curated/events/*.Decrypt for reads and Encrypt/GenerateDataKey for writes, scoped to CMK key-123.analytics.curated_events, scoped to those catalog resources.This meets the explicit constraint against wildcard access while preserving reliability (no unexpected AccessDenied) and improving security posture through reduced permissions.
Topic: Data Ingestion and Transformation
A data platform uses an Amazon EventBridge schedule to start an AWS Step Functions Standard state machine every 5 minutes. The workflow reads a batch of S3 object keys from an Amazon SQS queue, then fans out processing to AWS Lambda and AWS Glue jobs. During traffic spikes, the team must prevent downstream system overload and ensure failures are handled safely.
Which statement is INCORRECT about designing and operating this serverless workflow?
Options:
A. Use Map MaxConcurrency to cap parallel processing
B. Set Lambda reserved concurrency to protect downstream dependencies
C. Use retries with backoff and a DLQ for poison messages
D. Step Functions Standard is exactly-once, so idempotency isn’t needed
Best answer: D
Explanation: The unsafe statement is the claim that Step Functions Standard is exactly-once and therefore idempotency is unnecessary. In practice, retries, timeouts, and partial failures can cause the same unit of work to be attempted more than once. Designing idempotent processing (or explicit deduplication) is required for correctness in serverless orchestration.
Serverless orchestration is built around controlled concurrency and resilient failure handling. Step Functions Standard can retry failed tasks (or tasks can time out and be retried), and integrations that involve polling/queueing can also result in duplicate attempts. Because of this at-least-once reality, correctness is achieved by making each unit of work idempotent (for example, writing with conditional puts, using idempotency keys, or tracking processed S3 keys/ETags).
Separately, you manage load by capping parallelism (such as Step Functions Map MaxConcurrency) and by limiting Lambda concurrency to protect databases/APIs. For failures, use retries with exponential backoff for transient errors and route non-retryable “poison” messages to a DLQ for later inspection and redrive.
MaxConcurrency to prevent stampedes during spikes.Topic: Data Security and Governance
A data engineering team is creating least-privilege IAM permissions for roles that read curated data in Amazon S3 and run analytics jobs. Which statement is INCORRECT?
Options:
A. Use tools like IAM policy validation/Access Analyzer and CloudTrail to confirm the policy’s effective access.
B. To allow s3:GetObject, granting access to only the bucket ARN is sufficient.
C. For S3 least privilege, use bucket ARN for s3:ListBucket and object ARNs for s3:GetObject.
D. If a managed policy is too broad, create a customer-managed policy scoped to required actions and resources.
Best answer: B
Explanation: In IAM, least-privilege scoping requires matching actions to the correct resource types. Amazon S3 object-level actions (for example, s3:GetObject) must be granted on object ARNs, while bucket-level actions (for example, s3:ListBucket) use the bucket ARN and can be further restricted with conditions. Creating a custom customer-managed policy is the right approach when AWS managed policies are too permissive.
The core idea is least-privilege IAM design: when AWS managed policies don’t fit, write a customer-managed policy that grants only the needed actions and scopes Resource precisely.
For Amazon S3, actions map to different resource types:
s3:ListBucket use arn:aws:s3:::bucket and can be narrowed with conditions such as s3:prefix.s3:GetObject require object ARNs such as arn:aws:s3:::bucket/prefix/* (or arn:aws:s3:::bucket/*).Therefore, granting s3:GetObject on only the bucket ARN will not correctly authorize object reads and is not a valid least-privilege pattern. The key takeaway is to scope both actions and resources accurately and validate the effective permissions before rollout.
Topic: Data Operations and Support
An AWS Glue Spark job that computes daily metrics with groupBy(user_id) is intermittently missing its SLA. You review the stage summary below.
Exhibit: Stage 12 (shuffle) by partition
| Partition | Records read | Task time |
|---|---|---|
| 0 | 820,000,000 | 58 min |
| 1 | 21,000,000 | 2.1 min |
| 2 | 19,000,000 | 1.9 min |
| 3 | 20,000,000 | 2.0 min |
Based only on the exhibit, what is the best next step to mitigate the issue?
Options:
A. Decrease spark.sql.shuffle.partitions to reduce overhead
B. Salt the user_id key before aggregation to spread work
C. Increase the number of Glue workers for the job
D. Coalesce to 1 partition before the groupBy to simplify processing
Best answer: B
Explanation: The exhibit shows extreme data skew: one shuffle partition processes far more records and runs far longer than the others. This indicates hot keys or unbalanced partitioning around user_id, creating a straggler that dictates overall stage runtime. A key-salting approach spreads the skewed key’s rows across multiple partitions to balance the workload.
Data skew in distributed processing appears as a small number of partitions doing most of the work, causing “straggler” tasks and long stage completion times. In the exhibit, Partition 0 reads 820,000,000 records and takes 58 minutes, while the other partitions read ~20,000,000 records and take ~2 minutes, which is a classic hot-key/unbalanced-partition symptom.
A high-level mitigation for skewed groupBy(user_id) is to add a temporary random “salt” to the key (for example, user_id + salt) so the heavy key’s rows are split across many partitions, then perform a second aggregation to combine the salted results back to per-user_id totals. Key takeaway: fix the skewed key distribution rather than only adding more compute.
Topic: Data Ingestion and Transformation
A data engineer is building an ingestion job that pulls orders from a third-party REST API into Amazon S3 every 5 minutes. The API enforces rate limits and returns 429 Too Many Requests during bursts. The API supports cursor-based pagination by returning a nextPageToken in each response, and new/updated orders can arrive while a full pull is in progress.
Which statement is INCORRECT about designing pagination, backoff, and checkpointing for this ingestion job?
Options:
A. Use offset/page-number pagination and checkpoint only the page number to avoid duplicates even if new orders arrive.
B. Design loads to be idempotent (for example, dedupe by order ID) to handle retries safely.
C. Checkpoint the cursor token only after a page is successfully written downstream.
D. Retry 429 responses using exponential backoff with jitter and a max retry limit.
Best answer: A
Explanation: Cursor-based pagination with a persisted nextPageToken is designed for changing datasets, while offset/page-number pagination can shift when new records arrive. Reliable ingestion also requires controlled retries for throttling and a durable checkpoint that advances only after successful writes. Idempotent processing prevents duplicates when retries or replays occur.
The core design goal is to make API ingestion resilient to throttling and failures while ensuring correct continuity across runs. For APIs that return a nextPageToken, cursor-based pagination plus a durable checkpoint (such as storing the last successfully processed token) is the safer approach because it avoids the shifting-window problem that occurs when new/updated records change the ordering of results.
Operationally:
429 and transient 5xx errors as retryable with exponential backoff and jitter.Offset/page-number pagination with only a page checkpoint is the risky pattern when the source data can change mid-ingestion.
Topic: Data Ingestion and Transformation
You are transforming nested JSON event data (arrays and objects) into an Amazon S3 data lake for analytics with AWS Glue and Amazon Athena/Redshift Spectrum.
Which THREE statements are INCORRECT or unsafe guidance for handling nested and semi-structured data?
Options:
A. Convert nested JSON to CSV because CSV preserves nesting and boosts performance.
B. Normalize one-to-many arrays into parent and child tables when appropriate.
C. Athena can query nested fields using dot notation and UNNEST.
D. You must fully flatten all nested fields before writing to Amazon S3.
E. Athena/Redshift Spectrum cannot read Parquet struct types without flattening.
F. AWS Glue can relationalize or explode arrays into child tables.
Correct answers: A, D and E
Explanation: Nested data does not always need to be flattened up front; Athena and Redshift Spectrum can query nested fields directly (especially in columnar formats like Parquet). Flattening/normalizing is a design choice driven by query patterns, one-to-many relationships, and performance/cost tradeoffs. CSV is usually a poor target for nested data because it loses structure and is expensive to scan.
The core decision is whether to keep nested structures (schema-on-read) or to flatten/normalize them (schema-on-write) based on how the data will be queried.
struct/array columns and access them with dot notation, and use UNNEST/explode patterns for arrays.A good default is: keep nested in columnar storage, then selectively unnest/relationalize where it improves the main access patterns.
Topic: Data Ingestion and Transformation
A team stores an AWS Glue ETL script in a Git repository. A security review requires removing database credentials from source code and enabling managed secret rotation.
Exhibit: Glue script excerpt
1 JDBC_URL = "jdbc:postgresql://sales-db:5432/sales"
2 USER = "etl_user"
3 PASSWORD = "P@ssw0rd!"
Based only on the exhibit, what is the best next step to implement safe configuration management for this pipeline?
Options:
A. Upload the script to an encrypted S3 bucket
B. Pass the password as a Glue job argument in plaintext
C. Base64-encode the password before committing the script
D. Store credentials in AWS Secrets Manager and fetch at runtime
Best answer: D
Explanation: The exhibit shows a plaintext secret embedded directly in the ETL code. Replacing the hardcoded value with a runtime lookup from AWS Secrets Manager keeps secrets out of source control and lets you use managed rotation. Access can be limited to the Glue job IAM role.
Safe configuration management means keeping secrets out of code and retrieving them securely at runtime using managed services. In the exhibit, the issue is explicit: line 3 hardcodes PASSWORD = "P@ssw0rd!", which risks exposure through the repository and makes rotation operationally difficult.
The appropriate fix is to:
This removes the secret from the script while keeping access controlled and auditable.
Topic: Data Store Management
Which statement is INCORRECT about selecting a data store for vector similarity workloads on AWS?
Options:
A. Amazon OpenSearch Service can provide low-latency ANN vector search but requires capacity planning for shards and memory.
B. Aurora PostgreSQL with pgvector is a good fit when vector search must be combined with relational queries and ACID transactions.
C. Amazon S3 queried with Amazon Athena is a safe choice for millisecond-latency nearest-neighbor search without building vector indexes.
D. For batch/offline similarity scoring, storing embeddings in Amazon S3 and processing with Spark can be more cost-effective than an always-on search cluster.
Best answer: C
Explanation: Online vector search typically needs a purpose-built vector index (often approximate nearest neighbor) hosted in a database/search engine designed for low-latency queries. OpenSearch and Aurora PostgreSQL with pgvector can support these patterns with different scaling and operational tradeoffs. S3 with Athena is primarily for analytical scans and is not an appropriate choice for millisecond-latency similarity retrieval.
The key tradeoff in vector workloads is whether you need interactive, low-latency nearest-neighbor retrieval (usually requiring a vector index such as HNSW/IVF) versus offline/batch processing.
Aurora PostgreSQL with pgvector can be a strong choice when embeddings live alongside relational data and you need SQL joins, constraints, and transactional consistency, but you still must design indexes and consider scaling limits of a relational engine. OpenSearch Service is commonly used for scalable, low-latency similarity search using ANN techniques, with operational responsibilities around cluster sizing, shard layout, and memory.
By contrast, Athena queries data in-place in S3 and is optimized for analytical scans, not interactive nearest-neighbor search, so relying on it for millisecond vector retrieval is unsafe.
Topic: Data Operations and Support
A data pipeline ingests clickstream events hourly into an S3 curated bucket partitioned as dt=YYYY-MM-DD/hour=HH. A dashboard has a freshness SLA: each hourly partition must be queryable in Athena by HH:20 UTC. Traffic volume naturally varies by up to ±30% for the same hour across days.
The on-call team wants alerts that are actionable (target -6 pages/day). Warnings can go to Slack; pages go to PagerDuty.
Which TWO alerting approaches should you AVOID because they create unsafe/noisy thresholds for timeliness or completeness? (Select TWO.)
Options:
A. Publish a lag metric (now minus newest processed event time) and page only if lag >20 minutes for 2 consecutive 10-minute periods
B. Use CloudWatch anomaly detection on record-count metrics and page only for sustained deviations beyond the learned band
C. Compute hourly record count and alert when it drops more than 50% versus a 7-day baseline for that same hour; send a Slack warning for 20% to 50% drops
D. Page if the hourly record count is not exactly equal to yesterday’s count for the same hour
E. Alert if the expected dt/hour partition prefix is missing or contains only 0-byte objects at HH:20
F. Page when the latest partition arrives later than HH:20 for a single 1-minute evaluation period
Correct answers: D and F
Explanation: Freshness and completeness checks should be tied to the SLA and natural variability, with thresholds that reduce false positives. Paging on single short-lived SLA misses or using exact record-count equality creates noisy, non-actionable alerts. Using sustained-breach windows and baseline/anomaly-based completeness thresholds better matches operational goals.
Timeliness (freshness) checks validate that data is available by the stated SLA time; good alerting adds a short persistence window so you page only when the breach is likely real and ongoing. Completeness checks validate that “enough” data arrived; thresholds must account for expected volume variability (seasonality, day-to-day changes) using baselines (rolling median/average) or anomaly bands.
A practical pattern is:
The key takeaway is to align thresholds to the SLA and expected variance to avoid alert fatigue while still catching true data quality incidents quickly.
Topic: Data Operations and Support
A daily AWS Glue Studio job reads s3://datalake/raw/orders/, applies a mapping that renames customer_id to cust_id, and then runs an AWS Glue Data Quality EvaluateDataQuality step to catch empty IDs and invalid values before writing Parquet to s3://datalake/curated/orders/.
Since the rename change, the job fails during the data quality step.
Exhibit: CloudWatch Logs (excerpt)
ERROR EvaluateDataQuality: AnalysisException: cannot resolve 'customer_id'
given input columns: [cust_id, order_id, order_status, order_ts]
Ruleset orders_curated_rules:
Completeness "customer_id" > 0.99
Which change will fix the root cause with the least disruption while keeping data quality checks in the job?
Options:
A. Update the ruleset to use cust_id instead of customer_id
B. Grant the job role glue:GetTable on the Data Catalog
C. Increase the Glue job DPUs to avoid Spark executor errors
D. Rerun the Glue crawler to refresh the table schema
Best answer: A
Explanation: The failure occurs inside the EvaluateDataQuality step because the ruleset references customer_id, which no longer exists after the mapping rename. Aligning the ruleset with the transformed schema allows the job to run and still enforce completeness/validity checks before writing curated output. This is the smallest change because it does not alter the pipeline structure or data flow.
Symptom: the Glue job fails at EvaluateDataQuality with an AnalysisException stating it cannot resolve customer_id.
Root cause: the mapping step renamed customer_id to cust_id, but the Glue Data Quality ruleset still evaluates Completeness "customer_id" > 0.99. When the evaluation runs against the post-mapping frame, Spark cannot find the referenced column.
Fix: edit the existing ruleset to reference the current column name (cust_id) and keep the evaluation step in the same position so it continues to validate the transformed/curated schema before writing to S3. Key takeaway: data quality rules must match the schema at the point in the job where they are evaluated.
AccessDenied.Topic: Data Operations and Support
A data engineer runs Amazon Athena queries on an S3 table stored in Parquet and partitioned by dt (daily). There are 90 dt partitions. Each partition scans 100 GB when using SELECT *. If a query selects only user_id and page_id, Athena scans 10 GB per partition (column pruning).
A new report needs only user_id and page_id for the last 7 days. Athena costs USD 5.00 per TB scanned. Assume 1 TB = 1,000 GB. Round to the nearest cent.
Which option has the lowest estimated Athena cost for one run?
Options:
A. No dt filter and select user_id,page_id (USD 4.50)
B. Filter dt to last 7 days and select user_id,page_id (USD 0.35)
C. Filter dt to last 7 days and SELECT * (USD 3.50)
D. Add LIMIT 10,000 without a dt filter (USD 45.00)
Best answer: B
Explanation: The lowest-cost Athena query is the one that both limits partitions with a dt predicate and projects only the required columns. With Parquet, selecting fewer columns reduces bytes scanned, and partition pruning limits how many partitions are read. Using both together minimizes scanned data and typically improves performance and cost.
Athena charges by data scanned, so the goal is to reduce scanned bytes by (1) pruning partitions with a predicate on the partition key and (2) scanning only the needed columns with a columnar format like Parquet.
For the query that filters to 7 days and selects only user_id and page_id:
Any approach that reads more partitions or uses SELECT * increases bytes scanned (and cost).
SELECT * on 7 days still scans 100 GB per day because it reads all columns.LIMIT misconception does not prevent scanning partitions/columns needed to evaluate the query.Topic: Data Ingestion and Transformation
A data engineer uses AWS Glue to transform clickstream events from an S3 raw zone for querying in Amazon Athena. The source JSON has schema drift: the event_time field is sometimes an ISO-8601 string and sometimes an epoch-milliseconds number, which causes downstream jobs to fail when a consistent timestamp is expected.
A daily Athena report queries the last 7 days of data.
Which solution best enforces a consistent schema and type normalization during transformation and results in the lowest Athena scan cost for the daily report query?
Options:
A. Glue ETL applies explicit mappings to a fixed schema, converts event_time to timestamp, writes Parquet partitioned by normalized event_date (YYYY-MM-DD); USD 2.10/query
B. Athena view uses try_cast to normalize event_time at query time over raw JSON; USD 22.50/query
C. Glue crawler infers schema from raw JSON daily and Athena queries raw data; USD 22.50/query
D. Glue ETL casts event_time to timestamp and writes Parquet with no partitions; USD 9.00/query
Best answer: A
Explanation: Use a transformation step that enforces a fixed schema and normalizes types before writing curated data. Writing Parquet partitioned by a normalized date enables Athena partition pruning and avoids failures from mixed event_time representations. The daily report then scans only the 7 required partitions, minimizing bytes scanned and cost.
The core requirement is to prevent downstream failures from schema drift by enforcing a consistent target schema (including type normalization) during transformation, not at query time. In AWS Glue, this is done by applying an explicit mapping and converting mixed representations (ISO string and epoch ms) into a single timestamp type, then writing curated data.
For Athena cost, partition pruning limits scanned data to the last 7 days of curated Parquet:
Key takeaway: enforce types and normalize partition keys during ETL so both correctness and scan efficiency are achieved.
Topic: Data Operations and Support
A data pipeline uses AWS Step Functions to orchestrate an AWS Lambda validation step followed by an AWS Glue ETL job. Each run must carry a run_date and a unique run_id through all steps, and operators must be able to trace outputs back to the specific orchestration run.
Which mechanism best meets this requirement?
Options:
A. Store run_date and run_id as S3 object tags
B. Use CloudTrail event IDs to correlate the steps
C. Pass context in Step Functions input/output with execution ARN
D. Write run_date and run_id into Glue Data Catalog table properties
Best answer: C
Explanation: AWS Step Functions is designed to pass run-specific context between automated processing steps by carrying a JSON payload through the workflow. Each workflow run also has a unique execution identifier (execution ARN), which provides high-level traceability from outputs back to the specific run.
To pass run context such as dates, partitions, and run IDs through automated processing steps, use a mechanism that natively propagates per-execution values across the orchestration. AWS Step Functions does this by storing the execution input and each state’s output as JSON and passing that data to subsequent states (optionally shaping it with state Parameters). Step Functions also assigns each run a unique execution identifier (execution ARN), which can be injected into task inputs for end-to-end traceability and used to correlate logs and artifacts produced by Lambda and Glue for that run.
Key takeaway: use the workflow engine’s execution-scoped input/output and execution ID rather than static metadata or audit logs.
Topic: Data Store Management
Which THREE statements are true about AWS-backed approaches for storing embeddings and performing vector similarity search? (Select THREE.)
Options:
A. Aurora PostgreSQL with pgvector can store embeddings and run similarity SQL.
B. Athena can perform k-NN similarity search directly over S3 embeddings.
C. Amazon OpenSearch Service supports vector fields and k-NN queries.
D. RDS for MySQL includes built-in vector indexing and similarity functions.
E. DynamoDB natively supports vector distance queries for similarity search.
F. Self-managed FAISS on EC2 can serve similarity search from stored embeddings.
Correct answers: A, C and F
Explanation: Vector search is used to find semantically similar items by comparing embedding vectors. On AWS, common high-level approaches include managed search engines that support vector indexing, relational databases with vector extensions, or self-managed ANN libraries running on compute. Serverless SQL query over files and key-value databases are not, by themselves, vector similarity engines.
The core decision is choosing a store and query engine that can both persist embeddings and execute similarity (nearest-neighbor) queries efficiently. AWS-native options include OpenSearch for managed vector indexing and k-NN retrieval, and PostgreSQL-compatible databases (such as Aurora PostgreSQL) using pgvector to store vectors and compute distances in SQL. If you need custom algorithms or tighter control over indexing behavior, you can run an ANN library (for example, FAISS) on Amazon EC2 and manage the index lifecycle yourself.
Amazon OpenSearch Service vector + k-NN: OK
Aurora PostgreSQL + pgvector similarity SQL: OK
FAISS on EC2 (self-managed ANN service): OK
Athena over S3 for k-NN similarity: NO (not a vector engine)
DynamoDB native vector distance queries: NO
RDS for MySQL built-in vector indexing: NO
Pick services that provide vector-aware indexing and similarity query capabilities, not just storage.
Topic: Data Security and Governance
A data lake S3 bucket in Account A is encrypted with SSE-KMS using a customer managed key (CMK) in Account A. A Glue ETL job in Account B assumes the IAM role GlueETLRoleB to read objects from the bucket.
Current setup:
s3:GetObject to arn:aws:iam::<AccountB>:role/GlueETLRoleB.GlueETLRoleB has an IAM policy that allows kms:Decrypt on the CMK ARN in Account A.The Glue job fails with the following error:
AccessDeniedException: User: arn:aws:sts::<AccountB>:assumed-role/GlueETLRoleB/... is not authorized to perform: kms:Decrypt on resource: arn:aws:kms:us-east-1:<AccountA>:key/<key-id> because no resource-based policy allows the kms:Decrypt action
Which change will fix the root cause with the MINIMAL change while keeping SSE-KMS enabled?
Options:
A. Grant Lake Formation SELECT on the database to GlueETLRoleB
B. Add GlueETLRoleB to the CMK key policy in Account A
C. Switch the bucket to SSE-S3 encryption
D. Add kms:Decrypt to the S3 bucket policy in Account A
Best answer: B
Explanation: The failure occurs at AWS KMS, not S3 or Glue, because the CMK in Account A does not trust the cross-account role. For cross-account decryption, KMS authorization must be granted by the CMK’s resource-based key policy (or a KMS grant) in the key-owning account. Updating the key policy to allow the Account B role resolves the AccessDenied with minimal change.
Symptom: the Glue job can reach S3 but fails with kms:Decrypt AccessDenied stating that no resource-based policy allows the action.
Root cause: in cross-account scenarios, an IAM policy on the caller role is not sufficient by itself for KMS. The CMK in Account A must also allow the principal from Account B via the CMK key policy (or a KMS grant). Without that trust, KMS denies decryption even if S3 access is allowed.
Fix: update the CMK key policy in Account A to allow arn:aws:iam::<AccountB>:role/GlueETLRoleB (or an approved Account B principal) to use the key for required actions such as kms:Decrypt (and commonly kms:GenerateDataKey* for SSE-KMS workflows).
Key takeaway: cross-account KMS access is controlled by the key policy/grants in the key-owning account.
Topic: Data Ingestion and Transformation
A data platform team deploys the same CloudFormation stack to dev and prod to create an AWS Glue connection used by ingestion jobs. The team must keep environment-specific secrets out of source control and avoid passing secret values through deployment parameters/logs.
Exhibit: CloudFormation snippet
Parameters:
Env: {Type: String}
DbPassword: {Type: String, Default: "P@ssw0rd"}
Resources:
GlueConn:
Type: AWS::Glue::Connection
Properties:
ConnectionInput:
ConnectionProperties: {PASSWORD: !Ref DbPassword}
Based on the exhibit, what is the best next step?
Options:
A. Replace the password with a Secrets Manager dynamic reference per environment
B. Store the password in a CloudFormation Mapping keyed by Env
C. Base64-encode the password in the template default value
D. Keep the parameter but set NoEcho: true and pass the value at deploy time
Best answer: A
Explanation: The exhibit shows a plaintext default password and uses it directly for the Glue connection, which violates safe environment-specific configuration practices. Using a Secrets Manager dynamic reference lets CloudFormation retrieve the secret at deploy time without embedding it in the template or providing it as a parameter value.
The core issue is secret handling in infrastructure as code. In the exhibit, DbPassword is defined with a plaintext Default: "P@ssw0rd" and then used as PASSWORD: !Ref DbPassword, which would place a secret in source control and encourage reuse across environments.
A better pattern is to create a separate AWS Secrets Manager secret for each environment (for example, one for dev and one for prod) and reference it from the template using a CloudFormation dynamic reference (resolved at deploy time). This keeps secrets out of the template and avoids passing secret values through stack parameters or build logs.
Key takeaway: the exhibit’s plaintext Default value is the signal to move the password into a managed secret and reference it dynamically.
Default shown.NoEcho still passes a secret since the value must be provided as a parameter value, which the requirement says to avoid.Topic: Data Ingestion and Transformation
A data ingestion Lambda function uncompresses large objects from Amazon S3, performs transformations, and writes results back to S3. The function’s local storage is insufficient, so the team is considering mounting Amazon EFS.
Which statements about using additional storage with AWS Lambda are FALSE/UNSAFE? (Select THREE.)
Options:
A. Mount an Amazon EBS volume directly to a Lambda function.
B. EFS throughput can bottleneck; optimize I/O and choose throughput mode.
C. A Lambda function can use EFS without being in a VPC.
D. EFS-mounted files have the same latency as Lambda /tmp.
E. Increase Lambda ephemeral storage when you only need temporary space.
F. EFS allows multiple invocations to share files; manage concurrency.
Correct answers: A, C and D
Explanation: When Lambda needs more disk than local ephemeral storage, it can mount Amazon EFS, but EFS access is over the network and has different performance characteristics. EFS also requires the Lambda function to run in a VPC to reach EFS mount targets. Lambda cannot mount EC2-style block storage such as Amazon EBS.
Lambda has two common ways to handle “more disk” needs: increase Lambda ephemeral storage (for scratch space that is local to the execution environment) or mount Amazon EFS (for shared, persistent file storage). EFS is an NFS file system accessed over the network via VPC mount targets, so I/O is generally higher-latency than local /tmp and can be constrained by EFS throughput and access patterns (for example, many small reads).
Operationally, using EFS with Lambda requires VPC configuration (subnets and security groups) so the function can connect to EFS. Also, because EFS is shared, concurrent invocations can contend for the same files, so coordination (naming, locking, idempotency) may be needed. The key takeaway is: choose ephemeral storage for fast temporary scratch, and EFS for shared/persistent storage with network performance tradeoffs.
/tmp.Topic: Data Operations and Support
When troubleshooting incorrect results in Amazon Athena or Amazon Redshift queries (unexpected row counts, missing rows, or day-boundary mismatches), which TWO statements are false or unsafe?
(Select TWO.)
Options:
A. Check join cardinality; 1-to-many inflates counts
B. Convert to local time zone before applying DATE() filters
C. A WHERE filter on the right table keeps LEFT JOIN unmatched rows
D. Athena timestamps are time zone–naive; store offsets/UTC
E. Normalize timestamps to UTC before daily aggregates
F. Adding DISTINCT is a safe fix for join duplicates
Correct answers: C and F
Explanation: Common analytics errors come from JOIN/filter interactions, join cardinality changes, and inconsistent handling of timestamps across time zones. The unsafe statements are the ones that incorrectly describe LEFT JOIN behavior with filters and that treat DISTINCT as a generally safe “fix” for duplicates. Normalizing and converting timestamps deliberately, and validating join cardinality, are reliable troubleshooting steps.
At a high level, troubleshoot query correctness by validating three things: join semantics, filter placement, and time handling. With LEFT JOINs, predicates applied after the join can change which rows survive; filtering on columns from the right side in a WHERE clause removes the NULL-extended rows and often defeats the purpose of the LEFT JOIN (move such predicates into the ON clause when you intend to preserve unmatched rows).
Time issues often look like “missing” or “extra” records around midnight; avoid mixing local timestamps and UTC by standardizing storage (commonly UTC) and converting explicitly before bucketing or applying date-based filters. Finally, unexpected count inflation is frequently caused by a 1-to-many join; verify join keys and cardinality rather than masking symptoms.
A quick “dedupe with DISTINCT” can hide the real issue and can remove valid duplicates.
WHERE removes NULL-extended rows; use ON for preserved unmatched rows.DISTINCT is not generally safe; it can drop valid duplicates and mask join problems.DATE()/bucketing prevents day-boundary errors.Topic: Data Store Management
Which statement is INCORRECT about using technical catalogs vs business catalogs in an AWS data platform?
Options:
A. Technical catalogs store schema, partitions, and S3 locations for queries.
B. Business catalog entries can support governed sharing via Lake Formation.
C. AWS Glue Data Catalog is primarily a business glossary and approval tool.
D. Business catalogs add glossary, ownership, and request/approval workflows.
Best answer: C
Explanation: AWS Glue Data Catalog is designed to register technical metadata such as table schemas, partitions, and data locations for services like AWS Glue and Amazon Athena. Business catalogs focus on business context (glossary, ownership) and enable governance and data sharing workflows such as access requests and approvals.
A technical catalog is optimized for compute engines: it tracks datasets as technical objects (databases/tables), including schemas, partition keys, and physical locations (for example, S3 prefixes) so ETL and query services can discover and read data consistently. A business catalog sits above that layer to help people find and safely use data: it adds business-friendly descriptions, glossary terms, ownership/stewardship, classifications, and processes for requesting and approving access. In AWS, governed sharing is typically enforced with controls such as Lake Formation permissions, while a business catalog can provide the workflow and context to route access requests to the right owners and document approved sharing.
Key takeaway: Glue Data Catalog is for technical metadata; business catalogs enable governance-oriented discovery and sharing workflows.
Topic: Data Ingestion and Transformation
A data pipeline uses AWS Step Functions to orchestrate a Transform Task state that invokes an AWS Lambda function once per incoming S3 object. If the Lambda fails, Step Functions should retry transient failures and then route the original input to an Amazon SQS dead-letter queue (DLQ) for investigation.
Constraints:
IntervalSeconds between attempts.Catch sends the input to the SQS DLQ.Which configuration meets the requirements?
Options:
A. MaxAttempts 4, IntervalSeconds 20, Catch to SQS DLQ
B. MaxAttempts 4, IntervalSeconds 30, Catch to SQS DLQ
C. MaxAttempts 4, IntervalSeconds 20, no Catch
D. MaxAttempts 5, IntervalSeconds 20, Catch to SQS DLQ
Best answer: A
Explanation: To guarantee a bad object reaches the DLQ within 180 seconds, add up all execution time across attempts plus the retry wait time between attempts. The only viable choice both stays under the 3-minute bound and includes a Catch path that sends the original input to an SQS DLQ after retries are exhausted.
This is a resiliency pattern decision: use Step Functions Retry for transient Lambda failures, then use Catch to divert poison-pill inputs to a DLQ for later analysis.
Compute worst-case time to reach the DLQ as:
Since 160 seconds is within 180 seconds and the Catch routes the failed input to SQS, this configuration satisfies both the SLA and fault-tolerance requirements.
Topic: Data Ingestion and Transformation
An hourly AWS Glue ETL job reads JSON files from s3://dl/raw/orders/ and writes Parquet to s3://dl/curated/orders/ partitioned by order_date (append mode). The job is occasionally retried or manually re-run after transient failures. After these re-runs, analysts report doubled counts in Athena, and the duplicates match entire raw files being ingested again.
Exhibit: CloudWatch log excerpt
INFO Job bookmark option: job-bookmark-disable
INFO Reading s3://dl/raw/orders/ingest_date=2026-02-25/
INFO Processed files: 12,480
...
ERROR S3Exception: Service Unavailable (503)
INFO Retrying job run
Which change will fix the root cause with the least disruption while still allowing late-arriving files to be picked up in later runs?
Options:
A. Increase the Glue job’s number of workers to reduce runtime
B. Add a DynamoDB table to store every processed S3 object key
C. Change the Glue job to overwrite the entire curated table each run
D. Enable AWS Glue job bookmarks for the S3 source
Best answer: D
Explanation: The duplicates are caused by a stateless ingestion pattern that re-reads the same S3 inputs on job retries or manual re-runs, then appends them again to the curated dataset. Enabling AWS Glue job bookmarks makes the ingestion stateful by persisting what has already been processed. This preserves correctness at scale while still allowing late-arriving files to be processed when they first appear.
Symptom: Athena shows doubled counts after Glue job retries/re-runs, and duplicates align to whole input files.
Root cause: the job is stateless (job-bookmark-disable) and reads a broad S3 prefix each run; when a run is retried or repeated, the same raw objects are ingested again and appended, creating duplicates.
Fix: enable AWS Glue job bookmarks for the S3 source so Glue persists ingestion state (which files/partitions were already processed) and skips them on subsequent runs, while still ingesting any newly arrived (including late) files when they appear. Key takeaway: use stateful ingestion (bookmarks/checkpoints) when correctness requires “process each input once” across failures and replays.
Topic: Data Ingestion and Transformation
A data engineer must implement a daily transformation for a data lake on AWS.
Exhibit: Ingestion ticket
1) Source: s3://datalake/raw/orders/ (JSON gzip), ~5 TB/day
2) Record example:
{"order_id":"123","ts":"2026-02-20T12:01:02Z",
"customer":{"id":"c7","tier":"gold"},
"items":[{"sku":"A1","qty":2},{"sku":"B9","qty":1}]}
3) Target: one row per item, Parquet, partitioned by dt=YYYY-MM-DD
4) Columns may appear/disappear between days (schema drift)
Which option is the most appropriate language/tool choice to meet the ticket requirements?
Options:
A. AWS Lambda function written in Java
B. AWS Glue ETL job using PySpark (Python)
C. Bash script with jq to flatten JSON
D. Amazon Athena SQL view over the raw JSON
Best answer: B
Explanation: An AWS Glue Spark job written in PySpark is best suited for parsing nested JSON, exploding arrays into multiple rows, and writing partitioned Parquet at scale. The exhibit indicates ~5 TB/day (line 1), an items array that must become one row per item (line 2–3), and schema drift (line 4), all of which are common Glue ETL use cases.
The core decision is choosing a language/tool that can reliably transform large volumes of semi-structured data into an optimized lake format. The exhibit indicates high throughput (~5 TB/day on line 1), nested JSON with an array (items on line 2) that must be flattened to “one row per item” (line 3), and schema drift (line 4).
A Glue ETL job using Spark with PySpark is appropriate because it can:
items).A key takeaway is that set-based SQL-only approaches can work for some transforms, but Spark-based ETL is typically the better fit when nested data, schema drift, and very large daily volumes are all present.
Topic: Data Security and Governance
A team runs AWS Glue ETL jobs every 15 minutes that read from an Amazon Aurora PostgreSQL database by using a JDBC connection. The database password must be rotated every 30 days, and the team wants the rotation to require no job updates and to reduce operational risk (avoid custom scripts and manual coordination).
Which approach best meets these requirements?
Options:
A. Store the password in the Glue job parameters and rotate it by updating and redeploying the job on a schedule
B. Store the credentials in AWS Secrets Manager, enable Aurora hosted rotation, and have the Glue connection reference the secret ARN
C. Store the credentials as an SSM Parameter Store SecureString and rotate it monthly with an EventBridge rule and Lambda
D. Create an IAM user for the ETL jobs, store the access keys in Secrets Manager, and rotate the access keys every 30 days
Best answer: B
Explanation: Using AWS Secrets Manager with Aurora hosted rotation provides a managed, low-touch rotation mechanism and keeps the Glue job configuration stable by referencing the secret rather than a specific password value. The jobs retrieve the current credential at runtime via the secret, avoiding manual updates and reducing the chance of rotation-related outages.
The deciding factor is using a managed, integrated rotation mechanism that automatically updates the database credential and lets clients continue to reference the same secret.
With AWS Secrets Manager you can:
AWSCURRENT valueThis reduces operational risk by eliminating custom rotation code and preventing missed redeploys/coordination errors during password changes. A custom rotation built around Parameter Store can work, but it shifts rotation reliability and failure handling onto your team.
Topic: Data Operations and Support
A daily AWS Glue ETL job reads raw data from Amazon S3 and writes Parquet files to an S3 curated bucket. The job began failing right after the curated bucket was changed to use SSE-KMS with a customer managed key (CMK).
A data engineer runs a CloudWatch Logs Insights query on the Glue job log group and sees the following recurring error.
message (top)
------------------------------------------------------------
AccessDeniedException: not authorized to perform kms:GenerateDataKey
on arn:aws:kms:us-east-1:111122223333:key/abcd-...
stats
------------------------------------------------------------
count(*) = 31 (last 7 days)
Which action will fix the root cause with the LEAST change?
Options:
A. Disable SSE-KMS on the curated bucket and use SSE-S3 instead
B. Increase the Glue job timeout and number of DPUs to avoid intermittent failures
C. Add an S3 gateway VPC endpoint for the curated bucket path
D. Grant the Glue job role permission to use the CMK for encrypt operations
Best answer: D
Explanation: The recurring failure in Logs Insights is an authorization error against AWS KMS (kms:GenerateDataKey) that started immediately after enabling SSE-KMS. That indicates the ETL job can reach S3, but cannot use the CMK required to encrypt new objects. The minimal fix is to allow the Glue job role (and the CMK key policy) to use the key for encryption/data key generation.
Symptom: Glue job runs now fail consistently with AccessDeniedException for kms:GenerateDataKey, and Logs Insights shows the same message repeating over multiple days.
Root cause: When an S3 bucket uses SSE-KMS, the writer’s identity must be allowed to use the CMK. The Glue job role lacks the required KMS permissions (and/or the CMK key policy doesn’t trust that role), so S3 cannot obtain a data key on the role’s behalf during PutObject.
Fix: Update permissions to allow the Glue job role to use the CMK for writes (commonly kms:GenerateDataKey, kms:Encrypt, and kms:Decrypt as needed) and ensure the CMK key policy permits the role.
This resolves the authorization failure without changing the storage design or job sizing.
AccessDenied to KMS.Topic: Data Security and Governance
A company collects CloudTrail, VPC Flow Logs, and application logs from 20 AWS accounts (same Region). Logs arrive continuously and are stored in Amazon S3 as newline-delimited JSON, producing millions of small objects per day (~12 TB/day). The security team uses Amazon Athena for audit queries that filter by event_date, account_id, and awsRegion, and they have a 30-minute query SLA. Query costs and runtimes are increasing due to large scan sizes and excessive file counts.
Which TWO actions will best improve Athena cost and performance while keeping the logs suitable for audit? (Select TWO.)
Options:
A. Load the logs into DynamoDB and query with PartiQL for audits
B. Increase AWS Glue crawler frequency to update partitions every 5 minutes
C. Use Amazon EMR Spark to compact and write partitioned Parquet to S3
D. Use Kinesis Data Firehose dynamic partitioning with JSON-to-Parquet conversion
E. Keep JSON in S3 and optimize by creating Athena views only
F. Transition all logs to S3 Glacier Deep Archive after 1 day
Correct answers: C and D
Explanation: Athena performance and cost are dominated by data scanned and file/partition layout. Converting high-volume JSON logs into compact, partitioned Parquet reduces scanned bytes for selective audit queries and avoids excessive overhead from millions of small objects. Using EMR for batch compaction and Firehose for streaming delivery are scalable integrations that address both problems.
For large-scale audit logging in S3, the main levers for Athena are: fewer/larger objects (avoid the small-files problem), columnar formats (Parquet/ORC), and partitions aligned to common filters (such as date/account/Region).
A practical pattern is to keep the original raw logs for audit purposes, and produce a curated/query-optimized copy:
This reduces total bytes scanned per query and improves runtime by pruning partitions and reading only needed columns, while remaining compatible with audit workflows that rely on S3 + Athena + the Glue Data Catalog.
Topic: Data Store Management
Select TWO statements that are true about configuring Amazon Redshift tables to match common access patterns (joins and filters).
Options:
A. DISTSTYLE EVEN is always best for join-heavy workloads
B. A well-chosen DISTKEY can reduce data redistribution for joins
C. DISTSTYLE ALL is recommended for large fact tables to speed joins
D. INTERLEAVED sort keys are best when filters mostly use the first column
E. Declaring a PRIMARY KEY in Redshift enforces uniqueness automatically
F. A compound sort key helps when filtering by a leading column range
Correct answers: B and F
Explanation: In Redshift, distribution choices primarily affect how much data must move across nodes during joins, while sort keys primarily affect how efficiently Redshift can skip reading irrelevant blocks. Choosing a DISTKEY aligned to frequent join keys and a compound sort key aligned to common range predicates are two standard, high-impact optimizations.
Redshift performance is heavily influenced by where data lives (distribution) and how it is ordered on disk (sort keys). A DISTKEY is most useful when many large joins occur on a single, high-cardinality column; colocating rows that join together reduces network redistribution. Sort keys help Redshift prune blocks using zone maps; a compound sort key is best when queries commonly apply range/equality filters on the leading sort column.
Practical guidance:
DISTKEY on a frequent join column to reduce shuffle.SORTKEY when filters typically start with the same first column.The wrong choices either over-replicate data, misapply interleaved sorting, or assume constraints are enforced when they are informational only.
Topic: Data Security and Governance
Which THREE statements are true about AWS Key Management Service (AWS KMS) key concepts, key types, and key management models for encrypting and decrypting data? (Select THREE.)
Options:
A. Customer managed KMS keys require importing your own key material
B. With an AWS managed key, you cannot edit the key policy or directly manage its lifecycle
C. KMS Encrypt/Decrypt use symmetric KMS keys, and key material stays protected in KMS
D. Envelope encryption commonly uses GenerateDataKey and stores the encrypted data key with the ciphertext
E. Asymmetric KMS keys can be used with GenerateDataKey for envelope encryption
F. A key policy is optional because IAM policies alone can grant KMS key access
Correct answers: B, C and D
Explanation: KMS typically encrypts data using symmetric KMS keys and supports envelope encryption by generating and protecting data keys. You choose between AWS managed keys and customer managed keys based on how much control you need over permissions and lifecycle. Understanding which APIs apply to symmetric vs. asymmetric keys prevents incorrect designs.
The core concepts are symmetric encryption with KMS, envelope encryption, and key management responsibility. For most data-engineering workloads, KMS Encrypt/Decrypt operations are performed with symmetric KMS keys, while applications encrypt large payloads locally using data keys.
A common envelope encryption flow is:
GenerateDataKey to get a plaintext data key plus an encrypted data key.Decrypt on the encrypted data key to re-obtain the plaintext key for decryption.AWS managed keys are managed by AWS (including key policy and lifecycle), while customer managed keys provide more direct control. A common pitfall is assuming asymmetric keys work with data-key generation or that IAM alone can replace the KMS key policy.
Encrypt/Decrypt, and KMS does not hand you the CMK key material in plaintext.GenerateDataKey + storing the encrypted data key is the textbook envelope encryption pattern for large objects.GenerateDataKey is for symmetric keys, the key policy is always part of KMS authorization, and importing key material is optional (not required) for customer managed keys.Topic: Data Security and Governance
A company runs a multi-account data lake (Amazon S3, AWS Glue, IAM) and must support governance investigations. Investigators need to view a 1-year history of resource configuration changes across all accounts and Regions from a central audit account. Evidence must be tamper-resistant, and investigators must have read-only access.
Which TWO actions should you AVOID because they violate these requirements? (Select TWO.)
Options:
A. Deliver AWS Config history to per-account S3 buckets that local admins can delete from
B. Disable AWS Config recording for IAM and S3 after capturing an initial baseline snapshot
C. Use AWS Config resource timeline and AWS Config advanced queries in the audit account
D. Use AWS Config managed rules to continuously evaluate key governance controls
E. Enable AWS Config in all accounts and Regions and aggregate data into the audit account
F. Set AWS Config retention to at least 1 year for configuration items
Correct answers: A and B
Explanation: AWS Config supports governance investigations by recording configuration items (changes over time) and allowing centralized querying and timelines through an aggregator. The approach must preserve a complete 1-year history for key resource types and protect the evidence from deletion or modification by workload administrators. Actions that weaken evidence integrity or stop change recording undermine compliance investigations.
AWS Config is designed to record and retain resource configuration changes (configuration items) and provide investigation tooling such as resource timeline and queries. To support governance investigations across multiple accounts/Regions, you typically enable AWS Config recorders everywhere, deliver the data to a controlled destination, and use an AWS Config aggregator in a central audit account for cross-account visibility.
To meet the stated requirements, the solution must:
The key takeaway is that governance investigations depend on both completeness (continuous recording) and integrity (protected, centralized retention).
Topic: Data Security and Governance
A company uses AWS Lake Formation with the AWS Glue Data Catalog to govern a data lake in Amazon S3. Data is queried with Amazon Athena by analysts from Finance and Marketing. The company wants attribute-based access control using LF-tags such as department and classification so access can scale without per-table grants, and access to the curated S3 zone must not bypass Lake Formation.
Which THREE actions should you AVOID?
Options:
A. Grant database-level SELECT to a shared analyst role
B. Apply LF-tags to tables/columns and grant permissions using LF-tag policies
C. Remove IAMAllowedPrincipals access and require Lake Formation permissions
D. Add an S3 bucket policy granting analysts s3:GetObject to curated
E. Set new databases/tables to “Use only IAM access control”
F. Automate LF-tagging of newly crawled tables before analysts query them
Correct answers: A, D and E
Explanation: LF-tag ABAC works only when Lake Formation is the enforcement point and access is granted through LF-tag policies (or other Lake Formation grants) that align to the desired attributes. Any configuration that enables direct S3 access or shifts authorization to IAM alone can bypass governance. Also, overly broad Lake Formation grants are not constrained by tags because tags don’t act as denies.
The core idea of LF-tag ABAC is: tag Data Catalog resources (databases, tables, columns) with LF-tags, then grant permissions to IAM principals using LF-tag policies so access scales as new resources inherit tags. For this to work, Lake Formation must be the gatekeeper for both catalog access and underlying S3 data access.
Avoid configurations that break one of these rules:
SELECT) expecting LF-tags to “limit” them; LF-tags grant access but do not restrict existing permissions.The safe pattern is to tag resources and grant access via LF-tag policies, keeping S3 access aligned to Lake Formation governance.
Topic: Data Store Management
Select TWO statements that are true about choosing a lakehouse table format (for example, Apache Iceberg) versus traditional data warehouse tables (for example, Amazon Redshift managed tables) when considering transactionality, schema evolution, and interoperability.
Options:
A. Redshift tables provide open interoperability across Spark and Presto.
B. Lakehouse formats cannot support row-level updates on S3.
C. Iceberg tables on S3 can provide ACID and schema evolution.
D. Redshift managed tables are directly queryable by Athena in place.
E. Iceberg metadata enables multiple engines to share the same table.
F. Glue crawlers automatically create Iceberg tables from Parquet folders.
Correct answers: C and E
Explanation: Lakehouse table formats such as Apache Iceberg add transactional table metadata on top of files in Amazon S3, enabling ACID-style commits and controlled schema evolution. Because the format is open, multiple query engines can interoperate against the same table definition rather than being tied to a single warehouse engine.
The key distinction is where table semantics live. Traditional warehouse tables (such as Amazon Redshift managed tables) implement transactions and schema enforcement inside the warehouse engine and storage layer, which is optimized for that engine but not designed for other analytics engines to query the same tables directly.
Lakehouse table formats (such as Apache Iceberg) store data files in Amazon S3 and maintain table state (snapshots/manifests/metadata) that provides:
If you primarily need cross-engine access on S3 with governed evolution and consistent reads, a lakehouse table format is a strong fit compared with engine-specific warehouse tables.
Topic: Data Ingestion and Transformation
You are selecting programming languages/tools for ingestion and transformation tasks in an AWS data lake (Amazon S3, AWS Glue, Amazon Athena, Amazon EMR). Which THREE statements are INCORRECT or unsafe guidance?
Options:
A. Use Python in AWS Glue to transform data and call AWS APIs with boto3.
B. EMR Serverless for Spark supports only PySpark, not Scala or Java.
C. Use Scala or Java for Spark ETL needing typed APIs or custom Spark code.
D. Use Athena SQL to call REST APIs during a query to enrich each row.
E. Use Bash for complex nested JSON transforms and schema evolution at scale.
F. Use SQL in Athena to filter and aggregate partitioned Parquet in S3.
Correct answers: B, D and E
Explanation: Choose languages based on what the managed service actually supports and what the task requires. Bash is primarily for orchestration and shell-level automation, not large-scale structured transformations. Athena SQL runs inside the query engine and does not enrich rows by calling arbitrary REST APIs. Spark on EMR Serverless is not limited to PySpark-only workloads.
A good rule is to match the language to the execution engine and the kind of work being done. SQL is the right tool when you’re using a SQL engine (Athena/Redshift) for set-based transforms and analytics. Python is common for AWS-native ETL and integration work (Glue scripts, lightweight transforms, SDK calls). Scala/Java are appropriate when you need deeper Spark control, typed APIs, or existing JVM libraries.
The unsafe guidance is:
The key takeaway is to align the language with the service runtime and transformation needs.
Topic: Data Ingestion and Transformation
You are orchestrating a serverless data ingestion pipeline with AWS Step Functions that invokes AWS Lambda functions and starts AWS Glue jobs.
Which TWO statements are false or unsafe design assumptions for this workflow?
Options:
A. Express Workflows are exactly-once, so idempotency and deduplication are unnecessary
B. Step Functions supports configuring a DLQ on the state machine for failed executions
C. Use Retry/Catch with backoff to handle transient task failures
D. Prefer Glue/other service integrations for work that exceeds Lambda runtime limits
E. Tasks should be idempotent because retries can run the same step multiple times
F. Lambda reserved concurrency can cap parallelism and cause throttling when exceeded
Correct answers: A and B
Explanation: Step Functions does not provide a dead-letter queue that you attach to a state machine; you must explicitly model failure handling and routing. Also, Express Workflows can deliver at-least-once execution, so duplicate task attempts are possible and designs that skip idempotency are unsafe. The other statements describe common, recommended patterns for concurrency control and failure handling.
The key concepts are explicit failure handling in Step Functions, concurrency controls in Lambda, and at-least-once behaviors that require idempotent processing.
Retry/Catch to route failures (for example to SQS/SNS/EventBridge) and alarm on execution failures.The unsafe assumptions are the ones that expect built-in DLQ attachment and exactly-once behavior in Express.
Topic: Data Store Management
A team stores lakehouse tables (for example, Apache Iceberg, Delta Lake, or Apache Hudi) in Amazon S3 and wants to manage table lifecycle costs and performance.
Which THREE statements are true about tiering/compaction and managing snapshots and obsolete files for these tables?
Options:
A. Expiring old snapshots can enable deletion of unreferenced data files
B. Compaction rewrites many small files into fewer larger files
C. Deleting snapshot metadata immediately deletes all snapshot data files
D. Compaction removes the ability to time travel to recent versions
E. Deleting S3 data files outside the table can cause inconsistency
F. S3 Glacier tiering is a safe substitute for snapshot expiration
Correct answers: A, B and E
Explanation: Lakehouse tables rely on metadata (snapshots/commits) that reference immutable data files in S3. Compaction is a maintenance operation that reduces small files for better performance, while snapshot/retention management controls how much history is kept and which unreferenced files can be safely removed. Directly manipulating underlying S3 objects can break metadata consistency.
The core idea is that lakehouse tables separate table metadata (snapshots/commits and manifests) from the underlying data files in S3. Compaction improves performance by rewriting many small files into fewer larger files, but it should preserve logical table contents and snapshot semantics.
Snapshot expiration (or retention cleanup) reduces how much historical metadata is kept for time travel and can make older data files eligible for deletion only when they are no longer referenced by any retained snapshot. Because the metadata is the source of truth, deleting underlying S3 objects “out of band” (outside the table maintenance process) can leave the metadata pointing to missing files and cause failures or incorrect reads. The key takeaway is to use table-aware maintenance for compaction and cleanup instead of treating the table like unmanaged S3 folders.
Topic: Data Operations and Support
A data engineering team lands daily vendor CSV files in an Amazon S3 raw/ prefix. The files frequently have inconsistent column names, leading/trailing spaces, and occasional missing required fields. The team wants an automated, low-code step that prepares the data for downstream transformation (for example, an AWS Glue ETL job) by standardizing the schema and isolating bad records, while publishing operational signals for monitoring.
Which TWO actions should the team take?
Options:
A. Use AWS Lake Formation permissions to prevent files with missing fields from being written to S3
B. Use an S3-triggered AWS Lambda function to parse and clean each CSV before writing to S3
C. Run a DataBrew data quality job to validate required fields and publish results/metrics for monitoring
D. Use Athena CTAS to overwrite the raw S3 prefix with cleaned CSV files
E. Configure an AWS Glue crawler to correct invalid values while inferring the schema
F. Create a Glue DataBrew recipe and run a scheduled recipe job to write standardized output to S3
Correct answers: C and F
Explanation: AWS Glue DataBrew is designed for low-code data preparation before transformation, using reusable recipes and automated jobs. A DataBrew recipe job standardizes columns and formats into a prepared zone in S3. A DataBrew data quality job validates required fields and produces results that can be monitored and used to drive remediation workflows.
The core idea is to introduce an automated data-prep layer before downstream ETL by using managed, low-code preparation capabilities. Glue DataBrew lets you define repeatable preparation logic (a recipe) to normalize column names, trim whitespace, cast data types, and output to a curated/prepared S3 prefix (often as Parquet for efficient downstream processing). To operationalize correctness, run a DataBrew data quality job (with rules for required fields and basic validity checks) and publish job outcomes/metrics so operations can monitor failures and route bad rows/files to a quarantine prefix for remediation.
This approach prepares data reliably without custom parsing code and integrates cleanly into scheduled or orchestrated processing.
Topic: Data Ingestion and Transformation
A company runs a nightly AWS Glue batch job that writes curated Parquet files to Amazon S3 for querying in Amazon Athena. The curated dataset contains 30 days of data, averaging 40 GB per day (assume 1 TB = 1,000 GB). Analysts run 50 Athena queries per day, and each query filters on exactly one event_date.
Currently, the data is written to a single S3 prefix (no partitions), so each query scans all 1.2 TB. Athena costs USD 5 per TB scanned. The Glue job is sometimes re-run for the same day, so the write must be idempotent and the load should be incremental.
Which ingestion configuration best meets these requirements, and what will the Athena query cost be per 30-day month? Round to the nearest dollar.
Options:
A. Partition by device_type, use bookmarks, overwrite partitions; USD 9,000/month
B. Write to one prefix, infer schema each run, append-only; USD 9,000/month
C. Write to event_date/hour partitions, append new files only; USD 300/month
D. Write Parquet to event_date=YYYY-MM-DD/, use bookmarks, overwrite partitions; USD 300/month
Best answer: D
Explanation: Partitioning the curated S3 layout by event_date lets Athena prune data to only the day being queried, reducing bytes scanned per query from 1.2 TB to 40 GB. With 50 queries per day over 30 days, monthly scanned TB becomes 60 TB, and at USD 5 per TB the monthly cost is USD 300. Using Glue job bookmarks enables incremental loads, and overwriting the target partition makes reruns idempotent.
The core idea is to align S3 partitioning with the most common query predicate (event_date) so Athena can do partition pruning, while configuring the batch job for incremental processing and idempotent writes.
Cost calculation (30-day month):
event_date partitioning, each query scans one day: 40 GB = 0.04 TB.Using Glue job bookmarks (or an equivalent high-water mark) supports incremental loads, and overwriting the event_date partition prevents duplicate data when a day is reprocessed.
device_type) won’t prune scans for date-filtered queries.Topic: Data Ingestion and Transformation
A data engineer wants to reduce the cost of running transformation queries in Amazon Athena by avoiding unnecessary data scans. What factor does Athena primarily use to determine the cost of a query?
Options:
A. The amount of data scanned by the query
B. The total query runtime in seconds
C. The number of concurrent queries submitted
D. The number of rows returned in the result set
Best answer: A
Explanation: Amazon Athena query costs are primarily driven by how much data the query scans. Techniques like partitioning, column selection, and using columnar compressed formats reduce bytes scanned. Lower scanned bytes typically translates directly to lower Athena cost.
The key cost-optimization lever for Athena is reducing the amount of data scanned, because Athena pricing is based primarily on bytes scanned per query. To avoid unnecessary scans during transformations, structure data so queries can read less data (for example, partition by common filters such as date, select only required columns, and store data in columnar compressed formats like Parquet or ORC). These approaches reduce the bytes read from Amazon S3, which is what Athena uses to calculate query cost. In contrast, query runtime, rows returned, and concurrency are not the primary billing unit for Athena queries.
Topic: Data Store Management
A data engineering team is selecting between an Amazon Redshift provisioned cluster and Amazon Redshift Serverless for a new analytics platform. The platform runs interactive dashboards during business hours, sporadic ad hoc SQL from analysts, and occasional end-of-month reporting spikes.
Which THREE requirements best support choosing Amazon Redshift Serverless for this workload? (Select THREE.)
Options:
A. Require fixed monthly cost with pre-purchased capacity
B. Unpredictable spikes with long idle periods
C. Minimize capacity planning and cluster administration
D. Pay mainly for intermittent, ad hoc query usage
E. Run steady 24/7 ETL and reporting at constant load
F. Need manual WLM queue tuning for deterministic performance
Correct answers: B, C and D
Explanation: Amazon Redshift Serverless is designed for workloads with variable or unpredictable demand where automatic scaling and reduced administration are priorities. It is typically a better fit when usage is intermittent (including long idle periods) and teams want a pay-for-use model instead of managing and sizing a cluster.
The core decision is whether the workload is variable enough to benefit from on-demand scaling and usage-based pricing (Serverless), or steady enough to justify owning fixed capacity (provisioned).
If the workload is consistently busy and you want tighter cost/performance control, provisioned is usually the better match.
Topic: Data Store Management
A data lake on Amazon S3 is queried with Amazon Athena. The curated dataset currently uses daily partitions and is queried with a rolling 30-day window.
Requirements:
string to int). The team wants to minimize pipeline breakage.Usage and pricing (assume constant for the month):
Which solution meets the schema-evolution requirement while also meeting the monthly Athena cost target?
Options:
A. Store curated data as JSON and rely on Athena schema-on-read
B. Store curated data as CSV and reject schema drift in Glue ETL
C. Use an Athena-managed Apache Iceberg table in Parquet
D. Use a partitioned Parquet Hive table with a daily Glue crawler
Best answer: C
Explanation: Apache Iceberg is designed for table-level schema evolution, including adding columns, renaming fields, and certain type changes, which reduces downstream breakage. With Parquet at 10 GB/day, the monthly Athena scan cost stays under the USD 200 target given the stated query pattern.
To minimize pipeline breakage during schema evolution (new columns, renamed fields, type changes), a table format that tracks schema and supports evolution operations is needed; Apache Iceberg provides this while still using Parquet files for efficient scans.
Cost check using the given query pattern and units:
\[ \begin{aligned} \text{GB scanned/query} &= 30\ \text{days} \times 10\ \text{GB/day} = 300\ \text{GB} \\ \text{TB/query} &= 300/1000 = 0.3\ \text{TB} \\ \text{Monthly TB} &= 0.3\ \text{TB} \times (3\ \text{queries/day} \times 30\ \text{days}) = 27\ \text{TB} \\ \text{Monthly cost} &= 27\ \text{TB} \times USD 5 = USD 135 \end{aligned} \]This stays under USD 200, and Iceberg’s schema evolution avoids brittle “break on rename/type change” behavior common with simple Hive-style tables.
Topic: Data Operations and Support
A near-real-time analytics pipeline in us-east-1 ingests clickstream events (Kinesis Data Firehose -> Amazon S3 raw), transforms them (AWS Glue streaming job -> Amazon S3 curated), and serves queries (Amazon Athena). The business SLA is curated data must be queryable within 15 minutes of event time, and on-call must be paged within 5 minutes when the SLA is at risk. Today, operators mostly review Glue logs after users report stale dashboards.
Which change is the best way to improve operability and SLA regression detection with minimal added cost and no new third-party tooling?
Options:
A. Increase Glue job workers to reduce processing time variability
B. Rely on CloudWatch Logs Insights queries over Glue logs every 5 minutes
C. Publish stage-level custom CloudWatch metrics and build a CloudWatch dashboard with alarms
D. Enable CloudTrail S3 data events on the raw and curated buckets
Best answer: C
Explanation: Create an SLA-focused health dashboard by turning pipeline signals into CloudWatch metrics, then alerting on those metrics. Use managed service metrics (Glue failures/duration) plus custom metrics (end-to-end freshness/lag and record-count deltas) to detect regressions before consumers notice. The tradeoff is a small amount of engineering to emit metrics and a small ongoing CloudWatch custom metric cost.
High-level pipeline health dashboards work best when they track a few SLA-oriented metrics (freshness/lag, success/failure, and throughput) and drive alarms from those metrics rather than from ad hoc log searching. In this pipeline, you can use native CloudWatch metrics for Glue (job run state, duration) and publish custom metrics at key points (Firehose delivery delay, curated S3 watermark time, and optionally row-count/late-record rates) from the Glue job or a small Lambda triggered by EventBridge/S3. Then create a CloudWatch dashboard and CloudWatch alarms that page via SNS when freshness approaches or breaches 15 minutes.
The key tradeoff is paying for custom metrics (and maintaining the metric emission), but it provides reliable, low-latency detection and a single operational view of the SLA.
Topic: Data Security and Governance
You are troubleshooting common authorization failures in AWS data engineering workloads (for example, Athena/Glue/Redshift Spectrum reading data in Amazon S3, AWS Lake Formation-governed tables, and cross-account access with IAM roles). Which THREE statements describe correct high-level causes and fixes for these authorization failures?
Options:
A. AssumeRole failures are commonly fixed by correcting the target role trust policy to allow the caller
B. If S3 access is allowed, Lake Formation permissions are automatically bypassed for Athena
C. To fix sts:AssumeRole AccessDenied, add sts:AssumeRole to the target role permissions policy
D. An S3 AccessDenied can be caused by bucket policy, SCP, or a permissions boundary overriding an IAM allow
E. Lake Formation permission errors require Lake Formation grants to the querying role or principal
F. For SSE-KMS objects in S3, s3:GetObject alone is sufficient because S3 decrypts without KMS permissions
Correct answers: A, D and E
Explanation: Authorization failures typically come from the specific control plane enforcing access: Lake Formation grants for governed data, multiple IAM/S3 policy layers for object access, and role trust policies for STS role assumption. The best fixes align to the layer producing the denial, rather than adding unrelated permissions. Correct troubleshooting starts by identifying the service error and then adjusting the matching permission mechanism.
Use the error message to identify which authorization system is denying the request, then apply the matching fix.
| Statement | OK/NO | Brief fix / why |
|---|---|---|
| Lake Formation permission errors require LF grants | OK | Grant LF permissions (for example, SELECT on DB/table) to the principal used, and ensure the role can call lakeformation:GetDataAccess. |
| S3 AccessDenied can be caused by bucket policy/SCP/permissions boundary | OK | Check for missing allows or explicit denies across IAM policy, bucket policy, permissions boundary, and SCPs. |
| AssumeRole failures are fixed by trust policy changes | OK | Update the target role trust policy to trust the caller (and satisfy any conditions like external ID). |
| S3 allow bypasses Lake Formation for Athena | NO | Lake Formation governance is enforced independently of S3 object permissions when enabled. |
Add sts:AssumeRole to the target role permissions policy | NO | The target role’s trust policy controls who can assume it; identity permissions alone don’t grant trust. |
s3:GetObject is enough for SSE-KMS objects | NO | The caller also needs KMS permissions (and key policy access) to decrypt. |
Key takeaway: fix the specific layer that produced the denial (LF grants, policy evaluation layers, or trust policy).
Topic: Data Store Management
A data lake in Amazon S3 is queried with Amazon Athena through the AWS Glue Data Catalog. Pipeline today:
s3://lake/raw/ every minute.s3://lake/curated/events/.year/month/day/hour and contains ~50,000–100,000 small JSON files per hour (each ~1–5 MB).event_date (last 7 days) and region, and aggregate by hour.The team needs to reduce Athena cost and improve query reliability (fewer timeouts) without increasing end-to-end data availability beyond 15 minutes.
Which change is the best improvement to the curated table layout?
Options:
A. Partition by user_id to maximize pruning for selective queries
B. Keep JSON and add minute-level partitions to reduce scanned data
C. Write Parquet and compact into ~256 MB files; partition by day and region
D. Write one file per day per region to minimize file count
Best answer: C
Explanation: Converting the curated dataset to columnar Parquet reduces bytes scanned and improves CPU efficiency in Athena. Compaction to fewer, larger files cuts S3/listing and Glue/Athena planning overhead caused by many small objects. Partitioning by day and region matches the common filters, so pruning remains effective while meeting the 15-minute freshness constraint with frequent compaction.
The core optimization is balancing partition pruning with file sizing. With Athena, too many small files increase query planning time and metadata/listing overhead, and JSON increases scan cost because it is row-based and not efficiently splittable for column pruning.
A good layout for the described access patterns is:
event_date by day, and region).This reduces both “too many objects” overhead and scanned bytes, while still allowing Athena to skip entire partitions for date/region filters. The tradeoff is added ETL/compaction work and slightly more operational complexity.
user_id typically create an explosion of partitions and small files, increasing overhead and often worsening performance.Topic: Data Security and Governance
Select TWO true statements about AWS KMS key rotation and access auditing for KMS keys used to encrypt data platform resources (for example, Amazon S3 SSE-KMS, AWS Glue, Amazon Redshift).
Options:
A. Best risk reduction is allowing all IAM principals to use the key.
B. Automatic rotation can be enabled on AWS managed keys.
C. Key rotation changes the key ARN, requiring S3 object re-encryption.
D. Enable automatic rotation for symmetric customer managed keys annually.
E. CloudTrail does not record failed KMS Decrypt attempts.
F. CloudTrail logs KMS API calls like Decrypt and GenerateDataKey.
Correct answers: D and F
Explanation: KMS key usage can be audited by reviewing AWS CloudTrail events for KMS API calls, which show which principal invoked cryptographic operations. For symmetric customer managed keys, enabling automatic rotation reduces long-term exposure by periodically rotating key material without requiring applications to update the key ARN.
The core ideas are (1) rotate key material to limit blast radius over time and (2) audit every use of the key. For symmetric customer managed KMS keys, you can enable automatic rotation so AWS creates new backing key material on a schedule while preserving the same key ARN and alias, keeping integrations stable. For auditing, AWS CloudTrail records KMS events (for example, Decrypt, Encrypt, GenerateDataKey, CreateGrant) so you can investigate which IAM role/user accessed a key and from where.
To reduce misuse risk at a high level, combine rotation and auditing with least privilege on the key policy/IAM policies, and add guardrails such as kms:ViaService, kms:CallerAccount, and encryption context conditions to scope how/where the key can be used. The key takeaway is to limit who can call KMS APIs and continuously monitor those calls.
Topic: Data Ingestion and Transformation
A team is designing a replay strategy for a streaming ingestion pipeline (for example, Amazon MSK or Amazon Kinesis Data Streams) where consumers may need to reprocess past events after code fixes or downstream outages.
Which statement is NOT correct about designing for replayability?
Options:
A. Track consumer progress with offsets and be able to reset them
B. Make downstream writes idempotent to tolerate duplicates during replay
C. Set stream/topic retention longer than the maximum replay window
D. Commit offsets before processing to maximize throughput and still replay safely
Best answer: D
Explanation: Replayability depends on (1) keeping the source data long enough to reread it and (2) managing offsets so consumers can resume or rewind deterministically. Committing offsets before processing breaks this because a failure after the commit can cause the consumer group to advance past unprocessed events, making recovery and correct reprocessing unreliable.
A replay strategy for streaming ingestion is built on the combination of retention, offsets, and safe reprocessing. Retention (topic/stream) must cover the longest period you might need to reprocess; once events age out, replay is impossible. Offsets (or equivalent sequence tracking) represent consumer progress and must be durable and resettable so you can rewind to a known point for reprocessing.
To keep replay safe:
The key takeaway is that retention enables rereads, but correct offset management and idempotent processing make those rereads trustworthy.
Topic: Data Ingestion and Transformation
When troubleshooting slow Apache Spark stages (for example, on AWS Glue or Amazon EMR), which TWO statements are FALSE/unsafe? (Select TWO.)
Options:
A. High shuffle spill may improve with more memory or fewer cores
B. Repartitioning to very high counts always improves performance
C. Raising shuffle partitions can help when tasks are too large
D. Many small files on S3 can bottleneck I/O; reduce file counts
E. Skewed join keys can create straggler tasks; mitigate skew
F. Broadcast the larger join input to avoid shuffle
Correct answers: B and F
Explanation: Two statements are unsafe because they claim universal performance improvements from actions that often backfire. Broadcasting a large dataset can cause executor OOM and instability, and blindly repartitioning to extremely high partition counts increases shuffle, task scheduling overhead, and S3 I/O amplification. Effective debugging aligns shuffle volume, partition sizing, and parallelism with cluster resources and data characteristics (especially key skew).
Spark performance issues commonly come from expensive shuffles, skewed partitions (a few “hot” keys), and I/O bottlenecks (especially many small files on S3). Broadcasting is meant for the small side of a join; broadcasting a large input is unsafe because it pushes that dataset into executor memory and can trigger OOM or heavy GC.
Similarly, repartitioning does not “always” help: too many partitions creates lots of tiny tasks, increases shuffle metadata, and amplifies read/write overhead. A better approach is to size partitions so each task does a reasonable amount of work, then tune parallelism (for example spark.sql.shuffle.partitions) to match available cores.
The other statements are generally valid: skew mitigation reduces stragglers, spill often indicates memory pressure (or too much concurrency per executor), and reducing small files/using columnar formats helps relieve S3 I/O pressure.
Topic: Data Operations and Support
When using AWS CloudTrail to audit and investigate incidents in an AWS data platform, which THREE statements are false or unsafe?
Options:
A. Store trail logs in Amazon S3 with SSE-KMS and restricted access.
B. Query CloudTrail logs with Athena or CloudTrail Lake during incidents.
C. S3 object-level access is logged by default in CloudTrail.
D. Send CloudTrail events to CloudWatch Logs for near-real-time alerting.
E. CloudTrail Event history provides indefinite retention for investigations.
F. A single-Region trail captures API activity in every AWS Region.
Correct answers: C, E and F
Explanation: CloudTrail is used to audit AWS API activity, but it must be configured correctly for investigation workflows. Event history is not meant for long-term retention, object-level data events (such as S3 objects) are not logged by default, and Region coverage depends on using multi-Region trails or deploying trails per Region.
CloudTrail records AWS API activity primarily through management events (for example, IAM, AWS Glue job updates, Redshift cluster changes). For incident investigations, you typically need durable retention and searchable logs, which means creating a trail that delivers to an S3 bucket (often encrypted with SSE-KMS and tightly permissioned) or using CloudTrail Lake for longer-term storage and querying.
Some high-volume activity is not captured unless explicitly enabled as data events (for example, S3 object-level reads/writes). Also, CloudTrail is Region-aware: to see activity across Regions in one place, configure a multi-Region trail (or collect logs from multiple Region trails). A common operations pattern is streaming CloudTrail events to CloudWatch Logs/EventBridge to alert on sensitive API calls while retaining logs for later forensics.
Topic: Data Operations and Support
A data engineering team stores daily customer interaction data (Parquet) in Amazon S3. A new upstream release introduced data quality issues: duplicate event_id values and invalid event_timestamp formats.
Requirements:
Which TWO approaches should the team AVOID for verifying and cleaning the data? (Select TWO.)
Options:
A. Use an AWS Lambda function to validate schema/timestamp format on new objects and quarantine bad files to an S3 prefix, emitting CloudWatch metrics
B. Run Amazon Athena queries to identify invalid timestamps, then use CTAS to write cleaned Parquet to an SSE-KMS curated prefix
C. Download the raw S3 objects to a laptop, clean with local Python, and re-upload
D. Use SageMaker Studio Data Wrangler to define cleaning steps on a sample via Athena and run a processing job that writes cleaned SSE-KMS output to S3
E. Create an always-on Amazon EMR cluster to run Spark cleaning jobs continuously, regardless of data arrival
F. Use AWS Glue DataBrew to profile the dataset and apply a recipe that standardizes timestamps and removes duplicates, writing SSE-KMS output to S3
Correct answers: C and E
Explanation: Approaches that export PII outside AWS-managed services or rely on always-on compute violate the explicit security and cost requirements. Serverless/on-demand AWS services can both verify data quality (duplicates, format validity) and produce an encrypted, auditable cleaned dataset in S3.
Match the remediation to the issue type while honoring explicit constraints. For duplicates and invalid formats, common patterns are to (1) validate and quantify issues, (2) quarantine or rewrite bad records, and (3) write a cleaned dataset to a curated S3 location.
Athena, Glue DataBrew, Lambda, and Data Wrangler can all perform verification/cleaning within AWS and write results back to S3 encrypted with SSE-KMS. In contrast, exporting raw data to a local workstation breaks PII handling controls, and always-on clusters are an unnecessary cost when on-demand/serverless options meet the need. Key takeaway: keep PII processing in AWS-managed services and prefer on-demand/serverless execution for routine data quality tasks.
Topic: Data Operations and Support
A company uses Amazon Athena to query curated clickstream data in Amazon S3. The AWS Glue Data Catalog table is partitioned by dt (format YYYY-MM-DD). Analysts run hourly dashboards that must return exactly correct results and the data must remain protected by existing AWS Lake Formation permissions.
Which TWO actions should you AVOID when trying to reduce Athena query cost and improve performance by limiting scanned data?
Options:
A. Convert curated data to Parquet with Snappy compression
B. Use Athena CTAS to create a smaller, partitioned dashboard table
C. Add TABLESAMPLE to queries to scan fewer bytes
D. Increase partition granularity (dt/hour) for common filters
E. Add dt predicates and select only required columns
F. Grant broad S3 read access to bypass Lake Formation
Correct answers: C and F
Explanation: Avoid approaches that reduce scanned data by changing the data returned or by weakening governance controls. Using sampling can make dashboards incorrect because it intentionally returns only a fraction of rows. Bypassing Lake Formation to “make queries work” breaks the requirement to preserve existing permissions, even if it might speed access.
Athena cost and performance are primarily driven by bytes scanned. To reduce scans safely, use partition pruning (predicates directly on partition columns like dt), select only the needed columns (avoid SELECT *), and store data in compressed columnar formats such as Parquet to enable column projection and efficient reads.
Actions to avoid fall into two categories under the stated requirements:
TABLESAMPLE intentionally return incomplete data, which violates “exactly correct results” even if they reduce scanned bytes.The safest optimizations reduce scanned data without changing semantics or loosening access controls.
TABLESAMPLE lowers scan volume by reading fewer rows, so dashboards can be wrong.dt (and possibly hour) predicates is an appropriate way to limit scanned partitions.Topic: Data Store Management
A data team is trying to speed up repeated dashboard queries in Amazon Redshift by creating a materialized view.
Exhibit: Redshift SQL and error
CREATE MATERIALIZED VIEW analytics.mv_orders_enriched AS
SELECT o.order_id, o.order_ts, c.segment
FROM spectrum_ext.orders o
JOIN public.customers c ON o.customer_id = c.customer_id;
-- ERROR: Materialized views are not supported for external tables
Based on the exhibit, what is the best next step to achieve materialized-view-style performance for this dataset?
Options:
A. Replace the materialized view with a standard Redshift view over the external table
B. Use Redshift federated queries to read the S3 data and then create the materialized view
C. Create the same materialized view in Amazon Athena over the S3 table
D. Load the external data into a Redshift table, then create the materialized view
Best answer: D
Explanation: The exhibit shows Redshift rejecting the statement because the materialized view references a Spectrum external table (spectrum_ext.orders). To get materialized-view benefits, the referenced data must be stored in Redshift tables, which can then be materialized and refreshed to serve dashboards quickly.
This is a Redshift Spectrum vs. Redshift materialized view compatibility issue. In the exhibit, the FROM spectrum_ext.orders line indicates the query reads an external table, and the final line explicitly states: ERROR: Materialized views are not supported for external tables. That means you cannot use a Redshift materialized view to precompute results that depend on Spectrum external tables.
To achieve MV-like performance, first ingest the needed S3 data into Redshift (for example, using COPY into a native table), then build and refresh the materialized view on Redshift-managed tables. The key takeaway is that Spectrum is for querying S3 in place, while Redshift materialized views require Redshift-resident base data.
spectrum_ext.orders at query time.spectrum_ext.*.Use the AWS DEA-C01 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.
Try AWS DEA-C01 on Web View AWS DEA-C01 Practice Test
Read the AWS DEA-C01 Cheat Sheet on Tech Exam Lexicon for concept review before another timed run.