Try 10 focused AWS DEA-C01 questions on Data Store Management, with explanations, then continue with IT Mastery.
Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.
| Field | Detail |
|---|---|
| Exam route | AWS DEA-C01 |
| Topic area | Data Store Management |
| Blueprint weight | 26% |
| Page purpose | Focused sample questions before returning to mixed practice |
Use this page to isolate Data Store Management for AWS DEA-C01. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.
| Pass | What to do | What to record |
|---|---|---|
| First attempt | Answer without checking the explanation first. | The fact, rule, calculation, or judgment point that controlled your answer. |
| Review | Read the explanation even when you were correct. | Why the best answer is stronger than the closest distractor. |
| Repair | Repeat only missed or uncertain items after a short break. | The pattern behind misses, not the answer letter. |
| Transfer | Return to mixed practice once the topic feels stable. | Whether the same skill holds up when the topic is no longer obvious. |
Blueprint context: 26% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.
These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.
Topic: Data Store Management
A team is building a semantic search feature that finds support articles similar to a user question (vector similarity search). The team also wants to follow the core principle of keeping the raw data zone immutable so embeddings can be regenerated later if the model changes.
Which approach best matches this principle while enabling similarity search on AWS?
Options:
A. Overwrite raw S3 articles with embeddings to reduce storage
B. Store embeddings in RDS MySQL and use full-text search
C. Store embeddings as CSV in S3 and query with Athena
D. Store raw articles in S3; index embeddings in OpenSearch vectors
Best answer: D
Explanation: Semantic search is a vector search use case that requires storing embeddings and running k-NN similarity queries. Keeping the raw zone immutable means raw articles should remain unchanged so you can regenerate embeddings and rebuild indexes when models or preprocessing change. A managed vector-capable service provides efficient similarity search without mutating raw data.
The core principle is immutability of the raw zone: preserve original source data (for example, in an Amazon S3 raw prefix) and treat derived artifacts as rebuildable. Embeddings are derived from raw text and may need to be regenerated when the embedding model, chunking, or normalization changes, so they should be stored separately from raw content.
For the similarity search access pattern (nearest-neighbor over high-dimensional vectors), use a store designed for vector indexing and k-NN queries, such as Amazon OpenSearch Service with vector search, while keeping the raw articles unchanged in S3. This cleanly separates durable source-of-truth data from replaceable derived indexes and supports reprocessing without data loss.
Topic: Data Store Management
A company stores internal help articles as text files in Amazon S3. The company is building a semantic search feature for a support chatbot.
Requirements:
Which solution best meets these requirements with the least operational overhead?
Options:
A. Load documents into Amazon Redshift and compute similarity using SQL over an embeddings table for each query.
B. Use AWS Glue to generate embeddings into S3 Parquet and run cosine-similarity calculations in Amazon Athena at query time.
C. Use Knowledge Bases for Amazon Bedrock with the S3 bucket as the data source and Amazon OpenSearch Serverless as the vector store; configure frequent incremental sync and isolate access by business unit using separate collections/knowledge bases with IAM and KMS.
D. Store embeddings and metadata in Amazon DynamoDB and use Scan with application-side cosine similarity to find nearest matches.
Best answer: C
Explanation: Knowledge Bases for Amazon Bedrock automates the vectorization workflow: it chunks content, generates embeddings, and stores vectors in a supported vector store for low-latency semantic retrieval. Using OpenSearch Serverless provides managed vector indexes that scale for millions of documents and support metadata filters as schemas evolve. Separate collections/knowledge bases with IAM policies and KMS meet the access-control and encryption requirements with low operational effort.
Vectorization pipelines have three core parts: generate embeddings from content, store those vectors with stable document IDs plus metadata, and retrieve relevant items by nearest-neighbor (vector similarity) with optional metadata filtering. A managed knowledge base with OpenSearch Serverless directly matches this pattern: S3 acts as the source of truth, the service performs embedding generation and incremental sync to meet freshness, and OpenSearch Serverless provides scalable low-latency vector search for millions of documents.
For governance and schema evolution, store metadata alongside each vector and use it for filters; adding new metadata fields is typically an additive change (new fields) rather than re-embedding all content. Business-unit isolation is best handled by separating collections/knowledge bases (or equivalent isolation boundaries) and enforcing access with IAM-based policies plus KMS encryption.
Query-time full scans or application-side similarity calculations generally cannot meet the latency and cost goals at this scale.
Topic: Data Store Management
A company stores source application events in an Amazon S3 data lake with zones raw/, staged/, and curated/. Events can arrive up to 14 days late, and the data team must run occasional historical backfills without changing previously published curated datasets except when truly corrected.
Which approach should you AVOID to manage backfills and late-arriving updates while keeping downstream outputs consistent?
Options:
A. Write curated outputs with snapshot/versioned prefixes and publish a pointer
B. Upsert late records into curated/ using a stable record key
C. Use an append-only raw/ zone and reprocess into curated/
D. Overwrite existing objects in the raw/ zone during backfills
Best answer: D
Explanation: You should avoid changing the raw zone in-place because it removes the ability to reproduce past results and audit what data was originally received. For late-arriving updates and backfills, the common pattern is immutable ingestion followed by deterministic reprocessing and controlled publishing. This keeps curated outputs stable and only changes them when a defined correction is applied.
The core principle is immutability of the source-of-truth ingestion layer so pipelines can be replayed and backfills can be performed deterministically. If you overwrite the raw/ zone during a backfill, you lose lineage: you can no longer prove what was originally received, reproduce an earlier curated output, or safely troubleshoot late-arriving changes.
A consistent approach is:
raw/ append-only (often time-partitioned) and retain it per policy.staged/ and curated/ deterministically for the affected dates.The key takeaway is that late data handling should change curated outputs through governed reprocessing, not by mutating raw inputs.
Topic: Data Store Management
A company stores clickstream events for analytics. New events arrive continuously (about 1 TB/day) and must be queryable within 15 minutes. Analysts run interactive SQL queries throughout the day, mostly filtering the last 14 days, and the team wants to minimize Athena query cost by reducing bytes scanned. Data must be retained for 5 years, but data older than 90 days is queried only once per quarter and can tolerate slower queries. Some columns contain PII and require column-level access control.
Which solution BEST meets these requirements?
Options:
A. Store Parquet on Amazon S3 and transition objects older than 90 days to S3 Glacier Deep Archive
B. Write partitioned Parquet with Snappy to Amazon S3, use lifecycle to Intelligent-Tiering for older partitions, and query with Athena using Lake Formation and result reuse
C. Store uncompressed JSON in Amazon S3 Standard and query it directly with Athena
D. Load all events into a provisioned Amazon Redshift cluster and keep 5 years of data on the cluster
Best answer: B
Explanation: Using S3 with partitioned, compressed columnar formats (Parquet + Snappy) is the highest-leverage way to cut Athena bytes scanned and improve query performance. Keeping recent data in an online S3 tier and transitioning older partitions to a lower-cost online tier meets long retention at lower cost. Lake Formation provides fine-grained (including column-level) access controls for PII, and Athena result reuse can reduce repeat-query latency and cost.
For Athena, the key storage optimizations are columnar format, compression, and partitioning so queries read fewer bytes. Writing events to S3 as Parquet with Snappy compression and partitions such as event_date=YYYY-MM-DD lets common “last 14 days” filters prune partitions and scan only needed columns, lowering cost and improving performance.
To reduce storage cost while preserving periodic query access, keep recent partitions in S3 Standard and use an S3 lifecycle policy to transition older partitions to an online lower-cost tier (for example, S3 Intelligent-Tiering). Use AWS Lake Formation on top of the Glue Data Catalog to enforce table/column permissions for PII, and enable Athena workgroup query result reuse to speed up repeated queries over the same data.
Archival tiers that require long restores break the “query on demand” requirement for older data.
Topic: Data Store Management
A company is building an analytics platform on AWS. Source systems are OLTP databases where rows are frequently inserted, updated, and occasionally deleted. The analytics team needs a storage and data modeling approach that supports incremental processing from CDC feeds and can reconstruct current state and (when needed) prior state.
Which THREE approaches best support incremental processing and CDC requirements? (Select THREE.)
Options:
A. Model slowly changing dimensions as SCD Type 2 with effective start/end timestamps and a current-row indicator
B. Rewrite the full target dataset each day as a new snapshot and delete the prior snapshot
C. Persist CDC events as an immutable, partitioned change-log table with operation type and commit timestamp
D. Optimize only for query speed by denormalizing into one wide table without business keys or change timestamps
E. Store curated tables on Amazon S3 using Apache Hudi with a record key to support upserts and deletes
F. Store tables as a single CSV object per table and overwrite the object on every change batch
Correct answers: A, C and E
Explanation: Incremental processing with CDC works best when storage formats and schemas provide stable keys and change/commit metadata. Table formats like Hudi enable efficient upserts/deletes and incremental pulls, while dimensional modeling patterns like SCD Type 2 preserve historical versions. An append-only change log is a robust foundation for replay and incremental materialization of current-state tables.
To support CDC-driven incremental processing, your storage and model must let you (1) identify the record being changed and (2) know the ordering/version of changes (and optionally deletes). Table formats such as Apache Hudi on S3 are designed for this by storing record keys and commit metadata so pipelines can apply upserts/deletes and consumers can read only new commits.
At the schema/modeling layer, SCD Type 2 captures row-version history using effective dating (or similar versioning), which makes “as of time” queries possible without full reloads. A complementary pattern is persisting CDC as an immutable change-log (with operation type and commit timestamp) so you can incrementally materialize “current” tables and also replay changes to rebuild derived datasets. Approaches that overwrite whole datasets or omit keys/timestamps undermine reliable incremental processing.
Topic: Data Store Management
A company stores clickstream events in Amazon S3 and queries them with Amazon Athena through AWS Lake Formation (column-level permissions). New data must be queryable within 15 minutes of landing, and queries for the last 7 days must stay under 30 seconds while keeping a pay-per-query cost model.
The dataset grew to 5 TB/day, and the current layout is partitioned by device_id. Cardinality has increased (millions of device_id values/day) and key distributions are now skewed, causing too many partitions and slow queries. New columns are added frequently.
Which solution BEST addresses the change in data characteristics?
Options:
A. Rewrite to an Apache Iceberg table in S3, partitioned by event_date (and another low-cardinality column), using an AWS Glue job for compaction and schema evolution; query with Athena and keep Lake Formation permissions
B. Index the events by device_id in Amazon DynamoDB and run analytics by scanning the DynamoDB table with PartiQL
C. Load the events into a provisioned Amazon Redshift cluster and use an interleaved sort key on device_id to handle skewed access patterns
D. Keep the S3 layout partitioned by device_id and use Athena partition projection to avoid storing partition metadata in the Data Catalog
Best answer: A
Explanation: The core issue is a high-cardinality, skew-prone partition key (device_id) that creates excessive partitions and small files, degrading Athena performance. Rewriting the dataset with a low-cardinality, time-based partition strategy and compacting files restores partition pruning and scan efficiency. Using Iceberg also supports frequent schema changes while remaining compatible with Athena and Lake Formation governance.
When data characteristics change (cardinality, skew, null rates), the physical layout must change too. Partitioning by a high-cardinality key like device_id creates millions of partitions and many small objects, which hurts Athena planning and increases data scanned.
A better strategy is to rewrite the table to a format that supports evolution (Iceberg) and choose partitions that align with common filters and stay bounded in count:
event_date), optionally add a low-cardinality dimensionThis meets the 15-minute freshness goal (incremental writes) while restoring sub-30-second performance for recent-window queries.
device_id-based layout.Topic: Data Store Management
When selecting between Amazon Kinesis Data Streams and Amazon MSK for streaming transport and storage, which THREE statements are false or unsafe assumptions about ordering, replay, throughput, or the consumer model? (Select THREE.)
Options:
A. Kinesis Data Streams guarantees ordering across all shards
B. Kinesis provides ordered records only within a shard
C. MSK uses partitions and consumer groups for parallel consumption
D. Amazon MSK deletes records immediately after consumers read them
E. Kinesis Data Streams requires you to manage Kafka brokers
F. Both services support replay within configured retention
Correct answers: A, D and E
Explanation: Ordering guarantees are scoped to a shard in Kinesis and to a partition in Kafka/MSK, not to the entire stream/topic. Both services retain data for a configured period, which enables replay by re-reading from a prior position (sequence number or offset). MSK is Kafka-compatible, while Kinesis is fully managed and does not require operating Kafka brokers.
Choose between Kinesis Data Streams and MSK by matching service semantics to your requirements. For ordering, Kinesis preserves order within a shard (driven by partition key), and Kafka/MSK preserves order within a partition; neither provides inherent global ordering across all shards/partitions without adding application-side constraints. For replay, both retain records for a configured retention window, so consumers can reprocess data by reading again from an earlier sequence/offset. For the consumer model, MSK follows Kafka consumer groups and partition assignment, while Kinesis supports shared-shard consumption (and optional fan-out patterns) without you managing Kafka infrastructure. The key takeaway is to scope ordering and replay to shard/partition boundaries and not assume delete-on-read semantics.
Topic: Data Store Management
A company stores source data in an Amazon S3 “raw” zone and uses AWS Glue ETL jobs to produce Parquet tables in an S3 “curated” zone for Amazon Athena. The curated tables undergo schema evolution, and auditors require end-to-end data lineage showing which upstream datasets and transformations produced each curated dataset, with minimal custom code.
Which TWO actions will best establish this lineage using AWS tools? (Select TWO.)
Options:
A. Use Athena partition projection instead of updating partitions in the catalog
B. Encrypt S3 objects with KMS CMKs and rotate keys regularly
C. Enable S3 versioning on the raw and curated buckets
D. Enable CloudTrail S3 data events to reconstruct lineage from access logs
E. Catalog raw and curated tables and run Glue ETL using catalog tables
F. Use Amazon DataZone to capture and visualize lineage for data assets
Correct answers: E and F
Explanation: Data lineage is established by recording metadata relationships between datasets and the transformations that produce them. Registering datasets in the AWS Glue Data Catalog and running Glue ETL jobs against catalog tables creates explicit, traceable input/output links. A data catalog/portal layer such as Amazon DataZone can then present that lineage to auditors and users across the platform.
The core requirement is dataset-level lineage (upstream sources -7 transformations -7 downstream tables), not just storage history or security/audit logs. The most direct way to establish lineage with minimal custom code is to standardize on managed metadata: register raw and curated datasets as AWS Glue Data Catalog tables and ensure ETL jobs use the catalog as their sources and targets, so relationships are captured at the table level even as schemas evolve.
A governance/consumption layer such as Amazon DataZone can then aggregate and visualize lineage for published data assets, making it easier for auditors and consumers to trace where curated datasets came from. Storage versioning, encryption, partition projection, and access logs help with other concerns (recovery, security, performance, auditing) but do not create transformation lineage graphs by themselves.
Topic: Data Store Management
A company stores curated clickstream data in an Amazon S3 data lake and queries it with Amazon Athena. AWS Glue jobs write the curated tables and downstream dashboards expect stable column names and types. The source team frequently introduces new columns, renames fields, and occasionally changes types (for example, int to bigint).
Which TWO actions will best minimize pipeline breakage while allowing controlled schema evolution? (Select TWO.)
Options:
A. Use Apache Iceberg tables in S3 with Glue Catalog
B. Standardize on SELECT * in Athena/ETL queries
C. Create separate tables per schema version and union later
D. Expose a stable Athena view with aliases/casts for consumers
E. Let Glue crawlers automatically overwrite schemas on each run
F. Enforce SSE-KMS via S3 bucket policy for all writes
Correct answers: A and D
Explanation: Using a table format that natively supports schema evolution lets you apply changes intentionally without rewriting all historical data or breaking readers. Pairing that with a stable consumer-facing interface (a view) decouples downstream queries from frequent upstream renames and type shifts while you transition clients on your schedule.
To minimize breakage from new columns, renamed fields, and type changes, separate the physical storage evolution from the logical contract consumers depend on. Apache Iceberg (backed by the AWS Glue Data Catalog) is designed for data-lake table evolution, allowing controlled ALTER TABLE operations (for example, adding columns and renaming columns) while tracking schema in table metadata.
On the consumption side, an Athena view can provide a stable schema contract by:
This combination reduces emergency fixes in Glue jobs and dashboards when upstream schemas drift. In contrast, automatic schema replacement and SELECT * tend to propagate breaking changes immediately to consumers.
SELECT * makes consumers sensitive to renames, reordering, and incompatible type changes.Topic: Data Store Management
In an AWS data lake that uses AWS Glue Data Catalog and AWS Lake Formation, the team is documenting why a data catalog is needed.
Select THREE statements about a data catalog that are FALSE or unsafe.
Options:
A. It centralizes table, column, and partition metadata for services like Athena and Redshift Spectrum.
B. If a principal has Lake Formation access to a table, no Amazon S3 or AWS KMS permissions are needed to read the data.
C. It stores the actual dataset bytes, so deleting a catalog table deletes the data in Amazon S3.
D. It automatically prevents breaking schema drift in downstream queries without any updates or validation.
E. It can support governance by letting Lake Formation manage permissions on cataloged databases and tables.
F. It improves discovery by providing consistent schemas and searchable metadata such as owners and tags.
Correct answers: B, C and D
Explanation: A data catalog is a centralized metadata layer that enables dataset discovery and consistent schema usage across analytics and ETL tools. It also acts as a governance anchor (for example, via Lake Formation) by defining cataloged resources that can be permissioned and audited. The catalog does not replace storage-layer permissions or automatically solve schema evolution risks.
The core purpose of a data catalog is to store and publish metadata: dataset locations, table definitions, columns, partitions, and business context (owners/tags/descriptions). This enables discovery (people and tools can find and understand datasets) and consistent schema usage (multiple engines reference the same definitions instead of duplicating them).
In AWS, the Glue Data Catalog is commonly used by Athena, Glue jobs, and Redshift Spectrum. Lake Formation builds governance on top of catalog resources, but it does not magically eliminate the need for underlying data-plane controls: access to S3 objects (and KMS keys for encrypted data) must still be allowed. Also, catalogs record schema; preventing downstream breaks from schema drift still requires managed schema evolution, testing, and/or controlled updates (for example, crawler settings and versioning practices).
Use the AWS DEA-C01 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.
Try AWS DEA-C01 on Web View AWS DEA-C01 Practice Test
Read the AWS DEA-C01 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.