AWS DEA-C01: Data Store Management

May 1, 2026

Try 10 focused AWS DEA-C01 questions on Data Store Management, with explanations, then continue with IT Mastery.

On this page

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try AWS DEA-C01 on Web View full AWS DEA-C01 practice page

Topic snapshot

Field	Detail
Exam route	AWS DEA-C01
Topic area	Data Store Management
Blueprint weight	26%
Page purpose	Focused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Data Store Management for AWS DEA-C01. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

Pass	What to do	What to record
First attempt	Answer without checking the explanation first.	The fact, rule, calculation, or judgment point that controlled your answer.
Review	Read the explanation even when you were correct.	Why the best answer is stronger than the closest distractor.
Repair	Repeat only missed or uncertain items after a short break.	The pattern behind misses, not the answer letter.
Transfer	Return to mixed practice once the topic feels stable.	Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 26% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.

Question 1

Topic: Data Store Management

A team is building a semantic search feature that finds support articles similar to a user question (vector similarity search). The team also wants to follow the core principle of keeping the raw data zone immutable so embeddings can be regenerated later if the model changes.

Which approach best matches this principle while enabling similarity search on AWS?

Options:

A. Overwrite raw S3 articles with embeddings to reduce storage
B. Store embeddings in RDS MySQL and use full-text search
C. Store embeddings as CSV in S3 and query with Athena
D. Store raw articles in S3; index embeddings in OpenSearch vectors

Best answer: D

Explanation: Semantic search is a vector search use case that requires storing embeddings and running k-NN similarity queries. Keeping the raw zone immutable means raw articles should remain unchanged so you can regenerate embeddings and rebuild indexes when models or preprocessing change. A managed vector-capable service provides efficient similarity search without mutating raw data.

The core principle is immutability of the raw zone: preserve original source data (for example, in an Amazon S3 raw prefix) and treat derived artifacts as rebuildable. Embeddings are derived from raw text and may need to be regenerated when the embedding model, chunking, or normalization changes, so they should be stored separately from raw content.

For the similarity search access pattern (nearest-neighbor over high-dimensional vectors), use a store designed for vector indexing and k-NN queries, such as Amazon OpenSearch Service with vector search, while keeping the raw articles unchanged in S3. This cleanly separates durable source-of-truth data from replaceable derived indexes and supports reprocessing without data loss.

Mutating the raw zone breaks the ability to reprocess from original inputs when models change.
Athena over CSV is not a purpose-built vector similarity index and is inefficient for k-NN search.
MySQL full-text search targets keyword search, not vector similarity on embeddings.

Question 2

Topic: Data Store Management

A company stores internal help articles as text files in Amazon S3. The company is building a semantic search feature for a support chatbot.

Requirements:

New or updated S3 objects must be searchable within 15 minutes.
The corpus will grow to ~5 million documents; queries should return in under 1 second.
Search must use vector similarity (embeddings) and allow filtering on metadata fields that may evolve over time (new fields added).
Results must be access-controlled by business unit, with encryption at rest managed by AWS.

Which solution best meets these requirements with the least operational overhead?

Options:

A. Load documents into Amazon Redshift and compute similarity using SQL over an embeddings table for each query.
B. Use AWS Glue to generate embeddings into S3 Parquet and run cosine-similarity calculations in Amazon Athena at query time.
C. Use Knowledge Bases for Amazon Bedrock with the S3 bucket as the data source and Amazon OpenSearch Serverless as the vector store; configure frequent incremental sync and isolate access by business unit using separate collections/knowledge bases with IAM and KMS.
D. Store embeddings and metadata in Amazon DynamoDB and use Scan with application-side cosine similarity to find nearest matches.

Best answer: C

Explanation: Knowledge Bases for Amazon Bedrock automates the vectorization workflow: it chunks content, generates embeddings, and stores vectors in a supported vector store for low-latency semantic retrieval. Using OpenSearch Serverless provides managed vector indexes that scale for millions of documents and support metadata filters as schemas evolve. Separate collections/knowledge bases with IAM policies and KMS meet the access-control and encryption requirements with low operational effort.

Vectorization pipelines have three core parts: generate embeddings from content, store those vectors with stable document IDs plus metadata, and retrieve relevant items by nearest-neighbor (vector similarity) with optional metadata filtering. A managed knowledge base with OpenSearch Serverless directly matches this pattern: S3 acts as the source of truth, the service performs embedding generation and incremental sync to meet freshness, and OpenSearch Serverless provides scalable low-latency vector search for millions of documents.

For governance and schema evolution, store metadata alongside each vector and use it for filters; adding new metadata fields is typically an additive change (new fields) rather than re-embedding all content. Business-unit isolation is best handled by separating collections/knowledge bases (or equivalent isolation boundaries) and enforcing access with IAM-based policies plus KMS encryption.

Query-time full scans or application-side similarity calculations generally cannot meet the latency and cost goals at this scale.

Athena similarity at query time requires scanning large S3 datasets and computing similarity per query, which risks missing the sub-1-second latency requirement.
DynamoDB Scan approach does not provide native vector nearest-neighbor retrieval, so scans and client-side scoring will not scale to millions of items.
Redshift per-query similarity typically forces large table scans or heavy compute for nearest-neighbor searches, increasing cost and making sub-1-second responses harder at scale.

Question 3

Topic: Data Store Management

A company stores source application events in an Amazon S3 data lake with zones raw/, staged/, and curated/. Events can arrive up to 14 days late, and the data team must run occasional historical backfills without changing previously published curated datasets except when truly corrected.

Which approach should you AVOID to manage backfills and late-arriving updates while keeping downstream outputs consistent?

Options:

A. Write curated outputs with snapshot/versioned prefixes and publish a pointer
B. Upsert late records into curated/ using a stable record key
C. Use an append-only raw/ zone and reprocess into curated/
D. Overwrite existing objects in the raw/ zone during backfills

Best answer: D

Explanation: You should avoid changing the raw zone in-place because it removes the ability to reproduce past results and audit what data was originally received. For late-arriving updates and backfills, the common pattern is immutable ingestion followed by deterministic reprocessing and controlled publishing. This keeps curated outputs stable and only changes them when a defined correction is applied.

The core principle is immutability of the source-of-truth ingestion layer so pipelines can be replayed and backfills can be performed deterministically. If you overwrite the raw/ zone during a backfill, you lose lineage: you can no longer prove what was originally received, reproduce an earlier curated output, or safely troubleshoot late-arriving changes.

A consistent approach is:

Keep raw/ append-only (often time-partitioned) and retain it per policy.
Reprocess staged/ and curated/ deterministically for the affected dates.
Apply late-arriving changes as controlled upserts/merges in curated, and publish via versioned snapshots or an atomic “current” pointer.

The key takeaway is that late data handling should change curated outputs through governed reprocessing, not by mutating raw inputs.

Append-only raw supports replay, auditing, and repeatable backfills.
Curated upserts by key is a standard way to apply late-arriving corrections consistently.
Versioned curated snapshots allows consumers to read a stable version while you publish updates intentionally.

Question 4

Topic: Data Store Management

A company stores clickstream events for analytics. New events arrive continuously (about 1 TB/day) and must be queryable within 15 minutes. Analysts run interactive SQL queries throughout the day, mostly filtering the last 14 days, and the team wants to minimize Athena query cost by reducing bytes scanned. Data must be retained for 5 years, but data older than 90 days is queried only once per quarter and can tolerate slower queries. Some columns contain PII and require column-level access control.

Which solution BEST meets these requirements?

Options:

A. Store Parquet on Amazon S3 and transition objects older than 90 days to S3 Glacier Deep Archive
B. Write partitioned Parquet with Snappy to Amazon S3, use lifecycle to Intelligent-Tiering for older partitions, and query with Athena using Lake Formation and result reuse
C. Store uncompressed JSON in Amazon S3 Standard and query it directly with Athena
D. Load all events into a provisioned Amazon Redshift cluster and keep 5 years of data on the cluster

Best answer: B

Explanation: Using S3 with partitioned, compressed columnar formats (Parquet + Snappy) is the highest-leverage way to cut Athena bytes scanned and improve query performance. Keeping recent data in an online S3 tier and transitioning older partitions to a lower-cost online tier meets long retention at lower cost. Lake Formation provides fine-grained (including column-level) access controls for PII, and Athena result reuse can reduce repeat-query latency and cost.

For Athena, the key storage optimizations are columnar format, compression, and partitioning so queries read fewer bytes. Writing events to S3 as Parquet with Snappy compression and partitions such as event_date=YYYY-MM-DD lets common “last 14 days” filters prune partitions and scan only needed columns, lowering cost and improving performance.

To reduce storage cost while preserving periodic query access, keep recent partitions in S3 Standard and use an S3 lifecycle policy to transition older partitions to an online lower-cost tier (for example, S3 Intelligent-Tiering). Use AWS Lake Formation on top of the Glue Data Catalog to enforce table/column permissions for PII, and enable Athena workgroup query result reuse to speed up repeated queries over the same data.

Archival tiers that require long restores break the “query on demand” requirement for older data.

Raw JSON with Athena increases bytes scanned dramatically, raising cost and hurting interactive performance.
All data in Redshift can work technically, but keeping 5 years of clickstream data in-cluster is typically the highest-cost storage choice for this access pattern.
Deep Archive lifecycle makes quarterly queries impractical because objects are not immediately readable and require long restore times.

Question 5

Topic: Data Store Management

A company is building an analytics platform on AWS. Source systems are OLTP databases where rows are frequently inserted, updated, and occasionally deleted. The analytics team needs a storage and data modeling approach that supports incremental processing from CDC feeds and can reconstruct current state and (when needed) prior state.

Which THREE approaches best support incremental processing and CDC requirements? (Select THREE.)

Options:

A. Model slowly changing dimensions as SCD Type 2 with effective start/end timestamps and a current-row indicator
B. Rewrite the full target dataset each day as a new snapshot and delete the prior snapshot
C. Persist CDC events as an immutable, partitioned change-log table with operation type and commit timestamp
D. Optimize only for query speed by denormalizing into one wide table without business keys or change timestamps
E. Store curated tables on Amazon S3 using Apache Hudi with a record key to support upserts and deletes
F. Store tables as a single CSV object per table and overwrite the object on every change batch

Correct answers: A, C and E

Explanation: Incremental processing with CDC works best when storage formats and schemas provide stable keys and change/commit metadata. Table formats like Hudi enable efficient upserts/deletes and incremental pulls, while dimensional modeling patterns like SCD Type 2 preserve historical versions. An append-only change log is a robust foundation for replay and incremental materialization of current-state tables.

To support CDC-driven incremental processing, your storage and model must let you (1) identify the record being changed and (2) know the ordering/version of changes (and optionally deletes). Table formats such as Apache Hudi on S3 are designed for this by storing record keys and commit metadata so pipelines can apply upserts/deletes and consumers can read only new commits.

At the schema/modeling layer, SCD Type 2 captures row-version history using effective dating (or similar versioning), which makes “as of time” queries possible without full reloads. A complementary pattern is persisting CDC as an immutable change-log (with operation type and commit timestamp) so you can incrementally materialize “current” tables and also replay changes to rebuild derived datasets. Approaches that overwrite whole datasets or omit keys/timestamps undermine reliable incremental processing.

OK: Using Apache Hudi on S3 supports keyed upserts/deletes and incremental reads via commit metadata.
OK: SCD Type 2 preserves historical versions so incremental CDC updates don’t destroy prior state.
OK: An immutable CDC change log enables incremental consumption and replay to rebuild current or historical views.
NO: Rewriting full snapshots daily is batch replacement, not CDC-friendly incremental processing.
NO: Overwriting a single CSV object requires full rewrites and lacks efficient merge/delete semantics.
NO: Without business keys or change timestamps, you can’t deterministically apply CDC incrementally.

Question 6

Topic: Data Store Management

A company stores clickstream events in Amazon S3 and queries them with Amazon Athena through AWS Lake Formation (column-level permissions). New data must be queryable within 15 minutes of landing, and queries for the last 7 days must stay under 30 seconds while keeping a pay-per-query cost model.

The dataset grew to 5 TB/day, and the current layout is partitioned by device_id. Cardinality has increased (millions of device_id values/day) and key distributions are now skewed, causing too many partitions and slow queries. New columns are added frequently.

Which solution BEST addresses the change in data characteristics?

Options:

A. Rewrite to an Apache Iceberg table in S3, partitioned by event_date (and another low-cardinality column), using an AWS Glue job for compaction and schema evolution; query with Athena and keep Lake Formation permissions
B. Index the events by device_id in Amazon DynamoDB and run analytics by scanning the DynamoDB table with PartiQL
C. Load the events into a provisioned Amazon Redshift cluster and use an interleaved sort key on device_id to handle skewed access patterns
D. Keep the S3 layout partitioned by device_id and use Athena partition projection to avoid storing partition metadata in the Data Catalog

Best answer: A

Explanation: The core issue is a high-cardinality, skew-prone partition key (device_id) that creates excessive partitions and small files, degrading Athena performance. Rewriting the dataset with a low-cardinality, time-based partition strategy and compacting files restores partition pruning and scan efficiency. Using Iceberg also supports frequent schema changes while remaining compatible with Athena and Lake Formation governance.

When data characteristics change (cardinality, skew, null rates), the physical layout must change too. Partitioning by a high-cardinality key like device_id creates millions of partitions and many small objects, which hurts Athena planning and increases data scanned.

A better strategy is to rewrite the table to a format that supports evolution (Iceberg) and choose partitions that align with common filters and stay bounded in count:

Partition primarily by time (for example, event_date), optionally add a low-cardinality dimension
Compact small files during the rewrite to improve scan efficiency
Keep the table in the Glue Data Catalog so Athena can query it, and continue using Lake Formation for fine-grained access control

This meets the 15-minute freshness goal (incremental writes) while restoring sub-30-second performance for recent-window queries.

Partition projection misuse still leaves an extremely high partition count and does not fix small files or skew from device_id-based layout.
DynamoDB for analytics violates the pay-per-query Athena/lake access pattern and is inefficient for large analytical scans.
Provisioned Redshift cluster changes the cost/operations model and does not address S3/Athena/Lake Formation governance requirements.

Question 7

Topic: Data Store Management

When selecting between Amazon Kinesis Data Streams and Amazon MSK for streaming transport and storage, which THREE statements are false or unsafe assumptions about ordering, replay, throughput, or the consumer model? (Select THREE.)

Options:

A. Kinesis Data Streams guarantees ordering across all shards
B. Kinesis provides ordered records only within a shard
C. MSK uses partitions and consumer groups for parallel consumption
D. Amazon MSK deletes records immediately after consumers read them
E. Kinesis Data Streams requires you to manage Kafka brokers
F. Both services support replay within configured retention

Correct answers: A, D and E

Explanation: Ordering guarantees are scoped to a shard in Kinesis and to a partition in Kafka/MSK, not to the entire stream/topic. Both services retain data for a configured period, which enables replay by re-reading from a prior position (sequence number or offset). MSK is Kafka-compatible, while Kinesis is fully managed and does not require operating Kafka brokers.

Choose between Kinesis Data Streams and MSK by matching service semantics to your requirements. For ordering, Kinesis preserves order within a shard (driven by partition key), and Kafka/MSK preserves order within a partition; neither provides inherent global ordering across all shards/partitions without adding application-side constraints. For replay, both retain records for a configured retention window, so consumers can reprocess data by reading again from an earlier sequence/offset. For the consumer model, MSK follows Kafka consumer groups and partition assignment, while Kinesis supports shared-shard consumption (and optional fan-out patterns) without you managing Kafka infrastructure. The key takeaway is to scope ordering and replay to shard/partition boundaries and not assume delete-on-read semantics.

Global ordering assumption is unsafe because Kinesis ordering is per shard, not across shards.
Delete-on-read assumption is unsafe because MSK/Kafka retention is independent of consumer reads and supports offset replay.
Broker management assumption is unsafe because Kinesis is not Kafka and does not require operating brokers.

Question 8

Topic: Data Store Management

A company stores source data in an Amazon S3 “raw” zone and uses AWS Glue ETL jobs to produce Parquet tables in an S3 “curated” zone for Amazon Athena. The curated tables undergo schema evolution, and auditors require end-to-end data lineage showing which upstream datasets and transformations produced each curated dataset, with minimal custom code.

Which TWO actions will best establish this lineage using AWS tools? (Select TWO.)

Options:

A. Use Athena partition projection instead of updating partitions in the catalog
B. Encrypt S3 objects with KMS CMKs and rotate keys regularly
C. Enable S3 versioning on the raw and curated buckets
D. Enable CloudTrail S3 data events to reconstruct lineage from access logs
E. Catalog raw and curated tables and run Glue ETL using catalog tables
F. Use Amazon DataZone to capture and visualize lineage for data assets

Correct answers: E and F

Explanation: Data lineage is established by recording metadata relationships between datasets and the transformations that produce them. Registering datasets in the AWS Glue Data Catalog and running Glue ETL jobs against catalog tables creates explicit, traceable input/output links. A data catalog/portal layer such as Amazon DataZone can then present that lineage to auditors and users across the platform.

The core requirement is dataset-level lineage (upstream sources -7 transformations -7 downstream tables), not just storage history or security/audit logs. The most direct way to establish lineage with minimal custom code is to standardize on managed metadata: register raw and curated datasets as AWS Glue Data Catalog tables and ensure ETL jobs use the catalog as their sources and targets, so relationships are captured at the table level even as schemas evolve.

A governance/consumption layer such as Amazon DataZone can then aggregate and visualize lineage for published data assets, making it easier for auditors and consumers to trace where curated datasets came from. Storage versioning, encryption, partition projection, and access logs help with other concerns (recovery, security, performance, auditing) but do not create transformation lineage graphs by themselves.

OK Cataloging datasets and running Glue ETL against catalog tables creates explicit upstream/downstream table relationships.
OK Publishing assets with Amazon DataZone enables discovery features like lineage visualization for supported sources.
NO S3 versioning tracks object revisions, not transformation relationships between datasets.
NO KMS encryption and key rotation protect data but do not describe source-to-target derivation.
NO Partition projection reduces catalog maintenance, but it doesn’t record dataset lineage.
NO CloudTrail data events show access activity, not the semantic input/output mapping of transformations.

Question 9

Topic: Data Store Management

A company stores curated clickstream data in an Amazon S3 data lake and queries it with Amazon Athena. AWS Glue jobs write the curated tables and downstream dashboards expect stable column names and types. The source team frequently introduces new columns, renames fields, and occasionally changes types (for example, int to bigint).

Which TWO actions will best minimize pipeline breakage while allowing controlled schema evolution? (Select TWO.)

Options:

A. Use Apache Iceberg tables in S3 with Glue Catalog
B. Standardize on SELECT * in Athena/ETL queries
C. Create separate tables per schema version and union later
D. Expose a stable Athena view with aliases/casts for consumers
E. Let Glue crawlers automatically overwrite schemas on each run
F. Enforce SSE-KMS via S3 bucket policy for all writes

Correct answers: A and D

Explanation: Using a table format that natively supports schema evolution lets you apply changes intentionally without rewriting all historical data or breaking readers. Pairing that with a stable consumer-facing interface (a view) decouples downstream queries from frequent upstream renames and type shifts while you transition clients on your schedule.

To minimize breakage from new columns, renamed fields, and type changes, separate the physical storage evolution from the logical contract consumers depend on. Apache Iceberg (backed by the AWS Glue Data Catalog) is designed for data-lake table evolution, allowing controlled ALTER TABLE operations (for example, adding columns and renaming columns) while tracking schema in table metadata.

On the consumption side, an Athena view can provide a stable schema contract by:

Aliasing old names to new names
Casting to expected types (or providing compatible derived columns)
Hiding transitional or deprecated columns

This combination reduces emergency fixes in Glue jobs and dashboards when upstream schemas drift. In contrast, automatic schema replacement and SELECT * tend to propagate breaking changes immediately to consumers.

OK: Using Iceberg with the Glue Catalog enables intentional schema evolution operations managed at the table level.
OK: A stable Athena view preserves a consistent consumer contract through aliases and casts during transitions.
NO: Automatically overwriting schemas with crawlers can publish breaking changes (and renames often appear as drop/add).
NO: Relying on SELECT * makes consumers sensitive to renames, reordering, and incompatible type changes.
NO: SSE-KMS enforcement improves security but does not address schema compatibility or evolution.
NO: Separate tables per version increases operational complexity and shifts breakage to unions/consumer logic.

Question 10

Topic: Data Store Management

In an AWS data lake that uses AWS Glue Data Catalog and AWS Lake Formation, the team is documenting why a data catalog is needed.

Select THREE statements about a data catalog that are FALSE or unsafe.

Options:

A. It centralizes table, column, and partition metadata for services like Athena and Redshift Spectrum.
B. If a principal has Lake Formation access to a table, no Amazon S3 or AWS KMS permissions are needed to read the data.
C. It stores the actual dataset bytes, so deleting a catalog table deletes the data in Amazon S3.
D. It automatically prevents breaking schema drift in downstream queries without any updates or validation.
E. It can support governance by letting Lake Formation manage permissions on cataloged databases and tables.
F. It improves discovery by providing consistent schemas and searchable metadata such as owners and tags.

Correct answers: B, C and D

Explanation: A data catalog is a centralized metadata layer that enables dataset discovery and consistent schema usage across analytics and ETL tools. It also acts as a governance anchor (for example, via Lake Formation) by defining cataloged resources that can be permissioned and audited. The catalog does not replace storage-layer permissions or automatically solve schema evolution risks.

The core purpose of a data catalog is to store and publish metadata: dataset locations, table definitions, columns, partitions, and business context (owners/tags/descriptions). This enables discovery (people and tools can find and understand datasets) and consistent schema usage (multiple engines reference the same definitions instead of duplicating them).

In AWS, the Glue Data Catalog is commonly used by Athena, Glue jobs, and Redshift Spectrum. Lake Formation builds governance on top of catalog resources, but it does not magically eliminate the need for underlying data-plane controls: access to S3 objects (and KMS keys for encrypted data) must still be allowed. Also, catalogs record schema; preventing downstream breaks from schema drift still requires managed schema evolution, testing, and/or controlled updates (for example, crawler settings and versioning practices).

Shared metadata layer is accurate: query/ETL services can use cataloged schemas and partitions.
Lake Formation integration is accurate: permissions are applied to catalog databases/tables for governed access.
Discovery and consistency is accurate: searchable metadata and one schema definition reduce ambiguity and rework.

Continue with full practice

Use the AWS DEA-C01 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try AWS DEA-C01 on Web View AWS DEA-C01 Practice Test

Free review resource

Read the AWS DEA-C01 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.

Revised on Thursday, May 14, 2026

Data Ingestion and Transformation

Data Operations and Support

Browse Certification Practice Tests by Exam Family

AWS DEA-C01: Data Store Management

Topic snapshot

How to use this topic drill

Sample questions

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Continue with full practice

Related focused pages

Free review resource