Try 10 focused SnowPro Data Engineer DEA-C02 questions on Sourcing and Ingestion, with explanations, then continue with IT Mastery.
Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.
Try SnowPro Data Engineer DEA-C02 on Web View full SnowPro Data Engineer DEA-C02 practice page
| Field | Detail |
|---|---|
| Exam route | SnowPro Data Engineer DEA-C02 |
| Topic area | Data Sourcing, Storage, and Ingestion |
| Blueprint weight | 22% |
| Page purpose | Focused sample questions before returning to mixed practice |
Use this page to isolate Data Sourcing, Storage, and Ingestion for SnowPro Data Engineer DEA-C02. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.
| Pass | What to do | What to record |
|---|---|---|
| First attempt | Answer without checking the explanation first. | The fact, rule, calculation, or judgment point that controlled your answer. |
| Review | Read the explanation even when you were correct. | Why the best answer is stronger than the closest distractor. |
| Repair | Repeat only missed or uncertain items after a short break. | The pattern behind misses, not the answer letter. |
| Transfer | Return to mixed practice once the topic feels stable. | Whether the same skill holds up when the topic is no longer obvious. |
Blueprint context: 22% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.
These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.
Topic: Data Sourcing, Storage, and Ingestion
A retail company has a custom service that emits order events one record at a time. The data engineering team needs those events in a Snowflake table within a few seconds for downstream processing with streams and tasks, and they want to avoid landing micro-files in cloud object storage first. Which Snowflake design is the BEST fit?
Options:
A. Schedule a task to run COPY INTO from a stage every minute
B. Write files to a stage and query them with an external table
C. Use Snowpipe Streaming from the service into a target table
D. Batch micro-files in cloud storage and trigger Snowpipe loads
Best answer: C
Explanation: This is a continuous row-ingestion requirement, not a cloud-storage selection problem. Snowpipe Streaming is designed for application-driven ingestion into Snowflake tables with seconds-level latency and no file-staging step.
Snowpipe Streaming is the best fit when an application or service produces records continuously and the target is a Snowflake table that must be updated within seconds. In this scenario, the deciding requirements are row-at-a-time ingestion, low latency for downstream processing with streams and tasks, and avoiding staged micro-files in object storage. Snowpipe Streaming is purpose-built for that pattern.
File-based Snowpipe is still useful when files already land in a stage, but it is not the best match when the source emits individual events and file staging is explicitly unwanted.
COPY INTO option adds polling latency and extra orchestration, so it misses the continuous seconds-level ingestion goal.Topic: Data Sourcing, Storage, and Ingestion
An engineering team started with the following definition for lake data already used by Spark:
CREATE EXTERNAL TABLE lake_orders (
order_id NUMBER,
order_ts TIMESTAMP_NTZ
)
LOCATION=@orders_stage
FILE_FORMAT=(TYPE=PARQUET);
The same object storage path is already maintained as an Apache Iceberg table by another platform, and both Snowflake and that platform must read the same table metadata without copying the data. What is the best next step?
Options:
A. Keep the external table and enable auto-refresh
B. Create an externally managed Iceberg table
C. Load the files into a Snowflake native table
D. Create a dynamic table on top of the external table
Best answer: B
Explanation: This scenario is driven by open table format interoperability, not just reading Parquet files. Because another platform already maintains Apache Iceberg metadata and Snowflake must use that same metadata without copying data, an externally managed Iceberg table is the best fit.
Iceberg-related table patterns fit when the requirement is to preserve an open table format and share table semantics across engines. Here, the lake path is already managed as an Apache Iceberg table by another platform, and Snowflake must read that same table metadata in place. That points to an externally managed Iceberg table.
An external table can expose files in object storage, but it is still a file-oriented pattern and does not make Snowflake participate in the shared Iceberg metadata layer. Loading the data into a native table breaks the no-copy requirement, and a dynamic table solves transformation refresh, not cross-engine table interoperability.
When the deciding constraint is shared open-format metadata, choose an Iceberg table pattern instead of a basic external table.
Topic: Data Sourcing, Storage, and Ingestion
A team ingests clickstream events by buffering records in an application and writing one staged JSON file every 60 seconds. A Snowpipe with auto-ingest loads each file within 5–10 seconds after the file lands, but the product team requires each event to be queryable within 2 seconds of creation. Pipe health is normal. What is the best next step?
Options:
A. Add a stream on the target table and run a task every minute.
B. Use Snowpipe Streaming from the event producer or connector.
C. Split each 60-second batch into smaller files before staging.
D. Increase the warehouse size for the target table workloads.
Best answer: B
Explanation: The problem is not Snowpipe performance after files arrive; it is the file-based buffering before Snowflake can ingest anything. A true near real-time requirement of 2 seconds needs row-level ingestion, not staged files created once per minute.
Continuous file loading is still limited by when a file is closed and staged. In this scenario, Snowpipe is already loading promptly after file arrival, so the dominant delay is the application’s 60-second batching window before Snowflake even sees the data. That means the design is continuous, but not truly near real-time.
Changing compute size or downstream CDC objects will not remove latency that happens before data reaches Snowflake.
Topic: Data Sourcing, Storage, and Ingestion
A partner delivers daily CSV files with a header row. Existing field names stay the same, but the partner occasionally reorders columns and adds new ones. The team wants the raw load to keep working without repeated manual redesign.
create or replace table raw_orders (
order_id number,
customer_id number,
order_ts timestamp_ntz,
amount number(10,2)
);
copy into raw_orders
from @partner_stage/orders/
file_format = (type = csv skip_header = 1);
Which change best makes this ingestion design resilient to those file structure changes?
Options:
A. Enable schema evolution and load CSV columns by header name
B. Switch ingestion to an external table over the stage
C. Add a stream and task to detect layout changes
D. Increase warehouse size and keep positional COPY INTO
Best answer: A
Explanation: The current load is positional because it skips the CSV header and does not match columns by name. To tolerate reordered and newly added columns, use Snowflake schema evolution on the target table and load from parsed headers with column-name matching.
This COPY INTO pattern is fragile because skip_header = 1 ignores the header row and loads CSV fields by position. If the partner reorders columns, values can land in the wrong target columns; if new columns appear, the table still needs manual changes.
The Snowflake-native design for this case is to enable schema evolution on the target table and load by column name instead of position. For CSV files, use a file format that parses headers, typically PARSE_HEADER = TRUE and ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE, and use MATCH_BY_COLUMN_NAME in COPY INTO or a pipe. With ENABLE_SCHEMA_EVOLUTION = TRUE, Snowflake can add compatible new columns during load and map existing fields by header name.
More compute or downstream automation can help operations, but they do not make the core ingestion pattern self-adapting.
Topic: Data Sourcing, Storage, and Ingestion
Orders land every minute in cloud storage and Snowpipe loads them into RAW_ORDERS. A standard stream on RAW_ORDERS feeds a task that MERGEs into CURATED_ORDERS, which is shared to another account. After a weekend outage, the task fails with Stream RAW_ORDERS_STRM is stale, while COPY_HISTORY shows Snowpipe kept loading files. What is the best next step?
Options:
A. Recreate the secure share for CURATED_ORDERS.
B. Upsize the task warehouse and resume the task.
C. Re-run Snowpipe for every missed source file.
D. Recreate the stream and backfill missed changes from RAW_ORDERS.
Best answer: D
Explanation: The ingestion layer is healthy because Snowpipe continued loading into RAW_ORDERS. The failure is in change capture: once a stream becomes stale, its missed offset is no longer usable, so the pipeline must rebuild CDC state and backfill from the raw table before downstream delivery can be current again.
This pipeline has four stages: ingestion with Snowpipe, change capture with a stream, transformation with a task and MERGE, and delivery through a secure share. The observability clues isolate the problem: COPY_HISTORY confirms ingestion succeeded, and the stale-stream error shows the CDC layer failed after the outage.
When a stream is stale, Snowflake cannot replay the missed change records from that old stream position. The practical recovery is:
RAW_ORDERS into CURATED_ORDERSThe key takeaway is that the raw ingested table remains the recovery source of truth; rerunning ingestion or changing delivery settings does not repair a stale CDC cursor.
COPY_HISTORY already shows the source files were loaded into RAW_ORDERS.Topic: Data Sourcing, Storage, and Ingestion
A retail company already publishes clickstream events to Apache Kafka. Analysts need the events available in Snowflake within a few seconds for near-real-time dashboards, and the data engineering team wants to avoid building and maintaining a custom ingestion application. Which approach is the BEST fit?
Options:
A. Build a custom service with the Snowpipe Streaming SDK
B. Run scheduled COPY INTO commands every minute from an external stage
C. Use Snowpipe auto-ingest on files written from Kafka to cloud storage
D. Configure the Snowflake Connector for Kafka with Snowpipe Streaming
Best answer: D
Explanation: When Kafka is already the source and the latency target is only a few seconds, the Snowflake Connector for Kafka with Snowpipe Streaming is the best fit. It provides low-latency ingestion through a supported connector pattern and avoids unnecessary custom application development.
The key decision is to match the source pattern and latency requirement while minimizing operational complexity. Because the events already exist in Kafka, the best Snowflake-native design is to use the Snowflake Connector for Kafka with Snowpipe Streaming. That approach is designed for low-latency ingestion from Kafka and avoids the extra engineering effort of writing and operating a custom streaming application.
File-based patterns such as staging files and triggering Snowpipe, or polling with scheduled COPY INTO, add batching and storage steps that increase latency. A custom service built on the Snowpipe Streaming SDK could also achieve low latency, but it is not the best choice when a supported connector already fits the source system. The deciding factor is low latency from Kafka without unnecessary custom code.
COPY INTO is a batch ingestion pattern and is not the best fit for near-real-time dashboards.Topic: Data Sourcing, Storage, and Ingestion
A team lands small JSON files in an external stage every few minutes. They want those files loaded into a Snowflake table automatically with minimal manual intervention, and the source system produces files rather than application row streams. Which Snowflake feature best fits this requirement?
Options:
A. Snowpipe Streaming
B. External table
C. Snowpipe
D. Stream with a task
Best answer: C
Explanation: Snowpipe is the right choice when files arrive incrementally in a stage and should be loaded automatically into a Snowflake table. It is designed for continuous file ingestion without repeated manual COPY INTO execution.
The key concept is continuous file ingestion from stages. Snowpipe watches for new files in a stage and loads them into a target table as they arrive, which matches the requirement for incremental files and minimal operational effort. This is different from features that process records before they become files, track changes after data is already in Snowflake, or leave data in external storage.
In this scenario:
That combination points to Snowpipe. The closest distractor is Snowpipe Streaming, but that is for low-latency row ingestion from applications or connectors rather than staged files.
Topic: Data Sourcing, Storage, and Ingestion
A data engineer uses the following design for files in an external cloud location:
CREATE OR REPLACE STAGE doc_stage
URL='s3://acme-legal/contracts/'
DIRECTORY=(ENABLE=TRUE);
CREATE OR REPLACE EXTERNAL TABLE ext_contracts (...)
WITH LOCATION=@doc_stage
AUTO_REFRESH=TRUE
FILE_FORMAT=(TYPE=PARQUET);
The location will stop receiving Parquet files and will instead receive scanned PDF contracts. Analysts only need file inventory in Snowflake before downstream document processing. What is the best next design choice?
Options:
A. Switch the external table to TYPE=JSON for the PDFs.
B. Use the stage directory table for metadata; process PDFs separately.
C. Load the PDFs with COPY INTO and column matching.
D. Keep the external table and add search optimization.
Best answer: B
Explanation: The key change is that Parquet is a queryable file format, while scanned PDFs are unstructured files. For an inventory-only requirement, keep the files in a stage and use the directory table for metadata instead of an external table.
External tables are designed for externally stored structured or semi-structured files whose contents can be projected into columns. When the source changes from Parquet to scanned PDFs, the design choice changes because PDFs are unstructured data. Changing the file format or trying to load them with tabular load options does not make the document contents queryable as rows.
Because the requirement is only to inventory files inside Snowflake, the existing stage with DIRECTORY=(ENABLE=TRUE) is the correct anchor. Query the stage’s directory table for metadata such as file path, size, and last modified time, and use a separate unstructured-data processing step later if document content must be extracted.
The closest distractor is switching file format, but file formats do not convert binary documents into VARIANT records.
COPY INTO with column matching applies to tabular file loads, not binary document ingestion.Topic: Data Sourcing, Storage, and Ingestion
A team stores JSON and Parquet files in cloud object storage and must leave the files in place for existing data-lake processes. They want Snowflake to query the file contents and file metadata without first loading the files into native Snowflake storage. Which Snowflake object is designed for this use case?
Options:
A. External table
B. Dynamic table
C. Snowpipe
D. Iceberg table
Best answer: A
Explanation: An external table is the Snowflake object built to query supported files in external stages while the data stays in cloud storage. That makes it the right fit when semi-structured lake files such as JSON or Parquet must remain outside native Snowflake storage.
The core concept is query in place over external files. An external table stores metadata about files in an external stage and lets Snowflake query supported file formats such as JSON and Parquet without first ingesting them into native table storage. That design matters when storage location is a requirement, such as keeping lake files in object storage for shared access or existing lake workflows.
An Iceberg table also involves external storage, but it relies on Apache Iceberg table metadata and represents an Iceberg-managed table format, not just a set of raw files to expose directly. Snowpipe is for loading files into tables, and dynamic tables materialize transformed query results inside Snowflake.
When the requirement is to leave semi-structured files in the lake and still query them from Snowflake, external tables are the direct fit.
Topic: Data Sourcing, Storage, and Ingestion
An on-premises ERP exposes order changes only through a REST API. The current pipeline runs an hourly export job that writes JSON files to cloud object storage, and Snowpipe loads them into Snowflake. When the export job retries, downstream streams and tasks see order updates 30–60 minutes late. The business now requires near-real-time ingestion with fewer missed changes. What is the best next step?
Options:
A. Use a connector service to read the API and ingest with Snowpipe Streaming.
B. Create an external table on the staged JSON files.
C. Run scheduled COPY INTO from the same stage every minute.
D. Increase the warehouse size used by downstream transformation tasks.
Best answer: A
Explanation: This is a source-ingestion pattern problem, not a warehouse or file-query problem. Because the source is an on-premises API, the hourly file export is creating the lag, so the design should shift to a connector or ingestion service that reads the API directly and streams data into Snowflake.
Snowpipe and staged-file ingestion are strong choices when data naturally arrives in cloud object storage. In this scenario, the real bottleneck is upstream: an on-premises REST API is being converted into hourly files before Snowflake ever sees the changes. That file-export step causes the stale CDC behavior.
A better design is to use a source-specific connector or custom ingestion service that reads the API directly and delivers rows into Snowflake, ideally with Snowpipe Streaming for low-latency ingestion. This moves the integration pattern closer to the actual source system and reduces dependence on batch file drops. Tuning warehouses, polling the same stage faster, or querying the staged files differently only optimizes after the slowest step has already occurred.
When the source is an API or on-premises system, choose an ingestion pattern built for that source rather than relying on cloud-object storage alone.
COPY INTO still waits for files to appear in object storage, so the stale-change problem remains upstream.Use the SnowPro Data Engineer DEA-C02 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.
Try SnowPro Data Engineer DEA-C02 on Web View SnowPro Data Engineer DEA-C02 Practice Test
Read the SnowPro Data Engineer DEA-C02 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.