SnowPro Data Engineer DEA-C02: Sourcing and Ingestion

Try 10 focused SnowPro Data Engineer DEA-C02 questions on Sourcing and Ingestion, with explanations, then continue with IT Mastery.

On this page

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try SnowPro Data Engineer DEA-C02 on Web View full SnowPro Data Engineer DEA-C02 practice page

Topic snapshot

FieldDetail
Exam routeSnowPro Data Engineer DEA-C02
Topic areaData Sourcing, Storage, and Ingestion
Blueprint weight22%
Page purposeFocused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Data Sourcing, Storage, and Ingestion for SnowPro Data Engineer DEA-C02. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

PassWhat to doWhat to record
First attemptAnswer without checking the explanation first.The fact, rule, calculation, or judgment point that controlled your answer.
ReviewRead the explanation even when you were correct.Why the best answer is stronger than the closest distractor.
RepairRepeat only missed or uncertain items after a short break.The pattern behind misses, not the answer letter.
TransferReturn to mixed practice once the topic feels stable.Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 22% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.

Question 1

Topic: Data Sourcing, Storage, and Ingestion

A retail company has a custom service that emits order events one record at a time. The data engineering team needs those events in a Snowflake table within a few seconds for downstream processing with streams and tasks, and they want to avoid landing micro-files in cloud object storage first. Which Snowflake design is the BEST fit?

Options:

  • A. Schedule a task to run COPY INTO from a stage every minute

  • B. Write files to a stage and query them with an external table

  • C. Use Snowpipe Streaming from the service into a target table

  • D. Batch micro-files in cloud storage and trigger Snowpipe loads

Best answer: C

Explanation: This is a continuous row-ingestion requirement, not a cloud-storage selection problem. Snowpipe Streaming is designed for application-driven ingestion into Snowflake tables with seconds-level latency and no file-staging step.

Snowpipe Streaming is the best fit when an application or service produces records continuously and the target is a Snowflake table that must be updated within seconds. In this scenario, the deciding requirements are row-at-a-time ingestion, low latency for downstream processing with streams and tasks, and avoiding staged micro-files in object storage. Snowpipe Streaming is purpose-built for that pattern.

  • It ingests records directly from the producer.
  • It avoids the operational overhead of creating and loading many small files.
  • It supports near-real-time table availability for Snowflake-native pipelines.

File-based Snowpipe is still useful when files already land in a stage, but it is not the best match when the source emits individual events and file staging is explicitly unwanted.

  • The external table option keeps data in object storage and is meant for querying staged files, not for loading row events into a Snowflake table for CDC-style processing.
  • The file-based Snowpipe option can automate file ingestion, but it still requires creating and managing staged micro-files first.
  • The scheduled COPY INTO option adds polling latency and extra orchestration, so it misses the continuous seconds-level ingestion goal.

Question 2

Topic: Data Sourcing, Storage, and Ingestion

An engineering team started with the following definition for lake data already used by Spark:

CREATE EXTERNAL TABLE lake_orders (
  order_id NUMBER,
  order_ts TIMESTAMP_NTZ
)
LOCATION=@orders_stage
FILE_FORMAT=(TYPE=PARQUET);

The same object storage path is already maintained as an Apache Iceberg table by another platform, and both Snowflake and that platform must read the same table metadata without copying the data. What is the best next step?

Options:

  • A. Keep the external table and enable auto-refresh

  • B. Create an externally managed Iceberg table

  • C. Load the files into a Snowflake native table

  • D. Create a dynamic table on top of the external table

Best answer: B

Explanation: This scenario is driven by open table format interoperability, not just reading Parquet files. Because another platform already maintains Apache Iceberg metadata and Snowflake must use that same metadata without copying data, an externally managed Iceberg table is the best fit.

Iceberg-related table patterns fit when the requirement is to preserve an open table format and share table semantics across engines. Here, the lake path is already managed as an Apache Iceberg table by another platform, and Snowflake must read that same table metadata in place. That points to an externally managed Iceberg table.

An external table can expose files in object storage, but it is still a file-oriented pattern and does not make Snowflake participate in the shared Iceberg metadata layer. Loading the data into a native table breaks the no-copy requirement, and a dynamic table solves transformation refresh, not cross-engine table interoperability.

When the deciding constraint is shared open-format metadata, choose an Iceberg table pattern instead of a basic external table.

  • Auto-refresh is insufficient because external tables can track file changes, but they do not use shared Iceberg table metadata.
  • Loading copies data into Snowflake storage, which violates the requirement to keep the data in place.
  • Wrong abstraction because a dynamic table manages derived results and refresh logic, not open table format interoperability.

Question 3

Topic: Data Sourcing, Storage, and Ingestion

A team ingests clickstream events by buffering records in an application and writing one staged JSON file every 60 seconds. A Snowpipe with auto-ingest loads each file within 5–10 seconds after the file lands, but the product team requires each event to be queryable within 2 seconds of creation. Pipe health is normal. What is the best next step?

Options:

  • A. Add a stream on the target table and run a task every minute.

  • B. Use Snowpipe Streaming from the event producer or connector.

  • C. Split each 60-second batch into smaller files before staging.

  • D. Increase the warehouse size for the target table workloads.

Best answer: B

Explanation: The problem is not Snowpipe performance after files arrive; it is the file-based buffering before Snowflake can ingest anything. A true near real-time requirement of 2 seconds needs row-level ingestion, not staged files created once per minute.

Continuous file loading is still limited by when a file is closed and staged. In this scenario, Snowpipe is already loading promptly after file arrival, so the dominant delay is the application’s 60-second batching window before Snowflake even sees the data. That means the design is continuous, but not truly near real-time.

  • Records can wait almost a full minute before a file exists.
  • Snowpipe only begins after the file lands in the stage.
  • Snowpipe Streaming is designed for low-latency row ingestion from producers or supported connectors.

Changing compute size or downstream CDC objects will not remove latency that happens before data reaches Snowflake.

  • Bigger warehouse helps compute after data is available, but it does not shorten the application’s 60-second buffering period.
  • Stream plus task works only after rows are loaded, so it cannot reduce source-to-ingest latency.
  • Smaller files may tune file loading, but they still depend on periodic file creation instead of event-level ingestion.

Question 4

Topic: Data Sourcing, Storage, and Ingestion

A partner delivers daily CSV files with a header row. Existing field names stay the same, but the partner occasionally reorders columns and adds new ones. The team wants the raw load to keep working without repeated manual redesign.

create or replace table raw_orders (
  order_id number,
  customer_id number,
  order_ts timestamp_ntz,
  amount number(10,2)
);

copy into raw_orders
from @partner_stage/orders/
file_format = (type = csv skip_header = 1);

Which change best makes this ingestion design resilient to those file structure changes?

Options:

  • A. Enable schema evolution and load CSV columns by header name

  • B. Switch ingestion to an external table over the stage

  • C. Add a stream and task to detect layout changes

  • D. Increase warehouse size and keep positional COPY INTO

Best answer: A

Explanation: The current load is positional because it skips the CSV header and does not match columns by name. To tolerate reordered and newly added columns, use Snowflake schema evolution on the target table and load from parsed headers with column-name matching.

This COPY INTO pattern is fragile because skip_header = 1 ignores the header row and loads CSV fields by position. If the partner reorders columns, values can land in the wrong target columns; if new columns appear, the table still needs manual changes.

The Snowflake-native design for this case is to enable schema evolution on the target table and load by column name instead of position. For CSV files, use a file format that parses headers, typically PARSE_HEADER = TRUE and ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE, and use MATCH_BY_COLUMN_NAME in COPY INTO or a pipe. With ENABLE_SCHEMA_EVOLUTION = TRUE, Snowflake can add compatible new columns during load and map existing fields by header name.

More compute or downstream automation can help operations, but they do not make the core ingestion pattern self-adapting.

  • More compute changes load speed, not how CSV fields are mapped into table columns.
  • Stream and task logic adds custom detection and DDL work instead of using built-in schema evolution during load.
  • External table usage can expose staged files, but it does not automatically evolve the schema of a loaded raw table.

Question 5

Topic: Data Sourcing, Storage, and Ingestion

Orders land every minute in cloud storage and Snowpipe loads them into RAW_ORDERS. A standard stream on RAW_ORDERS feeds a task that MERGEs into CURATED_ORDERS, which is shared to another account. After a weekend outage, the task fails with Stream RAW_ORDERS_STRM is stale, while COPY_HISTORY shows Snowpipe kept loading files. What is the best next step?

Options:

  • A. Recreate the secure share for CURATED_ORDERS.

  • B. Upsize the task warehouse and resume the task.

  • C. Re-run Snowpipe for every missed source file.

  • D. Recreate the stream and backfill missed changes from RAW_ORDERS.

Best answer: D

Explanation: The ingestion layer is healthy because Snowpipe continued loading into RAW_ORDERS. The failure is in change capture: once a stream becomes stale, its missed offset is no longer usable, so the pipeline must rebuild CDC state and backfill from the raw table before downstream delivery can be current again.

This pipeline has four stages: ingestion with Snowpipe, change capture with a stream, transformation with a task and MERGE, and delivery through a secure share. The observability clues isolate the problem: COPY_HISTORY confirms ingestion succeeded, and the stale-stream error shows the CDC layer failed after the outage.

When a stream is stale, Snowflake cannot replay the missed change records from that old stream position. The practical recovery is:

  • create or replace the stream at a new current offset
  • backfill the missing period from RAW_ORDERS into CURATED_ORDERS
  • resume the downstream task

The key takeaway is that the raw ingested table remains the recovery source of truth; rerunning ingestion or changing delivery settings does not repair a stale CDC cursor.

  • Re-run ingestion fails because COPY_HISTORY already shows the source files were loaded into RAW_ORDERS.
  • Recreate sharing fails because a secure share exposes current table contents but does not fix upstream CDC state.
  • Add compute fails because a stale stream is a change-capture problem, not a warehouse-capacity problem.

Question 6

Topic: Data Sourcing, Storage, and Ingestion

A retail company already publishes clickstream events to Apache Kafka. Analysts need the events available in Snowflake within a few seconds for near-real-time dashboards, and the data engineering team wants to avoid building and maintaining a custom ingestion application. Which approach is the BEST fit?

Options:

  • A. Build a custom service with the Snowpipe Streaming SDK

  • B. Run scheduled COPY INTO commands every minute from an external stage

  • C. Use Snowpipe auto-ingest on files written from Kafka to cloud storage

  • D. Configure the Snowflake Connector for Kafka with Snowpipe Streaming

Best answer: D

Explanation: When Kafka is already the source and the latency target is only a few seconds, the Snowflake Connector for Kafka with Snowpipe Streaming is the best fit. It provides low-latency ingestion through a supported connector pattern and avoids unnecessary custom application development.

The key decision is to match the source pattern and latency requirement while minimizing operational complexity. Because the events already exist in Kafka, the best Snowflake-native design is to use the Snowflake Connector for Kafka with Snowpipe Streaming. That approach is designed for low-latency ingestion from Kafka and avoids the extra engineering effort of writing and operating a custom streaming application.

File-based patterns such as staging files and triggering Snowpipe, or polling with scheduled COPY INTO, add batching and storage steps that increase latency. A custom service built on the Snowpipe Streaming SDK could also achieve low latency, but it is not the best choice when a supported connector already fits the source system. The deciding factor is low latency from Kafka without unnecessary custom code.

  • File staging delay writing Kafka data to cloud storage and then using Snowpipe adds a file-based hop, which is less suitable for a within-seconds target.
  • Custom app overhead using the Snowpipe Streaming SDK can work technically, but it adds application code and maintenance the scenario explicitly wants to avoid.
  • Batch polling scheduled COPY INTO is a batch ingestion pattern and is not the best fit for near-real-time dashboards.

Question 7

Topic: Data Sourcing, Storage, and Ingestion

A team lands small JSON files in an external stage every few minutes. They want those files loaded into a Snowflake table automatically with minimal manual intervention, and the source system produces files rather than application row streams. Which Snowflake feature best fits this requirement?

Options:

  • A. Snowpipe Streaming

  • B. External table

  • C. Snowpipe

  • D. Stream with a task

Best answer: C

Explanation: Snowpipe is the right choice when files arrive incrementally in a stage and should be loaded automatically into a Snowflake table. It is designed for continuous file ingestion without repeated manual COPY INTO execution.

The key concept is continuous file ingestion from stages. Snowpipe watches for new files in a stage and loads them into a target table as they arrive, which matches the requirement for incremental files and minimal operational effort. This is different from features that process records before they become files, track changes after data is already in Snowflake, or leave data in external storage.

In this scenario:

  • Files arrive every few minutes
  • The source delivers files, not row-by-row events
  • The goal is automatic loading into a Snowflake table

That combination points to Snowpipe. The closest distractor is Snowpipe Streaming, but that is for low-latency row ingestion from applications or connectors rather than staged files.

  • Snowpipe Streaming is for ingesting rows directly from applications or connectors, not for auto-loading files that land in a stage.
  • Stream with a task helps capture and process table changes after ingestion, but it does not replace staged file loading.
  • External table lets Snowflake query files in external storage, but it does not automatically load those files into a native table.

Question 8

Topic: Data Sourcing, Storage, and Ingestion

A data engineer uses the following design for files in an external cloud location:

CREATE OR REPLACE STAGE doc_stage
  URL='s3://acme-legal/contracts/'
  DIRECTORY=(ENABLE=TRUE);

CREATE OR REPLACE EXTERNAL TABLE ext_contracts (...)
  WITH LOCATION=@doc_stage
  AUTO_REFRESH=TRUE
  FILE_FORMAT=(TYPE=PARQUET);

The location will stop receiving Parquet files and will instead receive scanned PDF contracts. Analysts only need file inventory in Snowflake before downstream document processing. What is the best next design choice?

Options:

  • A. Switch the external table to TYPE=JSON for the PDFs.

  • B. Use the stage directory table for metadata; process PDFs separately.

  • C. Load the PDFs with COPY INTO and column matching.

  • D. Keep the external table and add search optimization.

Best answer: B

Explanation: The key change is that Parquet is a queryable file format, while scanned PDFs are unstructured files. For an inventory-only requirement, keep the files in a stage and use the directory table for metadata instead of an external table.

External tables are designed for externally stored structured or semi-structured files whose contents can be projected into columns. When the source changes from Parquet to scanned PDFs, the design choice changes because PDFs are unstructured data. Changing the file format or trying to load them with tabular load options does not make the document contents queryable as rows.

Because the requirement is only to inventory files inside Snowflake, the existing stage with DIRECTORY=(ENABLE=TRUE) is the correct anchor. Query the stage’s directory table for metadata such as file path, size, and last modified time, and use a separate unstructured-data processing step later if document content must be extracted.

The closest distractor is switching file format, but file formats do not convert binary documents into VARIANT records.

  • Switch file format fails because file formats parse supported structured or semi-structured encodings, not scanned PDFs.
  • Search optimization is a performance feature and does not solve the mismatch between external tables and unstructured files.
  • COPY INTO with column matching applies to tabular file loads, not binary document ingestion.

Question 9

Topic: Data Sourcing, Storage, and Ingestion

A team stores JSON and Parquet files in cloud object storage and must leave the files in place for existing data-lake processes. They want Snowflake to query the file contents and file metadata without first loading the files into native Snowflake storage. Which Snowflake object is designed for this use case?

Options:

  • A. External table

  • B. Dynamic table

  • C. Snowpipe

  • D. Iceberg table

Best answer: A

Explanation: An external table is the Snowflake object built to query supported files in external stages while the data stays in cloud storage. That makes it the right fit when semi-structured lake files such as JSON or Parquet must remain outside native Snowflake storage.

The core concept is query in place over external files. An external table stores metadata about files in an external stage and lets Snowflake query supported file formats such as JSON and Parquet without first ingesting them into native table storage. That design matters when storage location is a requirement, such as keeping lake files in object storage for shared access or existing lake workflows.

An Iceberg table also involves external storage, but it relies on Apache Iceberg table metadata and represents an Iceberg-managed table format, not just a set of raw files to expose directly. Snowpipe is for loading files into tables, and dynamic tables materialize transformed query results inside Snowflake.

When the requirement is to leave semi-structured files in the lake and still query them from Snowflake, external tables are the direct fit.

  • Iceberg confusion sounds plausible because data can stay external, but it requires an Apache Iceberg table format rather than loose JSON or Parquet lake files.
  • Load instead of query fails because Snowpipe is an ingestion service that loads files into a target table.
  • Transformation layer fails because a dynamic table maintains query results, not direct access to external files.

Question 10

Topic: Data Sourcing, Storage, and Ingestion

An on-premises ERP exposes order changes only through a REST API. The current pipeline runs an hourly export job that writes JSON files to cloud object storage, and Snowpipe loads them into Snowflake. When the export job retries, downstream streams and tasks see order updates 30–60 minutes late. The business now requires near-real-time ingestion with fewer missed changes. What is the best next step?

Options:

  • A. Use a connector service to read the API and ingest with Snowpipe Streaming.

  • B. Create an external table on the staged JSON files.

  • C. Run scheduled COPY INTO from the same stage every minute.

  • D. Increase the warehouse size used by downstream transformation tasks.

Best answer: A

Explanation: This is a source-ingestion pattern problem, not a warehouse or file-query problem. Because the source is an on-premises API, the hourly file export is creating the lag, so the design should shift to a connector or ingestion service that reads the API directly and streams data into Snowflake.

Snowpipe and staged-file ingestion are strong choices when data naturally arrives in cloud object storage. In this scenario, the real bottleneck is upstream: an on-premises REST API is being converted into hourly files before Snowflake ever sees the changes. That file-export step causes the stale CDC behavior.

A better design is to use a source-specific connector or custom ingestion service that reads the API directly and delivers rows into Snowflake, ideally with Snowpipe Streaming for low-latency ingestion. This moves the integration pattern closer to the actual source system and reduces dependence on batch file drops. Tuning warehouses, polling the same stage faster, or querying the staged files differently only optimizes after the slowest step has already occurred.

When the source is an API or on-premises system, choose an ingestion pattern built for that source rather than relying on cloud-object storage alone.

  • Bigger warehouse helps downstream compute, but it does not remove the hourly API export bottleneck.
  • More frequent COPY INTO still waits for files to appear in object storage, so the stale-change problem remains upstream.
  • External tables can read staged files without loading them, but they still depend on the same delayed batch exports.

Continue with full practice

Use the SnowPro Data Engineer DEA-C02 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try SnowPro Data Engineer DEA-C02 on Web View SnowPro Data Engineer DEA-C02 Practice Test

Free review resource

Read the SnowPro Data Engineer DEA-C02 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.

Revised on Thursday, May 14, 2026