Try 60 free SnowPro Data Engineer DEA-C02 questions across the exam domains, with explanations, then continue with full IT Mastery practice.
This free full-length SnowPro Data Engineer DEA-C02 practice exam includes 60 original IT Mastery questions across the exam domains.
These questions are for self-assessment. They are not official exam questions and do not imply affiliation with the exam sponsor.
Count note: this page uses the full-length practice count maintained in the Mastery exam catalog. Some certification vendors publish total questions, scored questions, duration, or unscored/pretest-item rules differently; always confirm exam-day rules with the sponsor.
Need concept review first? Read the SnowPro Data Engineer DEA-C02 Cheat Sheet on Tech Exam Lexicon, then return here for timed mocks and full IT Mastery practice.
Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.
Try SnowPro Data Engineer DEA-C02 on Web View full SnowPro Data Engineer DEA-C02 practice page
| Domain | Weight |
|---|---|
| Data Sourcing, Storage, and Ingestion | 22% |
| Transformations, Programmability, and Developer Workflows | 24% |
| Streaming, Orchestration, and Near Real-Time Pipeline Design | 20% |
| Sharing, Replication, and Cross-Platform Delivery | 18% |
| Compute, Governance, Observability, and Performance | 16% |
Use this as one diagnostic run. IT Mastery gives you timed mocks, topic drills, analytics, code-reading practice where relevant, and full practice.
Topic: Compute, Governance, Observability, and Performance
A data engineering team already ingests hourly order files with Snowpipe and merges them into a curated table with a task. Analysts only notice bad loads hours later, even though the load and merge complete successfully. They want a Snowflake-native way to detect missing ORDER_ID values or abnormally low row counts earlier, without redesigning the pipeline.
create or replace pipe raw.orders_pipe auto_ingest=true as
copy into raw.orders_stage
from @raw.orders_ext_stage
file_format = (type = json);
create or replace task etl.merge_orders
warehouse = etl_wh
schedule = 'USING CRON 5 * * * * UTC'
as
merge into prod.orders t
using raw.orders_stage s
on t.order_id = s.order_id
when matched then update set order_total = s.order_total
when not matched then insert (order_id, order_total)
values (s.order_id, s.order_total);
Which Snowflake feature is the best next step?
Options:
A. Create an alert that only checks task status
B. Replace the task with a dynamic table
C. Create a stream on RAW.ORDERS_STAGE
D. Associate data metric functions with PROD.ORDERS
Best answer: D
Explanation: Data metric functions are designed for table-level data quality observability. Here, the pipeline already runs, but the team needs earlier detection of bad data such as missing keys or unexpectedly low volumes without replacing Snowpipe or the task flow.
The core need is data quality observability on an existing pipeline, not a new ingestion or transformation design. Snowpipe and the merge task can both succeed even when upstream files contain missing business keys or unusually small batches. Associating data metric functions to PROD.ORDERS lets the team measure conditions like row counts or nulls on the table they already load, so issues can be surfaced earlier without rebuilding the pipeline.
This fits because it adds monitoring rather than changing processing:
The closest alternative only checks whether the task ran, but execution success does not prove the data is healthy.
Topic: Streaming, Orchestration, and Near Real-Time Pipeline Design
A data engineering team has validated the SQL in each task. However, publish_task sometimes reads curated.events before the latest refresh is complete.
Exhibit:
create or replace task ingest_task
warehouse = etl_wh
schedule = '5 MINUTES'
as
copy into raw.events from @landing/events;
create or replace task enrich_task
warehouse = etl_wh
after ingest_task
as
merge into curated.events c
using raw.events r on c.id = r.id
when matched then update set payload = r.payload
when not matched then
insert (id, payload) values (r.id, r.payload);
create or replace task publish_task
warehouse = etl_wh
after ingest_task
as
insert into mart.event_counts
select current_timestamp(), count(*) from curated.events;
What is the best next step?
Options:
A. Rewrite enrich_task as a stored procedure
B. Replace COPY INTO with Snowpipe Streaming
C. Make publish_task depend on enrich_task
D. Increase the warehouse size for enrich_task
Best answer: C
Explanation: This is a task-orchestration issue, not a transformation-logic issue. Because enrich_task and publish_task both use AFTER ingest_task, they are sibling tasks that can run independently after ingestion completes. To guarantee fresh curated data before publishing, publish_task must depend on enrich_task.
In Snowflake, the SQL inside a task defines what the task does, but the task graph defines when each step can run. Here, ingest_task is the scheduled root task. Both enrich_task and publish_task are configured with AFTER ingest_task, which makes them parallel child branches rather than a required sequence.
ingest_task starts the graph every 5 minutes.enrich_task runs after ingestion finishes.publish_task also runs after ingestion finishes.publish_task to wait for enrich_task.To enforce dependency ordering, change publish_task to AFTER enrich_task. The key takeaway is that stale downstream reads here are caused by task-graph design, not by the MERGE logic.
Topic: Transformations, Programmability, and Developer Workflows
A data engineering team loads CDC files into a staging table every 5 minutes. After each load, one task must iterate through a list of target tables, run MERGE statements, write batch results to an audit table, and stop the downstream share-refresh task if any step fails. Which Snowflake design is the best fit?
Options:
A. Use a scalar UDF inside each MERGE to write audit rows.
B. Use a Python UDF to loop through tables and execute SQL.
C. Use a Snowflake Scripting stored procedure called by the task.
D. Use an external function to send the batch to an API.
Best answer: C
Explanation: This workload needs statement orchestration, not reusable inline expression logic. A stored procedure is the right Snowflake abstraction for loops, MERGE and audit DML, and exception handling so the task can fail the batch and prevent downstream execution.
Choose a stored procedure when the logic must orchestrate SQL statements rather than compute a value inside a query. Here, the workflow must loop through multiple target tables, execute MERGE operations, insert audit records, and raise an error so the task graph stops before the share-refresh step runs.
If the requirement were reusable inline transformation logic inside SELECT, MERGE, or a dynamic table definition, a UDF would be the better abstraction.
Topic: Streaming, Orchestration, and Near Real-Time Pipeline Design
Source-system change rows already land in RAW.CUSTOMER_CHANGES every 30 seconds. A task refreshes CORE.CUSTOMERS once per minute by scanning the raw table and filtering on an EVENT_TS watermark. Ingestion into RAW.CUSTOMER_CHANGES is current, but the task is lagging and some late-arriving updates with older EVENT_TS are skipped. What is the best next step?
Options:
A. Replace the table with an external table and process newly discovered files.
B. Increase the task warehouse size and keep using the EVENT_TS watermark.
C. Load smaller micro-batches into the raw table more frequently with Snowpipe.
D. Create a stream on RAW.CUSTOMER_CHANGES and MERGE from the stream.
Best answer: D
Explanation: This is a CDC design problem, not an ingestion problem. A stream on the landed table lets Snowflake track table changes directly, so the task can consume only new rows without repeated full-table scans or missed late arrivals. That is the right boundary when the data is already in Snowflake.
When source data has already landed in a Snowflake table, the main design choice is how to capture table changes reliably, not how to ingest files faster. A raw-table pattern that filters by EVENT_TS is only an approximation of CDC: it requires rescanning the table and can miss late-arriving records whose event time is older than the stored watermark.
A stream is Snowflake’s native change-tracking object for this situation. The downstream task should read from the stream and MERGE those rows into the target table. Snowflake then tracks which changes have been consumed by stream offset, instead of relying on a custom timestamp rule.
Scaling compute may reduce runtime, but it does not fix the incorrect CDC boundary or the skipped late-arriving updates.
Topic: Data Sourcing, Storage, and Ingestion
A retail application emits individual order events continuously from its services. The team needs data loaded into a Snowflake table with latency measured in seconds and does not want to create or manage staged files. Which Snowflake-native ingestion pattern is the best fit?
Options:
A. An external table on the source storage
B. A scheduled COPY INTO load from staged files
C. Snowpipe Streaming into the target table
D. Snowpipe auto-ingest from a stage
Best answer: C
Explanation: Snowpipe Streaming fits continuous, low-latency ingestion when records are pushed directly from an application or connector into Snowflake. It is the Snowflake-native choice when the delivery style is event-by-event rather than staged files.
The deciding concept is file-based versus row-based ingestion. Snowpipe Streaming is built for near-real-time ingestion of records that arrive continuously from applications, services, or connectors, and it avoids the extra step of writing micro-files to a stage first.
In this scenario, the requirements are:
That combination maps directly to Snowpipe Streaming. By contrast, standard Snowpipe and COPY INTO patterns are file-oriented: data must first land in a stage and then be loaded. External tables are also not a load mechanism into native Snowflake tables; they let Snowflake query files where they already live. The closest distractor is standard Snowpipe, but it still assumes staged files.
COPY INTO is a batch or micro-batch pattern and adds orchestration plus file handling.Topic: Streaming, Orchestration, and Near Real-Time Pipeline Design
Orders arrive continuously into RAW_ORDERS through Snowpipe Streaming. The team needs a transformed CURATED_ORDERS dataset that joins reference tables and is usually no more than 5 minutes behind source data. The logic is SQL-only, no custom procedural branching is required, and refresh compute must run on a dedicated warehouse. Which design is the BEST fit?
Options:
A. Create a dynamic table with TARGET_LAG = '5 minutes' on a dedicated warehouse.
B. Create a stream and a 1-minute task graph to MERGE changes into a table.
C. Create a standard view and let analyst queries recompute the transformed results.
D. Add search optimization to RAW_ORDERS and query the raw table directly.
Best answer: A
Explanation: A dynamic table is the best balance here because the requirement is near-real-time freshness, SQL-only transformation, and low operational overhead. Setting a 5-minute TARGET_LAG lets Snowflake manage incremental refresh while a dedicated warehouse isolates refresh compute from analyst workloads.
Dynamic tables are designed for declarative incremental pipelines when you want transformed table outputs without hand-building change-processing orchestration. In this scenario, the team needs a persisted curated table, about 5-minute freshness, and isolated compute. A dynamic table fits because you define the SQL once, assign a warehouse, and use TARGET_LAG to express acceptable staleness. Snowflake then manages refresh planning and incremental maintenance for the dependency graph.
MERGE behavior.The key distinction is declarative freshness versus hand-managed orchestration.
MERGE orchestration that the requirements do not need.Topic: Compute, Governance, Observability, and Performance
An engineering team wants to monitor sales.orders with one built-in completeness metric and one company-specific metric. Review the exhibit and choose the best interpretation.
ALTER TABLE sales.orders
ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.NULL_COUNT
ON (customer_id);
CREATE OR REPLACE DATA METRIC FUNCTION governance.high_value_order_pct(
t TABLE(amount NUMBER)
)
RETURNS NUMBER
AS 'SELECT 100 * COUNT_IF(amount > 1000) / COUNT(*) FROM t';
Options:
A. Both metrics are custom because each must be declared for monitoring.
B. SNOWFLAKE.CORE.NULL_COUNT is system-provided; high_value_order_pct is custom business logic.
C. high_value_order_pct is system-provided; NULL_COUNT is a custom table check.
D. Both metrics are system-provided because Snowflake evaluates both.
Best answer: B
Explanation: System-provided data metric functions are supplied by Snowflake, typically from SNOWFLAKE.CORE. A metric created with CREATE DATA METRIC FUNCTION is custom, and the amount > 1000 rule shows business-specific logic rather than a built-in platform metric.
Snowflake distinguishes system-provided metrics from custom measurements by where the metric logic originates. SNOWFLAKE.CORE.NULL_COUNT is a built-in data metric function that Snowflake provides for standard data-quality checks such as completeness. governance.high_value_order_pct is different: the team defines it with CREATE DATA METRIC FUNCTION, and its logic depends on a business rule, amount > 1000, so it is a custom measurement.
Attaching a metric to a table does not make it custom, and executing a custom metric inside Snowflake does not make it system-provided.
SNOWFLAKE.CORE metric is built in.CREATE DATA METRIC FUNCTION defines account-specific logic, while NULL_COUNT already exists as a Snowflake-supplied metric.Topic: Sharing, Replication, and Cross-Platform Delivery
An e-commerce platform emits order events continuously from application services, not as files. Data engineers need curated order tables refreshed within 5 minutes, and they want to avoid managing change-processing logic or complex orchestration where possible. A finance team in a separate Snowflake account in the same region must query a masked subset using its own warehouse. Which architecture is the best fit?
Options:
A. Use external tables on landed files, scheduled transformations, and daily file unloads.
B. Use Snowpipe Streaming, streams with task graphs, secure views with masking, and Secure Data Sharing.
C. Use Snowpipe auto-ingest from staged files, materialized views, and database replication.
D. Use Snowpipe Streaming, dynamic tables, secure views with masking, and Secure Data Sharing.
Best answer: D
Explanation: Snowpipe Streaming is the right ingestion fit for continuous application events without file staging. Dynamic tables meet the 5-minute freshness goal with less orchestration than streams and task graphs, and Secure Data Sharing lets the finance account query governed data using its own warehouse.
The core trade-off here is choosing the simplest Snowflake-native architecture that still satisfies low-latency ingestion, managed transformation freshness, and governed delivery. Snowpipe Streaming is built for continuous row-based ingestion from applications, so it fits better than file-based loading. Dynamic tables are the best match for keeping curated tables current within a target lag while reducing custom orchestration compared with building and operating streams plus task graphs. For delivery, Secure Data Sharing is the right mechanism because the finance team already has its own Snowflake account in the same region and needs separate compute. Exposing a secure view with masking preserves governance while sharing live data.
A streams-and-tasks design could also work technically, but it adds operational complexity the scenario explicitly wants to avoid.
Topic: Compute, Governance, Observability, and Performance
A team uses one LARGE warehouse for a CDC pipeline and BI dashboards. The pipeline reads from a stream and runs MERGE tasks every minute. Query History shows dashboard queries spend 20-30 seconds in queued_overload_time during task runs, but only 4-6 seconds executing. Requirements: keep dashboards under 10 seconds, keep the pipeline schedule unchanged, and track credits separately for BI and ingestion. What is the best next action?
Options:
A. Create separate warehouses for CDC tasks and BI dashboards
B. Increase the warehouse size to XLARGE
C. Convert the current warehouse to multi-cluster
D. Enable Query Acceleration Service for dashboard queries
Best answer: A
Explanation: The dashboards are not slow because of execution time; they are waiting for compute. Since the delay happens when CDC tasks overlap and the team also needs separate credit tracking, the main concern is workload isolation, so separate warehouses are the best fit.
queued_overload_time indicates queries are waiting for warehouse resources, not spending that time executing. In this scenario, dashboard queries finish in 4-6 seconds once they start, so the real issue is contention from mixing minute-by-minute CDC MERGE tasks with BI traffic on the same warehouse. Because the pipeline schedule must stay unchanged and finance wants BI and ingestion credits tracked separately, the strongest Snowflake design is to isolate the workloads on different warehouses.
A multi-cluster warehouse can help with concurrency on a shared warehouse, but it still keeps both workloads on the same compute boundary. When predictable dashboard latency and per-workload cost visibility are explicit requirements, separate warehouses are the better design.
Topic: Data Sourcing, Storage, and Ingestion
A task that runs every 5 minutes began failing after a partner changed the JSON payload shape. The current load extracts nested paths and casts them into a typed table:
COPY INTO curated.events(order_id, customer_id, amount, device_os)
FROM (
SELECT
$1:order.id::NUMBER,
$1:customer.id::NUMBER,
$1:amount::NUMBER(10,2),
$1:device.os::STRING
FROM @partner_stage
)
FILE_FORMAT = (TYPE = JSON);
New attributes can appear at any nested level, and the team must keep every record plus the full original payload for later reprocessing. What is the best next step?
Options:
A. Set ON_ERROR = CONTINUE on the COPY INTO
B. Load raw JSON into VARIANT, then transform downstream
C. Switch the pipeline to Snowpipe Streaming
D. Enable schema evolution with MATCH_BY_COLUMN_NAME
Best answer: B
Explanation: Because the source is semi-structured JSON with changing nested shape, the fragile part is the path-based typed COPY, not the file transport. Landing each record in VARIANT preserves all attributes and lets downstream transformations handle evolving schema safely.
When nested JSON structure changes, a COPY INTO that extracts specific paths and casts them into relational columns becomes brittle. Snowflake can ingest the full document into a VARIANT column without forcing those nested attributes to match a fixed schema at load time. That is the safest ingestion pattern here because the requirement is to keep every record and preserve the original payload for replay or backfill.
VARIANT column.Schema evolution and name-based matching help only in narrower direct-to-column mapping scenarios; they do not make path-based transformed JSON loads resilient to arbitrary nested drift.
ON_ERROR = CONTINUE would skip problematic rows, which breaks the requirement to keep every record and full payload.Topic: Streaming, Orchestration, and Near Real-Time Pipeline Design
An events pipeline uses Snowpipe Streaming to ingest into raw.events. Two independent tasks run every minute and both read the same stream, raw.events_strm: one writes error events to ops.alerts, and the other writes minute aggregates to mart.usage_minute. Ingestion latency is under 30 seconds, but the alerts task often sees no rows while the aggregate task succeeds. Both consumers must receive all changes within 2 minutes with the smallest Snowflake-native change. What should the team do?
Options:
A. Create a second stream on raw.events so each task consumes its own stream.
B. Recreate the stream on a view used by both tasks.
C. Replace Snowpipe Streaming with Snowpipe and keep one shared stream.
D. Increase task frequency and warehouse size for both tasks.
Best answer: A
Explanation: This is a downstream consumption problem, not a freshness problem. The data is arriving on time, but two independent tasks are competing to consume the same stream, so each consumer needs its own stream offset.
The issue is downstream consumption. The stem says ingestion is already meeting the latency target, so freshness is not the bottleneck. It also does not describe missing changes caused by tracking the wrong object; both tasks need the same changes from the same base table.
A Snowflake stream is not a fan-out queue for multiple independent consumers. When a task consumes a stream in a committed DML transaction, that stream advances its offset. If two tasks share one stream, one task can consume the available changes before the other task processes them.
For this pattern, create separate streams on the same source table so each downstream task has its own independent offset. Creating a view-based stream would only help if the problem were object scope, such as needing to track changes in a different derived dataset.
Topic: Sharing, Replication, and Cross-Platform Delivery
A provider account currently runs hourly COPY INTO unloads to cloud storage so a partner can ingest a curated sales dataset into its own Snowflake account. The partner reports stale data between unloads, and the provider’s security team notes that revoking access does not remove files the partner already copied. Both Snowflake accounts are in the same region. What is the best next step?
Options:
A. Create a secure share of the curated objects for the partner account
B. Increase the COPY INTO unload frequency and tighten storage access
C. Have the partner create an external table on the exported files
D. Replicate the database to the partner account each hour
Best answer: A
Explanation: This issue comes from using copied files as a sharing mechanism. Secure data sharing is designed for controlled consumer access: the provider keeps the source of truth, the consumer sees current data, and access can be revoked without managing exported file copies.
Use secure data sharing when the goal is controlled, read-only access to Snowflake data rather than distributing copies. In this scenario, hourly unloads create two built-in problems: the partner only sees data as fresh as the last export, and any files already copied remain outside the provider’s control after access is revoked. A secure share exposes selected Snowflake objects directly from the provider account to the consumer account without copying the underlying data. The consumer queries the live shared data with its own compute, while the provider retains governance over access. Database replication is for maintaining another copy, often for availability or distribution needs, and more frequent unloads or exported-file external tables still rely on copies.
Topic: Transformations, Programmability, and Developer Workflows
Order files land every minute and load into RAW_ORDERS on time. Query history shows no warehouse queuing for the downstream task, but the curated table is still wrong because the next step must apply several business rules in one run: keep the latest event per order_id, MERGE valid rows into ORDERS_CURATED, write invalid rows to ORDERS_REJECTED, and roll back if any step fails. What is the best next step?
Options:
A. Upsize the task warehouse and enable more clusters.
B. Call a Snowflake Scripting stored procedure from the task.
C. Replace the file load with Snowpipe Streaming.
D. Use a dynamic table for all downstream processing.
Best answer: B
Explanation: This is primarily a transformation-design problem, not a compute or ingestion problem. The files already load on time and the warehouse is not queued; the missing capability is multi-step business logic with conditional DML, reject handling, and transactional rollback, which fits a Snowflake Scripting stored procedure invoked by a task.
The key signal is that ingestion and warehouse capacity are healthy, yet the pipeline still cannot enforce the required business rules. When a downstream step must execute ordered logic such as deduplication, MERGE, reject-table writes, and rollback on failure, the right design is to encapsulate that logic in a Snowflake Scripting stored procedure and invoke it from the task. This gives the pipeline one governed unit that can use variables, multiple SQL statements, control flow, exception handling, and transaction control.
MERGE valid rows and insert rejected rows.More warehouse capacity might reduce runtime, but it does not add the missing business-logic structure.
RAW_ORDERS is already current.Topic: Streaming, Orchestration, and Near Real-Time Pipeline Design
Exhibit:
create or replace dynamic table order_enriched_dt
target_lag = '5 minutes'
warehouse = etl_wh
as
select order_id, region, amount
from raw_orders;
create or replace dynamic table sales_kpi_dt
target_lag = '15 minutes'
warehouse = etl_wh
as
select region, sum(amount) as total_amount
from order_enriched_dt
group by region;
A dashboard reads only sales_kpi_dt, and up to 15 minutes of staleness is acceptable. order_enriched_dt is not queried directly. Which change best aligns incremental maintenance with the downstream freshness expectation?
Options:
A. Keep the lags and use a larger warehouse.
B. Set order_enriched_dt to TARGET_LAG = DOWNSTREAM.
C. Lower both dynamic tables to TARGET_LAG = '1 minute'.
D. Set sales_kpi_dt to TARGET_LAG = DOWNSTREAM.
Best answer: B
Explanation: The freshness SLA exists at sales_kpi_dt, not at the intermediate table. Using TARGET_LAG = DOWNSTREAM on order_enriched_dt lets Snowflake refresh that upstream table only when needed for the 15-minute downstream requirement, reducing unnecessary refresh work.
In a dynamic table pipeline, the explicit TARGET_LAG should usually sit where the business freshness expectation actually exists. Here, the dashboard consumes only sales_kpi_dt and allows data to be up to 15 minutes old, so that table should remain the freshness anchor. Since order_enriched_dt is only an intermediate step, TARGET_LAG = DOWNSTREAM is the better pattern: Snowflake refreshes it as needed so dependent dynamic tables can meet their own lag.
DOWNSTREAM for upstream-only dynamic tables in the chain.Making the final table DOWNSTREAM removes the clear SLA anchor, while lowering lags or adding more compute increases work without improving the required business outcome.
DOWNSTREAM on the final table is flawed because the consumer-facing object should carry the explicit freshness target.Topic: Sharing, Replication, and Cross-Platform Delivery
Which statement correctly differentiates Snowflake replication from secure data sharing when the requirement is resilience, data locality, or promoted failover behavior?
Options:
A. Secure data sharing creates a failover-ready secondary copy; replication is mainly for consumer read access.
B. Replication creates a secondary copy that can be promoted for failover; secure data sharing exposes provider-managed data for read access.
C. Replication is for live no-copy access; secure data sharing is for cross-region resilience and locality.
D. Both features support promoted failover because both create the same secondary-database behavior.
Best answer: B
Explanation: For resilience, locality, and explicit failover, Snowflake replication is the relevant mechanism because it maintains a secondary copy that can be promoted. Secure data sharing is designed for governed read access to provider-managed data, not for creating a failover target.
The core distinction is copied secondary versus shared access. Replication creates and synchronizes a secondary database or account in another region or cloud, which supports disaster recovery, locality, and promoted failover behavior when configured with Snowflake replication and failover features. Secure data sharing lets another account query shared objects without taking ownership of a failover-ready secondary copy. That makes sharing excellent for collaboration and data distribution, but it does not by itself satisfy requirements such as regional resilience or promoting a secondary after an outage. When the requirement names recovery, locality, or failover promotion, think replication; when it names governed consumer access to the same data, think secure data sharing.
Topic: Sharing, Replication, and Cross-Platform Delivery
A provider account ingests partner order files as JSON into a raw table with Snowpipe every 5 minutes. Loads finish within 3 minutes, so the 15-minute SLA is already met. Two consumer accounts need the data for BI with consistent customer, product, and calendar definitions, and the shared dataset cannot expose raw email addresses. Which design is the BEST next step?
Options:
A. Add search optimization to the raw table and share it directly.
B. Replace Snowpipe with Snowpipe Streaming for lower-latency ingestion.
C. Replicate the raw database so each consumer can model independently.
D. Build a curated dimensional model and share secure views with masked emails.
Best answer: D
Explanation: This is primarily a modeling and delivery problem, not an ingestion problem. The current Snowpipe pipeline already meets the latency target, so the best next step is to create a curated analytics-ready model and share governed secure views from that layer.
When ingestion already satisfies the SLA, the remaining gap is how data is organized for downstream use. Snowpipe gets files into Snowflake, but it does not create conformed customer, product, and calendar definitions for BI consumers. The better design is to keep the raw landing layer, transform it into curated fact and dimension tables, and publish secure views that expose only the governed, analytics-ready shape.
Changing ingestion speed, tuning raw-table access, or pushing modeling to each consumer does not solve the consistency and governance requirements in the stem.
Topic: Streaming, Orchestration, and Near Real-Time Pipeline Design
A table stream on RAW_ORDERS feeds a task that runs every minute and MERGEs changes into CURATED_ORDERS. The task succeeds, and the new rows are visible in CURATED_ORDERS. Immediately after the task run, an engineer queries the stream and gets no rows.
Which explanation best fits this behavior?
Options:
A. Freshness lag is delaying new changes from appearing in the stream
B. Downstream consumption advanced the stream offset
C. Object scope shifted from the source table to the target table
D. The stream requires a manual refresh before it can be queried again
Best answer: B
Explanation: This is a downstream consumption scenario, not a freshness or scope problem. Because the task successfully used the stream in a MERGE and the target table already has the new rows, the stream offset was advanced by that consuming DML operation.
Snowflake streams track a position, or offset, over changes on their source object. When a downstream DML operation such as MERGE, INSERT, UPDATE, or DELETE reads from the stream in a transaction, those change records are considered consumed and the stream advances to the new offset.
In this scenario, the decisive fact is that the task succeeded and CURATED_ORDERS already contains the new rows. That means the stream changes were available and were used downstream. This rules out a freshness delay. It also rules out an object-scope issue, because the stream does not switch to tracking the target table after being used in a task.
The key takeaway is that an empty stream after a successful downstream load often indicates normal consumption behavior, not a problem.
Topic: Sharing, Replication, and Cross-Platform Delivery
A data engineering team lands source data as Parquet in cloud object storage. Snowflake must transform the data into curated datasets, and Spark and Trino users must query those same curated tables without a separate export pipeline. Which Snowflake-oriented architecture best fits?
Options:
A. Use native tables with dynamic tables
B. Use Iceberg tables for the curated layer
C. Use external tables on the landing files
D. Use native tables with secure data sharing
Best answer: B
Explanation: Iceberg tables are the best fit when curated data must be usable both in Snowflake and in other query engines. They balance storage, transformation, and delivery by keeping the curated layer in an open table format instead of creating a separate export path.
The key concept is choosing an architecture that supports cross-platform delivery without duplicating the curated dataset. Iceberg tables are designed for this pattern: data stays in object storage in an open table format, Snowflake can transform and manage the curated layer, and engines such as Spark or Trino can read the same tables.
In this scenario, the deciding factor is not just transformation automation but the delivery model. External tables expose files in place, but they are not a Snowflake-managed curated table architecture. Native tables work well for Snowflake-only consumption, and dynamic tables help maintain derived data, but both keep the curated layer inside Snowflake. The best trade-off here is an open table architecture that serves both Snowflake and non-Snowflake consumers.
Topic: Streaming, Orchestration, and Near Real-Time Pipeline Design
A team continuously loads RAW_ORDERS into Snowflake. Downstream transformations are implemented as a chain of streams and tasks, but when one task fails or is suspended, curated tables can stay stale until engineers manually resume and replay the chain. The business only needs data within 10 minutes, transformations are SQL-based, and the team wants less orchestration code rather than exact cron control. What is the best Snowflake-native fix?
Options:
A. Use chained dynamic tables with an appropriate TARGET_LAG
B. Point BI queries at the stream instead of curated tables
C. Build a materialized view directly on RAW_ORDERS
D. Keep streams and tasks, but run them more frequently
Best answer: A
Explanation: Dynamic tables fit SQL-based incremental pipelines that need a freshness SLA more than exact scheduling control. In this scenario, setting TARGET_LAG addresses stale downstream data while removing much of the fragile stream-and-task orchestration.
Dynamic tables are the Snowflake-native choice when a pipeline is primarily SQL, needs incremental results, and can be governed by a freshness target instead of explicit task schedules. Here, the core problem is orchestration fragility: a failed task leaves downstream tables stale until someone intervenes. By defining each transformation as a dynamic table and setting TARGET_LAG to the 10-minute requirement, Snowflake manages dependency-aware refreshes and incremental maintenance automatically. That gives a simpler operating model than streams plus tasks while still keeping near-real-time data current. The closest alternative is keeping streams and tasks, but that preserves the monitoring, scheduling, and replay burden the team wants to reduce.
Topic: Sharing, Replication, and Cross-Platform Delivery
A data provider shares sales data with dozens of external consumers. Replication to the required regions is already in place. Most consumer queries are dashboards filtered by month, region, and category, and the provider wants a stable long-term contract.
Exhibit: current shared object
create or replace secure view publish.v_customer_orders as
select
o.order_ts,
c.region,
p.category,
oi.quantity,
oi.unit_price
from oltp.orders o
join oltp.order_items oi on o.order_id = oi.order_id
join oltp.customers c on o.customer_id = c.customer_id
join oltp.products p on oi.product_id = p.product_id;
What is the best interpretation?
Options:
A. Replace sharing with a failover group
B. Add search optimization to the OLTP tables
C. Expose a curated analytics model instead of OLTP joins
D. Increase the provider warehouse for shared queries
Best answer: C
Explanation: The main issue is architectural fit, not a missing tuning switch. The shared object exposes normalized operational joins to many analytic consumers, so a curated reporting model is the better pattern for stable contracts and predictable consumption.
This scenario points to a data-modeling decision. The secure view is built directly on OLTP-style tables (orders, order_items, customers, products), while the consumers are running repeated analytical queries by business dimensions such as month, region, and category. For shared analytics, the stronger pattern is to publish a consumer-facing model at the intended reporting grain, such as fact/dimension tables or another curated reporting layer, and share that layer.
Feature-level tuning does not solve the core mismatch:
The key takeaway is to choose the right published data product shape first, then tune it if needed.
Topic: Compute, Governance, Observability, and Performance
A BI team runs dashboard queries in large bursts for about 5 minutes at the top of each hour. They want low latency during those bursts without keeping oversized compute running the rest of the hour.
Exhibit:
Warehouse: BI_WH
Size: MEDIUM
Min clusters: 1
Max clusters: 1
Auto-suspend: 60
09:00-09:05 queries: 1,180
Avg execution time: 2.1 sec
Avg queued overload time: 28.4 sec
Avg bytes scanned/query: 35 MB
Based on the exhibit, which warehouse design best matches this workload?
Options:
A. Resize BI_WH to a single X-Large warehouse.
B. Run the dashboard workload with serverless tasks.
C. Use a multi-cluster warehouse with auto-scale from 1 to 4 clusters.
D. Enable query acceleration on BI_WH.
Best answer: C
Explanation: The warehouse is spending far more time queued than executing, and each query scans relatively little data. That points to concurrency pressure from bursty dashboard traffic, so a multi-cluster warehouse is the best fit.
The key design choice is to match compute to the bottleneck. In the exhibit, average execution time is only 2.1 seconds, but average queued overload time is 28.4 seconds during a short burst window. That means the main problem is many concurrent queries competing for slots, not slow individual query processing.
For dashboard spikes that last only a few minutes, auto-scaling clusters up and back down is the most appropriate compute design.
Topic: Data Sourcing, Storage, and Ingestion
A data lake team stores event data as Apache Iceberg tables in cloud object storage. Spark and Snowflake must query the same tables, the lake platform must continue to own Iceberg metadata maintenance, and the team does not want to copy data into Snowflake-managed storage. What is the best Snowflake design choice?
Options:
A. Load the files with Snowpipe into a native table
B. Create a Snowflake-managed Iceberg table and migrate the data
C. Create an externally managed Iceberg table in Snowflake
D. Create an external table over the data files
Best answer: C
Explanation: This is a storage-pattern decision, not an ingestion decision. An externally managed Iceberg table lets Snowflake query the same Iceberg data in object storage while another platform continues managing the table metadata and maintenance.
The deciding requirement is metadata ownership. When data already exists as Iceberg tables in object storage and another platform must keep managing snapshots, manifests, and table maintenance, Snowflake should attach to that table as an externally managed Iceberg table. That preserves a shared table definition across engines and avoids copying data into Snowflake-managed storage.
An external table can read files in place, but it does not use the existing Iceberg table metadata as the authoritative shared table definition. Snowflake-managed Iceberg tables keep the Iceberg table under Snowflake’s management, which conflicts with the stated ownership requirement. Snowpipe is an ingestion mechanism for loading data into Snowflake tables, so it solves the wrong problem here.
The key takeaway is to choose the storage pattern first when the real issue is where data lives and who maintains metadata.
Topic: Streaming, Orchestration, and Near Real-Time Pipeline Design
A team loads staged files into RAW.ORDERS and then merges into CURATED.ORDERS with these tasks:
CREATE TASK load_raw
WAREHOUSE = etl_wh
SCHEDULE = '1 MINUTE'
AS COPY INTO RAW.ORDERS FROM @landing/orders;
CREATE TASK merge_curated
WAREHOUSE = etl_wh
SCHEDULE = '1 MINUTE'
AS MERGE INTO CURATED.ORDERS ...
load_raw finishes in 20 to 30 seconds, merge_curated in about 10 seconds, and query history shows no warehouse queuing. CURATED.ORDERS must refresh within 1 minute of a successful load, and the merge must not run if the load fails. What is the best next action?
Options:
A. Make load_raw the scheduled root and run merge_curated AFTER load_raw.
B. Keep both schedules and offset merge_curated by 45 seconds.
C. Increase etl_wh size for both tasks.
D. Run merge_curated on a separate warehouse.
Best answer: A
Explanation: This is a dependency-management problem, not a warehouse-capacity problem. A task graph with load_raw as the scheduled root and merge_curated as a child using AFTER ensures the merge starts only after a successful load and still fits the latency target.
The key issue is that the two tasks are independently scheduled, so Snowflake is free to start them separately even when one step logically depends on the other. Because runtimes are already short and query history shows no warehouse queuing, warehouse selection is not the bottleneck.
In this scenario, the right design is to use a task graph:
load_raw as the scheduled root task.merge_curated as a child task with AFTER load_raw.That directly satisfies both requirements: near-real-time sequencing and failure-aware dependency control. A larger or separate warehouse may help with queueing or long runtimes, but it does not enforce execution order or success-based orchestration.
Topic: Data Sourcing, Storage, and Ingestion
A retail team currently ingests order files with the following Snowflake object. They are redesigning the source so the application emits one order event at a time, and the business wants the data queryable within seconds without first landing files in cloud storage.
Exhibit:
create or replace pipe orders_pipe
auto_ingest = true
as
copy into raw.orders
from @landing/orders/
file_format = (type = json);
Which is the best interpretation of the new requirement?
Options:
A. Keep the pipe because AUTO_INGEST already streams events
B. Run COPY INTO every minute with a task
C. Use an external table with auto-refresh
D. Switch to Snowpipe Streaming for direct event ingestion
Best answer: D
Explanation: AUTO_INGEST on a pipe means Snowpipe continuously loads new files that arrive in a stage. The new requirement removes staged files and asks for per-event data available within seconds, which points to Snowpipe Streaming.
The key distinction is the ingestion unit and the latency target. A PIPE with AUTO_INGEST and COPY INTO represents continuous ingestion of staged files: Snowflake reacts when new files land and loads them into a table. That is different from a streaming design where an application sends individual records directly and expects very low latency without writing files first.
For that new pattern, Snowpipe Streaming is the native Snowflake fit. A task that runs COPY INTO every minute is still batch ingestion on a schedule, and it still depends on files being available to load. An external table can expose staged files for querying, but it does not replace direct row or event ingestion into a Snowflake table. The closest distractor is keeping the existing pipe, but AUTO_INGEST still depends on files arriving in the stage.
AUTO_INGEST watches for new files in a stage; it does not ingest application events directly.COPY INTO is still batch-oriented and still assumes files exist to be copied.Topic: Data Sourcing, Storage, and Ingestion
A vendor delivers small CSV files to an external stage at irregular times. Snowflake currently loads them with a scheduled task:
CREATE TASK load_vendor_files
SCHEDULE = '15 MINUTE'
AS
COPY INTO raw.vendor_events
FROM @vendor_stage;
Dashboards are often 10-15 minutes behind. The SLA is data available in under 1 minute after each file arrives, and the team wants the most Snowflake-native, low-operations approach while keeping file-based delivery. What is the best next step?
Options:
A. Configure Snowpipe auto-ingest on the stage for event-driven file loading.
B. Create an external table with AUTO_REFRESH and query the staged files.
C. Increase task frequency and warehouse size for scheduled COPY INTO loads.
D. Rebuild the feed to use Snowpipe Streaming for direct row ingestion.
Best answer: A
Explanation: Snowpipe is designed for continuous, file-based ingestion when files land in a stage and low operational overhead is required. It fits the under-1-minute target better than a polling task because it loads on file arrival instead of waiting for the next schedule.
The deciding concept is matching the ingestion mechanism to the delivery model and latency target. The source is still delivering files, so the best Snowflake-native fix is to move from scheduled COPY INTO polling to Snowpipe auto-ingest, which uses event-driven loading when new files arrive in the stage.
This fits the stated requirements:
Snowpipe Streaming is for direct row ingestion from applications or streaming clients, not for a vendor that is already sending files. The key takeaway is that continuous file arrival plus a near-real-time SLA usually points to Snowpipe, not scheduled tasks.
COPY INTO execution instead of event-driven file ingestion.Topic: Data Sourcing, Storage, and Ingestion
A team uses Snowpipe to auto-ingest partner CSV files into a raw table. After a vendor release, ingestion lag grows even though file sizes and arrival rate are unchanged. COPY_HISTORY shows repeated parse errors:
unexpected end of record
found '\n' while expecting record delimiter
The partner confirms a comments field now contains embedded line breaks inside double quotes. What is the best next step?
Options:
A. Switch to Snowpipe Streaming for lower-latency ingestion.
B. Update the CSV file format for quoted multi-line fields, then reload failed files.
C. Enable schema evolution on the target table for new columns.
D. Increase the warehouse size used by downstream transformations.
Best answer: B
Explanation: This is a parsing problem that looks like a latency problem. The files are arriving, but Snowflake cannot correctly identify CSV record boundaries after the source started sending quoted fields with embedded line breaks.
The key clue is the parse error text, not the lag symptom. Snowpipe is receiving files, but the load is failing because the CSV shape changed: a field now contains embedded newlines inside quotes. That means the file format must correctly handle quoted multi-line fields before ingestion can resume normally.
In this situation, the right action is to update the CSV file format settings to match the source layout, then replay the failed files. Schema evolution is for column changes, not broken row parsing. More compute also does not help, because the problem occurs while Snowflake is interpreting the file contents. The main takeaway is to check source-data shape and file-format assumptions before treating backlog as a throughput issue.
Topic: Transformations, Programmability, and Developer Workflows
An analytics team ingests JSON order events into raw Snowflake tables with Snowpipe every minute. The next step deduplicates records, flattens VARIANT, joins to dimension tables already in Snowflake, and upserts a curated fact table by using window functions and MERGE. The curated table must refresh within 5 minutes, keep existing masking policies and access history, and use isolated compute so BI queries are not affected. Which design is the BEST fit?
Options:
A. Replace the pipeline with external tables over files and expose secure views to consumers.
B. Unload data each minute to cloud storage, transform it in Spark, then reload the curated table.
C. Send rows to a remote service through external functions for cleansing before loading the curated table.
D. Use streams and a task graph to run SQL/Snowflake Scripting MERGE logic in Snowflake on a dedicated warehouse.
Best answer: D
Explanation: Because the source data, reference tables, and transformation logic are already well served by Snowflake SQL, the pipeline should stay in Snowflake. Streams and tasks with Snowflake Scripting meet the 5-minute SLA, isolate compute with a dedicated warehouse, and avoid needless unload/reload steps.
This scenario fits Snowflake-native transformation processing: data already lands in Snowflake, the required work is relational and semi-structured SQL, and governance must stay close to the curated table. Exporting data to Spark or a remote service would add data movement, orchestration, failure points, and reload delay without addressing a capability gap.
MERGE into the curated table.That design keeps latency low while preserving Snowflake-native governance and observability. External engines are better only when there is a proven requirement Snowflake SQL or Snowflake Scripting cannot reasonably meet.
MERGE processing.Topic: Transformations, Programmability, and Developer Workflows
A data engineering team receives newline-delimited JSON order events and related warranty PDFs in cloud object storage. Requirements: JSON attributes must be queryable within 5 minutes, PDF files must remain governed and accessible from Snowflake, and the team wants to avoid a separate Spark pipeline. Which design is the BEST way to preserve Snowflake-native processing for both content types?
Options:
A. Flatten JSON into relational columns at ingest and store PDF binary content in tables.
B. Use object-store events to trigger Spark parsing, then load curated rows into Snowflake.
C. Create external tables over JSON and PDF files, then query both directly with SQL.
D. Snowpipe JSON into VARIANT; stage PDFs with a directory table; process with SQL/Snowpark.
Best answer: D
Explanation: The best fit is to use Snowpipe for continuous JSON ingestion into VARIANT and keep PDFs as unstructured files on a stage with a directory table. That preserves Snowflake-native processing, meets the low-latency query requirement, and avoids introducing an external compute platform.
For this scenario, Snowflake-native processing means using the platform’s built-in patterns for each data type instead of pushing parsing and transformation outside Snowflake. Semi-structured JSON is best ingested continuously with Snowpipe into a table that includes a VARIANT column, so attributes can be queried quickly and flexibly. Unstructured files such as PDFs should remain on a Snowflake stage, where a directory table can expose file metadata and Snowpark can be used when file-aware processing is needed.
This approach fits the stated requirements:
The closest distractor is the external-table approach, which is appropriate for structured or semi-structured lake data, but not for treating PDFs as directly queryable tabular content.
Topic: Data Sourcing, Storage, and Ingestion
A data engineer uses the following design for files in an external cloud location:
CREATE OR REPLACE STAGE doc_stage
URL='s3://acme-legal/contracts/'
DIRECTORY=(ENABLE=TRUE);
CREATE OR REPLACE EXTERNAL TABLE ext_contracts (...)
WITH LOCATION=@doc_stage
AUTO_REFRESH=TRUE
FILE_FORMAT=(TYPE=PARQUET);
The location will stop receiving Parquet files and will instead receive scanned PDF contracts. Analysts only need file inventory in Snowflake before downstream document processing. What is the best next design choice?
Options:
A. Load the PDFs with COPY INTO and column matching.
B. Switch the external table to TYPE=JSON for the PDFs.
C. Keep the external table and add search optimization.
D. Use the stage directory table for metadata; process PDFs separately.
Best answer: D
Explanation: The key change is that Parquet is a queryable file format, while scanned PDFs are unstructured files. For an inventory-only requirement, keep the files in a stage and use the directory table for metadata instead of an external table.
External tables are designed for externally stored structured or semi-structured files whose contents can be projected into columns. When the source changes from Parquet to scanned PDFs, the design choice changes because PDFs are unstructured data. Changing the file format or trying to load them with tabular load options does not make the document contents queryable as rows.
Because the requirement is only to inventory files inside Snowflake, the existing stage with DIRECTORY=(ENABLE=TRUE) is the correct anchor. Query the stage’s directory table for metadata such as file path, size, and last modified time, and use a separate unstructured-data processing step later if document content must be extracted.
The closest distractor is switching file format, but file formats do not convert binary documents into VARIANT records.
COPY INTO with column matching applies to tabular file loads, not binary document ingestion.Topic: Data Sourcing, Storage, and Ingestion
An engineering team started with the following definition for lake data already used by Spark:
CREATE EXTERNAL TABLE lake_orders (
order_id NUMBER,
order_ts TIMESTAMP_NTZ
)
LOCATION=@orders_stage
FILE_FORMAT=(TYPE=PARQUET);
The same object storage path is already maintained as an Apache Iceberg table by another platform, and both Snowflake and that platform must read the same table metadata without copying the data. What is the best next step?
Options:
A. Create a dynamic table on top of the external table
B. Create an externally managed Iceberg table
C. Load the files into a Snowflake native table
D. Keep the external table and enable auto-refresh
Best answer: B
Explanation: This scenario is driven by open table format interoperability, not just reading Parquet files. Because another platform already maintains Apache Iceberg metadata and Snowflake must use that same metadata without copying data, an externally managed Iceberg table is the best fit.
Iceberg-related table patterns fit when the requirement is to preserve an open table format and share table semantics across engines. Here, the lake path is already managed as an Apache Iceberg table by another platform, and Snowflake must read that same table metadata in place. That points to an externally managed Iceberg table.
An external table can expose files in object storage, but it is still a file-oriented pattern and does not make Snowflake participate in the shared Iceberg metadata layer. Loading the data into a native table breaks the no-copy requirement, and a dynamic table solves transformation refresh, not cross-engine table interoperability.
When the deciding constraint is shared open-format metadata, choose an Iceberg table pattern instead of a basic external table.
Topic: Data Sourcing, Storage, and Ingestion
A retail team lands small CSV files in an external stage at unpredictable times throughout the day. Operators currently run this command whenever files arrive:
COPY INTO raw.orders
FROM @landing/orders/
FILE_FORMAT = (TYPE = CSV SKIP_HEADER = 1);
They want each new file loaded automatically, without manual reruns or scheduled polling. What is the best next step?
Options:
A. Schedule a task to rerun COPY INTO every 5 minutes.
B. Create a Snowpipe with auto-ingest for the stage.
C. Use Snowpipe Streaming to load the staged CSV files.
D. Create a stream on raw.orders to detect new files.
Best answer: B
Explanation: Snowpipe is the Snowflake-native choice when files arrive incrementally in a stage and should load with minimal operational effort. The team already has a working COPY INTO pattern, so turning that pattern into a pipe is the best fit.
Snowpipe is Snowflake’s continuous file ingestion service for data that lands in an internal or external stage. It is the right choice when files arrive one by one or in small batches over time and the goal is to avoid manually rerunning COPY INTO.
In this scenario, the files are already staged, the load statement already exists, and the team explicitly wants automatic loading without a polling schedule. A Snowpipe can use the same loading logic and ingest only newly arrived files as they appear.
A scheduled task can automate repeated execution, but Snowpipe is the purpose-built feature for incremental staged-file ingestion.
Topic: Streaming, Orchestration, and Near Real-Time Pipeline Design
raw.customer_landing already receives inserts, updates, and deletes from an upstream application every minute. A downstream curated.customer_current table must reflect those changes within 2 minutes, and the pipeline should process only rows changed since the last successful run. The team does not want to manage custom watermark logic or rescan the full landing table. Which design is the BEST fit?
Options:
A. Create a Snowpipe on the source stage and load the same files again into curated.customer_current.
B. Replace the target with a dynamic table and rely on refreshes as the CDC consumption mechanism.
C. Track the highest processed ingest_ts in a control table and run a scheduled MERGE from raw.customer_landing.
D. Create a standard stream on raw.customer_landing and a triggered task to MERGE its changes into curated.customer_current.
Best answer: D
Explanation: Because the data has already landed in Snowflake, this is a downstream CDC consumption problem, not a raw ingestion problem. A standard stream plus triggered task processes only new row changes, supports inserts/updates/deletes, and avoids custom watermark tracking or full rescans.
When source data is already in a Snowflake table, the best CDC pattern is a stream plus task rather than another ingestion mechanism. A standard stream on the landing table records row-level changes since the last consumption point, including the information needed to propagate updates and deletes. A triggered task can then MERGE those changes into the curated table quickly, which matches the low-latency requirement.
This design fits the stated needs:
Re-ingesting files with Snowpipe solves file loading, not post-landing CDC, and timestamp-based scans push state management and edge cases back onto the engineering team.
Topic: Streaming, Orchestration, and Near Real-Time Pipeline Design
A team needs new application events queryable in Snowflake within 2 minutes. Reliability is the higher priority: raw events must stay replayable if downstream transforms fail. Which architecture best fits?
Options:
A. Query the landing files through an external table only.
B. Stage micro-batch files, load with Snowpipe, then use streams and tasks.
C. Ingest rows with Snowpipe Streaming directly into target tables.
D. Build a dynamic table over an external table on the files.
Best answer: B
Explanation: Snowpipe is the best match when the latency target is a minute or two and reliability is the priority. Landing files in a stage keeps a durable replay point, and Snowflake can still load them continuously enough to meet the stated near real-time goal.
The core concept is choosing file-based continuous ingestion when you need both near real-time delivery and strong recoverability. Snowpipe loads new staged files automatically into a raw table, so the original files remain available for replay, audit, or backfill if a later transformation fails.
A strong pattern is:
Snowpipe load a raw table continuouslySnowpipe Streaming is designed for lower-latency row ingestion, but the stem prioritizes preserving a durable raw replay point over minimizing latency to seconds. External tables and dynamic tables can be useful, but they do not replace this simple, reliable raw-ingestion pattern.
Topic: Streaming, Orchestration, and Near Real-Time Pipeline Design
A team needs to process only changed rows from a staging table, but its main requirement is to control exactly when each step runs and in what order validation, enrichment, and load logic execute. Which Snowflake approach best fits this need?
Options:
A. Materialized views
B. Snowpipe Streaming
C. Streams with tasks
D. Dynamic tables
Best answer: C
Explanation: This requirement is about explicit orchestration, not just incremental refresh. Streams capture change data, and tasks let engineers decide the schedule, dependencies, and execution order of downstream processing.
Use streams and tasks when the pipeline must react to changed rows and engineers need direct control over execution timing and sequencing. A stream records inserts, updates, and deletes since the last consumption point, which makes it suitable for CDC-style processing. Tasks then run SQL or procedure logic on a defined schedule or in a task graph, so validation, enrichment, and load steps can be orchestrated in an exact order.
Dynamic tables are better when the goal is declarative, automatically maintained transformations with Snowflake managing refresh behavior. That is different from a requirement to explicitly control when each step runs. The key distinction is automatic refresh versus engineer-controlled execution.
Topic: Sharing, Replication, and Cross-Platform Delivery
A data engineering team must keep a production pipeline available if its primary Snowflake account becomes unavailable. The secondary account is in a different cloud region, and the requirement explicitly includes controlled failover and failback.
Exhibit:
ALTER DATABASE prod_etl
ENABLE REPLICATION TO ACCOUNTS orgname.dr_account;
Based on the exhibit, what is the best next step?
Options:
A. Continue database replication and promote the secondary database if needed
B. Use Secure Data Sharing from the primary account to the DR account
C. Create a failover group for prod_etl and supported account objects
D. Publish prod_etl through a listing with auto-fulfillment
Best answer: C
Explanation: The exhibit shows database replication for a single database. Because the requirement is business continuity across regions or clouds with controlled failover and failback, the better Snowflake mechanism is a failover group, which is built for that outcome.
ALTER DATABASE ... ENABLE REPLICATION sets up replication for that database, but the stem asks for a continuity solution, not just a replicated copy. In Snowflake, failover groups are the mechanism intended for cross-region or cross-cloud business continuity because they support controlled failover and failback and can include the replicated database plus supported account objects needed by the workload.
A good way to separate the concepts is:
The closest alternative is continuing with database replication alone, but that does not provide the full continuity model requested in the scenario.
Topic: Compute, Governance, Observability, and Performance
A data engineering team runs a 5-minute CDC task graph and interactive dashboards on the same warehouse. From 8:00 to 10:00, CDC tasks miss SLA and many dashboard queries show high queued_overload_time. Once queries start, execution time is normal and there is little spill. The dashboard pattern is hundreds of short concurrent queries; the CDC pattern is a small number of MERGE statements that must run predictably every 5 minutes, even if dashboard traffic doubles. Which compute design is the best next step?
Options:
A. Enable query acceleration service on the shared warehouse
B. Replace the CDC task graph with dynamic tables on the same warehouse
C. Split workloads across warehouses and make the dashboard warehouse multi-cluster
D. Increase the shared warehouse size and keep all workloads together
Best answer: C
Explanation: The symptom points to compute contention, not slow query execution. A dedicated warehouse for CDC protects predictable task runtimes, while a multi-cluster warehouse is the better fit for many short concurrent dashboard queries.
queued_overload_time with normal execution after start usually means the main problem is warehouse concurrency pressure. Here, two very different workloads are sharing compute: bursty BI queries and time-sensitive CDC MERGE statements. The best design is to isolate them on separate warehouses so the CDC pipeline keeps a stable SLA, then configure the dashboard warehouse as multi-cluster because that pattern benefits from scaling out for many simultaneous short queries.
A larger single warehouse may help somewhat, but it still mixes workloads and does not give the CDC process predictable isolation if dashboard demand grows. The key design choice is matching scale-out concurrency handling to BI while reserving dedicated compute for pipeline reliability.
queued_overload_time.Topic: Sharing, Replication, and Cross-Platform Delivery
A data engineering team wants data to remain in cloud object storage in an open table format, while Snowflake and non-Snowflake engines query the same table metadata. Which Snowflake architectural pattern best fits this requirement?
Options:
A. Iceberg tables
B. Materialized views
C. Dynamic tables
D. External tables
Best answer: A
Explanation: Iceberg tables are Snowflake’s architectural pattern for open-format tables stored in object storage and used across engines. The key requirement is shared table metadata and interoperability, not simply querying files or enabling a refresh feature.
The core concept is choosing an interoperable table architecture. Iceberg tables are designed for data that remains in object storage while being represented as a table through open metadata and snapshots that compatible engines can understand. That makes them the right fit when the requirement is cross-platform table access, not just reading external files or optimizing Snowflake-only queries.
The closest distractor is external tables, but they query file-based data rather than provide the same open table architecture.
Topic: Transformations, Programmability, and Developer Workflows
A task scheduled every 5 minutes now misses its SLA. It calls a Python stored procedure that reads rows from raw_events, loops through them, and performs only LATERAL FLATTEN, type casts, joins, ROW_NUMBER() deduplication, and a final MERGE into a curated table. What is the best next step?
Options:
A. Move the transformation into external functions that call a REST service
B. Rewrite the transformation as set-based Snowflake SQL run directly by the task
C. Increase the task warehouse size and keep the Python procedure unchanged
D. Split the procedure into per-row UDF calls for each shaping step
Best answer: B
Explanation: The transformation described is a natural fit for Snowflake SQL: flatten semi-structured data, join, deduplicate with window functions, and upsert with MERGE. Rewriting it as set-based SQL is the most direct fix because it removes Python handler and row-by-row overhead instead of adding more layers.
When the required shaping is already expressible with native Snowflake SQL constructs, SQL is usually the best transformation choice. In this scenario, LATERAL FLATTEN, casts, joins, ROW_NUMBER() with QUALIFY, and MERGE are all set-based operations that Snowflake can optimize directly. A Python stored procedure adds runtime startup and often encourages row-wise logic, which is a poor fit for this workload and can cause missed task SLAs.
A better design is to keep orchestration in the task and move the transformation into SQL, for example:
LATERAL FLATTENMERGEAdding more compute may hide the symptom temporarily, but it does not fix the implementation mismatch.
Topic: Transformations, Programmability, and Developer Workflows
A retailer already loads JSON order documents into a raw Snowflake table every minute. Each document contains nested line_items and discounts arrays. Analysts need item-level relational tables within 5 minutes, and the original document must remain available for replay if transformation logic changes. Which Snowflake design is the best fit?
Options:
A. Create an external table over the staged JSON files
B. Store raw rows in VARIANT and use LATERAL FLATTEN downstream
C. Load each document into VARCHAR columns for string parsing
D. Treat the payloads as unstructured files with a directory table
Best answer: B
Explanation: The main concern is nested semi-structured transformation, not file discovery or unstructured file processing. Keeping the raw JSON in VARIANT and flattening arrays downstream preserves replayability and efficiently produces relational rows from nested elements.
When the source is JSON documents with nested arrays, the primary Snowflake concern is semi-structured transformation. VARIANT keeps the original payload intact for replay or reprocessing, and LATERAL FLATTEN turns arrays such as line_items into one row per element for relational analytics. That directly matches the need for item-level tables while retaining the original document.
VARIANT.payload:order_id.LATERAL FLATTEN on nested arrays to create item-level rows.File-oriented features are better for staged-file access or unstructured file inventories, not for the core task of transforming nested JSON already loaded into Snowflake.
Topic: Compute, Governance, Observability, and Performance
A data engineering team uses Snowpipe Streaming to load customer interactions into a curated table with sub-minute latency. Support engineers need full email and phone values, but marketing analysts must query the same table and see those fields only in masked form. The team wants centralized enforcement without copying data or redesigning downstream tasks. Which Snowflake feature is the BEST fit?
Options:
A. Use a secure view that removes the sensitive columns
B. Apply a row access policy to the table
C. Apply masking policies to the sensitive columns
D. Create a dynamic table with pre-masked data
Best answer: C
Explanation: Masking policies are Snowflake’s column-level governance control for exposing sensitive data differently by role at query time. They let authorized users see clear values while other users see masked values from the same base table, with no extra copy of the data.
The core concept is column-level controlled exposure. A masking policy is designed for sensitive fields such as email and phone, where some roles need full values and others should see masked results from the same table. Because the policy is enforced at query time on the existing columns, the Snowpipe Streaming ingestion pattern, table design, and downstream task flow do not need to change. This gives centralized governance with one governed copy of the data.
A row access policy governs which rows are visible, not how individual column values are obscured. A secure view or a separate dynamic table could work around the requirement, but both add extra objects and access patterns when Snowflake already provides a native policy feature for this exact use case.
The key takeaway is to use masking policies when the requirement is selective exposure of sensitive column values.
email or phone values are masked.Topic: Data Sourcing, Storage, and Ingestion
Which Snowflake feature is built for low-latency row ingestion from a client application directly into a table, rather than detecting or processing changes only after the data is already stored in Snowflake?
Options:
A. Task
B. Dynamic table
C. Stream on a table
D. Snowpipe Streaming
Best answer: D
Explanation: Snowpipe Streaming is the ingestion mechanism in this set. It lets applications push rows into Snowflake tables with very low latency, while the other choices act only after data has already landed and then track, transform, or orchestrate work on it.
The key distinction is ingestion versus post-ingestion change processing. Snowpipe Streaming is designed to move rows from an application or connector into a Snowflake table with low latency, without depending on staged files as the primary pattern. By contrast, a stream records change data for rows that already exist in a Snowflake object, a task runs scheduled or triggered SQL after data is present, and a dynamic table maintains a derived result from source objects already in Snowflake. If the requirement is to get events into Snowflake first, choose the ingestion feature. If the requirement begins only after rows have landed, then streams, tasks, or dynamic tables are the relevant tools.
Topic: Sharing, Replication, and Cross-Platform Delivery
Which Snowflake feature is intended for cross-region disaster recovery by replicating eligible databases together with supported account objects and allowing promotion of the secondary environment during an outage?
Options:
A. Failover group
B. Reader account
C. Secure data sharing
D. Replication group
Best answer: A
Explanation: A failover group is the Snowflake construct designed for recovery-oriented replication when you need both replicated objects and the ability to fail over. It covers eligible databases plus supported account objects and lets the secondary environment be promoted during an outage.
The key concept is the difference between replication-only designs and failover-capable designs. A failover group is used for business continuity and disaster recovery across regions or clouds when the secondary environment must be promotable. It replicates eligible databases together with supported account objects, then allows failover when the primary environment is unavailable.
A replication group is similar in scope for replicated objects, but it does not provide the failover behavior the requirement explicitly calls for. Secure data sharing is for live data access, not for creating a recovery copy you can promote. A reader account helps consumers access shared data without their own Snowflake account, but it is not a disaster recovery mechanism.
If the requirement includes both replication and promotion during an outage, use a failover group.
Topic: Data Sourcing, Storage, and Ingestion
A team loads order files from an external stage. New JSON files land every 1–2 minutes, analysts need them queryable within 5 minutes, and the team wants to stop rerunning loads manually when urgent files arrive.
CREATE OR REPLACE TASK load_orders_task
WAREHOUSE = ingest_wh
SCHEDULE = '60 MINUTE'
AS
COPY INTO raw.orders
FROM @orders_ext_stage
FILE_FORMAT = (TYPE = JSON);
Which next step best fits the requirement?
Options:
A. Use Snowpipe Streaming for the files
B. Increase the size of ingest_wh
C. Replace the task with Snowpipe auto-ingest
D. Create a stream on raw.orders
Best answer: C
Explanation: The shown design is still an hourly batch pattern, so it cannot reliably meet a 5-minute availability target. For files arriving continuously in an external stage, Snowpipe auto-ingest is the Snowflake-native fit because it loads new files as they land.
This is a file-based continuous-ingestion requirement, not just a faster batch job. The task runs COPY INTO only once every 60 minutes, so even if the warehouse is powerful, data can still wait nearly an hour before loading. That makes the pattern both too slow for the SLA and likely to trigger manual reruns when urgent files arrive.
For files landing in an external stage every 1–2 minutes, Snowpipe auto-ingest is the appropriate design because it uses event-driven loading for newly arrived files. Streams help only after data is already in a table, and Snowpipe Streaming is intended for pushing records directly from applications rather than reading staged files.
The key takeaway is to use event-driven file ingestion when latency requirements are tighter than the batch schedule.
COPY INTO runtime, but it does not remove the 60-minute schedule delay.Topic: Transformations, Programmability, and Developer Workflows
Which requirement is the clearest fit for Snowflake AI text-processing functions in a data engineering pipeline, rather than a generic analytics or data-science workload?
Options:
A. Enable search optimization to speed selective filters on case IDs.
B. Add sentiment and short-summary columns to each incoming case note for downstream routing.
C. Build a dynamic table of daily case counts by product and region.
D. Train a model to forecast weekly case volume from historical patterns.
Best answer: B
Explanation: Pipeline enrichment uses AI text functions to turn unstructured text into structured attributes during ingestion or transformation. Adding sentiment and summary columns to incoming case notes is row-level text enrichment, while the other choices are reporting, forecasting, or performance tuning tasks.
The key distinction is whether the workload extracts new structured information from text as part of the pipeline. Adding sentiment and a summary to each case note is row-level enrichment: the pipeline writes derived text attributes back into Snowflake so downstream SQL, tasks, or dynamic tables can use them immediately. That differs from summarizing already-structured metrics, training a predictive model from historical data, or tuning query performance.
The closest distractor is the dynamic-table aggregate, which is still analytics over structured results rather than text enrichment.
Topic: Sharing, Replication, and Cross-Platform Delivery
A company curates sales data in Snowflake and must deliver it to two external consumers: a distributor that already uses Snowflake in another cloud, and a reseller that does not have a Snowflake account. Requirements: publish one governed business model, enforce consumer-specific row filtering and PII masking, and avoid building separate ETL copies for each recipient. Which design is the BEST fit?
Options:
A. Publish a curated share layer with secure views and policies, using a private listing with auto-fulfillment for the distributor and a reader account for the reseller.
B. Replicate raw databases to each consumer environment and let each consumer model and secure its own data.
C. Unload curated tables to object storage and require both consumers to ingest them with Snowpipe.
D. Share base tables directly and depend on tags and separate warehouses for consumer governance.
Best answer: A
Explanation: This is a data-product architecture decision, not a single-feature setting. The best pattern is to model a curated share layer once, enforce security in Snowflake with secure views and policies, and then choose the sharing mechanism that matches each consumer type.
When requirements combine reusable business modeling, per-consumer governance, and external delivery, the right answer is usually an architectural pattern: create a provider-owned curated layer and expose only governed projections of it. In Snowflake, secure views are a common boundary for publishing a stable business model, while row access and masking policies enforce consumer-specific restrictions.
For a partner that already has Snowflake in another cloud, a private listing with auto-fulfillment fits cross-cloud delivery. For a recipient without Snowflake, a reader account avoids building a separate export pipeline. Replicating raw data or unloading files pushes modeling and governance downstream, creating duplication and reducing control. The key takeaway is to separate internal storage from the governed share layer.
Topic: Transformations, Programmability, and Developer Workflows
A data engineering team ingests customer chat transcripts into RAW_CHAT every minute. They need a summary for each transcript within 5 minutes, raw text must stay inside Snowflake-governed processing, and they want to extend an existing stream/task pipeline without adding new infrastructure. Which approach is BEST?
Options:
A. Use a stream and task with a Snowflake Cortex AISQL summarization function.
B. Use an external function to send transcript text to a third-party LLM.
C. Run a custom summarizer in Snowpark Container Services for new rows.
D. Enable search optimization on RAW_CHAT and summarize in dashboards.
Best answer: A
Explanation: A stream plus task with Snowflake Cortex keeps summarization where the data already lives and stays within Snowflake governance. It also matches the team’s existing pipeline pattern and meets the near-real-time requirement without adding external services or heavier custom infrastructure.
The key design choice is to use Snowflake-native AI processing directly in the data pipeline. A stream tracks newly ingested transcript rows, and a task can call a Snowflake Cortex AISQL summarization function to write results into a downstream table on a frequent schedule. That keeps raw text inside Snowflake’s governance boundary, supports incremental processing, and avoids operational overhead from managing separate services or exporting data.
This is a better fit than sending text to an external model endpoint, which breaks the boundary requirement, or using search optimization, which improves lookup performance but does not create summaries. A custom containerized summarizer could work, but it is unnecessarily complex when built-in summarization already satisfies the stated needs.
Topic: Transformations, Programmability, and Developer Workflows
A team stores customer case notes in Snowflake and needs row-level summaries generated as part of a SQL pipeline. The design goal is to keep text processing near the data and avoid sending the text to an external service. Which Snowflake-native approach best fits?
Options:
A. An external function that calls a third-party LLM
B. A Snowflake Cortex AI summarization function in SQL
C. Snowpipe Streaming into a summary table
D. A dynamic table that refreshes from the notes table
Best answer: B
Explanation: The best fit is a Snowflake Cortex AI summarization function used directly in SQL. It keeps the text and processing inside Snowflake instead of exporting data to an outside service.
The core concept is using Snowflake-native AI functions for text work directly where the data already lives. A Cortex AI summarization function can be called from SQL, and it can be incorporated into pipeline logic such as tasks or downstream transformations without moving the text outside Snowflake. That satisfies the requirement to keep summarization near the data and inside the pipeline boundary.
A dynamic table can persist results from a query, but it is not itself the summarization capability. Snowpipe Streaming is for low-latency ingestion, not text generation. An external function can reach an outside model, but that sends processing outside Snowflake, which the stem explicitly wants to avoid.
Topic: Transformations, Programmability, and Developer Workflows
A data engineering team runs CDC every minute with a task:
SELECT apply_cdc();
apply_cdc() is a Python UDF intended to read a stream, run two MERGE statements into target tables, and insert audit rows. Task history shows failures when the function tries to execute SQL. What is the best next step?
Options:
A. Replace the task with a dynamic table on the stream
B. Convert apply_cdc to a UDTF and keep the MERGE logic there
C. Rewrite apply_cdc as a stored procedure and CALL it from the task
D. Increase the warehouse size for the task
Best answer: C
Explanation: This is a programmability-boundary problem, not a compute problem. UDFs are for reusable computation that returns values inside SQL, while stored procedures are for procedural workflows that execute multi-step DML and can be invoked by tasks.
In Snowflake, use a UDF when you need reusable logic that computes and returns a value to a SQL statement, such as normalizing text or deriving a column from inputs. This scenario is different: the routine must coordinate CDC processing by reading change data, running multiple MERGE statements, and writing audit records. That is procedural orchestration with database side effects, which fits a stored procedure.
A task can CALL a stored procedure so the procedure can manage multi-statement SQL, control flow, and status reporting for the CDC step. A UDF or UDTF may help transform data returned to a query, but they are not the right abstraction for orchestrating multi-target updates and audit writes.
The key takeaway is to use UDFs for expression-level computation and stored procedures for operational pipeline steps.
Topic: Transformations, Programmability, and Developer Workflows
A Snowflake task populates ORDER_LINES from an API landing table. Each row contains one JSON document in a VARIANT column, and line_items is an array of objects. The task inserts NULL for sku and qty, even though order_id loads correctly.
SELECT
payload:order_id::NUMBER,
payload:line_items:sku::STRING AS sku,
payload:line_items:qty::NUMBER AS qty
FROM raw_orders;
Which action best addresses the issue?
Options:
A. Reload the stage with a different JSON file format
B. Use LATERAL FLATTEN on payload:line_items before selecting child fields
C. Apply document extraction to the source files
D. Run the task on a larger warehouse
Best answer: B
Explanation: This is a nested semi-structured transformation problem, not a file-loading or document-extraction problem. The payload is already in VARIANT, and only the array fields return NULL, which points to missing array flattening in the SQL.
When JSON is loaded into a VARIANT column, top-level object paths such as payload:order_id can be selected directly. Arrays are different: payload:line_items contains multiple elements, so child fields like sku and qty are not projected correctly by simple path notation. The symptom here—top-level extraction works but array attributes become NULL—shows the main concern is nested data handling. The fix is to use LATERAL FLATTEN on payload:line_items, then read each element through the flattened value column. Changing file-loading settings or treating the source as unstructured documents would not address the transformation logic, and adding compute would only make the wrong query run faster. The key distinction is nested JSON traversal versus file-based or document-like processing.
VARIANT.Topic: Compute, Governance, Observability, and Performance
A team ingests orders with Snowpipe Streaming into raw_orders, then uses a stream and task to populate curated_orders, which is shared to analysts. Requirements: curated data must be available within 5 minutes, and the pipeline uses a dedicated X-Small warehouse.
Monitoring:
raw_orders: 1 minute oldcurated_orders: 18 minutes oldSUCCEEDEDHow should the on-call engineer classify this incident?
Options:
A. A data quality incident
B. A cost control incident
C. A freshness incident
D. An execution failure incident
Best answer: C
Explanation: This is a freshness problem because the curated table is older than the 5-minute SLA while upstream ingestion is still current. The monitoring signals show a timeliness gap between raw arrival and curated availability.
Freshness monitoring is about whether data is available when consumers expect it. In this scenario, new rows are reaching raw_orders within 1 minute, but curated_orders is 18 minutes behind even though the SLA is 5 minutes. That means the pipeline is producing data too late for downstream use.
In Snowflake terms, this usually points to delay between ingestion and publication, such as stream consumption, task logic, or downstream transformation latency. The first step is to treat the alert as a freshness breach and trace where the delay is introduced between raw_orders and curated_orders.
A succeeded task run does not by itself prove that the data is current.
Topic: Sharing, Replication, and Cross-Platform Delivery
An organization replicates a production database from one Snowflake account to a secondary account in another cloud for disaster recovery. During failover tests, the tables are current in the secondary account, but the application cannot resume quickly because users, roles, and warehouse settings must be recreated manually. The continuity requirement is cross-region or cross-cloud failover with minimal reconfiguration. What is the best next step?
Options:
A. Convert the database to a secure share for the DR account.
B. Unload data to external storage and reload it after an outage.
C. Use a failover group to replicate databases and account objects.
D. Keep database replication only and rebuild account objects manually.
Best answer: C
Explanation: This scenario shows that data replication alone is not enough for continuity. A failover group is the Snowflake mechanism built to replicate supported databases plus supported account objects, which reduces manual work during cross-region or cross-cloud recovery.
The key issue is that the secondary environment has current table data but lacks the surrounding Snowflake objects required to resume service quickly. Database replication addresses database contents, but continuity plans often also need supported account objects such as users, roles, and warehouses. Failover groups are intended for disaster recovery and business continuity because they can replicate both supported databases and supported account objects between Snowflake accounts, including across regions or clouds where supported.
Using a secure share would expose data, not create a recoverable standby environment. Keeping database replication alone leaves manual rebuild steps in the recovery path, which violates the minimal-reconfiguration requirement. Exporting and reloading data is slower and more operationally complex than using Snowflake’s native continuity features.
When recovery must preserve more than just data, choose failover groups over database replication by itself.
Topic: Data Sourcing, Storage, and Ingestion
A retailer’s custom checkout service emits one JSON order event at a time, and the application team can modify the producer code. The data engineering team needs each event queryable in Snowflake within about 5 seconds, wants to avoid managing staged files, and will handle downstream merges only after the raw events land. Which Snowflake design BEST meets these requirements?
Options:
A. Use a table stream and tasks to ingest newly arriving events.
B. Use scheduled COPY INTO jobs from a stage every minute.
C. Use Snowpipe Streaming to ingest rows directly into a raw table.
D. Use Snowpipe auto-ingest on staged micro-batch event files.
Best answer: C
Explanation: This is primarily a streaming-ingestion requirement, not a post-load CDC requirement. Snowpipe Streaming lets a custom producer send rows directly into Snowflake with very low latency and no staged files, while downstream change processing can be added later if needed.
The key decision is whether the problem is getting data into Snowflake or reacting to changes after it is already there. In this scenario, the source is a custom application sending row-level events, the latency target is only a few seconds, and the team does not want to manage files in a stage. That makes Snowpipe Streaming the best fit because it ingests rows directly into a Snowflake table with low latency.
A stream is not an ingestion mechanism from an external producer. It records row changes for a Snowflake table after inserts, updates, or deletes have already occurred, and tasks orchestrate follow-on SQL or procedures. Standard Snowpipe and scheduled COPY INTO are file-based or batch-oriented approaches, so they add staging overhead or more delay than required. The stream-and-task idea is the closest distractor, but it starts after ingestion, not before.
COPY INTO from a stage is a micro-batch pattern and is weaker for a 5-second direct-ingestion target.Topic: Transformations, Programmability, and Developer Workflows
A stream-driven task is failing after a team wrapped CDC logic in a SQL UDF:
CREATE OR REPLACE TASK apply_orders
WHEN SYSTEM$STREAM_HAS_DATA('orders_stream')
AS
SELECT APPLY_CDC();
Inside APPLY_CDC, the team tries to run a MERGE into fact_orders, delete bad rows from order_errors, and return a status message. What is the most appropriate Snowflake fix?
Options:
A. Use a stored procedure and let the task CALL it.
B. Keep the UDF and upsize the warehouse.
C. Use a UDTF and select from it in the task.
D. Use an external function for the CDC logic.
Best answer: A
Explanation: This CDC step needs to execute Snowflake DML, not just return a computed value. A stored procedure is the right extensibility mechanism for multi-statement pipeline logic, and the task can invoke it directly with CALL.
The key distinction is between computing query results and performing actions. UDFs and UDTFs are meant to be used inside SQL queries to return a value or a row set. They are not the right mechanism when the logic itself must perform side effects such as MERGE, INSERT, UPDATE, or DELETE.
In this scenario, the task is orchestrating CDC application from a stream. That usually requires multi-statement execution, conditional logic, and status handling. A stored procedure fits that pattern because it can run DML and encapsulate procedural workflow, while the task still runs a single SQL statement:
CALL APPLY_CDC();An external function is for calling an external service from SQL, not for internal table-maintenance logic. The practical rule is simple: use functions for reusable query-time computation, and use procedures for pipeline actions.
MERGE and DELETE side effects.Topic: Transformations, Programmability, and Developer Workflows
A Snowflake pipeline ingests customer comments every minute. A downstream task runs a Python stored procedure that sends each comment to an external NLP API for issue-type classification. The task is now missing its SLA because the API rate-limits requests, and the team wants the enrichment to stay inside Snowflake with minimal new components. What is the best next step?
Options:
A. Increase the task warehouse size and keep the external API calls.
B. Add search optimization on the comment column before classification.
C. Replace the stored procedure with an external function that calls the same NLP API.
D. Use a Cortex AISQL text-classification function in the task SQL.
Best answer: D
Explanation: The problem is caused by an external dependency, not by Snowflake compute. Using a Snowflake Cortex AISQL text-classification function directly in the pipeline keeps data in-platform and removes the rate-limited API call that is delaying the task.
When text enrichment or classification must happen inside an existing Snowflake pipeline, the best fit is an in-platform text-processing function that runs directly where the data already resides. In this scenario, the task misses its SLA because every batch depends on a rate-limited external NLP service. Replacing that step with a Snowflake Cortex AISQL classification function lets the team perform the enrichment inside task SQL, reducing operational complexity and removing the external bottleneck.
The closest distractor is adding warehouse compute, but more Snowflake compute cannot overcome throttling imposed by an outside service.
Topic: Compute, Governance, Observability, and Performance
A data engineering team ingests clickstream events into a 30 TB EVENTS table with Snowpipe Streaming. Product dashboards run ad hoc queries like the one below every few minutes. The queries must finish in under 15 seconds, filters change often across several dimensions, and each run still scans tens of millions of rows before aggregating to fewer than 20 output rows. Query history shows most elapsed time in scan/filter/aggregate steps, not queuing.
SELECT campaign_id, SUM(revenue)
FROM events
WHERE event_ts >= DATEADD(day,-7,CURRENT_TIMESTAMP())
AND region IN ('US','CA')
AND device_type = 'mobile'
GROUP BY campaign_id
ORDER BY SUM(revenue) DESC
LIMIT 20;
What is the best Snowflake performance lever?
Options:
A. Create a minute-refresh dynamic table mirroring EVENTS
B. Define a clustering key across all filter columns
C. Add Search Optimization Service to the filtered columns
D. Enable Query Acceleration Service on the dashboard warehouse
Best answer: D
Explanation: Query Acceleration Service is the best fit because these queries are scan-heavy and spend most of their time filtering and aggregating large volumes of data before returning a tiny result set. It improves that warehouse-level execution pattern without requiring table-specific search structures for many changing predicates.
Search Optimization Service and Query Acceleration Service address different bottlenecks. Search optimization is strongest when Snowflake needs to find a very small set of rows quickly, such as point lookups, highly selective joins, or targeted searches on specific columns or semi-structured paths. Here, the workload is not a row-location problem: each query still reads tens of millions of rows from a recent time window and spends most of its time in scan, filter, and aggregate work before producing a short ranked output. That pattern aligns better with Query Acceleration Service, which adds shared compute to speed eligible selective analytical queries without redesigning the table.
The closest distractor is search optimization, but it is usually stronger for pinpoint row retrieval than for broad analytical scans like this one.
EVENTS.Topic: Streaming, Orchestration, and Near Real-Time Pipeline Design
A data engineering team already lands source records into RAW.ORDERS_LANDING in Snowflake. A downstream process must consume only the rows that changed since its last run, including inserts, updates, and deletes, without rescanning the full table. Which Snowflake object is designed for this CDC pattern?
Options:
A. A stream on RAW.ORDERS_LANDING
B. A Snowpipe on the source stage
C. A task that queries RAW.ORDERS_LANDING
D. A dynamic table built from RAW.ORDERS_LANDING
Best answer: A
Explanation: A stream is Snowflake’s CDC object for data that is already stored in a table. It exposes row-level changes since the stream’s current offset, so downstream processing can read only new inserts, updates, and deletes.
Use a stream when the data has already landed in Snowflake and downstream logic must process only the changed rows. A stream uses change tracking on the underlying table to expose a CDC delta from its current offset, which avoids rereading the entire landing table each time.
A dynamic table can maintain a derived target table, but it is not the object that provides a consumable CDC feed from the source table.
Topic: Data Sourcing, Storage, and Ingestion
A team lands raw order JSON in RAW_ORDERS and currently uses an external ETL job to build a curated table every few minutes. They are evaluating this Snowflake design:
CREATE OR REPLACE DYNAMIC TABLE curated_orders
TARGET_LAG = '5 minutes'
WAREHOUSE = etl_wh
AS
SELECT
payload:order_id::NUMBER AS order_id,
payload:customer_id::NUMBER AS customer_id,
payload:order_total::NUMBER(10,2) AS order_total
FROM raw_orders;
What is the best interpretation of this design?
Options:
A. It uses Snowpipe to continuously load staged files without warehouse compute.
B. It uses a dynamic table to keep transformed results fresh in Snowflake.
C. It uses an external table to query object-store files in place.
D. It uses zero-copy cloning to keep a copy synchronized automatically.
Best answer: B
Explanation: The exhibit defines a dynamic table with a freshness target and a transformation query over RAW_ORDERS. That means Snowflake manages the refresh of the curated result inside the platform, reducing the need to export data to a separate ETL or Spark system for this common engineering task.
This is a dynamic table pattern. A dynamic table stores the result of a query and Snowflake refreshes it to stay within the specified TARGET_LAG, using the named warehouse for the refresh work. In this case, the query parses JSON from a raw landing table into typed relational columns, which is a common transformation that teams often push to external ETL tools.
Because the transformation and refresh both happen inside Snowflake, data does not need to be moved out to a separate processing platform just to keep the curated table current. That is the key Snowflake-native service being demonstrated here. The closest confusion is Snowpipe, but Snowpipe is for ingesting files, not for maintaining transformed query results.
Topic: Transformations, Programmability, and Developer Workflows
A data engineering team needs a SQL-native transformation step that runs every 5 minutes, skips work when no CDC rows are available, and returns a simple status to an orchestrating task. They review this procedure body:
DECLARE has_rows BOOLEAN;
BEGIN
SELECT SYSTEM$STREAM_HAS_DATA('ORDERS_STM')
INTO :has_rows;
IF (has_rows) THEN
MERGE INTO CURATED_ORDERS t
USING ORDERS_STM s
ON t.ORDER_ID = s.ORDER_ID
WHEN MATCHED THEN UPDATE SET STATUS = s.STATUS
WHEN NOT MATCHED THEN
INSERT (ORDER_ID, STATUS)
VALUES (s.ORDER_ID, s.STATUS);
END IF;
RETURN 'ok';
END;
Which Snowflake design choice best matches this fragment and the requirements?
Options:
A. A dynamic table for automatic incremental refresh
B. Snowpipe Streaming into the curated target table
C. An external table over staged CDC files
D. A SQL stored procedure using Snowflake Scripting
Best answer: D
Explanation: This fragment demonstrates Snowflake Scripting used for a SQL stored procedure pattern. The code assigns a value to a variable, conditionally runs a MERGE, and returns a status, which fits a SQL-native orchestration step that can be called by a task.
Snowflake Scripting is Snowflake’s SQL procedural language for SQL stored procedures and anonymous blocks. In this fragment, SYSTEM$STREAM_HAS_DATA checks whether the stream has CDC rows, INTO stores that result in a variable, IF decides whether to run the MERGE, and RETURN sends a status back to the caller. That is a procedural wrapper around SQL transformations, which is useful when a team wants branching logic while staying entirely SQL-native.
MERGE performs the CDC upsert.A declarative refresh feature could maintain derived data, but it would not be the main concept shown by this scripting fragment.
Topic: Compute, Governance, Observability, and Performance
A governance team plans to drop a column from a shared table. They need to identify which users queried that column recently and which downstream views depend on it. They do not need to monitor null rates, freshness, or row-count drift. Which Snowflake capability best matches this observability requirement?
Options:
A. Access history and object dependency metadata
B. Object tags
C. Data metric functions
D. Masking policies
Best answer: A
Explanation: The requirement is to audit usage and understand downstream impact before a schema change. In Snowflake, access history covers who queried data, and object dependency metadata covers which objects rely on it; data metric functions are for data-quality signals instead.
This is an access observability and impact analysis problem, not a data-quality problem. Access history is used to investigate who accessed a table or column and when, while object dependency metadata is used to find downstream objects such as views that reference the source object. Together, they answer the two questions in the stem: “who used it” and “what depends on it.”
Data metric functions are the wrong fit because they evaluate data characteristics such as freshness, null counts, or row-level quality expectations. Those metrics help validate data quality, but they do not provide an audit trail of access or a dependency graph for schema-change planning. The key takeaway is to choose access and dependency metadata when the goal is governance observability, not quality measurement.
Topic: Compute, Governance, Observability, and Performance
A team wants Snowflake-native data-quality monitoring on PROD.SALES.ORDERS without writing custom validation pipelines. Checks must run every 15 minutes, track null and duplicate counts, and support rules such as ORDER_ID null count = 0.
ALTER TABLE prod.sales.orders
SET DATA_METRIC_SCHEDULE = '15 MINUTE';
Based on the exhibit, what is the best next step?
Options:
A. Associate Data Metric Functions and expectations
B. Enable search optimization on ORDER_ID
C. Create an alert with validation SQL
D. Create a stream and task graph
Best answer: A
Explanation: The exhibit only schedules when Snowflake evaluates data-quality checks; it does not define any checks by itself. To monitor nulls, duplicates, and threshold-based rules natively, the engineer should associate Data Metric Functions and add expectations.
Snowflake’s native feature for metric-level and rule-level data-quality monitoring is Data Metric Functions (DMFs). DATA_METRIC_SCHEDULE sets the cadence for evaluation, but the table still needs metric associations such as SNOWFLAKE.CORE.NULL_COUNT or SNOWFLAKE.CORE.DUPLICATE_COUNT. If the team also wants pass/fail logic like “null count must be 0,” expectations can be attached so Snowflake evaluates that rule directly. This keeps the monitoring definition inside Snowflake and avoids building custom audit pipelines for common quality checks.
Streams, tasks, or alerts can automate custom logic, but they are not the purpose-built feature the exhibit is setting up.
Use the SnowPro Data Engineer DEA-C02 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.
Try SnowPro Data Engineer DEA-C02 on Web View SnowPro Data Engineer DEA-C02 Practice Test
Read the SnowPro Data Engineer DEA-C02 Cheat Sheet on Tech Exam Lexicon for concept review before another timed run.