Snowflake SnowPro Advanced: Data Engineer (DEA-C02): 24 Sample Questions & Simulator

SnowPro DEA-C02 sample questions, mock-exam practice, and simulator access with detailed explanations in IT Mastery on web, iOS, and Android.

On this page

Snowflake SnowPro Advanced: Data Engineer (DEA-C02) targets pipeline design, transformations, streaming, delivery, governance, and performance judgment for teams building serious Snowflake data-engineering workflows. If you are searching for DEA-C02 sample questions, a practice test, mock exam, or exam simulator, this is the main IT Mastery page to start on web and continue on iOS or Android with the same account.

Interactive Practice Center

Start a practice session for Snowflake SnowPro Advanced: Data Engineer (DEA-C02) below, or open the full app in a new tab. For the best experience, open the full app in a new tab and navigate with swipes/gestures or the mouse wheel—just like on your phone or tablet.

Open Full App in a New Tab

A small set of questions is available for free preview. Subscribers can unlock full access by signing in with the same account used on mobile.

Prefer to practice on your phone or tablet? Download the IT Mastery – AWS, Azure, GCP & CompTIA exam prep app for iOS or IT Mastery app on Google Play (Android) and then sign in with the same account on web to continue your sessions on desktop.

What this practice page gives you

a direct route into the live IT Mastery simulator for DEA-C02
24 sample questions with detailed explanations across ingestion, transformations, streaming, sharing, governance, and performance
focused practice around Snowflake-native data-engineering choices instead of generic warehouse theory
a clear free-preview path before you subscribe
the same account across web and mobile

DEA-C02 exam snapshot

Vendor: Snowflake
Official exam name: Snowflake SnowPro Advanced: Data Engineer (DEA-C02)
Exam code: DEA-C02
Items: 65 total
Practice format on this page: 24 single-answer sample questions with detailed explanations from the current local bank

Scenario-heavy Snowflake Advanced Data Engineer practice with inferred domain weighting and companion formats aligned to the current 65-question DEA-C02 exam.

Topic coverage for DEA-C02 practice

Domain	Weight
Data sourcing, storage, and ingestion	22%
Transformations, programmability, and developer workflows	24%
Streaming, orchestration, and near real-time pipeline design	20%
Sharing, replication, and cross-platform delivery	18%
Compute, governance, observability, and performance	16%

How to use the DEA-C02 simulator efficiently

Start with ingestion and transformation questions so dynamic tables, tasks, streams, and data-loading patterns feel distinct.
Review every miss until you can explain why the best answer fits Snowflake-native pipeline behavior, governance, or performance expectations.
Move into mixed sets once streaming, orchestration, sharing, replication, and observability scenarios begin to feel connected.
Finish with timed runs so the full advanced-exam rhythm feels controlled before test day.

Free preview vs premium

Free preview: a smaller web set so you can validate the question style and explanation depth.
Premium: the full DEA-C02 bank, focused drills, mixed sets, detailed explanations, and progress tracking across web and mobile.

Good next pages after DEA-C02

SnowPro Core (COF-C02) if you still need stronger Snowflake platform fundamentals before advanced data engineering
Snowflake exam pages if you are deciding between foundational and advanced Snowflake certification routes

24 DEA-C02 sample questions with detailed explanations

These sample questions are drawn from the current local bank for this exact exam code. Use them to check your readiness here, then continue into the full IT Mastery question bank for broader timed coverage.

Question 1

A provider currently delivers a nightly file snapshot to a partner-managed stage.

1COPY INTO @partner_stage/sales/
2FROM prod.curated.sales_orders
3FILE_FORMAT = (TYPE = PARQUET)
4OVERWRITE = TRUE;

The partner already has its own Snowflake account in the same region. The provider now wants the partner to query the latest data continuously while the provider retains control over access and can revoke it at any time. What is the best next step?

Options:

A. Configure scheduled database replication to the partner account.
B. Create a reader account and continue delivering refreshed files.
C. Use Secure Data Sharing by creating a share for the partner account.
D. Have the partner query external tables over the unloaded Parquet files.

Best answer: C

Explanation: The requirement is ongoing access to current data without handing off a separate copy. Secure Data Sharing fits because the consumer queries provider-managed shared objects directly, and the provider can revoke access at any time.

Secure Data Sharing is the best fit when a consumer already has its own Snowflake account and needs ongoing access to current provider data. The consumer queries shared objects exposed from the provider account, so updates are visible without unloading files or maintaining a separate replicated dataset. That also preserves provider-side control because the provider decides which objects are shared and can change or revoke access centrally.

Use a share for live, read-only access to current data.
Use replication when you need a separate copy for locality, recovery, or account-level duplication.
Use reader accounts only when the consumer does not have a Snowflake account.

The key distinction here is live shared access versus delivering or maintaining copies.

Replication copies data into another account, so it does not best satisfy the goal of provider-managed live access.
Reader accounts are unnecessary because the partner already has its own Snowflake account.
External tables over unloaded files still depend on staged file snapshots rather than live shared objects.

Question 2

A task that populates CURATED_EVENTS now fails at compile time. The team wrote the SQL below, but normalize_agent was created as a Python stored procedure. The logic returns one normalized string for each input row, does not need external network access, and analysts also need to call it directly in ad hoc SQL queries.

1INSERT INTO curated_events
2SELECT event_id,
3       normalize_agent(user_agent) AS ua_norm
4FROM raw_events;

What is the best next step?

Options:

A. Move the logic to an external function
B. Recreate normalize_agent as a Python UDF
C. Rewrite the transformation as a Snowpark DataFrame job
D. Recreate normalize_agent as a Python UDTF

Best answer: B

Explanation: This is a SQL-callable logic problem, not a Snowpark pipeline problem. Because the code must return one scalar value per row inside a SELECT and be reusable in SQL, a Python UDF fits directly.

The core issue is choosing the right SQL-integrated callable object. The task needs logic that can be invoked inside INSERT ... SELECT, produce exactly one value for each input row, and remain reusable in ad hoc SQL. That is the design point of a scalar UDF, including a Python UDF when the business logic is best written in Python.

Stored procedures are invoked with CALL and are intended for orchestration or multi-step operations, not inline row-by-row expressions. A UDTF is only appropriate when the code returns a table. Snowpark is useful for full DataFrame-style programs, but it is not the best fit when the requirement is a SQL expression callable directly from queries. External functions are for remote service calls, which the stem explicitly does not require.

When SQL needs to call custom logic inline, pick the callable object that matches the SQL usage pattern and output shape.

Tabular return mismatch a Python UDTF returns rows, so it is the wrong shape for one scalar output per input row.
Wrong abstraction a Snowpark DataFrame job can perform the transformation, but it does not satisfy the need for reusable inline SQL expressions.
Unneeded remote call an external function is for invoking code outside Snowflake, adding dependency and latency the stem says are unnecessary.

Question 3

A Snowpipe auto-ingests partner CSV files with a header row into orders_raw, a typed table queried directly by downstream SQL. The partner periodically adds nullable columns and sometimes reorders existing columns. Pipe history now shows load failures from column mismatches, and the team wants unattended ingestion to keep working whenever compatible columns are introduced. What is the best next step?

Options:

A. Keep positional loading and set ON_ERROR = CONTINUE
B. Load each file into a single VARIANT column
C. Add a task to ALTER TABLE before each load
D. Enable schema evolution and load by column name

Best answer: D

Explanation: The failures come from positional loading against a table whose source columns now move and grow over time. The most reliable Snowflake-native fix is to enable schema evolution on the target and have the pipe match columns by name so compatible additions can be absorbed automatically.

This scenario is a classic schema-drift problem in file ingestion. Because the source sends CSV headers, sometimes reorders fields, and occasionally adds nullable columns, the robust fix is to keep the relational target but enable Snowflake schema evolution and use name-based column matching in the pipe or COPY INTO logic. That allows Snowflake to map reordered columns correctly and add compatible new columns without manual intervention.

Positional loading is fragile when column order changes.
Allowing errors to continue can hide bad loads instead of fixing them.
Custom DDL orchestration adds operational overhead and race conditions.

Landing everything as VARIANT is flexible, but it changes the downstream contract when consumers already depend on typed columns.

VARIANT landing handles drift, but it changes the downstream table contract instead of preserving the typed relational load.
ON_ERROR = CONTINUE may skip problematic data, but it does not solve reordered columns or compatible schema additions.
Pre-load DDL task can be made to work, but it is more brittle and operationally heavier than native schema evolution.

Question 4

A data engineer says the JSON load format is wrong because one field returns NULL. The data is already stored in Snowflake as RAW_ORDERS(src VARIANT).

Exhibit:

1SELECT
2  src:order_id::NUMBER AS order_id,
3  src:items.sku::STRING AS sku
4FROM RAW_ORDERS;

Sample src value already in the table:

1{"order_id":101,"items":[{"sku":"A1"},{"sku":"B2"}]}

order_id is populated, but sku is NULL for every row. What is the best interpretation?

Options:

A. This is a transformation issue; items is an array and should be indexed or flattened.
B. This is a load-mapping issue; use MATCH_BY_COLUMN_NAME to extract sku.
C. This is an ingestion-format issue; reload with STRIP_OUTER_ARRAY = TRUE.
D. This is a Snowpipe issue; refresh the pipe so nested values are materialized.

Best answer: A

Explanation: Because order_id is extracted successfully from the same VARIANT column, the data was loaded and parsed correctly. The problem is the query pattern: items is an array of objects, so you must index into it or use LATERAL FLATTEN to get each sku.

This is a semi-structured transformation problem, not an ingestion-format problem. In Snowflake, once JSON is stored in a VARIANT column, object fields and array elements are accessed with path notation. Here, order_id works because it is a top-level scalar field. The items field is an array, so src:items.sku does not iterate through its elements.

Use one of these patterns:

src:items[0].sku to read a specific element
LATERAL FLATTEN(input => src:items) to return one row per item

STRIP_OUTER_ARRAY is a load-time setting for source files whose top-level content is an array, not for nested arrays already stored inside VARIANT. The key takeaway is to fix the semi-structured query logic before changing ingestion settings.

Reloading with STRIP_OUTER_ARRAY fails because the nested array is already inside a successfully loaded VARIANT row.
Refreshing Snowpipe does not expand nested array elements into query results; that is handled by query logic.
Using MATCH_BY_COLUMN_NAME applies during loading into table columns, not when traversing arrays in existing semi-structured data.

Question 5

A source system drops new CSV files into a stage at unpredictable times all day. Data must be loaded into a Snowflake table within a few minutes of file arrival, and the team wants to avoid running scheduled batch load statements. Which Snowflake feature best fits this requirement?

Options:

A. An external table on the staged files
B. Snowpipe
C. A task that runs COPY INTO every night
D. Snowpipe Streaming

Best answer: B

Explanation: Snowpipe is the Snowflake-native choice for continuous file ingestion when files arrive throughout the day and freshness is measured in minutes. Bulk loading with COPY INTO is better for scheduled batch loads, not event-driven file arrival.

The key concept is matching the ingestion mechanism to the file arrival pattern and freshness target. When files land unpredictably and should be loaded soon after arrival, Snowpipe is built for continuous file ingestion from a stage into a Snowflake table. It reduces the need to manage repeated polling or scheduled batch jobs.

By contrast, bulk loading with COPY INTO is typically used when files are loaded in larger batches on a defined schedule. Snowpipe Streaming is for streaming row-based data into Snowflake without relying on staged files, and external tables let you query files in place rather than ingesting them into a native table.

A minutes-level freshness requirement for arriving files points to Snowpipe, not a batch COPY INTO pattern.

Nightly batch load misses the freshness requirement because a scheduled COPY INTO job is bulk loading, not continuous ingestion.
Row streaming confusion fails because Snowpipe Streaming targets streamed records, not staged files arriving as CSV objects.
Query in place is different because an external table exposes staged files without loading them into a native Snowflake table.

Question 6

A team uses Snowpipe Streaming to land support tickets into RAW_TICKETS. They need each new ticket routed to one of five categories and a short summary available within 5 minutes for an operations dashboard. Ticket text must remain in Snowflake, and the team wants to avoid a separate model-training project. What is the best design choice?

Options:

A. Train a custom model with Snowpark ML and notebooks
B. Add search optimization and route by keyword queries
C. Use a stream and task with Snowflake Cortex AI functions
D. Call a third-party LLM through an external function

Best answer: C

Explanation: Snowflake Cortex AI functions fit lightweight text-enrichment tasks such as classification and summarization without turning the solution into a full ML project. Using a stream and task keeps the workflow near real time, inside Snowflake, and operationally simple.

The key concept is choosing Snowflake-native AI functions for an engineering workflow when the requirement is straightforward text enrichment rather than model development. Here, the team already has continuous ingestion, needs category labels and short summaries quickly, must keep ticket text in Snowflake, and wants low operational overhead. A stream captures newly arrived rows, and a task can process those rows every few minutes with Snowflake Cortex AI functions in SQL, writing results to an enriched table for the dashboard.

Ingest tickets into RAW_TICKETS.
Read only new rows from a stream.
Use a task to classify and summarize them.
Store the enriched output for downstream reporting.

This meets the latency target while avoiding the extra training, deployment, and data-movement complexity of a separate ML solution.

Custom model is unnecessary because the requirement is simple text enrichment, not a full model lifecycle.
Search optimization improves text lookup performance, but it does not generate summaries or routing categories.
External LLM call adds data egress and external dependency complexity, which conflicts with keeping text in Snowflake.

Question 7

A team built a three-layer dynamic table pipeline: dt_clean -> dt_enriched -> dt_mart. Each dynamic table uses TARGET_LAG = '5 minutes'. Source data lands every 2 minutes, but the only business requirement is that dt_mart be no more than 15 minutes behind the source. Query history shows frequent refresh overlap and warehouse queuing. They want the simplest Snowflake-native fix. What should they do?

Options:

A. Convert dt_mart to a materialized view
B. Use DOWNSTREAM upstream and 15 minutes on dt_mart
C. Replace dynamic tables with streams and scheduled tasks
D. Keep all lags at 5 minutes and resize the warehouse

Best answer: B

Explanation: The pipeline is refreshing more often than the business actually requires. Setting intermediate dynamic tables to DOWNSTREAM and keeping the explicit freshness target only on the final table lets Snowflake coordinate refreshes across the graph while reducing overlap and compute pressure.

Dynamic tables are designed for declarative freshness management. In a multi-layer graph, giving every layer an aggressive explicit TARGET_LAG can cause upstream tables to refresh more often than the final consumer needs, which increases compute usage and can lead to warehouse queuing.

When the real SLA is only on the final output, a better design is to set intermediate dynamic tables to TARGET_LAG = DOWNSTREAM and give the final dynamic table the required lag, here 15 minutes. Snowflake then refreshes dependencies based on downstream need instead of treating each layer as independently urgent.

This keeps the pipeline simpler than rebuilding it with streams and tasks, while aligning refresh effort to the actual freshness requirement. Increasing warehouse size alone may reduce symptoms, but it does not fix the over-refresh design.

Bigger warehouse only may reduce queueing, but it still refreshes all layers more often than the 15-minute SLA requires.
Streams and tasks can meet the SLA, but they add custom orchestration to a problem dynamic tables already solve declaratively.
Materialized view swap does not address the multi-layer dynamic table refresh pattern and is less flexible for this pipeline design.

Question 8

A Snowpark stored procedure invokes several SQL statements and external function calls. The team sees intermittent latency spikes and needs visibility into the execution path and timing of each step within a single run. Which Snowflake capability is most relevant?

Options:

A. Data metric functions
B. Alerts
C. Tracing
D. Logging

Best answer: C

Explanation: Tracing is designed to follow request flow and span timing across a single execution. When the requirement is to diagnose where latency occurs inside one Snowpark procedure run, tracing provides the needed path-level visibility.

Tracing is the best fit when an operational visibility requirement is about understanding how a single execution moved through multiple steps and how long each step took. In this case, the team needs per-step timing and call-path detail across SQL work and external function calls inside one stored procedure run. Logging records emitted events or messages, and alerts evaluate a condition to notify or act, but neither one reconstructs end-to-end execution flow. Data metric functions focus on data quality measurements on table data, not procedural runtime behavior.

Use alerts for threshold-based notification or response.
Use logging for event messages and troubleshooting details.
Use tracing for correlated spans, call flow, and latency breakdown.

If the need were simply to notify on failures or SLA breaches, alerts would be more appropriate.

Logging is useful for recorded events and debug messages, but it does not inherently provide correlated step-by-step timing across one execution.
Alerts help detect conditions and notify or trigger actions, but they do not show internal execution flow.
Data metric functions evaluate data quality on tables, not latency inside a Snowpark procedure.

Question 9

A central Snowflake team distributes customer data to Finance and Sales in separate consumer accounts. To prepare for later cross-region delivery, an architect proposes distributing only raw tables and letting each team maintain its own secure views and task logic locally; analysts already report different active_customer counts, and each schema change triggers fixes in multiple accounts. Which next step best addresses the problem while preserving maintainability and cross-team clarity?

Options:

A. Replicate the raw database to more regions first, then let each team standardize its own views.
B. Create separate shared databases per team so each group can rename and organize objects independently.
C. Keep distributing raw tables and use task graphs in each consumer account to recreate local views after changes.
D. Centralize KPI logic in a producer-managed curated database and deliver that same layer to all teams.

Best answer: D

Explanation: The failure is semantic drift, not data movement. When each consumer account owns its own view logic, KPI definitions diverge and every upstream change must be implemented multiple times; a producer-managed curated layer keeps one authoritative definition for all teams.

When the stated need is maintainability and cross-team clarity, distributing raw tables and letting every team build its own secure views is the wrong architecture. Even if shared or replicated data is current, local modeling creates duplicate business logic, conflicting KPI definitions, and repeated remediation after every schema change. A better Snowflake pattern is to publish a producer-managed curated database, often with standardized secure views, and deliver that same governed layer to each consumer. Then business-rule updates and schema adjustments happen once, and all teams see the same meaning for active_customer. Replication can still help later with regional delivery, but it should move the curated layer rather than multiply independent semantic definitions.

Rebuilding local views with task graphs automates drift, but it does not remove duplicate ownership of business logic.
Replicating raw data to more regions improves placement, not consistency of KPI definitions across teams.
Giving each team its own shared database structure increases renaming and semantic divergence instead of clarity.

Question 10

An insurer is onboarding policy and claims data from six operational systems into Snowflake. Requirements: retain full source history with source-system traceability for 7 years, tolerate frequent source schema changes, and publish different consumer-specific datasets through secure sharing. A 5-minute refresh is sufficient. The team is debating Snowpipe versus Snowpipe Streaming. What is the BEST next action?

Options:

A. Build separate dynamic tables for each consumer
B. Publish landing tables directly through secure views
C. Use Snowpipe Streaming for direct star-schema loads
D. Design a Data Vault-style core, then share marts

Best answer: D

Explanation: The deciding factor is long-term integration design, not sub-minute ingest mechanics. A Data Vault-style core layer best fits full history, source traceability, and flexible downstream sharing, while a 5-minute SLA leaves multiple acceptable ingestion options.

When the requirements center on history retention, source traceability, schema-change tolerance, and reuse across many downstream products, the primary design decision is the core data model. In Snowflake, a Data Vault-style core layer is a strong fit because it preserves historized relationships and business-key lineage while letting teams publish different marts or secure shares without redesigning ingestion each time.

Land source or CDC data.
Load a stable historized core layer.
Build consumer-specific marts from that layer.
Use secure sharing for delivery.

Choosing between Snowpipe and Snowpipe Streaming matters only after the architecture is clear; with a 5-minute target, immediate load mechanics are not the main constraint.

Direct star loads focus on ingestion speed, but the harder requirement is a reusable historized integration model that survives source change.
Per-consumer dynamic tables duplicate transformation logic and do not create a durable shared core for many downstream products.
Secure views on landing data expose unstable raw structures and skip the governed, traceable integration layer the scenario needs.

Question 11

A data engineering team runs CDC every minute with a task:

1SELECT apply_cdc();

apply_cdc() is a Python UDF intended to read a stream, run two MERGE statements into target tables, and insert audit rows. Task history shows failures when the function tries to execute SQL. What is the best next step?

Options:

A. Replace the task with a dynamic table on the stream
B. Convert apply_cdc to a UDTF and keep the MERGE logic there
C. Rewrite apply_cdc as a stored procedure and CALL it from the task
D. Increase the warehouse size for the task

Best answer: C

Explanation: This is a programmability-boundary problem, not a compute problem. UDFs are for reusable computation that returns values inside SQL, while stored procedures are for procedural workflows that execute multi-step DML and can be invoked by tasks.

In Snowflake, use a UDF when you need reusable logic that computes and returns a value to a SQL statement, such as normalizing text or deriving a column from inputs. This scenario is different: the routine must coordinate CDC processing by reading change data, running multiple MERGE statements, and writing audit records. That is procedural orchestration with database side effects, which fits a stored procedure.

A task can CALL a stored procedure so the procedure can manage multi-statement SQL, control flow, and status reporting for the CDC step. A UDF or UDTF may help transform data returned to a query, but they are not the right abstraction for orchestrating multi-target updates and audit writes.

The key takeaway is to use UDFs for expression-level computation and stored procedures for operational pipeline steps.

A UDTF can return rows to SQL, but it does not make multi-statement CDC updates and audit writes the right pattern.
More warehouse compute may reduce runtime pressure, but it does not fix a routine that uses the wrong Snowflake abstraction.
A dynamic table maintains a query result; it is not a replacement for procedural logic that updates multiple targets and logs outcomes.

Question 12

A vendor delivers small CSV files to an external stage at irregular times. Snowflake currently loads them with a scheduled task:

1CREATE TASK load_vendor_files
2 SCHEDULE = '15 MINUTE'
3AS
4COPY INTO raw.vendor_events
5FROM @vendor_stage;

Dashboards are often 10-15 minutes behind. The SLA is data available in under 1 minute after each file arrives, and the team wants the most Snowflake-native, low-operations approach while keeping file-based delivery. What is the best next step?

Options:

A. Configure Snowpipe auto-ingest on the stage for event-driven file loading.
B. Create an external table with AUTO_REFRESH and query the staged files.
C. Increase task frequency and warehouse size for scheduled COPY INTO loads.
D. Rebuild the feed to use Snowpipe Streaming for direct row ingestion.

Best answer: A

Explanation: Snowpipe is designed for continuous, file-based ingestion when files land in a stage and low operational overhead is required. It fits the under-1-minute target better than a polling task because it loads on file arrival instead of waiting for the next schedule.

The deciding concept is matching the ingestion mechanism to the delivery model and latency target. The source is still delivering files, so the best Snowflake-native fix is to move from scheduled COPY INTO polling to Snowpipe auto-ingest, which uses event-driven loading when new files arrive in the stage.

This fits the stated requirements:

File-based delivery remains unchanged.
Low latency improves because loading starts on arrival, not on a 15-minute schedule.
Low operations is achieved because Snowpipe is managed by Snowflake and avoids custom orchestration loops.

Snowpipe Streaming is for direct row ingestion from applications or streaming clients, not for a vendor that is already sending files. The key takeaway is that continuous file arrival plus a near-real-time SLA usually points to Snowpipe, not scheduled tasks.

Direct row ingestion is tempting, but Snowpipe Streaming is the wrong fit when the upstream system is delivering files rather than records.
Query files in place does not solve the ingestion requirement; external tables expose staged files but do not load them into the target table.
Faster polling can reduce lag, but it still relies on scheduled warehouse-backed COPY INTO execution instead of event-driven file ingestion.

Question 13

A company ingests order events into Snowflake with Snowpipe Streaming and refreshes a curated ORDERS_CURATED table every few minutes. An external logistics partner already has its own Snowflake account and needs query access with under 5-minute latency. The partner must see only shipment-related columns for its assigned region, and the provider wants to avoid copying data into another database. Which Snowflake design is the BEST fit?

Options:

A. Replicate the database to the partner and grant table access
B. Create a reader account and grant the curated schema
C. Share a secure view that filters region rows and approved columns
D. Export filtered files to a stage for partner external tables

Best answer: C

Explanation: The best choice is to enforce least privilege in the provider account with a secure view, then deliver that governed dataset through sharing. This meets the low-latency requirement and avoids creating another physical copy of the data.

When a scenario mixes consumer delivery with governance, first decide how the provider will restrict what the consumer can see. In Snowflake, a secure view can expose only the required shipment columns and filter rows to the partner’s allowed region. That governed object can then be delivered through Secure Data Sharing to the partner’s existing Snowflake account.

This design fits the stated constraints:

under 5-minute access without file export cycles
no extra database copy to manage
provider-side least-privilege control over rows and columns

Replication is mainly for data movement, availability, or disaster recovery, not for minimizing consumer exposure. The key takeaway is to govern the shared dataset first, then choose the delivery mechanism.

Replication mismatch adds another database copy and focuses on distribution or recovery rather than least-privilege exposure control.
Reader account overuse is intended for consumers without Snowflake accounts and still does not solve schema-level overexposure.
File export detour adds an unnecessary batch delivery pipeline and storage copy for a use case native sharing already handles.

Question 14

A task runs every 10 minutes and calls a Snowpark Python stored procedure that flattens VARIANT event payloads into a curated table. After a source schema change, the task now fails intermittently with cast errors on a small subset of rows. Engineers need to inspect those records, try revised parsing logic, and validate the output before changing the production pipeline. What is the best next step?

Options:

A. Use a Snowflake notebook to inspect rows and test fixes
B. Increase the warehouse size used by the task
C. Replace the procedure with a dynamic table
D. Add a stream on the raw events table

Best answer: A

Explanation: Snowflake notebooks are designed for interactive development and validation. In this case, the team needs to explore problematic rows, experiment with revised Snowpark parsing, and confirm the output before promoting a fix to the scheduled pipeline.

The core issue is transformation logic validation, not throughput or orchestration. Because the failures started after a schema change and only affect some records, engineers need an interactive workspace where they can query the bad payloads, run Snowpark or SQL step by step, and compare outputs on representative samples. A Snowflake notebook fits that workflow well and can use separate compute so troubleshooting stays isolated from production runs.

Inspect the failed source rows
Prototype revised parsing logic
Validate outputs before updating the task or procedure

Changing warehouse size or CDC objects does not solve the need for iterative experimentation and validation.

More compute helps with queuing or long runtimes, but it does not fix cast errors caused by changed payload structure.
Dynamic tables instead change the transformation pattern, but they are not the best tool for interactively debugging Snowpark Python logic.
Adding a stream helps capture row-level changes for CDC, but the problem here is understanding and validating failed transformations.

Question 15

Which Snowflake capability is specifically designed to provide consumers with controlled, read-only access to current provider data without copying or unloading the underlying data?

Options:

A. External tables
B. Database replication
C. Secure data sharing
D. COPY INTO <location>

Best answer: C

Explanation: Secure data sharing is the Snowflake feature for live, controlled consumer access to provider data. It lets consumers query shared data directly while the provider manages what is exposed, without creating exported files or duplicated datasets.

The key concept is that secure data sharing is an access pattern, not a data movement pattern. When the main goal is to let another party query current data under provider control, Snowflake sharing is the native choice because it exposes shared objects directly to the consumer without unloading files or creating a separate copied dataset.

Replication is for maintaining copies across accounts or regions, and unloading with COPY INTO <location> is file export. External tables help query data stored outside Snowflake, but they do not provide Snowflake-native controlled sharing of live provider-managed objects.

If the requirement is controlled consumer access to up-to-date data, choose sharing rather than copy or export mechanisms.

Replication confusion misses the goal because replication creates another copy for availability or locality, not direct shared access.
Unload confusion fails because COPY INTO <location> exports files and shifts access control to the storage location.
External table confusion is about querying externally stored files, not granting consumers live access to provider-owned Snowflake data.

Question 16

A data engineer reviews the current setup before altering a view.

1ALTER TABLE sales.raw_orders
2  ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.NULL_COUNT ON (customer_id);
3ALTER TABLE sales.raw_orders
4  ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.ROW_COUNT ON ();

The new change request says to identify downstream objects that depend on sales.v_orders and which roles queried sensitive columns in the last 14 days. What is the best next step?

Options:

A. Add more data metric functions and alerts.
B. Query SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY and OBJECT_DEPENDENCIES.
C. Apply masking policies to the sensitive columns.
D. Resize the warehouse and inspect QUERY_HISTORY.

Best answer: B

Explanation: The existing SQL adds data metric functions, which monitor data-quality signals such as nulls and row counts. The new requirement is different: it asks who accessed sensitive data and what downstream objects will be affected by a view change, so access history and dependency metadata are the right observability tools.

Data metric functions help detect data-quality issues such as missing values, row-count shifts, or freshness problems. They do not answer who queried specific data or which downstream objects depend on a view. In this scenario, the engineer needs governance observability and change-impact analysis.

SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY is used for detailed access auditing, including which queries and roles accessed objects and columns. Dependency metadata such as OBJECT_DEPENDENCIES is used to understand lineage-style relationships so the engineer can see which views or other objects may be impacted before changing sales.v_orders.

The key distinction is simple: data-quality monitoring answers whether the data looks healthy, while access history and dependency tracking answer who used it and what depends on it.

More DMFs extends quality checks, but it still does not reveal data consumers or downstream dependencies.
Warehouse tuning is a performance action, not an access-audit or impact-analysis solution.
Masking policies protect sensitive data going forward, but they do not by themselves provide the required historical access and dependency visibility.

Question 17

A team receives small Parquet files continuously in cloud object storage. Their proposal is to run COPY INTO from an external stage every 30 minutes by using a task. Requirements: data must be available in a native Snowflake table within 5 minutes, no manual runs are allowed, and compute usage should stay low when no files arrive. Which design is BEST?

Options:

A. Use Snowpipe auto-ingest on the external stage
B. Use Snowpipe Streaming from the external stage
C. Refresh an external table every 5 minutes
D. Run COPY INTO every 30 minutes with a task

Best answer: A

Explanation: The 5-minute SLA and no-manual-ops requirement make a 30-minute task-based batch pattern too slow and too operationally heavy. Snowpipe auto-ingest is designed for continuously arriving files in object storage and loads them into a native table without warehouse polling.

For continuously arriving files in cloud object storage, Snowpipe auto-ingest is the Snowflake-native fit when latency is measured in minutes, operations must be hands-off, and you do not want a warehouse polling for work. It uses cloud event notifications to detect new files on an external stage and loads them into a target table as files arrive. A task that runs COPY INTO every 30 minutes is still a polling batch pattern, so it misses the 5-minute SLA, and increasing the schedule frequency adds avoidable warehouse overhead. External tables query files in place rather than loading them into a native table, and Snowpipe Streaming is intended for streaming rows from producers rather than staged files. When the source is files and the requirement is near-real-time with low operational effort, choose event-driven file ingestion.

The scheduled COPY INTO task remains a polling batch design, so 30-minute runs violate the 5-minute target and consume warehouse time just checking for files.
The external table option exposes files in place, but it does not satisfy the requirement to load data into a native Snowflake table.
The Snowpipe Streaming option misuses the feature because it is for row-based streaming ingestion, not files already landing on an external stage.

Question 18

A data engineering team already publishes CDC events to Kafka and needs near-real-time ingestion into Snowflake. They want the lowest practical latency without writing and operating a custom ingestion service.

Exhibit:

1Source: Apache Kafka topic `orders_cdc`
2Target: table `RAW.ORDERS_CDC`
3Latency target: under 10 seconds
4Constraint: avoid custom application code

Which approach best fits the requirement?

Options:

A. Create a pipe for file-based Snowpipe auto-ingest
B. Build a custom app with the Snowpipe Streaming SDK
C. Schedule a task to run COPY INTO every minute
D. Deploy the Snowflake Connector for Kafka

Best answer: D

Explanation: The key requirements are an existing Kafka source, sub-10-second latency, and no custom ingestion application. The Snowflake Connector for Kafka is designed for exactly this pattern and uses Snowpipe Streaming to load records continuously.

The deciding factors are the Kafka source and the requirement to avoid custom code while keeping latency very low. The Snowflake Connector for Kafka is the best fit because it connects directly to Kafka topics and ingests records continuously into Snowflake by using Snowpipe Streaming. That gives a lower-latency, more operationally efficient pattern than file-based ingestion or scheduled batch loads.

Kafka already exists as the event source.
The connector avoids building and maintaining a custom client.
Snowpipe Streaming is intended for continuous, low-latency row ingestion.

A custom app built on the Snowpipe Streaming SDK could also be low latency, but it adds engineering and operational overhead that the exhibit explicitly says to avoid.

File-based auto-ingest is for files arriving in cloud storage, not direct Kafka topic ingestion.
Scheduled COPY INTO is a micro-batch approach and is usually not the best match for this latency target.
Custom SDK code can meet the latency goal, but it fails the stated requirement to avoid a custom ingestion service.

Question 19

A data engineering team already builds transformations inside Snowflake with streams, tasks, and secure views. A separate delivery service must send a curated result set to a partner application every hour.

Exhibit:

1POST /api/v2/statements
2{
3  "statement": "SELECT order_id, status, updated_at FROM partner_feed_v",
4  "warehouse": "DELIVERY_WH",
5  "role": "PARTNER_DELIVERY_ROLE"
6}

What is the best interpretation of this exhibit in the overall design?

Options:

A. Delivery-layer SQL API call against curated Snowflake data
B. Continuous ingestion path equivalent to Snowpipe Streaming
C. Cross-region replication method for partner distribution
D. Snowflake-native transformation replacement for tasks and dynamic tables

Best answer: A

Explanation: This exhibit shows an external service submitting a SQL statement to Snowflake through the SQL API. Because it selects from an already curated view using a delivery-specific warehouse and role, it belongs to the delivery/integration layer rather than the main transformation pipeline.

The core concept is separating transformation logic from delivery surfaces. Here, the exhibit is a SQL API request that reads from partner_feed_v, which implies the data has already been prepared inside Snowflake. That makes the API call part of how an external application or service consumes Snowflake data for downstream delivery.

Snowflake-native transformation and orchestration would typically live in objects such as:

SQL transformations or stored procedures
streams and tasks
dynamic tables
Snowpark code running as part of the pipeline

The SQL API is simply the access surface the delivery service uses to submit SQL and retrieve results. Scheduling an API caller outside Snowflake is not the same thing as defining the transformation workflow inside Snowflake.

Task confusion fails because tasks and dynamic tables define work inside Snowflake, while the exhibit is just an external request to run a query.
Ingestion confusion fails because Snowpipe Streaming is for loading records into Snowflake, but this exhibit reads rows out of a curated view.
Replication confusion fails because replication distributes Snowflake objects and data across accounts or regions, not through ad hoc SQL API queries.

Question 20

A support application runs the following query thousands of times per hour with different order_id values. Engineers are debating a materialized view, clustering on order_id, or Query Acceleration Service.

Exhibit:

1SELECT order_id, customer_id, status
2FROM raw.orders
3WHERE order_id = 'A123456789';

Query profile summary:

Table rows: 11.8 billion
Rows returned: 1
Partitions scanned: 94%
order_id values are randomly distributed

Which interpretation is best?

Options:

A. Create a materialized view on the selected columns.
B. Add a clustering key on order_id.
C. Enable Query Acceleration Service for this lookup.
D. Enable Search Optimization Service on order_id.

Best answer: D

Explanation: This workload is a classic selective point-lookup problem on a very large table. The query returns one row, but it still scans most partitions, so Search Optimization Service is the best fit. Clustering, materialized views, and Query Acceleration Service target different performance patterns.

Search Optimization Service is the right choice when a very large table is queried with highly selective equality predicates and only a few rows are returned. In the exhibit, the query returns 1 row but still scans 94% of partitions, which means the main problem is locating a tiny set of matching values efficiently. The randomly distributed order_id values also make clustering a poor fit, because clustering works best when data can be organized to improve pruning for common range or grouping patterns.

Materialized views help when precomputing reused transformations or aggregations reduces repeated work.
Clustering helps when data layout can improve partition pruning for predictable access patterns.
Query Acceleration Service helps eligible scan-heavy query fragments, not single-row equality lookups like this.

The key takeaway is to match the optimization to the access pattern shown by the query profile.

Clustering mismatch fails because random high-cardinality point lookups are not the pattern clustering is best designed to optimize.
Materialized view mismatch fails because this query does not show a reusable aggregation or transformation that a materialized view would precompute effectively.
Query Acceleration mismatch fails because the problem is fast row location for an equality predicate, not accelerating a large scan-heavy analytic query.

Question 21

A team is configuring cross-region disaster recovery and assumes the same setup will also keep a near-real-time pipeline current after failover.

Exhibit:

1FAILOVER GROUP fg_orders
2  OBJECT_TYPES = USERS, ROLES, WAREHOUSES, DATABASES
3  ALLOWED_DATABASES = orders_db
4  REPLICATION_SCHEDULE = '5 MINUTE'

orders_db receives events continuously, and downstream dashboards require data to be under 1 minute behind the source after failover.

What is the best interpretation of this design?

Options:

A. It protects only the database; account objects still need manual recovery.
B. It covers continuity, but freshness can still lag by replication delay.
C. It guarantees under-1-minute freshness because orders_db is included.
D. It should use secure data sharing to keep pipeline state current.

Best answer: B

Explanation: This setup is a continuity design, not a freshness guarantee. The failover group includes account objects and the database, but the 5-minute replication schedule means the secondary can still be behind when failover occurs.

Failover groups are used for business continuity across regions or clouds. In this exhibit, the protected scope includes account-level objects such as users, roles, and warehouses, plus the orders_db database, so the design addresses recovery and failover availability of those objects.

Pipeline freshness is a separate concern. The REPLICATION_SCHEDULE = '5 MINUTE' indicates asynchronous replication, so after failover the secondary may only contain data from the last completed replication cycle, not the most recent continuously ingested events. A near-real-time freshness target of under 1 minute is therefore not guaranteed by this configuration alone.

The key takeaway is to separate continuity scope from freshness objectives such as replication lag or RPO.

Assuming database inclusion guarantees sub-minute currency ignores asynchronous replication lag.
Saying only the database is protected contradicts the listed users, roles, and warehouses.
Using secure data sharing addresses data access patterns, not failover continuity for account objects and DR recovery.

Question 22

A retail team drops 5-20 MB CSV files into an external stage at unpredictable times throughout the day. A downstream stream and task read from a native landing table and require new rows within 2 minutes. The team wants minimal operational overhead and does not want to manage a warehouse just to poll for new files.

What is the best Snowflake design choice?

Options:

A. Configure Snowpipe auto-ingest from the stage into the landing table
B. Schedule COPY INTO every minute on a dedicated warehouse
C. Switch to Snowpipe Streaming for direct row ingestion
D. Use an external table and query staged files directly

Best answer: A

Explanation: Snowpipe with auto-ingest is Snowflake’s continuous file ingestion pattern for files arriving irregularly in cloud storage. It supports near-real-time freshness and avoids a task-plus-warehouse polling design while loading directly into the native landing table.

The key distinction is bulk loading versus continuous file ingestion. When files arrive unpredictably throughout the day and freshness is measured in minutes, Snowpipe is the Snowflake-native fit for automatically loading new staged files into a table. With auto-ingest, cloud storage events trigger file loading as files land, so the team does not need to run scheduled polling logic or size a warehouse just for ingestion.

This also matches the requirement that downstream processing reads from a native landing table. A frequent COPY INTO remains a polling-based batch pattern, even if scheduled every minute. Snowpipe Streaming is designed for row-based streaming producers rather than files already written to cloud storage. The deciding factor is the file arrival pattern plus the freshness target.

Scheduled polling with COPY INTO is still a batch or micro-batch pattern and requires warehouse management for recurring loads.
Snowpipe Streaming fits direct row ingestion from applications, not files that already land in an external stage.
External tables expose staged files for querying, but they do not replace loading data into the native landing table needed by the downstream stream and task.

Question 23

A data engineering team loads partner CSV files through Snowpipe into raw tables and runs tasks every 5 minutes to populate curated tables. Analysts need to know within 10 minutes if load freshness degrades or if customer_id null rates spike, but the team does not want to redesign the existing task graph. Which Snowflake feature choice is the BEST fit?

Options:

A. Add a stream and separate validation task for each curated table
B. Apply data metric functions and alert on threshold breaches
C. Create a resource monitor on the warehouse running the tasks
D. Replace the tasks with dynamic tables and monitor refresh history

Best answer: B

Explanation: Data metric functions are built to measure table-level quality and freshness without changing the ingestion or transformation pattern. Pairing them with an alert lets the team detect null spikes or stale loads within the required window.

This is an observability requirement, not a reason to redesign the pipeline. Data metric functions let you attach data quality or freshness checks to the curated table itself, such as monitoring null behavior for customer_id or whether recent data arrived on time. An alert can then evaluate those metric results on a schedule and notify the team quickly when thresholds are breached. That adds early detection while preserving the current Snowpipe ingestion and task-based transformation design. The closest alternative is adding a stream and validation task, but that pushes custom monitoring logic into the pipeline and increases operational complexity.

Dynamic tables change the transformation design and refresh model, which is unnecessary when the gap is early monitoring.
Extra stream and task can detect issues, but they add orchestration and custom validation logic instead of using a built-in observability feature.
Resource monitor is for controlling warehouse credit consumption, not for detecting stale or low-quality table data.

Question 24

A 25 TB events table supports an operational dashboard. Most queries use an equality filter on one of several high-cardinality columns such as event_id, session_id, or user_email, return fewer than 20 rows, and the query profile shows many micro-partitions scanned. Which performance-improvement action best aligns with this symptom?

Options:

A. Create a materialized view for the base table.
B. Define a clustering key on the lookup columns.
C. Enable query acceleration on the warehouse.
D. Enable search optimization on the lookup columns.

Best answer: D

Explanation: Search Optimization Service best fits selective equality lookups on very large tables when result sets are tiny but scans are still broad. It adds search access paths so Snowflake can locate matching rows more efficiently than relying only on normal micro-partition pruning.

The key is matching the optimization to the query pattern. Search Optimization Service is intended for selective lookup queries on large tables, especially equality predicates on high-cardinality columns where only a few rows are returned. In this scenario, the symptom is not heavy aggregation or underpowered compute; it is inefficient row location across many micro-partitions. Search optimization adds persistent access paths that help Snowflake find matching values faster and reduce unnecessary scanning.

Query acceleration is more appropriate when a query still must process a large amount of data after pruning. Clustering is more useful for predictable range-based pruning patterns, and materialized views help when the same transformed or aggregated result is repeatedly reused. For point-lookups across several high-cardinality columns, search optimization is the best-aligned action.

Query acceleration confusion fits scan-heavy processing, but this workload is dominated by highly selective lookups with tiny result sets.
Clustering confusion can improve pruning for ordered or range-oriented access patterns, but it is not the best match for point-lookups across several high-cardinality columns.
Materialized view confusion helps repeated precomputed projections or aggregations, not general row retrieval from a large base table.

COF-C02

Browse Certification Exams

Snowflake SnowPro Advanced: Data Engineer (DEA-C02): 24 Sample Questions & Simulator

What this practice page gives you

DEA-C02 exam snapshot

Topic coverage for DEA-C02 practice

How to use the DEA-C02 simulator efficiently

Free preview vs premium

Good next pages after DEA-C02

24 DEA-C02 sample questions with detailed explanations

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Question 11

Question 12

Question 13

Question 14

Question 15

Question 16

Question 17

Question 18

Question 19

Question 20

Question 21

Question 22

Question 23

Question 24