SnowPro DEA-C02 sample questions, mock-exam practice, and simulator access with detailed explanations in IT Mastery on web, iOS, and Android.
Snowflake SnowPro Advanced: Data Engineer (DEA-C02) targets pipeline design, transformations, streaming, delivery, governance, and performance judgment for teams building serious Snowflake data-engineering workflows. If you are searching for DEA-C02 sample questions, a practice test, mock exam, or exam simulator, this is the main IT Mastery page to start on web and continue on iOS or Android with the same account.
Start a practice session for Snowflake SnowPro Advanced: Data Engineer (DEA-C02) below, or open the full app in a new tab. For the best experience, open the full app in a new tab and navigate with swipes/gestures or the mouse wheel—just like on your phone or tablet.
Open Full App in a New TabA small set of questions is available for free preview. Subscribers can unlock full access by signing in with the same account used on mobile.
Prefer to practice on your phone or tablet? Download the IT Mastery – AWS, Azure, GCP & CompTIA exam prep app for iOS or IT Mastery app on Google Play (Android) and then sign in with the same account on web to continue your sessions on desktop.
Scenario-heavy Snowflake Advanced Data Engineer practice with inferred domain weighting and companion formats aligned to the current 65-question DEA-C02 exam.
| Domain | Weight |
|---|---|
| Data sourcing, storage, and ingestion | 22% |
| Transformations, programmability, and developer workflows | 24% |
| Streaming, orchestration, and near real-time pipeline design | 20% |
| Sharing, replication, and cross-platform delivery | 18% |
| Compute, governance, observability, and performance | 16% |
These sample questions are drawn from the current local bank for this exact exam code. Use them to check your readiness here, then continue into the full IT Mastery question bank for broader timed coverage.
A provider currently delivers a nightly file snapshot to a partner-managed stage.
1COPY INTO @partner_stage/sales/
2FROM prod.curated.sales_orders
3FILE_FORMAT = (TYPE = PARQUET)
4OVERWRITE = TRUE;
The partner already has its own Snowflake account in the same region. The provider now wants the partner to query the latest data continuously while the provider retains control over access and can revoke it at any time. What is the best next step?
Options:
Best answer: C
Explanation: The requirement is ongoing access to current data without handing off a separate copy. Secure Data Sharing fits because the consumer queries provider-managed shared objects directly, and the provider can revoke access at any time.
Secure Data Sharing is the best fit when a consumer already has its own Snowflake account and needs ongoing access to current provider data. The consumer queries shared objects exposed from the provider account, so updates are visible without unloading files or maintaining a separate replicated dataset. That also preserves provider-side control because the provider decides which objects are shared and can change or revoke access centrally.
The key distinction here is live shared access versus delivering or maintaining copies.
A task that populates CURATED_EVENTS now fails at compile time. The team wrote the SQL below, but normalize_agent was created as a Python stored procedure. The logic returns one normalized string for each input row, does not need external network access, and analysts also need to call it directly in ad hoc SQL queries.
1INSERT INTO curated_events
2SELECT event_id,
3 normalize_agent(user_agent) AS ua_norm
4FROM raw_events;
What is the best next step?
Options:
normalize_agent as a Python UDFnormalize_agent as a Python UDTFBest answer: B
Explanation: This is a SQL-callable logic problem, not a Snowpark pipeline problem. Because the code must return one scalar value per row inside a SELECT and be reusable in SQL, a Python UDF fits directly.
The core issue is choosing the right SQL-integrated callable object. The task needs logic that can be invoked inside INSERT ... SELECT, produce exactly one value for each input row, and remain reusable in ad hoc SQL. That is the design point of a scalar UDF, including a Python UDF when the business logic is best written in Python.
Stored procedures are invoked with CALL and are intended for orchestration or multi-step operations, not inline row-by-row expressions. A UDTF is only appropriate when the code returns a table. Snowpark is useful for full DataFrame-style programs, but it is not the best fit when the requirement is a SQL expression callable directly from queries. External functions are for remote service calls, which the stem explicitly does not require.
When SQL needs to call custom logic inline, pick the callable object that matches the SQL usage pattern and output shape.
A Snowpipe auto-ingests partner CSV files with a header row into orders_raw, a typed table queried directly by downstream SQL. The partner periodically adds nullable columns and sometimes reorders existing columns. Pipe history now shows load failures from column mismatches, and the team wants unattended ingestion to keep working whenever compatible columns are introduced. What is the best next step?
Options:
ON_ERROR = CONTINUEVARIANT columnALTER TABLE before each loadBest answer: D
Explanation: The failures come from positional loading against a table whose source columns now move and grow over time. The most reliable Snowflake-native fix is to enable schema evolution on the target and have the pipe match columns by name so compatible additions can be absorbed automatically.
This scenario is a classic schema-drift problem in file ingestion. Because the source sends CSV headers, sometimes reorders fields, and occasionally adds nullable columns, the robust fix is to keep the relational target but enable Snowflake schema evolution and use name-based column matching in the pipe or COPY INTO logic. That allows Snowflake to map reordered columns correctly and add compatible new columns without manual intervention.
Landing everything as VARIANT is flexible, but it changes the downstream contract when consumers already depend on typed columns.
VARIANT landing handles drift, but it changes the downstream table contract instead of preserving the typed relational load.ON_ERROR = CONTINUE may skip problematic data, but it does not solve reordered columns or compatible schema additions.A data engineer says the JSON load format is wrong because one field returns NULL. The data is already stored in Snowflake as RAW_ORDERS(src VARIANT).
Exhibit:
1SELECT
2 src:order_id::NUMBER AS order_id,
3 src:items.sku::STRING AS sku
4FROM RAW_ORDERS;
Sample src value already in the table:
1{"order_id":101,"items":[{"sku":"A1"},{"sku":"B2"}]}
order_id is populated, but sku is NULL for every row. What is the best interpretation?
Options:
items is an array and should be indexed or flattened.MATCH_BY_COLUMN_NAME to extract sku.STRIP_OUTER_ARRAY = TRUE.Best answer: A
Explanation: Because order_id is extracted successfully from the same VARIANT column, the data was loaded and parsed correctly. The problem is the query pattern: items is an array of objects, so you must index into it or use LATERAL FLATTEN to get each sku.
This is a semi-structured transformation problem, not an ingestion-format problem. In Snowflake, once JSON is stored in a VARIANT column, object fields and array elements are accessed with path notation. Here, order_id works because it is a top-level scalar field. The items field is an array, so src:items.sku does not iterate through its elements.
Use one of these patterns:
src:items[0].sku to read a specific elementLATERAL FLATTEN(input => src:items) to return one row per itemSTRIP_OUTER_ARRAY is a load-time setting for source files whose top-level content is an array, not for nested arrays already stored inside VARIANT. The key takeaway is to fix the semi-structured query logic before changing ingestion settings.
STRIP_OUTER_ARRAY fails because the nested array is already inside a successfully loaded VARIANT row.MATCH_BY_COLUMN_NAME applies during loading into table columns, not when traversing arrays in existing semi-structured data.A source system drops new CSV files into a stage at unpredictable times all day. Data must be loaded into a Snowflake table within a few minutes of file arrival, and the team wants to avoid running scheduled batch load statements. Which Snowflake feature best fits this requirement?
Options:
SnowpipeCOPY INTO every nightSnowpipe StreamingBest answer: B
Explanation: Snowpipe is the Snowflake-native choice for continuous file ingestion when files arrive throughout the day and freshness is measured in minutes. Bulk loading with COPY INTO is better for scheduled batch loads, not event-driven file arrival.
The key concept is matching the ingestion mechanism to the file arrival pattern and freshness target. When files land unpredictably and should be loaded soon after arrival, Snowpipe is built for continuous file ingestion from a stage into a Snowflake table. It reduces the need to manage repeated polling or scheduled batch jobs.
By contrast, bulk loading with COPY INTO is typically used when files are loaded in larger batches on a defined schedule. Snowpipe Streaming is for streaming row-based data into Snowflake without relying on staged files, and external tables let you query files in place rather than ingesting them into a native table.
A minutes-level freshness requirement for arriving files points to Snowpipe, not a batch COPY INTO pattern.
COPY INTO job is bulk loading, not continuous ingestion.Snowpipe Streaming targets streamed records, not staged files arriving as CSV objects.A team uses Snowpipe Streaming to land support tickets into RAW_TICKETS. They need each new ticket routed to one of five categories and a short summary available within 5 minutes for an operations dashboard. Ticket text must remain in Snowflake, and the team wants to avoid a separate model-training project. What is the best design choice?
Options:
Best answer: C
Explanation: Snowflake Cortex AI functions fit lightweight text-enrichment tasks such as classification and summarization without turning the solution into a full ML project. Using a stream and task keeps the workflow near real time, inside Snowflake, and operationally simple.
The key concept is choosing Snowflake-native AI functions for an engineering workflow when the requirement is straightforward text enrichment rather than model development. Here, the team already has continuous ingestion, needs category labels and short summaries quickly, must keep ticket text in Snowflake, and wants low operational overhead. A stream captures newly arrived rows, and a task can process those rows every few minutes with Snowflake Cortex AI functions in SQL, writing results to an enriched table for the dashboard.
RAW_TICKETS.This meets the latency target while avoiding the extra training, deployment, and data-movement complexity of a separate ML solution.
A team built a three-layer dynamic table pipeline: dt_clean -> dt_enriched -> dt_mart. Each dynamic table uses TARGET_LAG = '5 minutes'. Source data lands every 2 minutes, but the only business requirement is that dt_mart be no more than 15 minutes behind the source. Query history shows frequent refresh overlap and warehouse queuing. They want the simplest Snowflake-native fix. What should they do?
Options:
dt_mart to a materialized viewDOWNSTREAM upstream and 15 minutes on dt_mart5 minutes and resize the warehouseBest answer: B
Explanation: The pipeline is refreshing more often than the business actually requires. Setting intermediate dynamic tables to DOWNSTREAM and keeping the explicit freshness target only on the final table lets Snowflake coordinate refreshes across the graph while reducing overlap and compute pressure.
Dynamic tables are designed for declarative freshness management. In a multi-layer graph, giving every layer an aggressive explicit TARGET_LAG can cause upstream tables to refresh more often than the final consumer needs, which increases compute usage and can lead to warehouse queuing.
When the real SLA is only on the final output, a better design is to set intermediate dynamic tables to TARGET_LAG = DOWNSTREAM and give the final dynamic table the required lag, here 15 minutes. Snowflake then refreshes dependencies based on downstream need instead of treating each layer as independently urgent.
This keeps the pipeline simpler than rebuilding it with streams and tasks, while aligning refresh effort to the actual freshness requirement. Increasing warehouse size alone may reduce symptoms, but it does not fix the over-refresh design.
A Snowpark stored procedure invokes several SQL statements and external function calls. The team sees intermittent latency spikes and needs visibility into the execution path and timing of each step within a single run. Which Snowflake capability is most relevant?
Options:
Best answer: C
Explanation: Tracing is designed to follow request flow and span timing across a single execution. When the requirement is to diagnose where latency occurs inside one Snowpark procedure run, tracing provides the needed path-level visibility.
Tracing is the best fit when an operational visibility requirement is about understanding how a single execution moved through multiple steps and how long each step took. In this case, the team needs per-step timing and call-path detail across SQL work and external function calls inside one stored procedure run. Logging records emitted events or messages, and alerts evaluate a condition to notify or act, but neither one reconstructs end-to-end execution flow. Data metric functions focus on data quality measurements on table data, not procedural runtime behavior.
If the need were simply to notify on failures or SLA breaches, alerts would be more appropriate.
A central Snowflake team distributes customer data to Finance and Sales in separate consumer accounts. To prepare for later cross-region delivery, an architect proposes distributing only raw tables and letting each team maintain its own secure views and task logic locally; analysts already report different active_customer counts, and each schema change triggers fixes in multiple accounts. Which next step best addresses the problem while preserving maintainability and cross-team clarity?
Options:
Best answer: D
Explanation: The failure is semantic drift, not data movement. When each consumer account owns its own view logic, KPI definitions diverge and every upstream change must be implemented multiple times; a producer-managed curated layer keeps one authoritative definition for all teams.
When the stated need is maintainability and cross-team clarity, distributing raw tables and letting every team build its own secure views is the wrong architecture. Even if shared or replicated data is current, local modeling creates duplicate business logic, conflicting KPI definitions, and repeated remediation after every schema change. A better Snowflake pattern is to publish a producer-managed curated database, often with standardized secure views, and deliver that same governed layer to each consumer. Then business-rule updates and schema adjustments happen once, and all teams see the same meaning for active_customer. Replication can still help later with regional delivery, but it should move the curated layer rather than multiply independent semantic definitions.
An insurer is onboarding policy and claims data from six operational systems into Snowflake. Requirements: retain full source history with source-system traceability for 7 years, tolerate frequent source schema changes, and publish different consumer-specific datasets through secure sharing. A 5-minute refresh is sufficient. The team is debating Snowpipe versus Snowpipe Streaming. What is the BEST next action?
Options:
Best answer: D
Explanation: The deciding factor is long-term integration design, not sub-minute ingest mechanics. A Data Vault-style core layer best fits full history, source traceability, and flexible downstream sharing, while a 5-minute SLA leaves multiple acceptable ingestion options.
When the requirements center on history retention, source traceability, schema-change tolerance, and reuse across many downstream products, the primary design decision is the core data model. In Snowflake, a Data Vault-style core layer is a strong fit because it preserves historized relationships and business-key lineage while letting teams publish different marts or secure shares without redesigning ingestion each time.
Choosing between Snowpipe and Snowpipe Streaming matters only after the architecture is clear; with a 5-minute target, immediate load mechanics are not the main constraint.
A data engineering team runs CDC every minute with a task:
1SELECT apply_cdc();
apply_cdc() is a Python UDF intended to read a stream, run two MERGE statements into target tables, and insert audit rows. Task history shows failures when the function tries to execute SQL. What is the best next step?
Options:
apply_cdc to a UDTF and keep the MERGE logic thereapply_cdc as a stored procedure and CALL it from the taskBest answer: C
Explanation: This is a programmability-boundary problem, not a compute problem. UDFs are for reusable computation that returns values inside SQL, while stored procedures are for procedural workflows that execute multi-step DML and can be invoked by tasks.
In Snowflake, use a UDF when you need reusable logic that computes and returns a value to a SQL statement, such as normalizing text or deriving a column from inputs. This scenario is different: the routine must coordinate CDC processing by reading change data, running multiple MERGE statements, and writing audit records. That is procedural orchestration with database side effects, which fits a stored procedure.
A task can CALL a stored procedure so the procedure can manage multi-statement SQL, control flow, and status reporting for the CDC step. A UDF or UDTF may help transform data returned to a query, but they are not the right abstraction for orchestrating multi-target updates and audit writes.
The key takeaway is to use UDFs for expression-level computation and stored procedures for operational pipeline steps.
A vendor delivers small CSV files to an external stage at irregular times. Snowflake currently loads them with a scheduled task:
1CREATE TASK load_vendor_files
2 SCHEDULE = '15 MINUTE'
3AS
4COPY INTO raw.vendor_events
5FROM @vendor_stage;
Dashboards are often 10-15 minutes behind. The SLA is data available in under 1 minute after each file arrives, and the team wants the most Snowflake-native, low-operations approach while keeping file-based delivery. What is the best next step?
Options:
COPY INTO loads.Best answer: A
Explanation: Snowpipe is designed for continuous, file-based ingestion when files land in a stage and low operational overhead is required. It fits the under-1-minute target better than a polling task because it loads on file arrival instead of waiting for the next schedule.
The deciding concept is matching the ingestion mechanism to the delivery model and latency target. The source is still delivering files, so the best Snowflake-native fix is to move from scheduled COPY INTO polling to Snowpipe auto-ingest, which uses event-driven loading when new files arrive in the stage.
This fits the stated requirements:
Snowpipe Streaming is for direct row ingestion from applications or streaming clients, not for a vendor that is already sending files. The key takeaway is that continuous file arrival plus a near-real-time SLA usually points to Snowpipe, not scheduled tasks.
COPY INTO execution instead of event-driven file ingestion.A company ingests order events into Snowflake with Snowpipe Streaming and refreshes a curated ORDERS_CURATED table every few minutes. An external logistics partner already has its own Snowflake account and needs query access with under 5-minute latency. The partner must see only shipment-related columns for its assigned region, and the provider wants to avoid copying data into another database. Which Snowflake design is the BEST fit?
Options:
Best answer: C
Explanation: The best choice is to enforce least privilege in the provider account with a secure view, then deliver that governed dataset through sharing. This meets the low-latency requirement and avoids creating another physical copy of the data.
When a scenario mixes consumer delivery with governance, first decide how the provider will restrict what the consumer can see. In Snowflake, a secure view can expose only the required shipment columns and filter rows to the partner’s allowed region. That governed object can then be delivered through Secure Data Sharing to the partner’s existing Snowflake account.
This design fits the stated constraints:
Replication is mainly for data movement, availability, or disaster recovery, not for minimizing consumer exposure. The key takeaway is to govern the shared dataset first, then choose the delivery mechanism.
A task runs every 10 minutes and calls a Snowpark Python stored procedure that flattens VARIANT event payloads into a curated table. After a source schema change, the task now fails intermittently with cast errors on a small subset of rows. Engineers need to inspect those records, try revised parsing logic, and validate the output before changing the production pipeline. What is the best next step?
Options:
Best answer: A
Explanation: Snowflake notebooks are designed for interactive development and validation. In this case, the team needs to explore problematic rows, experiment with revised Snowpark parsing, and confirm the output before promoting a fix to the scheduled pipeline.
The core issue is transformation logic validation, not throughput or orchestration. Because the failures started after a schema change and only affect some records, engineers need an interactive workspace where they can query the bad payloads, run Snowpark or SQL step by step, and compare outputs on representative samples. A Snowflake notebook fits that workflow well and can use separate compute so troubleshooting stays isolated from production runs.
Changing warehouse size or CDC objects does not solve the need for iterative experimentation and validation.
Which Snowflake capability is specifically designed to provide consumers with controlled, read-only access to current provider data without copying or unloading the underlying data?
Options:
COPY INTO <location>Best answer: C
Explanation: Secure data sharing is the Snowflake feature for live, controlled consumer access to provider data. It lets consumers query shared data directly while the provider manages what is exposed, without creating exported files or duplicated datasets.
The key concept is that secure data sharing is an access pattern, not a data movement pattern. When the main goal is to let another party query current data under provider control, Snowflake sharing is the native choice because it exposes shared objects directly to the consumer without unloading files or creating a separate copied dataset.
Replication is for maintaining copies across accounts or regions, and unloading with COPY INTO <location> is file export. External tables help query data stored outside Snowflake, but they do not provide Snowflake-native controlled sharing of live provider-managed objects.
If the requirement is controlled consumer access to up-to-date data, choose sharing rather than copy or export mechanisms.
COPY INTO <location> exports files and shifts access control to the storage location.A data engineer reviews the current setup before altering a view.
1ALTER TABLE sales.raw_orders
2 ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.NULL_COUNT ON (customer_id);
3ALTER TABLE sales.raw_orders
4 ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.ROW_COUNT ON ();
The new change request says to identify downstream objects that depend on sales.v_orders and which roles queried sensitive columns in the last 14 days. What is the best next step?
Options:
SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY and OBJECT_DEPENDENCIES.QUERY_HISTORY.Best answer: B
Explanation: The existing SQL adds data metric functions, which monitor data-quality signals such as nulls and row counts. The new requirement is different: it asks who accessed sensitive data and what downstream objects will be affected by a view change, so access history and dependency metadata are the right observability tools.
Data metric functions help detect data-quality issues such as missing values, row-count shifts, or freshness problems. They do not answer who queried specific data or which downstream objects depend on a view. In this scenario, the engineer needs governance observability and change-impact analysis.
SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY is used for detailed access auditing, including which queries and roles accessed objects and columns. Dependency metadata such as OBJECT_DEPENDENCIES is used to understand lineage-style relationships so the engineer can see which views or other objects may be impacted before changing sales.v_orders.
The key distinction is simple: data-quality monitoring answers whether the data looks healthy, while access history and dependency tracking answer who used it and what depends on it.
A team receives small Parquet files continuously in cloud object storage. Their proposal is to run COPY INTO from an external stage every 30 minutes by using a task. Requirements: data must be available in a native Snowflake table within 5 minutes, no manual runs are allowed, and compute usage should stay low when no files arrive. Which design is BEST?
Options:
COPY INTO every 30 minutes with a taskBest answer: A
Explanation: The 5-minute SLA and no-manual-ops requirement make a 30-minute task-based batch pattern too slow and too operationally heavy. Snowpipe auto-ingest is designed for continuously arriving files in object storage and loads them into a native table without warehouse polling.
For continuously arriving files in cloud object storage, Snowpipe auto-ingest is the Snowflake-native fit when latency is measured in minutes, operations must be hands-off, and you do not want a warehouse polling for work. It uses cloud event notifications to detect new files on an external stage and loads them into a target table as files arrive. A task that runs COPY INTO every 30 minutes is still a polling batch pattern, so it misses the 5-minute SLA, and increasing the schedule frequency adds avoidable warehouse overhead. External tables query files in place rather than loading them into a native table, and Snowpipe Streaming is intended for streaming rows from producers rather than staged files. When the source is files and the requirement is near-real-time with low operational effort, choose event-driven file ingestion.
COPY INTO task remains a polling batch design, so 30-minute runs violate the 5-minute target and consume warehouse time just checking for files.A data engineering team already publishes CDC events to Kafka and needs near-real-time ingestion into Snowflake. They want the lowest practical latency without writing and operating a custom ingestion service.
Exhibit:
1Source: Apache Kafka topic `orders_cdc`
2Target: table `RAW.ORDERS_CDC`
3Latency target: under 10 seconds
4Constraint: avoid custom application code
Which approach best fits the requirement?
Options:
COPY INTO every minuteBest answer: D
Explanation: The key requirements are an existing Kafka source, sub-10-second latency, and no custom ingestion application. The Snowflake Connector for Kafka is designed for exactly this pattern and uses Snowpipe Streaming to load records continuously.
The deciding factors are the Kafka source and the requirement to avoid custom code while keeping latency very low. The Snowflake Connector for Kafka is the best fit because it connects directly to Kafka topics and ingests records continuously into Snowflake by using Snowpipe Streaming. That gives a lower-latency, more operationally efficient pattern than file-based ingestion or scheduled batch loads.
A custom app built on the Snowpipe Streaming SDK could also be low latency, but it adds engineering and operational overhead that the exhibit explicitly says to avoid.
COPY INTO is a micro-batch approach and is usually not the best match for this latency target.A data engineering team already builds transformations inside Snowflake with streams, tasks, and secure views. A separate delivery service must send a curated result set to a partner application every hour.
Exhibit:
1POST /api/v2/statements
2{
3 "statement": "SELECT order_id, status, updated_at FROM partner_feed_v",
4 "warehouse": "DELIVERY_WH",
5 "role": "PARTNER_DELIVERY_ROLE"
6}
What is the best interpretation of this exhibit in the overall design?
Options:
Best answer: A
Explanation: This exhibit shows an external service submitting a SQL statement to Snowflake through the SQL API. Because it selects from an already curated view using a delivery-specific warehouse and role, it belongs to the delivery/integration layer rather than the main transformation pipeline.
The core concept is separating transformation logic from delivery surfaces. Here, the exhibit is a SQL API request that reads from partner_feed_v, which implies the data has already been prepared inside Snowflake. That makes the API call part of how an external application or service consumes Snowflake data for downstream delivery.
Snowflake-native transformation and orchestration would typically live in objects such as:
The SQL API is simply the access surface the delivery service uses to submit SQL and retrieve results. Scheduling an API caller outside Snowflake is not the same thing as defining the transformation workflow inside Snowflake.
A support application runs the following query thousands of times per hour with different order_id values. Engineers are debating a materialized view, clustering on order_id, or Query Acceleration Service.
Exhibit:
1SELECT order_id, customer_id, status
2FROM raw.orders
3WHERE order_id = 'A123456789';
Query profile summary:
order_id values are randomly distributedWhich interpretation is best?
Options:
order_id.order_id.Best answer: D
Explanation: This workload is a classic selective point-lookup problem on a very large table. The query returns one row, but it still scans most partitions, so Search Optimization Service is the best fit. Clustering, materialized views, and Query Acceleration Service target different performance patterns.
Search Optimization Service is the right choice when a very large table is queried with highly selective equality predicates and only a few rows are returned. In the exhibit, the query returns 1 row but still scans 94% of partitions, which means the main problem is locating a tiny set of matching values efficiently. The randomly distributed order_id values also make clustering a poor fit, because clustering works best when data can be organized to improve pruning for common range or grouping patterns.
The key takeaway is to match the optimization to the access pattern shown by the query profile.
A team is configuring cross-region disaster recovery and assumes the same setup will also keep a near-real-time pipeline current after failover.
Exhibit:
1FAILOVER GROUP fg_orders
2 OBJECT_TYPES = USERS, ROLES, WAREHOUSES, DATABASES
3 ALLOWED_DATABASES = orders_db
4 REPLICATION_SCHEDULE = '5 MINUTE'
orders_db receives events continuously, and downstream dashboards require data to be under 1 minute behind the source after failover.
What is the best interpretation of this design?
Options:
orders_db is included.Best answer: B
Explanation: This setup is a continuity design, not a freshness guarantee. The failover group includes account objects and the database, but the 5-minute replication schedule means the secondary can still be behind when failover occurs.
Failover groups are used for business continuity across regions or clouds. In this exhibit, the protected scope includes account-level objects such as users, roles, and warehouses, plus the orders_db database, so the design addresses recovery and failover availability of those objects.
Pipeline freshness is a separate concern. The REPLICATION_SCHEDULE = '5 MINUTE' indicates asynchronous replication, so after failover the secondary may only contain data from the last completed replication cycle, not the most recent continuously ingested events. A near-real-time freshness target of under 1 minute is therefore not guaranteed by this configuration alone.
The key takeaway is to separate continuity scope from freshness objectives such as replication lag or RPO.
A retail team drops 5-20 MB CSV files into an external stage at unpredictable times throughout the day. A downstream stream and task read from a native landing table and require new rows within 2 minutes. The team wants minimal operational overhead and does not want to manage a warehouse just to poll for new files.
What is the best Snowflake design choice?
Options:
COPY INTO every minute on a dedicated warehouseBest answer: A
Explanation: Snowpipe with auto-ingest is Snowflake’s continuous file ingestion pattern for files arriving irregularly in cloud storage. It supports near-real-time freshness and avoids a task-plus-warehouse polling design while loading directly into the native landing table.
The key distinction is bulk loading versus continuous file ingestion. When files arrive unpredictably throughout the day and freshness is measured in minutes, Snowpipe is the Snowflake-native fit for automatically loading new staged files into a table. With auto-ingest, cloud storage events trigger file loading as files land, so the team does not need to run scheduled polling logic or size a warehouse just for ingestion.
This also matches the requirement that downstream processing reads from a native landing table. A frequent COPY INTO remains a polling-based batch pattern, even if scheduled every minute. Snowpipe Streaming is designed for row-based streaming producers rather than files already written to cloud storage. The deciding factor is the file arrival pattern plus the freshness target.
COPY INTO is still a batch or micro-batch pattern and requires warehouse management for recurring loads.A data engineering team loads partner CSV files through Snowpipe into raw tables and runs tasks every 5 minutes to populate curated tables. Analysts need to know within 10 minutes if load freshness degrades or if customer_id null rates spike, but the team does not want to redesign the existing task graph. Which Snowflake feature choice is the BEST fit?
Options:
Best answer: B
Explanation: Data metric functions are built to measure table-level quality and freshness without changing the ingestion or transformation pattern. Pairing them with an alert lets the team detect null spikes or stale loads within the required window.
This is an observability requirement, not a reason to redesign the pipeline. Data metric functions let you attach data quality or freshness checks to the curated table itself, such as monitoring null behavior for customer_id or whether recent data arrived on time. An alert can then evaluate those metric results on a schedule and notify the team quickly when thresholds are breached. That adds early detection while preserving the current Snowpipe ingestion and task-based transformation design. The closest alternative is adding a stream and validation task, but that pushes custom monitoring logic into the pipeline and increases operational complexity.
A 25 TB events table supports an operational dashboard. Most queries use an equality filter on one of several high-cardinality columns such as event_id, session_id, or user_email, return fewer than 20 rows, and the query profile shows many micro-partitions scanned. Which performance-improvement action best aligns with this symptom?
Options:
Best answer: D
Explanation: Search Optimization Service best fits selective equality lookups on very large tables when result sets are tiny but scans are still broad. It adds search access paths so Snowflake can locate matching rows more efficiently than relying only on normal micro-partition pruning.
The key is matching the optimization to the query pattern. Search Optimization Service is intended for selective lookup queries on large tables, especially equality predicates on high-cardinality columns where only a few rows are returned. In this scenario, the symptom is not heavy aggregation or underpowered compute; it is inefficient row location across many micro-partitions. Search optimization adds persistent access paths that help Snowflake find matching values faster and reduce unnecessary scanning.
Query acceleration is more appropriate when a query still must process a large amount of data after pruning. Clustering is more useful for predictable range-based pruning patterns, and materialized views help when the same transformed or aggregated result is repeatedly reused. For point-lookups across several high-cardinality columns, search optimization is the best-aligned action.