SnowPro Data Engineer DEA-C02: Transformations and Workflows

Try 10 focused SnowPro Data Engineer DEA-C02 questions on Transformations and Workflows, with explanations, then continue with IT Mastery.

On this page

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try SnowPro Data Engineer DEA-C02 on Web View full SnowPro Data Engineer DEA-C02 practice page

Topic snapshot

FieldDetail
Exam routeSnowPro Data Engineer DEA-C02
Topic areaTransformations, Programmability, and Developer Workflows
Blueprint weight24%
Page purposeFocused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Transformations, Programmability, and Developer Workflows for SnowPro Data Engineer DEA-C02. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

PassWhat to doWhat to record
First attemptAnswer without checking the explanation first.The fact, rule, calculation, or judgment point that controlled your answer.
ReviewRead the explanation even when you were correct.Why the best answer is stronger than the closest distractor.
RepairRepeat only missed or uncertain items after a short break.The pattern behind misses, not the answer letter.
TransferReturn to mixed practice once the topic feels stable.Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 24% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.

Question 1

Topic: Transformations, Programmability, and Developer Workflows

A data engineer reviews a task that populates a curated transactions table. Ingestion already arrives within 10 seconds, and the task averages 3 seconds per run. Business users now need additional risk_band rules, and the same logic must be reused in several SQL transformations.

create or replace task classify_txn
  warehouse = etl_wh
  schedule = '1 minute'
as
merge into curated_txn t
using raw_txn_stream s
on t.txn_id = s.txn_id
when matched then update set
  risk_band = case
    when s.amount > 10000 then 'HIGH'
    when s.channel = 'CARD_PRESENT' then 'LOW'
  end;

Which action best addresses the primary issue?

Options:

  • A. Enable multi-cluster auto-scaling on etl_wh

  • B. Resize etl_wh to a larger warehouse

  • C. Encapsulate the classification rules in a SQL UDF

  • D. Replace the current load path with Snowpipe Streaming

Best answer: C

Explanation: This is a transformation-design problem, not a compute or ingestion problem. The pipeline is already fast enough; the missing piece is a reusable way to apply evolving business rules inside SQL transformations, which fits a SQL UDF.

The key clue is that the operational metrics are already acceptable: ingestion arrives within 10 seconds and the task finishes in 3 seconds. That rules out warehouse sizing and lower-latency ingestion as the primary issue. The requirement is to implement business logic for risk_band and reuse it across multiple SQL transformations.

A SQL UDF is well suited for reusable scalar logic such as transaction classification. The MERGE statement can call the UDF directly, so the rule is defined once and maintained centrally as business conditions change. This keeps transformation logic consistent across tasks, views, or other SQL pipelines.

Infrastructure tuning only helps when throughput, concurrency, or latency is the bottleneck, which the stem explicitly says is not the case.

  • Bigger warehouse increases compute capacity, but the task already runs quickly and does not centralize the classification rules.
  • Snowpipe Streaming is for lower ingestion latency, but the stem says the current arrival time already meets needs.
  • Multi-cluster scaling helps concurrent warehouse demand, not reusable business-rule implementation inside SQL transformations.

Question 2

Topic: Transformations, Programmability, and Developer Workflows

A task graph uses an external table and SQL on VARIANT to process new JSON claim files from an external stage. The source system now also lands scanned PDF claim forms in the same stage. Since that change, the extraction task fails on the PDF files, but the team still wants Snowflake-native detection of new arrivals and in-platform processing. What is the best next step?

Options:

  • A. Use a directory table stream and Snowpark with scoped file URLs.

  • B. Enable schema evolution on the existing external table.

  • C. Replace the task graph with dynamic table refreshes.

  • D. Increase warehouse size for the extraction task.

Best answer: A

Explanation: PDFs change the design from a semi-structured SQL pattern to an unstructured-data workflow. The best fix is to detect new files with a directory table and stream, then process file contents in Snowflake through Snowpark using scoped file access.

The core issue is that the original pipeline assumes staged files can be queried as semi-structured data and parsed with SQL from VARIANT. That works for JSON, but scanned PDFs are unstructured files, so the transformation mechanism must change.

  • Maintain a directory table on the stage to expose file metadata.
  • Use a stream on the directory table to capture newly arrived files.
  • Pass scoped file URLs into Snowpark or a Python procedure/UDF.
  • Read the files with Snowflake file APIs and write extracted results to tables.

This keeps orchestration inside Snowflake while switching to the correct processing model for unstructured content. More refresh logic or more compute does not solve the fact that the new files are no longer SQL-queryable JSON.

  • Schema evolution helps when structured fields drift, not when the source becomes scanned PDF files.
  • Dynamic tables maintain query results, but they do not replace unstructured file extraction logic.
  • More compute may reduce runtime pressure, but it will not make SQL parse binary documents as VARIANT.

Question 3

Topic: Transformations, Programmability, and Developer Workflows

A data engineer built the following transformation in a Snowflake notebook. The team wants to keep the Python DataFrame workflow, run it nightly, and materialize the result into a target table.

from snowflake.snowpark.functions import col, sum as sum_

def build_daily_sales(session):
    return (
        session.table("RAW.ORDERS")
        .filter(col("STATUS") == "COMPLETE")
        .group_by("ORDER_DATE")
        .agg(sum_("AMOUNT").alias("SALES"))
    )

build_daily_sales(session).write.mode("overwrite").save_as_table("CURATED.DAILY_SALES")

Which implementation is the best fit for productionizing this logic?

Options:

  • A. A dynamic table containing the Python function

  • B. A scalar Python UDF used in a SELECT

  • C. A Python stored procedure invoked by a task

  • D. A Python UDTF called from the FROM clause

Best answer: C

Explanation: This exhibit shows a Snowpark transformation workflow: reusable Python function, DataFrame operations, and a table write. That pattern is best productionized as a Python stored procedure, with a task providing the nightly schedule.

Snowpark is the right developer workflow here because the transformation is already expressed as Python DataFrame code. The production wrapper should be a Python stored procedure, because stored procedures are designed to execute multi-step program logic in Snowflake and can perform actions such as writing the final result to a table.

  • The notebook code is set-based Snowpark transformation logic.
  • The save_as_table call means the code is materializing output, not just returning a value inside a query.
  • A task is the normal way to schedule recurring execution of that procedure.

The closest distractor is a UDTF because the output is tabular, but UDTFs are meant to be consumed in SQL queries, not to orchestrate a nightly table-writing pipeline.

  • Scalar UDF trap fails because scalar UDFs return a value per invocation inside a query and are not used to materialize a table.
  • UDTF trap is tempting because the result is tabular, but UDTFs are consumed in SQL FROM clauses rather than used for table-write orchestration.
  • Dynamic table trap misses the workflow requirement because dynamic tables are defined by SQL queries, not arbitrary Snowpark Python functions.

Question 4

Topic: Transformations, Programmability, and Developer Workflows

A data engineering team stores API responses in raw_orders(src VARIANT). Downstream reconciliation must keep every order_id, even when the lines array is empty.

Exhibit:

SELECT
  src:order_id::STRING AS order_id,
  f.value:sku::STRING AS sku,
  f.value:qty::NUMBER AS qty
FROM raw_orders,
LATERAL FLATTEN(INPUT => src:lines) f;

What is the best next step?

Options:

  • A. Cast src:lines to text and split it with string functions

  • B. Add OUTER => TRUE to the FLATTEN call

  • C. Add RECURSIVE => TRUE to the FLATTEN call

  • D. Replace FLATTEN with direct indexing on src:lines[0]

Best answer: B

Explanation: FLATTEN is the correct Snowflake feature for turning a JSON array in a VARIANT column into relational rows. Because the requirement is to retain orders whose lines array is empty, OUTER => TRUE is needed so those parent rows are still returned.

LATERAL FLATTEN is Snowflake’s native pattern for extracting array elements from semi-structured data. In the exhibit, src:lines is an array, and f.value exposes each line item so fields like sku and qty can be projected as columns. The extra requirement is the deciding detail: some orders have an empty lines array, but reconciliation still needs the order_id returned. Adding OUTER => TRUE makes FLATTEN emit a single row for that zero-row expansion, with NULL values for the flattened fields.

  • Use FLATTEN when you need one output row per array element.
  • Use OUTER => TRUE when empty arrays should not remove the parent row.
  • Keep extracting nested fields from f.value after flattening.

The closest distractor changes traversal behavior, but it does not solve the row-retention requirement.

  • RECURSIVE => TRUE walks deeper nested structures, but it does not address keeping rows for empty arrays.
  • Direct indexing on src:lines[0] only targets one array element and does not reshape all line items into rows.
  • Converting JSON to text and splitting it discards structure and is not the Snowflake-native way to query nested arrays.

Question 5

Topic: Transformations, Programmability, and Developer Workflows

A data engineering team ingests IoT events with Snowpipe Streaming into a raw table. They are testing a new Snowpark Python enrichment that must be iteratively tuned, validated with ad hoc queries and quick visual checks, and kept inside Snowflake. They also need experimentation compute isolated from the production task graph. What is the BEST next action?

Options:

  • A. Materialize the enrichment as a dynamic table with target lag

  • B. Use a Snowflake Notebook with Snowpark on a dedicated warehouse

  • C. Run the enrichment in a serverless task and inspect task history

  • D. Share the raw table to another account for independent testing

Best answer: B

Explanation: A Snowflake Notebook is the best fit when engineers need interactive development, ad hoc validation, and quick experimentation without moving data out of Snowflake. Using a dedicated warehouse keeps exploratory runs separate from the production task graph while the transformation is still being refined.

Notebooks are most useful in the pre-production phase, when a team needs to iterate on Snowpark or SQL logic, inspect intermediate results, and validate output quality before operationalizing the code. Here, the enrichment is still being tuned, the team wants ad hoc queries and quick visual checks, and the data must remain in Snowflake. A notebook directly supports that workflow, and assigning it a separate warehouse prevents experimental runs from contending with scheduled ETL resources.

After the logic is proven, the team can move it into a task, procedure, or another production pipeline object. The key distinction is that notebooks support interactive development and experimentation, while tasks and dynamic tables support steady-state automation.

  • Serverless task automates recurring execution, but it is not the best workspace for iterative coding, ad hoc checks, and rapid tuning.
  • Dynamic table maintains a declarative transformation after the design is settled, not an experimental Snowpark workflow.
  • Secure sharing helps deliver data to consumers, but it does not provide an in-account development environment for validating new logic.

Question 6

Topic: Transformations, Programmability, and Developer Workflows

A data engineering team wants to reuse the same row-level cleanup logic in several hourly pipelines.

Exhibit:

create or replace task hourly_load_orders
  warehouse = etl_wh
  schedule = 'USING CRON 0 * * * * UTC'
as
insert into curated.orders
select
  order_id,
  upper(trim(customer_email)) as customer_email,
  try_to_timestamp_ntz(order_ts_raw) as order_ts
from landing.orders_raw;

A developer says, This task is our reusable transformation layer; we'll copy it for other pipelines. Which response is best?

Options:

  • A. Use a stored procedure as the scheduler and remove the task.

  • B. This confuses orchestration with reuse; extract the cleanup into a SQL UDF.

  • C. This is correct; tasks are Snowflake’s reusable objects for shared SQL cleanup.

  • D. Replace the task with an external function for reusable cleanup logic.

Best answer: B

Explanation: The exhibit shows reusable row-level expressions embedded inside a scheduled task. In Snowflake, tasks orchestrate when work runs, while reusable transformation logic should be encapsulated separately and then invoked by that orchestration layer.

The key distinction is between reusable transformation logic and orchestration logic. In the exhibit, upper(trim(customer_email)) and try_to_timestamp_ntz(order_ts_raw) are row-level transformations that may need to be reused across many queries or pipelines. The task is not the reuse boundary; it only provides the schedule and execution wrapper.

For this pattern, a SQL UDF is the best fit because the shared logic is expression-based and can be called from multiple statements. If the shared logic later becomes multi-step or procedural, a stored procedure could encapsulate that logic, but a task would still be the object used to schedule it. The main mistake is treating copied tasks as a reusable transformation layer instead of centralizing the transformation itself.

  • Tasks are not reuse units Duplicating tasks copies scheduling and execution wrappers, not a clean shared transformation abstraction.
  • External function mismatch External functions are for invoking external services, not simple in-Snowflake string and timestamp normalization.
  • Procedures do not schedule Stored procedures can encapsulate logic, but Snowflake tasks are the native scheduling and orchestration mechanism.

Question 7

Topic: Transformations, Programmability, and Developer Workflows

A stream-driven task is failing after a team wrapped CDC logic in a SQL UDF:

CREATE OR REPLACE TASK apply_orders
  WHEN SYSTEM$STREAM_HAS_DATA('orders_stream')
AS
  SELECT APPLY_CDC();

Inside APPLY_CDC, the team tries to run a MERGE into fact_orders, delete bad rows from order_errors, and return a status message. What is the most appropriate Snowflake fix?

Options:

  • A. Keep the UDF and upsize the warehouse.

  • B. Use a stored procedure and let the task CALL it.

  • C. Use an external function for the CDC logic.

  • D. Use a UDTF and select from it in the task.

Best answer: B

Explanation: This CDC step needs to execute Snowflake DML, not just return a computed value. A stored procedure is the right extensibility mechanism for multi-statement pipeline logic, and the task can invoke it directly with CALL.

The key distinction is between computing query results and performing actions. UDFs and UDTFs are meant to be used inside SQL queries to return a value or a row set. They are not the right mechanism when the logic itself must perform side effects such as MERGE, INSERT, UPDATE, or DELETE.

In this scenario, the task is orchestrating CDC application from a stream. That usually requires multi-statement execution, conditional logic, and status handling. A stored procedure fits that pattern because it can run DML and encapsulate procedural workflow, while the task still runs a single SQL statement:

  • CALL APPLY_CDC();

An external function is for calling an external service from SQL, not for internal table-maintenance logic. The practical rule is simple: use functions for reusable query-time computation, and use procedures for pipeline actions.

  • UDTF mismatch returns rows to a query, but it is not the mechanism for executing MERGE and DELETE side effects.
  • External service focus is for invoking remote endpoints, not for orchestrating internal CDC DML in Snowflake.
  • More compute can help runtime pressure, but it does not make a UDF valid for multi-statement DML logic.

Question 8

Topic: Transformations, Programmability, and Developer Workflows

A data engineering team wants an hourly, human-readable digest of failed ingestion messages for on-call review. They want to stay inside Snowflake and avoid building a separate ML training pipeline.

CREATE OR REPLACE TABLE failed_load_digest AS
SELECT
  DATE_TRUNC('HOUR', event_ts) AS load_hour,
  AI_SUMMARIZE_AGG(error_message) AS failure_summary
FROM ingest_errors
WHERE status = 'FAILED'
GROUP BY 1;

What is the best interpretation of this design?

Options:

  • A. Infer staged-file schema before running COPY INTO loads.

  • B. Create embeddings for vector search across similar error messages.

  • C. Train and host a custom model before scheduling this pipeline.

  • D. Use a Snowflake-managed AI SQL function for operational text summaries.

Best answer: D

Explanation: AI_SUMMARIZE_AGG is a Snowflake AI SQL function that summarizes grouped text where the data already lives. Here it turns many failed-load messages into one hourly digest, which fits an engineering workflow without the overhead of a separate ML project.

The core concept is using Snowflake AI/text functions for lightweight enrichment inside a data pipeline rather than building a full machine-learning lifecycle. In the exhibit, the team already has operational text in a Snowflake table and needs a concise summary for humans. AI_SUMMARIZE_AGG is designed for that pattern: aggregate rows of text and return a readable summary directly in SQL.

This is a good fit when the requirement is:

  • summarize or classify text already in Snowflake
  • keep the workflow SQL-native
  • avoid custom training, feature engineering, and model hosting

A full ML project would be more appropriate only if the team needed a custom predictive model or specialized model management. The closest trap is confusing summarization with embeddings or vector-search preparation.

  • Embeddings mismatch confuses summarization with vectorization; the query returns one readable summary per hour, not similarity vectors.
  • Custom model overhead is unnecessary because Snowflake AI SQL functions can be called directly in SQL transformations.
  • Schema inference mismatch is unrelated; nothing in the exhibit inspects staged files or load-time schema handling.

Question 9

Topic: Transformations, Programmability, and Developer Workflows

A media company lands nested JSON event files with arrays and occasional new keys every minute. The pipeline must preserve raw payloads for replay, publish flattened event rows within 5 minutes, and keep governed data inside Snowflake for downstream sharing. Which design is the BEST fit?

Options:

  • A. Expose the files through an external table and have consumers flatten the JSON in each query.

  • B. Parse the JSON on an external Spark platform first, then reload normalized tables into Snowflake.

  • C. Snowpipe into raw VARIANT, then use a dynamic table with LATERAL FLATTEN to publish curated rows in Snowflake.

  • D. Map each JSON field to fixed VARCHAR columns during load and recreate tables when keys change.

Best answer: C

Explanation: Snowflake natively handles nested JSON through VARIANT, semi-structured functions, and LATERAL FLATTEN. Loading raw data with Snowpipe and materializing curated rows with a dynamic table satisfies replay, latency, and governance requirements without exporting complex data to an external platform.

The key concept is that complex or semi-structured data does not require an external processing platform by default in Snowflake. Nested JSON with arrays and evolving keys can be loaded directly into a raw VARIANT table, preserving the full payload for replay. From there, Snowflake can flatten arrays and extract attributes natively, and a dynamic table can maintain a curated relational projection for near-real-time use.

  • Use Snowpipe to ingest the arriving files continuously.
  • Keep the raw JSON in VARIANT for replay and schema flexibility.
  • Use LATERAL FLATTEN and semi-structured functions in the dynamic table query.
  • Share the curated Snowflake objects downstream.

The closest distractor is the external-table approach, but that pushes transformation work to each consumer instead of producing governed curated data.

  • External Spark adds data movement and operational overhead; nested arrays and objects can be parsed and flattened natively in Snowflake.
  • External table only leaves transformation to each consumer and does not create governed, low-latency curated rows.
  • Fixed columns on load is brittle for schema drift and a poor fit for nested structures that belong naturally in VARIANT.

Question 10

Topic: Transformations, Programmability, and Developer Workflows

Which proposed implementation requires configuring Snowflake external network access, such as an EXTERNAL_ACCESS_INTEGRATIONS setting?

Options:

  • A. An external function backed by an API integration

  • B. A SQL UDF that extracts fields from VARIANT

  • C. A JavaScript stored procedure that runs MERGE

  • D. A Python UDF that calls a third-party REST API

Best answer: D

Explanation: A Python UDF needs external network access when its handler must call a service outside Snowflake at runtime. That is the purpose of network rules, secrets, and an EXTERNAL_ACCESS_INTEGRATIONS reference.

Snowflake external network access is for handler code inside supported UDFs and stored procedures when that code must connect to something outside Snowflake, such as a REST API. In that pattern, you configure network rules, secrets, and an external access integration, then associate it with the object.

An external function is different: it reaches an external service through an API integration, so it does not use the external-network-access feature for handler code. A SQL UDF that only transforms VARIANT data and a procedure that only runs SQL inside Snowflake both stay fully in-platform and do not need outbound connectivity. The deciding test is whether the handler itself must open an external connection.

  • External function confusion reaches an outside service, but it uses an API integration rather than EXTERNAL_ACCESS_INTEGRATIONS.
  • In-database SQL only applies to the VARIANT parsing UDF, so no outbound network call is involved.
  • Internal orchestration only applies to the procedure running MERGE, which stays inside Snowflake execution.

Continue with full practice

Use the SnowPro Data Engineer DEA-C02 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try SnowPro Data Engineer DEA-C02 on Web View SnowPro Data Engineer DEA-C02 Practice Test

Free review resource

Read the SnowPro Data Engineer DEA-C02 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.

Revised on Thursday, May 14, 2026