Try 10 focused SnowPro Data Engineer DEA-C02 questions on Transformations and Workflows, with explanations, then continue with IT Mastery.
Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.
Try SnowPro Data Engineer DEA-C02 on Web View full SnowPro Data Engineer DEA-C02 practice page
| Field | Detail |
|---|---|
| Exam route | SnowPro Data Engineer DEA-C02 |
| Topic area | Transformations, Programmability, and Developer Workflows |
| Blueprint weight | 24% |
| Page purpose | Focused sample questions before returning to mixed practice |
Use this page to isolate Transformations, Programmability, and Developer Workflows for SnowPro Data Engineer DEA-C02. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.
| Pass | What to do | What to record |
|---|---|---|
| First attempt | Answer without checking the explanation first. | The fact, rule, calculation, or judgment point that controlled your answer. |
| Review | Read the explanation even when you were correct. | Why the best answer is stronger than the closest distractor. |
| Repair | Repeat only missed or uncertain items after a short break. | The pattern behind misses, not the answer letter. |
| Transfer | Return to mixed practice once the topic feels stable. | Whether the same skill holds up when the topic is no longer obvious. |
Blueprint context: 24% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.
These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.
Topic: Transformations, Programmability, and Developer Workflows
A data engineer reviews a task that populates a curated transactions table. Ingestion already arrives within 10 seconds, and the task averages 3 seconds per run. Business users now need additional risk_band rules, and the same logic must be reused in several SQL transformations.
create or replace task classify_txn
warehouse = etl_wh
schedule = '1 minute'
as
merge into curated_txn t
using raw_txn_stream s
on t.txn_id = s.txn_id
when matched then update set
risk_band = case
when s.amount > 10000 then 'HIGH'
when s.channel = 'CARD_PRESENT' then 'LOW'
end;
Which action best addresses the primary issue?
Options:
A. Enable multi-cluster auto-scaling on etl_wh
B. Resize etl_wh to a larger warehouse
C. Encapsulate the classification rules in a SQL UDF
D. Replace the current load path with Snowpipe Streaming
Best answer: C
Explanation: This is a transformation-design problem, not a compute or ingestion problem. The pipeline is already fast enough; the missing piece is a reusable way to apply evolving business rules inside SQL transformations, which fits a SQL UDF.
The key clue is that the operational metrics are already acceptable: ingestion arrives within 10 seconds and the task finishes in 3 seconds. That rules out warehouse sizing and lower-latency ingestion as the primary issue. The requirement is to implement business logic for risk_band and reuse it across multiple SQL transformations.
A SQL UDF is well suited for reusable scalar logic such as transaction classification. The MERGE statement can call the UDF directly, so the rule is defined once and maintained centrally as business conditions change. This keeps transformation logic consistent across tasks, views, or other SQL pipelines.
Infrastructure tuning only helps when throughput, concurrency, or latency is the bottleneck, which the stem explicitly says is not the case.
Topic: Transformations, Programmability, and Developer Workflows
A task graph uses an external table and SQL on VARIANT to process new JSON claim files from an external stage. The source system now also lands scanned PDF claim forms in the same stage. Since that change, the extraction task fails on the PDF files, but the team still wants Snowflake-native detection of new arrivals and in-platform processing. What is the best next step?
Options:
A. Use a directory table stream and Snowpark with scoped file URLs.
B. Enable schema evolution on the existing external table.
C. Replace the task graph with dynamic table refreshes.
D. Increase warehouse size for the extraction task.
Best answer: A
Explanation: PDFs change the design from a semi-structured SQL pattern to an unstructured-data workflow. The best fix is to detect new files with a directory table and stream, then process file contents in Snowflake through Snowpark using scoped file access.
The core issue is that the original pipeline assumes staged files can be queried as semi-structured data and parsed with SQL from VARIANT. That works for JSON, but scanned PDFs are unstructured files, so the transformation mechanism must change.
This keeps orchestration inside Snowflake while switching to the correct processing model for unstructured content. More refresh logic or more compute does not solve the fact that the new files are no longer SQL-queryable JSON.
VARIANT.Topic: Transformations, Programmability, and Developer Workflows
A data engineer built the following transformation in a Snowflake notebook. The team wants to keep the Python DataFrame workflow, run it nightly, and materialize the result into a target table.
from snowflake.snowpark.functions import col, sum as sum_
def build_daily_sales(session):
return (
session.table("RAW.ORDERS")
.filter(col("STATUS") == "COMPLETE")
.group_by("ORDER_DATE")
.agg(sum_("AMOUNT").alias("SALES"))
)
build_daily_sales(session).write.mode("overwrite").save_as_table("CURATED.DAILY_SALES")
Which implementation is the best fit for productionizing this logic?
Options:
A. A dynamic table containing the Python function
B. A scalar Python UDF used in a SELECT
C. A Python stored procedure invoked by a task
D. A Python UDTF called from the FROM clause
Best answer: C
Explanation: This exhibit shows a Snowpark transformation workflow: reusable Python function, DataFrame operations, and a table write. That pattern is best productionized as a Python stored procedure, with a task providing the nightly schedule.
Snowpark is the right developer workflow here because the transformation is already expressed as Python DataFrame code. The production wrapper should be a Python stored procedure, because stored procedures are designed to execute multi-step program logic in Snowflake and can perform actions such as writing the final result to a table.
save_as_table call means the code is materializing output, not just returning a value inside a query.The closest distractor is a UDTF because the output is tabular, but UDTFs are meant to be consumed in SQL queries, not to orchestrate a nightly table-writing pipeline.
FROM clauses rather than used for table-write orchestration.Topic: Transformations, Programmability, and Developer Workflows
A data engineering team stores API responses in raw_orders(src VARIANT). Downstream reconciliation must keep every order_id, even when the lines array is empty.
Exhibit:
SELECT
src:order_id::STRING AS order_id,
f.value:sku::STRING AS sku,
f.value:qty::NUMBER AS qty
FROM raw_orders,
LATERAL FLATTEN(INPUT => src:lines) f;
What is the best next step?
Options:
A. Cast src:lines to text and split it with string functions
B. Add OUTER => TRUE to the FLATTEN call
C. Add RECURSIVE => TRUE to the FLATTEN call
D. Replace FLATTEN with direct indexing on src:lines[0]
Best answer: B
Explanation: FLATTEN is the correct Snowflake feature for turning a JSON array in a VARIANT column into relational rows. Because the requirement is to retain orders whose lines array is empty, OUTER => TRUE is needed so those parent rows are still returned.
LATERAL FLATTEN is Snowflake’s native pattern for extracting array elements from semi-structured data. In the exhibit, src:lines is an array, and f.value exposes each line item so fields like sku and qty can be projected as columns. The extra requirement is the deciding detail: some orders have an empty lines array, but reconciliation still needs the order_id returned. Adding OUTER => TRUE makes FLATTEN emit a single row for that zero-row expansion, with NULL values for the flattened fields.
FLATTEN when you need one output row per array element.OUTER => TRUE when empty arrays should not remove the parent row.f.value after flattening.The closest distractor changes traversal behavior, but it does not solve the row-retention requirement.
RECURSIVE => TRUE walks deeper nested structures, but it does not address keeping rows for empty arrays.src:lines[0] only targets one array element and does not reshape all line items into rows.Topic: Transformations, Programmability, and Developer Workflows
A data engineering team ingests IoT events with Snowpipe Streaming into a raw table. They are testing a new Snowpark Python enrichment that must be iteratively tuned, validated with ad hoc queries and quick visual checks, and kept inside Snowflake. They also need experimentation compute isolated from the production task graph. What is the BEST next action?
Options:
A. Materialize the enrichment as a dynamic table with target lag
B. Use a Snowflake Notebook with Snowpark on a dedicated warehouse
C. Run the enrichment in a serverless task and inspect task history
D. Share the raw table to another account for independent testing
Best answer: B
Explanation: A Snowflake Notebook is the best fit when engineers need interactive development, ad hoc validation, and quick experimentation without moving data out of Snowflake. Using a dedicated warehouse keeps exploratory runs separate from the production task graph while the transformation is still being refined.
Notebooks are most useful in the pre-production phase, when a team needs to iterate on Snowpark or SQL logic, inspect intermediate results, and validate output quality before operationalizing the code. Here, the enrichment is still being tuned, the team wants ad hoc queries and quick visual checks, and the data must remain in Snowflake. A notebook directly supports that workflow, and assigning it a separate warehouse prevents experimental runs from contending with scheduled ETL resources.
After the logic is proven, the team can move it into a task, procedure, or another production pipeline object. The key distinction is that notebooks support interactive development and experimentation, while tasks and dynamic tables support steady-state automation.
Topic: Transformations, Programmability, and Developer Workflows
A data engineering team wants to reuse the same row-level cleanup logic in several hourly pipelines.
Exhibit:
create or replace task hourly_load_orders
warehouse = etl_wh
schedule = 'USING CRON 0 * * * * UTC'
as
insert into curated.orders
select
order_id,
upper(trim(customer_email)) as customer_email,
try_to_timestamp_ntz(order_ts_raw) as order_ts
from landing.orders_raw;
A developer says, This task is our reusable transformation layer; we'll copy it for other pipelines. Which response is best?
Options:
A. Use a stored procedure as the scheduler and remove the task.
B. This confuses orchestration with reuse; extract the cleanup into a SQL UDF.
C. This is correct; tasks are Snowflake’s reusable objects for shared SQL cleanup.
D. Replace the task with an external function for reusable cleanup logic.
Best answer: B
Explanation: The exhibit shows reusable row-level expressions embedded inside a scheduled task. In Snowflake, tasks orchestrate when work runs, while reusable transformation logic should be encapsulated separately and then invoked by that orchestration layer.
The key distinction is between reusable transformation logic and orchestration logic. In the exhibit, upper(trim(customer_email)) and try_to_timestamp_ntz(order_ts_raw) are row-level transformations that may need to be reused across many queries or pipelines. The task is not the reuse boundary; it only provides the schedule and execution wrapper.
For this pattern, a SQL UDF is the best fit because the shared logic is expression-based and can be called from multiple statements. If the shared logic later becomes multi-step or procedural, a stored procedure could encapsulate that logic, but a task would still be the object used to schedule it. The main mistake is treating copied tasks as a reusable transformation layer instead of centralizing the transformation itself.
Topic: Transformations, Programmability, and Developer Workflows
A stream-driven task is failing after a team wrapped CDC logic in a SQL UDF:
CREATE OR REPLACE TASK apply_orders
WHEN SYSTEM$STREAM_HAS_DATA('orders_stream')
AS
SELECT APPLY_CDC();
Inside APPLY_CDC, the team tries to run a MERGE into fact_orders, delete bad rows from order_errors, and return a status message. What is the most appropriate Snowflake fix?
Options:
A. Keep the UDF and upsize the warehouse.
B. Use a stored procedure and let the task CALL it.
C. Use an external function for the CDC logic.
D. Use a UDTF and select from it in the task.
Best answer: B
Explanation: This CDC step needs to execute Snowflake DML, not just return a computed value. A stored procedure is the right extensibility mechanism for multi-statement pipeline logic, and the task can invoke it directly with CALL.
The key distinction is between computing query results and performing actions. UDFs and UDTFs are meant to be used inside SQL queries to return a value or a row set. They are not the right mechanism when the logic itself must perform side effects such as MERGE, INSERT, UPDATE, or DELETE.
In this scenario, the task is orchestrating CDC application from a stream. That usually requires multi-statement execution, conditional logic, and status handling. A stored procedure fits that pattern because it can run DML and encapsulate procedural workflow, while the task still runs a single SQL statement:
CALL APPLY_CDC();An external function is for calling an external service from SQL, not for internal table-maintenance logic. The practical rule is simple: use functions for reusable query-time computation, and use procedures for pipeline actions.
MERGE and DELETE side effects.Topic: Transformations, Programmability, and Developer Workflows
A data engineering team wants an hourly, human-readable digest of failed ingestion messages for on-call review. They want to stay inside Snowflake and avoid building a separate ML training pipeline.
CREATE OR REPLACE TABLE failed_load_digest AS
SELECT
DATE_TRUNC('HOUR', event_ts) AS load_hour,
AI_SUMMARIZE_AGG(error_message) AS failure_summary
FROM ingest_errors
WHERE status = 'FAILED'
GROUP BY 1;
What is the best interpretation of this design?
Options:
A. Infer staged-file schema before running COPY INTO loads.
B. Create embeddings for vector search across similar error messages.
C. Train and host a custom model before scheduling this pipeline.
D. Use a Snowflake-managed AI SQL function for operational text summaries.
Best answer: D
Explanation: AI_SUMMARIZE_AGG is a Snowflake AI SQL function that summarizes grouped text where the data already lives. Here it turns many failed-load messages into one hourly digest, which fits an engineering workflow without the overhead of a separate ML project.
The core concept is using Snowflake AI/text functions for lightweight enrichment inside a data pipeline rather than building a full machine-learning lifecycle. In the exhibit, the team already has operational text in a Snowflake table and needs a concise summary for humans. AI_SUMMARIZE_AGG is designed for that pattern: aggregate rows of text and return a readable summary directly in SQL.
This is a good fit when the requirement is:
A full ML project would be more appropriate only if the team needed a custom predictive model or specialized model management. The closest trap is confusing summarization with embeddings or vector-search preparation.
Topic: Transformations, Programmability, and Developer Workflows
A media company lands nested JSON event files with arrays and occasional new keys every minute. The pipeline must preserve raw payloads for replay, publish flattened event rows within 5 minutes, and keep governed data inside Snowflake for downstream sharing. Which design is the BEST fit?
Options:
A. Expose the files through an external table and have consumers flatten the JSON in each query.
B. Parse the JSON on an external Spark platform first, then reload normalized tables into Snowflake.
C. Snowpipe into raw VARIANT, then use a dynamic table with LATERAL FLATTEN to publish curated rows in Snowflake.
D. Map each JSON field to fixed VARCHAR columns during load and recreate tables when keys change.
Best answer: C
Explanation: Snowflake natively handles nested JSON through VARIANT, semi-structured functions, and LATERAL FLATTEN. Loading raw data with Snowpipe and materializing curated rows with a dynamic table satisfies replay, latency, and governance requirements without exporting complex data to an external platform.
The key concept is that complex or semi-structured data does not require an external processing platform by default in Snowflake. Nested JSON with arrays and evolving keys can be loaded directly into a raw VARIANT table, preserving the full payload for replay. From there, Snowflake can flatten arrays and extract attributes natively, and a dynamic table can maintain a curated relational projection for near-real-time use.
VARIANT for replay and schema flexibility.LATERAL FLATTEN and semi-structured functions in the dynamic table query.The closest distractor is the external-table approach, but that pushes transformation work to each consumer instead of producing governed curated data.
VARIANT.Topic: Transformations, Programmability, and Developer Workflows
Which proposed implementation requires configuring Snowflake external network access, such as an EXTERNAL_ACCESS_INTEGRATIONS setting?
Options:
A. An external function backed by an API integration
B. A SQL UDF that extracts fields from VARIANT
C. A JavaScript stored procedure that runs MERGE
D. A Python UDF that calls a third-party REST API
Best answer: D
Explanation: A Python UDF needs external network access when its handler must call a service outside Snowflake at runtime. That is the purpose of network rules, secrets, and an EXTERNAL_ACCESS_INTEGRATIONS reference.
Snowflake external network access is for handler code inside supported UDFs and stored procedures when that code must connect to something outside Snowflake, such as a REST API. In that pattern, you configure network rules, secrets, and an external access integration, then associate it with the object.
An external function is different: it reaches an external service through an API integration, so it does not use the external-network-access feature for handler code. A SQL UDF that only transforms VARIANT data and a procedure that only runs SQL inside Snowflake both stay fully in-platform and do not need outbound connectivity. The deciding test is whether the handler itself must open an external connection.
EXTERNAL_ACCESS_INTEGRATIONS.VARIANT parsing UDF, so no outbound network call is involved.MERGE, which stays inside Snowflake execution.Use the SnowPro Data Engineer DEA-C02 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.
Try SnowPro Data Engineer DEA-C02 on Web View SnowPro Data Engineer DEA-C02 Practice Test
Read the SnowPro Data Engineer DEA-C02 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.