DEA-C02 — Snowflake SnowPro Advanced: Data Engineer Quick Reference

Last revised: June 29, 2026

Compact DEA-C02 quick reference for Snowflake data engineering patterns, pipelines, performance, security, and operations.

How to Use This Quick Reference

This independent Quick Reference supports preparation for the Snowflake SnowPro Advanced: Data Engineer (DEA-C02) exam from Snowflake. It focuses on practical decisions a data engineer must make in Snowflake: ingestion pattern, transformation design, stream/task orchestration, table optimization, governance, and troubleshooting.

Use it to review:

Which Snowflake feature fits a scenario.
How ingestion, CDC, and orchestration components interact.
High-yield SQL patterns and common traps.
Performance and reliability signals to check during troubleshooting.

Core Snowflake Architecture Map

Layer / Concept	What It Does	Data Engineer Exam Relevance
Cloud services layer	Authentication, metadata, optimizer, access control, transactions	Explains metadata pruning, query compilation, RBAC, Time Travel metadata, and object management
Storage layer	Compressed columnar data in immutable micro-partitions	Drives pruning, clustering, cloning, Time Travel, and storage design
Virtual warehouses	User-managed compute for SQL, loading, transformations, Snowpark	Tune size for query complexity; scale out for concurrency
Serverless compute	Snowflake-managed compute for selected services	Used by features such as Snowpipe, automatic services, and some task/service patterns
Micro-partitions	Internal storage units with metadata such as ranges and distinct values	Enables pruning; clustering can improve pruning for large tables
Metadata	File load history, table statistics, object definitions, access metadata	Used for idempotent loads, query optimization, auditing, and troubleshooting
Result cache	Reuses prior query result when eligible	Useful but should not be treated as a pipeline correctness mechanism
Warehouse cache	Local data cache on a running warehouse	Can improve repeated scans; lost when warehouse suspends or changes depending on execution context

Feature Selection Matrix

Need	Prefer	Avoid / Watch For
Batch load staged files	`COPY INTO <table>`	Do not build custom loaders unless you need special processing outside Snowflake
Continuous file ingestion	Snowpipe	Snowpipe is not the same as transactional streaming; design for notification and ingestion latency
Low-latency row ingestion	Snowpipe Streaming or connector-based streaming pattern	Do not force tiny files through batch `COPY` if continuous row ingestion is required
Declarative refresh of derived tables	Dynamic tables	Not ideal for procedural branching, complex side effects, or custom retry logic
Incremental CDC from Snowflake table changes	Streams + tasks	Streams are offsets over change data, not physical event queues
Scheduled SQL orchestration	Tasks	Remember task graphs need resumed tasks and dependency design
Procedural orchestration in Snowflake	Stored procedures called by tasks	Avoid putting simple set-based SQL into procedural loops
Python/Java/Scala transformation pushed to Snowflake	Snowpark	Watch for code that pulls large data to the client instead of pushing down
Point lookup acceleration	Search Optimization Service	Do not use as a default substitute for good table design
Repeated expensive aggregate/query pattern	Materialized view, dynamic table, or precomputed table	Confirm freshness and maintenance tradeoffs
Scan-heavy outlier queries	Query Acceleration Service where eligible	Not a fix for bad joins, poor filters, or under-sized architecture
Cross-region/cloud DR	Replication / failover groups where supported	Do not confuse Time Travel with disaster recovery
Share data without copying	Secure Data Sharing	Secure views/policies may be needed to restrict exposed rows/columns

Data Loading and Ingestion

Loading Pattern Decision Table

Pattern	Best For	Key Objects	Exam Traps
Manual or scheduled batch load	Periodic files, controlled windows	Stage, file format, `COPY INTO`	`COPY` load metadata prevents duplicate file loads unless forced; understand `ON_ERROR` behavior
Snowpipe auto-ingest	Event-driven files from cloud storage	External stage, pipe, cloud notification integration	It is continuous ingestion, not instant transformation orchestration
Snowpipe REST API	Application-triggered file ingest	Pipe, REST call	Application still stages files; not row-by-row streaming
Snowpipe Streaming	Low-latency streaming rows	SDK/connector/channel pattern	Different mental model from staged-file `COPY`
External tables	Query data in external storage	External stage, external table, metadata refresh	Querying external data is not the same as loading into Snowflake-managed storage
Iceberg tables	Open table format / lakehouse interoperability	Catalog integration, external volume, Iceberg table	Understand ownership of metadata and storage before choosing
Connector ingestion	SaaS/app/system ingestion	Snowflake connectors or partner tools	Focus on governance, idempotency, and operational monitoring

Stage and File Format Reference

Object / Option	Purpose	Practical Notes
Internal stage	Snowflake-managed file staging	Good for controlled uploads and transient staging
External stage	Cloud object storage location	Prefer storage integrations over embedded cloud credentials
Named file format	Reusable parsing definition	Keeps `COPY` statements consistent and auditable
`PATTERN`	Regex filter for staged files	Useful but easy to overcomplicate; test carefully
`FILES`	Explicit list of files	Good for deterministic loads
`VALIDATION_MODE`	Validate load without committing rows	Useful before production loads
`ON_ERROR`	Controls behavior for bad rows/files	`CONTINUE` can hide data quality issues if not monitored
`MATCH_BY_COLUMN_NAME`	Map columns by name where supported	Useful when file column order changes
Metadata columns	Capture filename, row number, timestamps	Use for lineage, replay, and quarantine workflows

High-Yield `COPY INTO` Pattern

CREATE OR REPLACE FILE FORMAT ff_json
  TYPE = JSON
  STRIP_OUTER_ARRAY = TRUE;

COPY INTO raw.events
  (payload, source_file, loaded_at)
FROM (
  SELECT
    $1,
    METADATA$FILENAME,
    CURRENT_TIMESTAMP()
  FROM @raw_event_stage
)
FILE_FORMAT = (FORMAT_NAME = ff_json)
ON_ERROR = 'CONTINUE';

Key points:

Use named stages and file formats for repeatability.
Persist source file metadata for lineage and replay.
Monitor rejected rows/files; ON_ERROR = 'CONTINUE' is not a data quality solution by itself.
COPY maintains load metadata, so reloading the same file requires intentional design.

Snowpipe Reference

CREATE OR REPLACE PIPE ingest.event_pipe
  AUTO_INGEST = TRUE
AS
COPY INTO raw.events
FROM @raw_event_stage
FILE_FORMAT = (FORMAT_NAME = ff_json);

Topic	Remember
Auto-ingest	Uses cloud notifications to trigger ingestion from a stage
Pipe definition	Contains a `COPY INTO` statement
Monitoring	Use pipe history and copy history views/functions
Transformations	Keep pipe transformations simple; use downstream tasks/dynamic tables for complex logic
Idempotency	File naming, load metadata, and replay procedure matter
Errors	Check pipe status and load history; failed file handling must be operationalized

Transformation and Orchestration

Streams, Tasks, and Dynamic Tables

Feature	Use When	Core Concept	Common Trap
Stream	Need change data from table/view changes	Tracks offset of changes since last consumption	A stream is not a standalone queue; it depends on source data retention and consumption
Standard stream	Need inserts, deletes, updates	Updates appear with change metadata	Must handle delete/update semantics correctly
Append-only stream	Only new inserts matter	More efficient insert-only change tracking	Wrong choice if updates/deletes must be captured
Task	Need scheduled or event-like SQL execution	Runs SQL, stored procedure, or graph step	Tasks must be resumed; child tasks depend on graph state
Task graph	Need ordered multi-step pipeline	Root task plus child dependencies	Design for retries, failure isolation, and idempotency
Dynamic table	Need declarative refreshed result table	Snowflake refreshes toward target lag	Not a replacement for every task/procedure workflow
Stored procedure	Need procedural control flow	JavaScript, Snowpark Python/Java/Scala, SQL procedures	Avoid row-by-row procedural processing for set operations

Stream Metadata

Metadata Column	Meaning	Use
`METADATA$ACTION`	Insert or delete action	Drive `MERGE`, deletes, audit logic
`METADATA$ISUPDATE`	Whether row is part of an update operation	Distinguish update pairs from simple insert/delete behavior
`METADATA$ROW_ID`	Stable row identifier for change tracking	Useful for dedup/change handling patterns

Task and Stream Pattern

CREATE OR REPLACE STREAM stg.orders_stream
  ON TABLE stg.orders;

CREATE OR REPLACE TASK mart.merge_orders_task
  WAREHOUSE = etl_wh
  SCHEDULE = 'USING CRON 0 * * * * UTC'
  WHEN SYSTEM$STREAM_HAS_DATA('stg.orders_stream')
AS
MERGE INTO mart.orders AS t
USING (
  SELECT *
  FROM stg.orders_stream
  WHERE METADATA$ACTION = 'INSERT'
) AS s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET
  t.status = s.status,
  t.updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT
  (order_id, status, updated_at)
VALUES
  (s.order_id, s.status, s.updated_at);

ALTER TASK mart.merge_orders_task RESUME;

Exam reminders:

Consuming a stream occurs when the DML statement using it commits.
Wrap multi-step stream consumption in a transaction when consistency matters.
SYSTEM$STREAM_HAS_DATA is useful as a task WHEN condition, but the downstream SQL still needs to be idempotent.
Separate ingestion, validation, merge, and serving layers when troubleshooting clarity matters.

Dynamic Table Pattern

CREATE OR REPLACE DYNAMIC TABLE mart.daily_sales
  TARGET_LAG = '1 hour'
  WAREHOUSE = etl_wh
AS
SELECT
  order_date,
  SUM(net_amount) AS net_sales,
  COUNT(*) AS order_count
FROM clean.orders
GROUP BY order_date;

Dynamic Table Choice	Choose It When
`TARGET_LAG`	You need Snowflake-managed refresh toward a freshness objective
`TARGET_LAG = DOWNSTREAM`	Refresh should be driven by downstream dependencies
Incremental refresh	Query is eligible and incremental maintenance is efficient
Full refresh	Logic is not incrementally maintainable or full recompute is acceptable
Tasks instead	You need procedural steps, conditional branches, external calls, or custom retry logic

Table Design and Storage

Table Type Reference

Table Type	Best For	Key Behavior
Permanent table	Durable production data	Supports Snowflake data protection features such as Time Travel and Fail-safe behavior
Transient table	Rebuildable data, staging, derived data	No Fail-safe; useful when recovery requirements are lower
Temporary table	Session-scoped intermediate work	Exists only for the session; can shadow permanent object names in the session
External table	Query files in external storage	Metadata must reflect external file changes
Dynamic table	Managed refreshed table from a query	Stores derived results maintained by Snowflake
Iceberg table	Open table format interoperability	Requires understanding catalog/storage ownership

Micro-Partition and Clustering Reference

Concept	Practical Meaning
Micro-partition	Immutable internal storage unit; Snowflake stores metadata for pruning
Pruning	Skips micro-partitions that cannot satisfy query predicates
Natural clustering	Data order created by load patterns may already support pruning
Clustering key	Expression list used to improve organization for pruning
Clustering depth	Indicator of how well data is clustered for selected keys
Automatic Clustering	Snowflake-managed maintenance for defined clustering keys
Re-clustering cost	More clustering is not always better; use for large tables with selective filters

Choose clustering when:

Table is large enough that pruning matters.
Queries repeatedly filter on predictable columns.
Natural load order does not already cluster data well.
Maintenance cost is justified by query improvement.

Avoid clustering when:

Table is small.
Filters are unpredictable.
High-churn DML causes excessive maintenance.
Query bottleneck is joins, spilling, or concurrency rather than scan pruning.

Optimization Feature Comparison

Feature	Helps With	Not For
Clustering key	Range/filter pruning on large tables	Small tables or random access workloads alone
Search Optimization Service	Highly selective point lookups, some semi-structured search patterns	Broad scans and general warehouse sizing problems
Materialized view	Repeated expensive query patterns with maintained results	Arbitrary complex transformations or every dashboard query
Dynamic table	Fresh, declarative derived table pipeline	Procedural orchestration
Query Acceleration Service	Eligible scan-heavy queries with selective processing	Bad SQL logic, missing joins, or universal acceleration
Warehouse scale-up	Complex single queries, memory pressure, large joins	High concurrency alone
Multi-cluster warehouse	Concurrent user/query demand	Making one individual query faster
Result cache	Repeated identical eligible queries	ETL correctness, freshness guarantees, or parameterized workload design

Semi-Structured Data

Data Types and Functions

Item	Use
`VARIANT`	Store semi-structured values such as JSON
`OBJECT`	Key-value structure
`ARRAY`	Ordered list
`PARSE_JSON`	Convert JSON text to `VARIANT`
`TO_VARIANT`	Convert SQL value to `VARIANT`
`FLATTEN`	Explode arrays/objects into rows
Dot / bracket notation	Navigate nested values
Casts such as `::STRING`, `::NUMBER`	Convert variant values for typed processing

SELECT
  payload:customer.id::STRING AS customer_id,
  item.value:sku::STRING AS sku,
  item.value:quantity::NUMBER AS quantity
FROM raw.events,
LATERAL FLATTEN(input => payload:items) AS item;

High-yield distinctions:

JSON null and SQL NULL are not always equivalent in processing.
Cast extracted values before joining, grouping, or applying numeric logic.
Flatten only the level required; excessive flattening can explode row counts.
For frequently queried attributes, consider projecting into typed columns or derived tables.

Snowpark, UDFs, and Stored Procedures

Tool	Best Use	Exam Distinction
Snowpark DataFrame API	Pushdown transformations in Python, Java, or Scala	Lazy execution; operations are planned for Snowflake execution
Scalar UDF	Reusable row-level function	Good for deterministic expressions, not orchestration
UDTF	Function returning rows/table output	Useful for expanding or parsing custom structures
Stored procedure	Control flow, orchestration, multi-step logic	Can be called from tasks
External function	Call external service through configured integration	Requires security/network design
External access integration	Allows governed outbound access from supported code	Do not hard-code secrets or network assumptions
Packages/imports	Bring dependencies into Snowpark code	Versioning and allowed packages matter operationally

Practical design rules:

Prefer SQL set operations for relational transformations.
Use Snowpark when code libraries, complex logic, or developer language preference justify it.
Keep large datasets inside Snowflake; avoid collecting data to the client.
Use stored procedures for orchestration, not as a substitute for set-based SQL performance.

Security, Governance, and Data Protection

RBAC and Grants

Concept	Meaning	Exam Focus
Role-based access control	Privileges granted to roles, roles granted to users/roles	Design least-privilege access
`USAGE` privilege	Allows use of database, schema, warehouse, integration	Required but not sufficient for object access
Object privileges	`SELECT`, `INSERT`, `UPDATE`, `DELETE`, etc.	Grant only what the pipeline needs
`OWNERSHIP`	Full control over object and grant management	Transferring ownership can affect grants
Future grants	Apply grants to future objects in a schema/database	Useful for repeatable pipeline object creation
Managed access schema	Centralizes grant management through schema owner	Good for controlled environments
Secondary roles	Additional active roles for a session	Know how effective privileges are determined

Example least-privilege pattern:

GRANT USAGE ON WAREHOUSE etl_wh TO ROLE data_engineer_role;
GRANT USAGE ON DATABASE analytics TO ROLE data_engineer_role;
GRANT USAGE ON SCHEMA analytics.stg TO ROLE data_engineer_role;
GRANT SELECT, INSERT, UPDATE, DELETE
  ON ALL TABLES IN SCHEMA analytics.stg
  TO ROLE data_engineer_role;

Governance Feature Reference

Feature	Purpose	Use When
Masking policy	Dynamically mask column values	Sensitive fields require role/context-based display
Row access policy	Filter rows based on policy logic	Multi-tenant, regional, or entitlement-based access
Tag	Classify objects/columns	Governance, lineage, policy automation
Tag-based masking	Apply masking based on classification	Many sensitive columns need consistent control
Secure view	Hide query definition and restrict optimization exposure	Controlled data sharing or sensitive logic
Secure UDF	Protect function definition	Sensitive business logic
Access History	Audit who accessed what	Compliance, investigations, least-privilege review
Object dependencies	Understand upstream/downstream impact	Change management and pipeline safety

Data Protection and Recovery

Feature	Use	Do Not Confuse With
Time Travel	Query, clone, or restore prior object state within retention	Long-term backup strategy by itself
`UNDROP`	Restore recently dropped objects when eligible	Fixing corrupted logic after retention expires
Zero-copy clone	Create metadata-based copy for dev/test/recovery	Full physical copy at creation time
Fail-safe	Snowflake-managed last-resort recovery for permanent data	User-queryable recovery workflow
Replication	Copy supported databases/account objects across regions/clouds	Query cache or Time Travel
Failover group	Coordinated failover for supported objects	Simple table clone

CREATE TABLE dev.orders_clone
  CLONE prod.orders;

CREATE TABLE recovery.orders_before_change
  CLONE prod.orders
  AT (TIMESTAMP => '2026-06-18 10:00:00'::TIMESTAMP);

Feature	Best For	Notes
Secure Data Sharing	Share live data without copying	Provider controls shared objects and grants
Reader account	Share with consumers without their own Snowflake account	Provider administers reader account
Listing / Marketplace	Discoverable data product distribution	Can be private or public depending on setup
Secure view in share	Restrict columns/rows before sharing	Combine with masking/row access where needed
Replicated share	Cross-region/cloud consumer access patterns	Requires replication design and operational awareness

Exam traps:

Sharing exposes objects through grants; it does not copy table data into the consumer account.
Secure views are common when the provider must restrict shared data.
Consumer performance depends on consumer-side compute for querying shared data.
Governance policies must be tested from the consumer’s effective role/context.

Performance and Troubleshooting

Symptom-to-Action Table

Symptom	Check First	Likely Actions
Query scans too much data	Query Profile, partitions scanned, filter predicates	Add selective predicates, improve clustering, project typed columns
Query spills to disk	Query Profile spill indicators, join size, warehouse size	Scale up warehouse, reduce data earlier, optimize joins
Many queued queries	Warehouse load, concurrency, multi-cluster settings	Scale out, separate workloads, adjust task/user warehouses
One complex query is slow	Execution plan, joins, aggregates, memory	Scale up, rewrite SQL, precompute, use materialized view/dynamic table
Dashboard repeats same expensive query	Query history, result reuse, freshness needs	Consider materialized view, dynamic table, aggregate table
Point lookups slow on huge table	Filter selectivity, search predicates	Consider Search Optimization Service
Pipeline reprocesses files	Copy history, file names, `FORCE` use	Fix idempotency and replay design
Task did not run	Task state, schedule, dependencies, `WHEN` condition	Resume task, inspect task history, check stream has data
Stream stale or empty unexpectedly	Source retention, consumption transaction, task failures	Consume regularly; monitor stream/task health
Snowpipe not loading	Pipe status, notification setup, copy history	Validate stage path, file format, cloud events, pipe errors

Query Profile Review Checklist

Query Profile Area	What It Tells You
Compilation time	Excessive optimization/metadata overhead or complex SQL generation
Bytes scanned	Whether pruning/projection is effective
Partitions scanned	Micro-partition pruning quality
Join nodes	Join type, build/probe imbalance, missing filters
Aggregate nodes	High-cardinality grouping or late reduction
Sort nodes	Large ordering/window operations
Spill indicators	Memory pressure; warehouse sizing or SQL rewrite needed
Remote vs local I/O	Cache effectiveness and scan cost
Queuing	Warehouse concurrency or resource availability issue

Warehouse Sizing Decisions

Scenario	Better Lever
Single long-running complex query	Larger warehouse, SQL rewrite, precompute
Many simultaneous queries	Multi-cluster warehouse or workload separation
ETL and BI interfere with each other	Separate warehouses by workload
Intermittent workloads	Auto-suspend/auto-resume with appropriate settings
Repeated transformations	Tasks/dynamic tables with dedicated warehouse
Cost governance	Resource monitors, query review, workload isolation

Operational Monitoring Reference

Need	Snowflake Area to Check
Query execution details	Query History, Query Profile
Warehouse credit and load patterns	Warehouse metering/load history
File loads	Copy history, load history
Snowpipe health	Pipe status, pipe usage/history
Task runs	Task history and task graph state
Stream consumption	Stream metadata, task history, source table changes
Access auditing	Access History, Login History
Object changes	Account Usage views, object dependency metadata
Data quality issues	Rejected files, quarantine tables, validation queries
Replication/failover	Replication/failover group status and history

Useful monitoring query patterns:

SELECT
  query_id,
  warehouse_name,
  execution_status,
  start_time,
  bytes_scanned
FROM snowflake.account_usage.query_history
WHERE warehouse_name = 'ETL_WH'
ORDER BY start_time DESC;

SELECT *
FROM TABLE(information_schema.task_history(
  TASK_NAME => 'MERGE_ORDERS_TASK'
))
ORDER BY scheduled_time DESC;

Data Engineering Design Patterns

Bronze / Silver / Gold Mapping

Layer	Snowflake Objects	Design Goal
Raw / Bronze	Raw tables, `VARIANT`, source metadata	Preserve source fidelity and lineage
Clean / Silver	Typed tables, validation results, deduped records	Standardize schema and enforce quality
Curated / Gold	Marts, aggregates, dynamic tables, materialized views	Serve analytics, ML, and downstream apps
Governance	Policies, tags, secure views	Enforce access and classification
Operations	Tasks, streams, monitoring tables	Make pipeline health visible

Idempotent Pipeline Checklist

Use deterministic file naming and load tracking.
Capture METADATA$FILENAME and load timestamp.
Separate raw ingestion from business transformation.
Use MERGE for upserts where duplicate events are possible.
Process streams in transactions when multiple target updates must stay consistent.
Design replay procedures before production failure.
Quarantine bad records instead of silently discarding them.
Monitor task, pipe, and copy history.
Keep transformations set-based wherever possible.

Data Quality Controls

Control	Example
Type validation	`TRY_TO_NUMBER`, `TRY_TO_DATE`, explicit casts
Required fields	Reject or quarantine rows with missing keys
Range checks	Validate amounts, timestamps, status codes
Deduplication	Window functions over business keys and event timestamps
Referential checks	Compare staged keys to dimension tables
Schema drift detection	Compare inferred file schema to expected contract
Audit counts	Source count, loaded count, rejected count, merged count
Lineage fields	Source file, row number, batch ID, load time

High-Yield Exam Traps

Trap	Correct Mental Model
“Bigger warehouse always fixes performance.”	Scale up helps some single-query bottlenecks; concurrency may need scale-out or workload isolation
“A stream stores all change rows forever.”	A stream tracks change offsets over retained source data; monitor staleness and consumption
“Snowpipe replaces transformations.”	Snowpipe loads files; downstream tasks/dynamic tables usually transform
“Temporary tables are harmless with duplicate names.”	Temporary objects can shadow permanent names in a session
“Fail-safe is for normal user recovery.”	Use Time Travel, cloning, and `UNDROP` for routine recovery
“Clustering should be added to every large table.”	Add clustering only when query pruning benefit justifies maintenance
“Semi-structured data requires no modeling.”	Frequently queried fields often belong in typed columns or curated tables
“Secure Data Sharing copies provider data.”	Sharing allows live access without copying into the consumer account
“Stored procedures are faster than SQL.”	Set-based SQL is usually preferred for relational transformations
“Result cache proves the query is optimized.”	Cache can hide inefficient scans; inspect cold-query behavior and profile

Final DEA-C02 Review Checklist

Before taking DEA-C02, make sure you can quickly answer:

Which ingestion option fits batch files, continuous files, and low-latency streaming rows?
How do COPY INTO, Snowpipe, stages, file formats, and load history interact?
When should you choose streams/tasks versus dynamic tables?
How do stream metadata columns affect MERGE logic?
What table type fits production, staging, session-only, and externally stored data?
How do micro-partitions, pruning, clustering, and search optimization differ?
When should you scale a warehouse up versus scale out?
How do masking policies, row access policies, secure views, and tags work together?
How do Time Travel, cloning, UNDROP, replication, and Fail-safe differ?
What monitoring view or history area would you check for a failed load, task, pipe, or slow query?

Next step: turn each table above into scenario drills, then practice timed Snowflake data engineering questions that force you to choose the best ingestion, transformation, security, and performance pattern for the stated constraints.

Scenario Guide

Sourcing and Ingestion