DEA-C02 — Snowflake SnowPro Advanced: Data Engineer Quick Reference

Compact DEA-C02 quick reference for Snowflake data engineering patterns, pipelines, performance, security, and operations.

How to Use This Quick Reference

This independent Quick Reference supports preparation for the Snowflake SnowPro Advanced: Data Engineer (DEA-C02) exam from Snowflake. It focuses on practical decisions a data engineer must make in Snowflake: ingestion pattern, transformation design, stream/task orchestration, table optimization, governance, and troubleshooting.

Use it to review:

  • Which Snowflake feature fits a scenario.
  • How ingestion, CDC, and orchestration components interact.
  • High-yield SQL patterns and common traps.
  • Performance and reliability signals to check during troubleshooting.

Core Snowflake Architecture Map

Layer / ConceptWhat It DoesData Engineer Exam Relevance
Cloud services layerAuthentication, metadata, optimizer, access control, transactionsExplains metadata pruning, query compilation, RBAC, Time Travel metadata, and object management
Storage layerCompressed columnar data in immutable micro-partitionsDrives pruning, clustering, cloning, Time Travel, and storage design
Virtual warehousesUser-managed compute for SQL, loading, transformations, SnowparkTune size for query complexity; scale out for concurrency
Serverless computeSnowflake-managed compute for selected servicesUsed by features such as Snowpipe, automatic services, and some task/service patterns
Micro-partitionsInternal storage units with metadata such as ranges and distinct valuesEnables pruning; clustering can improve pruning for large tables
MetadataFile load history, table statistics, object definitions, access metadataUsed for idempotent loads, query optimization, auditing, and troubleshooting
Result cacheReuses prior query result when eligibleUseful but should not be treated as a pipeline correctness mechanism
Warehouse cacheLocal data cache on a running warehouseCan improve repeated scans; lost when warehouse suspends or changes depending on execution context

Feature Selection Matrix

NeedPreferAvoid / Watch For
Batch load staged filesCOPY INTO <table>Do not build custom loaders unless you need special processing outside Snowflake
Continuous file ingestionSnowpipeSnowpipe is not the same as transactional streaming; design for notification and ingestion latency
Low-latency row ingestionSnowpipe Streaming or connector-based streaming patternDo not force tiny files through batch COPY if continuous row ingestion is required
Declarative refresh of derived tablesDynamic tablesNot ideal for procedural branching, complex side effects, or custom retry logic
Incremental CDC from Snowflake table changesStreams + tasksStreams are offsets over change data, not physical event queues
Scheduled SQL orchestrationTasksRemember task graphs need resumed tasks and dependency design
Procedural orchestration in SnowflakeStored procedures called by tasksAvoid putting simple set-based SQL into procedural loops
Python/Java/Scala transformation pushed to SnowflakeSnowparkWatch for code that pulls large data to the client instead of pushing down
Point lookup accelerationSearch Optimization ServiceDo not use as a default substitute for good table design
Repeated expensive aggregate/query patternMaterialized view, dynamic table, or precomputed tableConfirm freshness and maintenance tradeoffs
Scan-heavy outlier queriesQuery Acceleration Service where eligibleNot a fix for bad joins, poor filters, or under-sized architecture
Cross-region/cloud DRReplication / failover groups where supportedDo not confuse Time Travel with disaster recovery
Share data without copyingSecure Data SharingSecure views/policies may be needed to restrict exposed rows/columns

Data Loading and Ingestion

Loading Pattern Decision Table

PatternBest ForKey ObjectsExam Traps
Manual or scheduled batch loadPeriodic files, controlled windowsStage, file format, COPY INTOCOPY load metadata prevents duplicate file loads unless forced; understand ON_ERROR behavior
Snowpipe auto-ingestEvent-driven files from cloud storageExternal stage, pipe, cloud notification integrationIt is continuous ingestion, not instant transformation orchestration
Snowpipe REST APIApplication-triggered file ingestPipe, REST callApplication still stages files; not row-by-row streaming
Snowpipe StreamingLow-latency streaming rowsSDK/connector/channel patternDifferent mental model from staged-file COPY
External tablesQuery data in external storageExternal stage, external table, metadata refreshQuerying external data is not the same as loading into Snowflake-managed storage
Iceberg tablesOpen table format / lakehouse interoperabilityCatalog integration, external volume, Iceberg tableUnderstand ownership of metadata and storage before choosing
Connector ingestionSaaS/app/system ingestionSnowflake connectors or partner toolsFocus on governance, idempotency, and operational monitoring

Stage and File Format Reference

Object / OptionPurposePractical Notes
Internal stageSnowflake-managed file stagingGood for controlled uploads and transient staging
External stageCloud object storage locationPrefer storage integrations over embedded cloud credentials
Named file formatReusable parsing definitionKeeps COPY statements consistent and auditable
PATTERNRegex filter for staged filesUseful but easy to overcomplicate; test carefully
FILESExplicit list of filesGood for deterministic loads
VALIDATION_MODEValidate load without committing rowsUseful before production loads
ON_ERRORControls behavior for bad rows/filesCONTINUE can hide data quality issues if not monitored
MATCH_BY_COLUMN_NAMEMap columns by name where supportedUseful when file column order changes
Metadata columnsCapture filename, row number, timestampsUse for lineage, replay, and quarantine workflows

High-Yield COPY INTO Pattern

CREATE OR REPLACE FILE FORMAT ff_json
  TYPE = JSON
  STRIP_OUTER_ARRAY = TRUE;

COPY INTO raw.events
  (payload, source_file, loaded_at)
FROM (
  SELECT
    $1,
    METADATA$FILENAME,
    CURRENT_TIMESTAMP()
  FROM @raw_event_stage
)
FILE_FORMAT = (FORMAT_NAME = ff_json)
ON_ERROR = 'CONTINUE';

Key points:

  • Use named stages and file formats for repeatability.
  • Persist source file metadata for lineage and replay.
  • Monitor rejected rows/files; ON_ERROR = 'CONTINUE' is not a data quality solution by itself.
  • COPY maintains load metadata, so reloading the same file requires intentional design.

Snowpipe Reference

CREATE OR REPLACE PIPE ingest.event_pipe
  AUTO_INGEST = TRUE
AS
COPY INTO raw.events
FROM @raw_event_stage
FILE_FORMAT = (FORMAT_NAME = ff_json);
TopicRemember
Auto-ingestUses cloud notifications to trigger ingestion from a stage
Pipe definitionContains a COPY INTO statement
MonitoringUse pipe history and copy history views/functions
TransformationsKeep pipe transformations simple; use downstream tasks/dynamic tables for complex logic
IdempotencyFile naming, load metadata, and replay procedure matter
ErrorsCheck pipe status and load history; failed file handling must be operationalized

Transformation and Orchestration

Streams, Tasks, and Dynamic Tables

FeatureUse WhenCore ConceptCommon Trap
StreamNeed change data from table/view changesTracks offset of changes since last consumptionA stream is not a standalone queue; it depends on source data retention and consumption
Standard streamNeed inserts, deletes, updatesUpdates appear with change metadataMust handle delete/update semantics correctly
Append-only streamOnly new inserts matterMore efficient insert-only change trackingWrong choice if updates/deletes must be captured
TaskNeed scheduled or event-like SQL executionRuns SQL, stored procedure, or graph stepTasks must be resumed; child tasks depend on graph state
Task graphNeed ordered multi-step pipelineRoot task plus child dependenciesDesign for retries, failure isolation, and idempotency
Dynamic tableNeed declarative refreshed result tableSnowflake refreshes toward target lagNot a replacement for every task/procedure workflow
Stored procedureNeed procedural control flowJavaScript, Snowpark Python/Java/Scala, SQL proceduresAvoid row-by-row procedural processing for set operations

Stream Metadata

Metadata ColumnMeaningUse
METADATA$ACTIONInsert or delete actionDrive MERGE, deletes, audit logic
METADATA$ISUPDATEWhether row is part of an update operationDistinguish update pairs from simple insert/delete behavior
METADATA$ROW_IDStable row identifier for change trackingUseful for dedup/change handling patterns

Task and Stream Pattern

CREATE OR REPLACE STREAM stg.orders_stream
  ON TABLE stg.orders;

CREATE OR REPLACE TASK mart.merge_orders_task
  WAREHOUSE = etl_wh
  SCHEDULE = 'USING CRON 0 * * * * UTC'
  WHEN SYSTEM$STREAM_HAS_DATA('stg.orders_stream')
AS
MERGE INTO mart.orders AS t
USING (
  SELECT *
  FROM stg.orders_stream
  WHERE METADATA$ACTION = 'INSERT'
) AS s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET
  t.status = s.status,
  t.updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT
  (order_id, status, updated_at)
VALUES
  (s.order_id, s.status, s.updated_at);

ALTER TASK mart.merge_orders_task RESUME;

Exam reminders:

  • Consuming a stream occurs when the DML statement using it commits.
  • Wrap multi-step stream consumption in a transaction when consistency matters.
  • SYSTEM$STREAM_HAS_DATA is useful as a task WHEN condition, but the downstream SQL still needs to be idempotent.
  • Separate ingestion, validation, merge, and serving layers when troubleshooting clarity matters.

Dynamic Table Pattern

CREATE OR REPLACE DYNAMIC TABLE mart.daily_sales
  TARGET_LAG = '1 hour'
  WAREHOUSE = etl_wh
AS
SELECT
  order_date,
  SUM(net_amount) AS net_sales,
  COUNT(*) AS order_count
FROM clean.orders
GROUP BY order_date;
Dynamic Table ChoiceChoose It When
TARGET_LAGYou need Snowflake-managed refresh toward a freshness objective
TARGET_LAG = DOWNSTREAMRefresh should be driven by downstream dependencies
Incremental refreshQuery is eligible and incremental maintenance is efficient
Full refreshLogic is not incrementally maintainable or full recompute is acceptable
Tasks insteadYou need procedural steps, conditional branches, external calls, or custom retry logic

Table Design and Storage

Table Type Reference

Table TypeBest ForKey Behavior
Permanent tableDurable production dataSupports Snowflake data protection features such as Time Travel and Fail-safe behavior
Transient tableRebuildable data, staging, derived dataNo Fail-safe; useful when recovery requirements are lower
Temporary tableSession-scoped intermediate workExists only for the session; can shadow permanent object names in the session
External tableQuery files in external storageMetadata must reflect external file changes
Dynamic tableManaged refreshed table from a queryStores derived results maintained by Snowflake
Iceberg tableOpen table format interoperabilityRequires understanding catalog/storage ownership

Micro-Partition and Clustering Reference

ConceptPractical Meaning
Micro-partitionImmutable internal storage unit; Snowflake stores metadata for pruning
PruningSkips micro-partitions that cannot satisfy query predicates
Natural clusteringData order created by load patterns may already support pruning
Clustering keyExpression list used to improve organization for pruning
Clustering depthIndicator of how well data is clustered for selected keys
Automatic ClusteringSnowflake-managed maintenance for defined clustering keys
Re-clustering costMore clustering is not always better; use for large tables with selective filters

Choose clustering when:

  • Table is large enough that pruning matters.
  • Queries repeatedly filter on predictable columns.
  • Natural load order does not already cluster data well.
  • Maintenance cost is justified by query improvement.

Avoid clustering when:

  • Table is small.
  • Filters are unpredictable.
  • High-churn DML causes excessive maintenance.
  • Query bottleneck is joins, spilling, or concurrency rather than scan pruning.

Optimization Feature Comparison

FeatureHelps WithNot For
Clustering keyRange/filter pruning on large tablesSmall tables or random access workloads alone
Search Optimization ServiceHighly selective point lookups, some semi-structured search patternsBroad scans and general warehouse sizing problems
Materialized viewRepeated expensive query patterns with maintained resultsArbitrary complex transformations or every dashboard query
Dynamic tableFresh, declarative derived table pipelineProcedural orchestration
Query Acceleration ServiceEligible scan-heavy queries with selective processingBad SQL logic, missing joins, or universal acceleration
Warehouse scale-upComplex single queries, memory pressure, large joinsHigh concurrency alone
Multi-cluster warehouseConcurrent user/query demandMaking one individual query faster
Result cacheRepeated identical eligible queriesETL correctness, freshness guarantees, or parameterized workload design

Semi-Structured Data

Data Types and Functions

ItemUse
VARIANTStore semi-structured values such as JSON
OBJECTKey-value structure
ARRAYOrdered list
PARSE_JSONConvert JSON text to VARIANT
TO_VARIANTConvert SQL value to VARIANT
FLATTENExplode arrays/objects into rows
Dot / bracket notationNavigate nested values
Casts such as ::STRING, ::NUMBERConvert variant values for typed processing
SELECT
  payload:customer.id::STRING AS customer_id,
  item.value:sku::STRING AS sku,
  item.value:quantity::NUMBER AS quantity
FROM raw.events,
LATERAL FLATTEN(input => payload:items) AS item;

High-yield distinctions:

  • JSON null and SQL NULL are not always equivalent in processing.
  • Cast extracted values before joining, grouping, or applying numeric logic.
  • Flatten only the level required; excessive flattening can explode row counts.
  • For frequently queried attributes, consider projecting into typed columns or derived tables.

Snowpark, UDFs, and Stored Procedures

ToolBest UseExam Distinction
Snowpark DataFrame APIPushdown transformations in Python, Java, or ScalaLazy execution; operations are planned for Snowflake execution
Scalar UDFReusable row-level functionGood for deterministic expressions, not orchestration
UDTFFunction returning rows/table outputUseful for expanding or parsing custom structures
Stored procedureControl flow, orchestration, multi-step logicCan be called from tasks
External functionCall external service through configured integrationRequires security/network design
External access integrationAllows governed outbound access from supported codeDo not hard-code secrets or network assumptions
Packages/importsBring dependencies into Snowpark codeVersioning and allowed packages matter operationally

Practical design rules:

  • Prefer SQL set operations for relational transformations.
  • Use Snowpark when code libraries, complex logic, or developer language preference justify it.
  • Keep large datasets inside Snowflake; avoid collecting data to the client.
  • Use stored procedures for orchestration, not as a substitute for set-based SQL performance.

Security, Governance, and Data Protection

RBAC and Grants

ConceptMeaningExam Focus
Role-based access controlPrivileges granted to roles, roles granted to users/rolesDesign least-privilege access
USAGE privilegeAllows use of database, schema, warehouse, integrationRequired but not sufficient for object access
Object privilegesSELECT, INSERT, UPDATE, DELETE, etc.Grant only what the pipeline needs
OWNERSHIPFull control over object and grant managementTransferring ownership can affect grants
Future grantsApply grants to future objects in a schema/databaseUseful for repeatable pipeline object creation
Managed access schemaCentralizes grant management through schema ownerGood for controlled environments
Secondary rolesAdditional active roles for a sessionKnow how effective privileges are determined

Example least-privilege pattern:

GRANT USAGE ON WAREHOUSE etl_wh TO ROLE data_engineer_role;
GRANT USAGE ON DATABASE analytics TO ROLE data_engineer_role;
GRANT USAGE ON SCHEMA analytics.stg TO ROLE data_engineer_role;
GRANT SELECT, INSERT, UPDATE, DELETE
  ON ALL TABLES IN SCHEMA analytics.stg
  TO ROLE data_engineer_role;

Governance Feature Reference

FeaturePurposeUse When
Masking policyDynamically mask column valuesSensitive fields require role/context-based display
Row access policyFilter rows based on policy logicMulti-tenant, regional, or entitlement-based access
TagClassify objects/columnsGovernance, lineage, policy automation
Tag-based maskingApply masking based on classificationMany sensitive columns need consistent control
Secure viewHide query definition and restrict optimization exposureControlled data sharing or sensitive logic
Secure UDFProtect function definitionSensitive business logic
Access HistoryAudit who accessed whatCompliance, investigations, least-privilege review
Object dependenciesUnderstand upstream/downstream impactChange management and pipeline safety

Data Protection and Recovery

FeatureUseDo Not Confuse With
Time TravelQuery, clone, or restore prior object state within retentionLong-term backup strategy by itself
UNDROPRestore recently dropped objects when eligibleFixing corrupted logic after retention expires
Zero-copy cloneCreate metadata-based copy for dev/test/recoveryFull physical copy at creation time
Fail-safeSnowflake-managed last-resort recovery for permanent dataUser-queryable recovery workflow
ReplicationCopy supported databases/account objects across regions/cloudsQuery cache or Time Travel
Failover groupCoordinated failover for supported objectsSimple table clone
CREATE TABLE dev.orders_clone
  CLONE prod.orders;

CREATE TABLE recovery.orders_before_change
  CLONE prod.orders
  AT (TIMESTAMP => '2026-06-18 10:00:00'::TIMESTAMP);

Data Sharing and Collaboration

FeatureBest ForNotes
Secure Data SharingShare live data without copyingProvider controls shared objects and grants
Reader accountShare with consumers without their own Snowflake accountProvider administers reader account
Listing / MarketplaceDiscoverable data product distributionCan be private or public depending on setup
Secure view in shareRestrict columns/rows before sharingCombine with masking/row access where needed
Replicated shareCross-region/cloud consumer access patternsRequires replication design and operational awareness

Exam traps:

  • Sharing exposes objects through grants; it does not copy table data into the consumer account.
  • Secure views are common when the provider must restrict shared data.
  • Consumer performance depends on consumer-side compute for querying shared data.
  • Governance policies must be tested from the consumer’s effective role/context.

Performance and Troubleshooting

Symptom-to-Action Table

SymptomCheck FirstLikely Actions
Query scans too much dataQuery Profile, partitions scanned, filter predicatesAdd selective predicates, improve clustering, project typed columns
Query spills to diskQuery Profile spill indicators, join size, warehouse sizeScale up warehouse, reduce data earlier, optimize joins
Many queued queriesWarehouse load, concurrency, multi-cluster settingsScale out, separate workloads, adjust task/user warehouses
One complex query is slowExecution plan, joins, aggregates, memoryScale up, rewrite SQL, precompute, use materialized view/dynamic table
Dashboard repeats same expensive queryQuery history, result reuse, freshness needsConsider materialized view, dynamic table, aggregate table
Point lookups slow on huge tableFilter selectivity, search predicatesConsider Search Optimization Service
Pipeline reprocesses filesCopy history, file names, FORCE useFix idempotency and replay design
Task did not runTask state, schedule, dependencies, WHEN conditionResume task, inspect task history, check stream has data
Stream stale or empty unexpectedlySource retention, consumption transaction, task failuresConsume regularly; monitor stream/task health
Snowpipe not loadingPipe status, notification setup, copy historyValidate stage path, file format, cloud events, pipe errors

Query Profile Review Checklist

Query Profile AreaWhat It Tells You
Compilation timeExcessive optimization/metadata overhead or complex SQL generation
Bytes scannedWhether pruning/projection is effective
Partitions scannedMicro-partition pruning quality
Join nodesJoin type, build/probe imbalance, missing filters
Aggregate nodesHigh-cardinality grouping or late reduction
Sort nodesLarge ordering/window operations
Spill indicatorsMemory pressure; warehouse sizing or SQL rewrite needed
Remote vs local I/OCache effectiveness and scan cost
QueuingWarehouse concurrency or resource availability issue

Warehouse Sizing Decisions

ScenarioBetter Lever
Single long-running complex queryLarger warehouse, SQL rewrite, precompute
Many simultaneous queriesMulti-cluster warehouse or workload separation
ETL and BI interfere with each otherSeparate warehouses by workload
Intermittent workloadsAuto-suspend/auto-resume with appropriate settings
Repeated transformationsTasks/dynamic tables with dedicated warehouse
Cost governanceResource monitors, query review, workload isolation

Operational Monitoring Reference

NeedSnowflake Area to Check
Query execution detailsQuery History, Query Profile
Warehouse credit and load patternsWarehouse metering/load history
File loadsCopy history, load history
Snowpipe healthPipe status, pipe usage/history
Task runsTask history and task graph state
Stream consumptionStream metadata, task history, source table changes
Access auditingAccess History, Login History
Object changesAccount Usage views, object dependency metadata
Data quality issuesRejected files, quarantine tables, validation queries
Replication/failoverReplication/failover group status and history

Useful monitoring query patterns:

SELECT
  query_id,
  warehouse_name,
  execution_status,
  start_time,
  bytes_scanned
FROM snowflake.account_usage.query_history
WHERE warehouse_name = 'ETL_WH'
ORDER BY start_time DESC;
SELECT *
FROM TABLE(information_schema.task_history(
  TASK_NAME => 'MERGE_ORDERS_TASK'
))
ORDER BY scheduled_time DESC;

Data Engineering Design Patterns

Bronze / Silver / Gold Mapping

LayerSnowflake ObjectsDesign Goal
Raw / BronzeRaw tables, VARIANT, source metadataPreserve source fidelity and lineage
Clean / SilverTyped tables, validation results, deduped recordsStandardize schema and enforce quality
Curated / GoldMarts, aggregates, dynamic tables, materialized viewsServe analytics, ML, and downstream apps
GovernancePolicies, tags, secure viewsEnforce access and classification
OperationsTasks, streams, monitoring tablesMake pipeline health visible

Idempotent Pipeline Checklist

  • Use deterministic file naming and load tracking.
  • Capture METADATA$FILENAME and load timestamp.
  • Separate raw ingestion from business transformation.
  • Use MERGE for upserts where duplicate events are possible.
  • Process streams in transactions when multiple target updates must stay consistent.
  • Design replay procedures before production failure.
  • Quarantine bad records instead of silently discarding them.
  • Monitor task, pipe, and copy history.
  • Keep transformations set-based wherever possible.

Data Quality Controls

ControlExample
Type validationTRY_TO_NUMBER, TRY_TO_DATE, explicit casts
Required fieldsReject or quarantine rows with missing keys
Range checksValidate amounts, timestamps, status codes
DeduplicationWindow functions over business keys and event timestamps
Referential checksCompare staged keys to dimension tables
Schema drift detectionCompare inferred file schema to expected contract
Audit countsSource count, loaded count, rejected count, merged count
Lineage fieldsSource file, row number, batch ID, load time

High-Yield Exam Traps

TrapCorrect Mental Model
“Bigger warehouse always fixes performance.”Scale up helps some single-query bottlenecks; concurrency may need scale-out or workload isolation
“A stream stores all change rows forever.”A stream tracks change offsets over retained source data; monitor staleness and consumption
“Snowpipe replaces transformations.”Snowpipe loads files; downstream tasks/dynamic tables usually transform
“Temporary tables are harmless with duplicate names.”Temporary objects can shadow permanent names in a session
“Fail-safe is for normal user recovery.”Use Time Travel, cloning, and UNDROP for routine recovery
“Clustering should be added to every large table.”Add clustering only when query pruning benefit justifies maintenance
“Semi-structured data requires no modeling.”Frequently queried fields often belong in typed columns or curated tables
“Secure Data Sharing copies provider data.”Sharing allows live access without copying into the consumer account
“Stored procedures are faster than SQL.”Set-based SQL is usually preferred for relational transformations
“Result cache proves the query is optimized.”Cache can hide inefficient scans; inspect cold-query behavior and profile

Final DEA-C02 Review Checklist

Before taking DEA-C02, make sure you can quickly answer:

  • Which ingestion option fits batch files, continuous files, and low-latency streaming rows?
  • How do COPY INTO, Snowpipe, stages, file formats, and load history interact?
  • When should you choose streams/tasks versus dynamic tables?
  • How do stream metadata columns affect MERGE logic?
  • What table type fits production, staging, session-only, and externally stored data?
  • How do micro-partitions, pruning, clustering, and search optimization differ?
  • When should you scale a warehouse up versus scale out?
  • How do masking policies, row access policies, secure views, and tags work together?
  • How do Time Travel, cloning, UNDROP, replication, and Fail-safe differ?
  • What monitoring view or history area would you check for a failed load, task, pipe, or slow query?

Next step: turn each table above into scenario drills, then practice timed Snowflake data engineering questions that force you to choose the best ingestion, transformation, security, and performance pattern for the stated constraints.