Databricks Certified Data Engineer Associate Quick Reference

Last revised: June 29, 2026

Compact exam-prep reference for Databricks Certified Data Engineer Associate candidates: Delta Lake, Spark SQL, ingestion, governance, jobs, and troubleshooting.

This Quick Reference is an independent study aid for candidates preparing for the Databricks Certified Data Engineer Associate exam, code Databricks DEA. It focuses on practical distinctions, commands, and design choices that commonly appear in data engineering scenarios on Databricks.

High-Yield Exam Map

Area	Know how to answer
Lakehouse architecture	Bronze/silver/gold design, Delta Lake as the table format, batch vs streaming, data quality layers
Delta Lake	ACID transactions, transaction log, time travel, schema enforcement/evolution, MERGE, OPTIMIZE, VACUUM
Spark SQL and DataFrames	Transformations vs actions, joins, aggregations, windows, null handling, deduplication, file/table reads
Ingestion	COPY INTO, Auto Loader, batch reads, streaming reads, schema inference, checkpoints
Orchestration	Databricks Workflows, jobs, tasks, dependencies, job clusters, scheduling, retries
Governance	Unity Catalog object hierarchy, catalogs, schemas, tables, views, volumes, privileges, managed vs external data
Performance and troubleshooting	Partitioning, file sizes, data skipping, caching, broadcast joins, shuffle, skew, query plans
Production behavior	Idempotency, incremental loads, checkpointing, permissions, alerts, parameterization

Lakehouse Object Model

Concept	Exam-ready meaning	Common trap
Workspace	User-facing Databricks environment for notebooks, jobs, clusters, repos, SQL assets	Workspace is not the same as the Unity Catalog metastore
Metastore	Governance container for Unity Catalog metadata	A workspace can be attached to a metastore; permissions are still object-level
Catalog	Top-level namespace in Unity Catalog	Catalogs contain schemas, not directly arbitrary notebooks
Schema	Namespace inside a catalog; similar to a database	In SQL, `USE SCHEMA` or fully qualify names to avoid wrong object references
Table	Structured data object, commonly Delta	A table may be managed or external
View	Saved query over tables/views	Standard views do not store data; permissions and lineage matter
Materialized view	Stores query results and can be refreshed	Not the same as a regular view
Volume	Unity Catalog object for non-tabular files	Use volumes for governed file access, not table queries
External location	Governed reference to cloud storage	Requires appropriate storage credential and privileges
Storage credential	Identity/credential Databricks uses to access cloud storage	Do not confuse with a user secret or personal access token

Naming Pattern

Prefer fully qualified object names in exam scenarios involving governance or multiple environments:

SELECT *
FROM catalog_name.schema_name.table_name;

Managed vs External Tables

Decision	Managed table	External table
Data location	Databricks-managed storage location	User-specified external path
Lifecycle	Dropping the table can remove both metadata and managed data	Dropping the table removes metadata, not necessarily underlying files
Governance	Governed through Unity Catalog	Governed through Unity Catalog plus external location controls
Best for	Default lakehouse tables, simplified lifecycle	Shared storage, existing data lakes, cross-system data ownership
Exam signal	“Let Databricks manage storage”	“Data already exists in cloud storage” or “retain files after dropping table”

Example DDL:

-- Managed Delta table
CREATE TABLE main.sales.orders (
  order_id STRING,
  order_ts TIMESTAMP,
  amount DECIMAL(10,2)
);

-- External Delta table
CREATE TABLE main.sales.orders_ext
LOCATION 's3://example-bucket/path/orders';

Delta Lake Essentials

Feature	What it does	Exam use
ACID transactions	Reliable concurrent reads/writes	Avoid corrupt partial writes
Transaction log	`_delta_log` records table versions	Enables time travel, rollback-style reads, metadata tracking
Schema enforcement	Rejects incompatible writes	Protects table quality
Schema evolution	Allows approved schema changes	Useful for evolving source data, but should be explicit in production
Time travel	Query older table versions or timestamps	Audit, reproduce, recover from bad writes
MERGE	Upsert/delete based on match condition	Incremental CDC-style loads
OPTIMIZE	Compacts small files	Improve scan performance
Z-ordering / clustering concepts	Co-locates related data for skipping	Useful for common filter columns; do not use blindly
VACUUM	Removes old unused data files	Can limit time travel and rollback options
Change Data Feed	Exposes row-level changes when enabled	Incremental downstream processing

Delta Time Travel

SELECT *
FROM main.sales.orders VERSION AS OF 12;

SELECT *
FROM main.sales.orders TIMESTAMP AS OF '2026-06-01T00:00:00Z';

High-yield distinction:

Need	Choose
Query previous table state	Time travel
Undo bad write manually	Read old version, then overwrite/restore using approved pattern
Free old file storage	VACUUM
Track row-level inserts/updates/deletes	Change Data Feed

Medallion Architecture

Layer	Purpose	Typical operations	Quality expectation
Bronze	Raw or lightly processed landing data	Ingest, append, capture metadata, preserve source fidelity	Low; keep source as received
Silver	Cleaned, conformed, deduplicated data	Parse, cast, validate, standardize, deduplicate, merge	Medium to high
Gold	Business-ready aggregates or marts	Aggregate, join dimensions, serve BI/ML use cases	High; curated and query-optimized

Common exam pattern:

Land source files into bronze.
Apply schema, validation, deduplication into silver.
Build aggregated or dimensional outputs in gold.
Orchestrate dependencies with a Databricks job or declarative pipeline.
Govern access by catalog/schema/table privileges.

Ingestion Selection Matrix

Requirement	Best fit	Why
Load files once or periodically with SQL	`COPY INTO`	Simple incremental file ingestion into Delta
Continuously ingest new files from cloud storage	Auto Loader	Scalable file discovery, schema handling, checkpointing
Read static files for ad hoc transformation	Spark batch read	Direct, flexible, not automatically incremental
Process event streams incrementally	Structured Streaming	Stateful streaming engine with checkpoints
Build managed declarative ETL with quality rules	Databricks declarative pipeline concepts	Pipeline orchestration, dependencies, expectations
Ingest small manual datasets	UI upload or simple table creation	Convenience, not production-scale ingestion

COPY INTO

Use when the source is file-based and the target is a Delta table.

COPY INTO main.bronze.orders_raw
FROM 's3://example-bucket/incoming/orders/'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true');

Exam reminders:

Point	Remember
Incremental behavior	`COPY INTO` tracks previously loaded files for the target table
Best use	Simple file ingestion without custom streaming logic
Target	Typically a Delta table
Trap	It is not the same as `INSERT INTO SELECT` from an already registered table

Auto Loader

Use Auto Loader for scalable incremental file ingestion.

df = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/Volumes/main/ops/checkpoints/orders_schema")
    .load("/Volumes/main/landing/orders")
)

(
    df.writeStream
    .format("delta")
    .option("checkpointLocation", "/Volumes/main/ops/checkpoints/orders_stream")
    .toTable("main.bronze.orders_raw")
)

Auto Loader concept	Exam meaning
`cloudFiles`	Format used by Auto Loader
Schema location	Stores inferred/evolving schema metadata
Checkpoint location	Tracks streaming progress and state
Rescue data	Captures unexpected columns or malformed fields depending on configuration
Incremental discovery	Processes new files without re-reading all old files

Batch vs Streaming

Scenario clue	Use batch	Use streaming
Files arrive once per day and can be loaded as a batch	Yes	Optional
Data must be processed continuously as it arrives	No	Yes
Query needs watermarking for late events	No	Yes
Need exact same transformation logic on bounded data	Yes	Sometimes with available-now style trigger
Job should terminate after processing available data	Yes	Use a bounded/available trigger if streaming ingestion is still desired
Stateful deduplication over time	Limited	Yes, with watermark/checkpoint

Structured Streaming Checkpoints

Rule	Why it matters
Each streaming query needs its own checkpoint	Prevents state/progress conflicts
Do not casually delete checkpoints	Can cause reprocessing or state loss
Use stable storage for checkpoints	Required for reliable recovery
Changing query logic may require checkpoint planning	State schema and output behavior can be incompatible

Spark SQL and DataFrame Core

Transformations vs Actions

Type	Examples	Behavior
Transformation	`select`, `filter`, `withColumn`, `join`, `groupBy`, `orderBy`	Lazy; builds logical plan
Action	`count`, `collect`, `show`, `write`, `display`	Triggers execution
Wide transformation	`groupBy`, `join`, `distinct`, `orderBy`	Often causes shuffle
Narrow transformation	`select`, simple `filter`, many column expressions	Usually no shuffle

Common trap: caching a DataFrame is lazy. It is materialized only after an action.

df_cached = df.filter("amount > 0").cache()
df_cached.count()  # materializes cache

SQL Operations to Know

Operation	Pattern	Exam note
Filter	`WHERE amount > 0`	Applied before aggregation
Aggregate filter	`HAVING count(*) > 1`	Applied after `GROUP BY`
Null comparison	`IS NULL`, `IS NOT NULL`	Do not use `= NULL`
Conditional logic	`CASE WHEN ... THEN ... END`	Useful for derived fields
Join	`INNER`, `LEFT`, `RIGHT`, `FULL`, `CROSS`, `ANTI`, `SEMI`	Know output semantics
Dedup	`DISTINCT`, `dropDuplicates`, window `row_number`	Window pattern gives deterministic survivor
Explode	`explode(array_col)`	Converts array elements to rows
Window	`OVER (PARTITION BY ... ORDER BY ...)`	Ranking, running totals, latest record selection

Join Types

Join	Returns
Inner	Matching rows from both sides
Left outer	All left rows plus matching right rows
Right outer	All right rows plus matching left rows
Full outer	All rows from both sides, matched where possible
Left semi	Left rows that have a match; only left columns
Left anti	Left rows with no match; only left columns
Cross	Cartesian product; usually avoid unless intentional

Example anti join for new records:

new_customers = incoming.join(existing, on="customer_id", how="left_anti")

Deduplication Patterns

Need	Pattern	Notes
Remove exact duplicate rows	`SELECT DISTINCT *`	Simple but may shuffle heavily
Deduplicate by key, arbitrary survivor	`dropDuplicates(["id"])`	Survivor may not be deterministic
Keep latest record per key	Window with `row_number()`	Most exam-safe when order column exists
Streaming deduplication	`dropDuplicates` with watermark	Controls state growth and late data handling
Upsert latest changes into target	`MERGE INTO`	Best for incremental table maintenance

Latest record per key:

WITH ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id
      ORDER BY updated_at DESC
    ) AS rn
  FROM main.bronze.customers_raw
)
SELECT *
FROM ranked
WHERE rn = 1;

MERGE for Upserts

Use MERGE when records may be new, changed, or deleted.

MERGE INTO main.silver.customers AS target
USING main.bronze.customers_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET *
WHEN NOT MATCHED THEN
  INSERT *;

Clause	Meaning
`WHEN MATCHED THEN UPDATE`	Existing target row is changed
`WHEN MATCHED THEN DELETE`	Existing target row is removed
`WHEN NOT MATCHED THEN INSERT`	New target row is inserted
`ON` condition	Defines business key match
Source duplicates	Can cause ambiguous matches; deduplicate source first

Exam-safe CDC flow:

Read source changes.
Cast and validate fields.
Deduplicate changes by business key and sequence/timestamp.
MERGE into silver table.
Write audit metrics or job status.

Schema Handling

Feature	Purpose	Trap
Schema inference	Detects schema from source data	Convenient but risky for production consistency
Explicit schema	Defines expected columns and types	Preferred for stable pipelines
Schema enforcement	Prevents invalid writes to Delta	Does not automatically fix bad data
Schema evolution	Adds or changes schema when allowed	Should be controlled; avoid accidental drift
Casts	Convert strings to dates, timestamps, decimals	Bad casts may produce nulls or errors depending on mode
Constraints	Enforce table-level data rules	Use for quality guarantees where supported

Example explicit schema:

from pyspark.sql.types import StructType, StructField, StringType, TimestampType, DecimalType

schema = StructType([
    StructField("order_id", StringType(), False),
    StructField("order_ts", TimestampType(), True),
    StructField("amount", DecimalType(10, 2), True)
])

Data Quality Patterns

Requirement	Pattern
Reject records with missing primary key	Filter invalid rows or use expectations/constraints
Quarantine malformed data	Write invalid records to separate error table
Track ingestion lineage	Add source file name, ingestion timestamp, batch ID
Prevent duplicate business keys	Deduplicate before merge; enforce uniqueness through pipeline logic
Validate referential quality	Join to dimension/reference tables and isolate non-matches
Monitor row counts	Compare source, accepted, rejected, inserted, updated counts

Useful metadata columns:

SELECT
  *,
  current_timestamp() AS ingestion_ts,
  _metadata.file_name AS source_file
FROM read_files('/Volumes/main/landing/orders', format => 'json');

Unity Catalog Security Quick Reference

Object	Typical privilege idea	Exam use
Metastore	Administrative governance boundary	Usually not granted broadly
Catalog	Access top-level namespace	Need catalog access before schema/table work
Schema	Use namespace and create objects	Required for table/view creation in that schema
Table	Select, modify, manage depending on role	Grant least privilege
View	Provide restricted access to query results	Use to hide columns/rows or simplify access
Volume	Read/write governed files	For non-tabular data access
External location	Access cloud storage path	Needed for external tables/volumes
Storage credential	Cloud identity abstraction	Secured tightly; not for general users

Governance Decision Table

Requirement	Prefer
Govern tabular data with SQL permissions	Unity Catalog tables/views
Govern raw files that are not tables	Unity Catalog volumes
Share a subset of columns or rows	Views with appropriate grants
Isolate dev/test/prod namespaces	Separate catalogs or schemas
Avoid hard-coded cloud credentials	Storage credentials, external locations, secrets
Grant only read access	`SELECT` on table/view, plus required namespace usage
Let analysts query without modifying data	Read-only grants on curated gold tables/views

Example grants:

GRANT USE CATALOG ON CATALOG main TO `data_analysts`;
GRANT USE SCHEMA ON SCHEMA main.gold TO `data_analysts`;
GRANT SELECT ON TABLE main.gold.sales_summary TO `data_analysts`;

Compute Selection

Need	Choose	Why
Interactive notebook development	All-purpose compute	Supports iterative exploration
Scheduled production task	Job compute	Created for job run, easier lifecycle control
SQL dashboards and BI queries	SQL warehouse	Optimized SQL serving experience
Isolate workloads and reduce idle cost	Job clusters / task-specific compute	Runs only when needed
Enforce standardized settings	Cluster policies	Governance and cost control
Faster SQL/DataFrame execution where available	Photon-enabled compute	Vectorized execution engine for supported workloads
Avoid installing libraries manually each run	Job/task library configuration	Reproducible production setup

Common traps:

Trap	Correction
Using all-purpose clusters for every scheduled workload	Prefer job compute for production jobs
Assuming driver memory solves all performance issues	Large shuffles/skew need query/data design fixes
Installing libraries interactively only	Configure libraries on job/cluster for repeatability
Giving broad cluster permissions	Use least privilege and cluster policies

Databricks Workflows and Jobs

Concept	What to know
Job	Production unit for scheduled or triggered work
Task	Step inside a job, such as notebook, Python script, SQL task, pipeline, or JAR
Dependency	Defines task order; downstream tasks wait for upstream success
Job cluster	Compute created for a job/task run
Parameters	Pass runtime values into notebooks/scripts/SQL
Retry	Automatically rerun failed tasks based on configuration
Alert/notification	Inform operators on failure, success, or duration conditions
Repair run	Rerun failed/skipped tasks without rerunning everything where supported
Task values	Pass small values between tasks in multi-task jobs

Workflow design pattern:

    flowchart LR
	    A[Ingest bronze] --> B[Validate and clean silver]
	    B --> C[MERGE dimensions]
	    B --> D[Build facts]
	    C --> E[Refresh gold marts]
	    D --> E
	    E --> F[Run quality checks]

Exam decision points:

Requirement	Answer pattern
Run notebook A before notebook B	Multi-task job with dependency
Reuse same pipeline with different dates	Job parameters
Use isolated compute for production	Job cluster
Notify on failure	Job notification/alert
Avoid rerunning successful upstream tasks after partial failure	Repair failed tasks where appropriate
Control who can edit/run job	Job permissions

SQL Warehouse vs Cluster

Feature	SQL warehouse	Interactive/job cluster
Primary use	SQL queries, dashboards, BI	Notebooks, Spark jobs, ML/data engineering code
Interface	Databricks SQL editor, BI integrations	Notebooks, jobs, Spark APIs
Language focus	SQL	SQL, Python, Scala, R depending on context
Production serving	Good for SQL analytics	Good for ETL and programmatic pipelines
Exam clue	“Dashboard,” “analyst,” “BI query”	“Notebook ETL,” “PySpark,” “library,” “job task”

Performance Reference

Symptom	Likely cause	Practical fix
Many tiny files	Frequent small writes, streaming micro-batches	`OPTIMIZE`, tune write patterns, compact periodically
Query scans too much data	Poor filters, no useful layout, unselective partitions	Filter early, select needed columns, optimize layout
Slow join	Large shuffle, missing broadcast opportunity, skew	Broadcast small dimension, handle skew, filter before join
Out-of-memory on driver	`collect()` or large result to driver	Avoid `collect`; write/query distributed results
Slow aggregation	High-cardinality shuffle	Pre-filter, aggregate in stages, review partitioning
Repeated computation	No caching/materialization	Cache selectively, materialize intermediate Delta table
Stale or inefficient plans	Missing stats or poor query design	Analyze/explain query, optimize tables
Streaming state grows	No watermark or unbounded keys	Add watermark, deduplicate carefully, manage state

Partitioning, OPTIMIZE, and Data Skipping

Technique	Use when	Avoid when
Table partitioning	Low/moderate-cardinality column commonly used for filters	High-cardinality columns like unique IDs
OPTIMIZE	Table has many small files or frequent incremental writes	Tiny tables with no performance issue
Z-ordering / clustering-style layout	Queries repeatedly filter on certain columns	Columns are not used in filters or are too random
Caching	Same data reused repeatedly in active session/job	Data is huge, rarely reused, or memory pressure is high
Materialized gold table	Many users need same transformed result	Source changes constantly and freshness requirements conflict

Troubleshooting Checklist

Problem	Check first
Permission denied	Unity Catalog grants, catalog/schema/table privileges, external location privileges
Table not found	Current catalog/schema, object name, workspace/metastore attachment
Stream reprocessed data	Checkpoint path changed/deleted, source semantics, output mode
Schema mismatch	Source schema drift, table schema, explicit schema, evolution settings
Duplicate rows after load	Non-idempotent append, missing merge key, source duplicates
Slow job after data growth	File count, skew, shuffle stages, join order, filters, OPTIMIZE need
Notebook works manually but job fails	Job parameters, cluster libraries, permissions, secrets, current working context
`collect()` crashes driver	Too much data returned to driver; use distributed write or limited sample
VACUUM removed needed files	Retention/time travel expectations were not considered before cleanup

Common Exam Traps

Trap	Correct understanding
“Delta Lake is just Parquet files”	Delta uses Parquet data files plus a transaction log for reliability and metadata
“Views store data”	Standard views store query definitions; materialized views store results
“A DataFrame transformation immediately runs”	Spark transformations are lazy until an action
“`count()` is harmless”	It is an action and can scan large data
“Streaming means real-time only”	Structured Streaming can also process available data incrementally with bounded-style triggers
“Append is safe for all incremental loads”	Updates/deletes require merge or CDC-aware logic
“Partition by unique ID for faster lookup”	High-cardinality partitioning often creates too many small partitions/files
“Delete checkpoint to fix stream”	This can cause data duplication or state loss
“External table means ungoverned”	External tables can still be governed through Unity Catalog
“VACUUM improves query speed directly”	It removes obsolete files; compaction/layout are separate performance concerns
“Cache guarantees faster queries”	Cache helps only if reused and materialized; it can also create memory pressure
“Job success means data quality is correct”	Jobs can succeed while loading bad data unless quality checks are implemented

Compact Snippet Bank

Create and Use Namespaces

CREATE CATALOG IF NOT EXISTS main;
CREATE SCHEMA IF NOT EXISTS main.silver;

USE CATALOG main;
USE SCHEMA silver;

Create Table from Query

CREATE OR REPLACE TABLE main.gold.daily_sales AS
SELECT
  date(order_ts) AS order_date,
  sum(amount) AS total_amount,
  count(*) AS order_count
FROM main.silver.orders
GROUP BY date(order_ts);

Append Cleaned Data

clean_df = (
    raw_df
    .filter("order_id IS NOT NULL")
    .withColumn("ingestion_ts", current_timestamp())
)

clean_df.write.mode("append").format("delta").saveAsTable("main.silver.orders")

Overwrite Safely for Rebuildable Gold Table

CREATE OR REPLACE TABLE main.gold.customer_metrics AS
SELECT
  customer_id,
  count(*) AS order_count,
  sum(amount) AS lifetime_value
FROM main.silver.orders
GROUP BY customer_id;

Watermarked Streaming Deduplication

deduped = (
    stream_df
    .withWatermark("event_ts", "1 day")
    .dropDuplicates(["event_id"])
)

Explain a Query Plan

EXPLAIN
SELECT customer_id, sum(amount)
FROM main.silver.orders
WHERE order_ts >= current_date() - INTERVAL 30 DAYS
GROUP BY customer_id;

Decision Trees

Ingestion Choice

    flowchart TD
	    A[Need to ingest data?] --> B{Source is files?}
	    B -- No --> C{Source is continuous events?}
	    C -- Yes --> D[Structured Streaming connector]
	    C -- No --> E[Batch read or connector-specific load]
	    B -- Yes --> F{Need scalable continuous file discovery?}
	    F -- Yes --> G[Auto Loader]
	    F -- No --> H{Prefer SQL incremental load?}
	    H -- Yes --> I[COPY INTO]
	    H -- No --> J[Spark batch read]

Table Maintenance Choice

    flowchart TD
	    A[Need to update target table?] --> B{Only new rows?}
	    B -- Yes --> C[Append]
	    B -- No --> D{Need inserts and updates?}
	    D -- Yes --> E[MERGE INTO]
	    D -- No --> F{Rebuild full result?}
	    F -- Yes --> G[CREATE OR REPLACE TABLE]
	    F -- No --> H[Use DELETE/UPDATE with clear predicate]

Final Review Checklist

Before exam day, make sure you can:

Explain Delta Lake transaction log, time travel, schema enforcement, and VACUUM.
Choose between COPY INTO, Auto Loader, batch reads, and Structured Streaming.
Write or recognize MERGE INTO for upserts.
Distinguish managed tables, external tables, views, materialized views, and volumes.
Apply Unity Catalog hierarchy and least-privilege grants.
Identify when a Spark operation is lazy, an action, narrow, or wide.
Use SQL windows for latest-record deduplication.
Choose job compute, all-purpose compute, or SQL warehouses based on workload.
Diagnose common failures involving checkpoints, permissions, schema drift, small files, and driver collection.
Recognize production patterns: idempotency, parameterization, retries, alerts, and quality checks.

For the next step, practice with scenario questions that force you to choose the right Databricks service, SQL command, Delta Lake operation, or production troubleshooting action under exam-style constraints.

Scenario Guide

Intelligence Platform