This Quick Reference is an independent study aid for candidates preparing for the Databricks Certified Data Engineer Associate exam, code Databricks DEA. It focuses on practical distinctions, commands, and design choices that commonly appear in data engineering scenarios on Databricks.
High-Yield Exam Map
| Area | Know how to answer |
|---|
| Lakehouse architecture | Bronze/silver/gold design, Delta Lake as the table format, batch vs streaming, data quality layers |
| Delta Lake | ACID transactions, transaction log, time travel, schema enforcement/evolution, MERGE, OPTIMIZE, VACUUM |
| Spark SQL and DataFrames | Transformations vs actions, joins, aggregations, windows, null handling, deduplication, file/table reads |
| Ingestion | COPY INTO, Auto Loader, batch reads, streaming reads, schema inference, checkpoints |
| Orchestration | Databricks Workflows, jobs, tasks, dependencies, job clusters, scheduling, retries |
| Governance | Unity Catalog object hierarchy, catalogs, schemas, tables, views, volumes, privileges, managed vs external data |
| Performance and troubleshooting | Partitioning, file sizes, data skipping, caching, broadcast joins, shuffle, skew, query plans |
| Production behavior | Idempotency, incremental loads, checkpointing, permissions, alerts, parameterization |
Lakehouse Object Model
| Concept | Exam-ready meaning | Common trap |
|---|
| Workspace | User-facing Databricks environment for notebooks, jobs, clusters, repos, SQL assets | Workspace is not the same as the Unity Catalog metastore |
| Metastore | Governance container for Unity Catalog metadata | A workspace can be attached to a metastore; permissions are still object-level |
| Catalog | Top-level namespace in Unity Catalog | Catalogs contain schemas, not directly arbitrary notebooks |
| Schema | Namespace inside a catalog; similar to a database | In SQL, USE SCHEMA or fully qualify names to avoid wrong object references |
| Table | Structured data object, commonly Delta | A table may be managed or external |
| View | Saved query over tables/views | Standard views do not store data; permissions and lineage matter |
| Materialized view | Stores query results and can be refreshed | Not the same as a regular view |
| Volume | Unity Catalog object for non-tabular files | Use volumes for governed file access, not table queries |
| External location | Governed reference to cloud storage | Requires appropriate storage credential and privileges |
| Storage credential | Identity/credential Databricks uses to access cloud storage | Do not confuse with a user secret or personal access token |
Naming Pattern
Prefer fully qualified object names in exam scenarios involving governance or multiple environments:
SELECT *
FROM catalog_name.schema_name.table_name;
Managed vs External Tables
| Decision | Managed table | External table |
|---|
| Data location | Databricks-managed storage location | User-specified external path |
| Lifecycle | Dropping the table can remove both metadata and managed data | Dropping the table removes metadata, not necessarily underlying files |
| Governance | Governed through Unity Catalog | Governed through Unity Catalog plus external location controls |
| Best for | Default lakehouse tables, simplified lifecycle | Shared storage, existing data lakes, cross-system data ownership |
| Exam signal | “Let Databricks manage storage” | “Data already exists in cloud storage” or “retain files after dropping table” |
Example DDL:
-- Managed Delta table
CREATE TABLE main.sales.orders (
order_id STRING,
order_ts TIMESTAMP,
amount DECIMAL(10,2)
);
-- External Delta table
CREATE TABLE main.sales.orders_ext
LOCATION 's3://example-bucket/path/orders';
Delta Lake Essentials
| Feature | What it does | Exam use |
|---|
| ACID transactions | Reliable concurrent reads/writes | Avoid corrupt partial writes |
| Transaction log | _delta_log records table versions | Enables time travel, rollback-style reads, metadata tracking |
| Schema enforcement | Rejects incompatible writes | Protects table quality |
| Schema evolution | Allows approved schema changes | Useful for evolving source data, but should be explicit in production |
| Time travel | Query older table versions or timestamps | Audit, reproduce, recover from bad writes |
| MERGE | Upsert/delete based on match condition | Incremental CDC-style loads |
| OPTIMIZE | Compacts small files | Improve scan performance |
| Z-ordering / clustering concepts | Co-locates related data for skipping | Useful for common filter columns; do not use blindly |
| VACUUM | Removes old unused data files | Can limit time travel and rollback options |
| Change Data Feed | Exposes row-level changes when enabled | Incremental downstream processing |
Delta Time Travel
SELECT *
FROM main.sales.orders VERSION AS OF 12;
SELECT *
FROM main.sales.orders TIMESTAMP AS OF '2026-06-01T00:00:00Z';
High-yield distinction:
| Need | Choose |
|---|
| Query previous table state | Time travel |
| Undo bad write manually | Read old version, then overwrite/restore using approved pattern |
| Free old file storage | VACUUM |
| Track row-level inserts/updates/deletes | Change Data Feed |
Medallion Architecture
| Layer | Purpose | Typical operations | Quality expectation |
|---|
| Bronze | Raw or lightly processed landing data | Ingest, append, capture metadata, preserve source fidelity | Low; keep source as received |
| Silver | Cleaned, conformed, deduplicated data | Parse, cast, validate, standardize, deduplicate, merge | Medium to high |
| Gold | Business-ready aggregates or marts | Aggregate, join dimensions, serve BI/ML use cases | High; curated and query-optimized |
Common exam pattern:
- Land source files into bronze.
- Apply schema, validation, deduplication into silver.
- Build aggregated or dimensional outputs in gold.
- Orchestrate dependencies with a Databricks job or declarative pipeline.
- Govern access by catalog/schema/table privileges.
Ingestion Selection Matrix
| Requirement | Best fit | Why |
|---|
| Load files once or periodically with SQL | COPY INTO | Simple incremental file ingestion into Delta |
| Continuously ingest new files from cloud storage | Auto Loader | Scalable file discovery, schema handling, checkpointing |
| Read static files for ad hoc transformation | Spark batch read | Direct, flexible, not automatically incremental |
| Process event streams incrementally | Structured Streaming | Stateful streaming engine with checkpoints |
| Build managed declarative ETL with quality rules | Databricks declarative pipeline concepts | Pipeline orchestration, dependencies, expectations |
| Ingest small manual datasets | UI upload or simple table creation | Convenience, not production-scale ingestion |
COPY INTO
Use when the source is file-based and the target is a Delta table.
COPY INTO main.bronze.orders_raw
FROM 's3://example-bucket/incoming/orders/'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true');
Exam reminders:
| Point | Remember |
|---|
| Incremental behavior | COPY INTO tracks previously loaded files for the target table |
| Best use | Simple file ingestion without custom streaming logic |
| Target | Typically a Delta table |
| Trap | It is not the same as INSERT INTO SELECT from an already registered table |
Auto Loader
Use Auto Loader for scalable incremental file ingestion.
df = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/Volumes/main/ops/checkpoints/orders_schema")
.load("/Volumes/main/landing/orders")
)
(
df.writeStream
.format("delta")
.option("checkpointLocation", "/Volumes/main/ops/checkpoints/orders_stream")
.toTable("main.bronze.orders_raw")
)
| Auto Loader concept | Exam meaning |
|---|
cloudFiles | Format used by Auto Loader |
| Schema location | Stores inferred/evolving schema metadata |
| Checkpoint location | Tracks streaming progress and state |
| Rescue data | Captures unexpected columns or malformed fields depending on configuration |
| Incremental discovery | Processes new files without re-reading all old files |
Batch vs Streaming
| Scenario clue | Use batch | Use streaming |
|---|
| Files arrive once per day and can be loaded as a batch | Yes | Optional |
| Data must be processed continuously as it arrives | No | Yes |
| Query needs watermarking for late events | No | Yes |
| Need exact same transformation logic on bounded data | Yes | Sometimes with available-now style trigger |
| Job should terminate after processing available data | Yes | Use a bounded/available trigger if streaming ingestion is still desired |
| Stateful deduplication over time | Limited | Yes, with watermark/checkpoint |
Structured Streaming Checkpoints
| Rule | Why it matters |
|---|
| Each streaming query needs its own checkpoint | Prevents state/progress conflicts |
| Do not casually delete checkpoints | Can cause reprocessing or state loss |
| Use stable storage for checkpoints | Required for reliable recovery |
| Changing query logic may require checkpoint planning | State schema and output behavior can be incompatible |
Spark SQL and DataFrame Core
| Type | Examples | Behavior |
|---|
| Transformation | select, filter, withColumn, join, groupBy, orderBy | Lazy; builds logical plan |
| Action | count, collect, show, write, display | Triggers execution |
| Wide transformation | groupBy, join, distinct, orderBy | Often causes shuffle |
| Narrow transformation | select, simple filter, many column expressions | Usually no shuffle |
Common trap: caching a DataFrame is lazy. It is materialized only after an action.
df_cached = df.filter("amount > 0").cache()
df_cached.count() # materializes cache
SQL Operations to Know
| Operation | Pattern | Exam note |
|---|
| Filter | WHERE amount > 0 | Applied before aggregation |
| Aggregate filter | HAVING count(*) > 1 | Applied after GROUP BY |
| Null comparison | IS NULL, IS NOT NULL | Do not use = NULL |
| Conditional logic | CASE WHEN ... THEN ... END | Useful for derived fields |
| Join | INNER, LEFT, RIGHT, FULL, CROSS, ANTI, SEMI | Know output semantics |
| Dedup | DISTINCT, dropDuplicates, window row_number | Window pattern gives deterministic survivor |
| Explode | explode(array_col) | Converts array elements to rows |
| Window | OVER (PARTITION BY ... ORDER BY ...) | Ranking, running totals, latest record selection |
Join Types
| Join | Returns |
|---|
| Inner | Matching rows from both sides |
| Left outer | All left rows plus matching right rows |
| Right outer | All right rows plus matching left rows |
| Full outer | All rows from both sides, matched where possible |
| Left semi | Left rows that have a match; only left columns |
| Left anti | Left rows with no match; only left columns |
| Cross | Cartesian product; usually avoid unless intentional |
Example anti join for new records:
new_customers = incoming.join(existing, on="customer_id", how="left_anti")
Deduplication Patterns
| Need | Pattern | Notes |
|---|
| Remove exact duplicate rows | SELECT DISTINCT * | Simple but may shuffle heavily |
| Deduplicate by key, arbitrary survivor | dropDuplicates(["id"]) | Survivor may not be deterministic |
| Keep latest record per key | Window with row_number() | Most exam-safe when order column exists |
| Streaming deduplication | dropDuplicates with watermark | Controls state growth and late data handling |
| Upsert latest changes into target | MERGE INTO | Best for incremental table maintenance |
Latest record per key:
WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY updated_at DESC
) AS rn
FROM main.bronze.customers_raw
)
SELECT *
FROM ranked
WHERE rn = 1;
MERGE for Upserts
Use MERGE when records may be new, changed, or deleted.
MERGE INTO main.silver.customers AS target
USING main.bronze.customers_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *;
| Clause | Meaning |
|---|
WHEN MATCHED THEN UPDATE | Existing target row is changed |
WHEN MATCHED THEN DELETE | Existing target row is removed |
WHEN NOT MATCHED THEN INSERT | New target row is inserted |
ON condition | Defines business key match |
| Source duplicates | Can cause ambiguous matches; deduplicate source first |
Exam-safe CDC flow:
- Read source changes.
- Cast and validate fields.
- Deduplicate changes by business key and sequence/timestamp.
MERGE into silver table.- Write audit metrics or job status.
Schema Handling
| Feature | Purpose | Trap |
|---|
| Schema inference | Detects schema from source data | Convenient but risky for production consistency |
| Explicit schema | Defines expected columns and types | Preferred for stable pipelines |
| Schema enforcement | Prevents invalid writes to Delta | Does not automatically fix bad data |
| Schema evolution | Adds or changes schema when allowed | Should be controlled; avoid accidental drift |
| Casts | Convert strings to dates, timestamps, decimals | Bad casts may produce nulls or errors depending on mode |
| Constraints | Enforce table-level data rules | Use for quality guarantees where supported |
Example explicit schema:
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, DecimalType
schema = StructType([
StructField("order_id", StringType(), False),
StructField("order_ts", TimestampType(), True),
StructField("amount", DecimalType(10, 2), True)
])
Data Quality Patterns
| Requirement | Pattern |
|---|
| Reject records with missing primary key | Filter invalid rows or use expectations/constraints |
| Quarantine malformed data | Write invalid records to separate error table |
| Track ingestion lineage | Add source file name, ingestion timestamp, batch ID |
| Prevent duplicate business keys | Deduplicate before merge; enforce uniqueness through pipeline logic |
| Validate referential quality | Join to dimension/reference tables and isolate non-matches |
| Monitor row counts | Compare source, accepted, rejected, inserted, updated counts |
Useful metadata columns:
SELECT
*,
current_timestamp() AS ingestion_ts,
_metadata.file_name AS source_file
FROM read_files('/Volumes/main/landing/orders', format => 'json');
Unity Catalog Security Quick Reference
| Object | Typical privilege idea | Exam use |
|---|
| Metastore | Administrative governance boundary | Usually not granted broadly |
| Catalog | Access top-level namespace | Need catalog access before schema/table work |
| Schema | Use namespace and create objects | Required for table/view creation in that schema |
| Table | Select, modify, manage depending on role | Grant least privilege |
| View | Provide restricted access to query results | Use to hide columns/rows or simplify access |
| Volume | Read/write governed files | For non-tabular data access |
| External location | Access cloud storage path | Needed for external tables/volumes |
| Storage credential | Cloud identity abstraction | Secured tightly; not for general users |
Governance Decision Table
| Requirement | Prefer |
|---|
| Govern tabular data with SQL permissions | Unity Catalog tables/views |
| Govern raw files that are not tables | Unity Catalog volumes |
| Share a subset of columns or rows | Views with appropriate grants |
| Isolate dev/test/prod namespaces | Separate catalogs or schemas |
| Avoid hard-coded cloud credentials | Storage credentials, external locations, secrets |
| Grant only read access | SELECT on table/view, plus required namespace usage |
| Let analysts query without modifying data | Read-only grants on curated gold tables/views |
Example grants:
GRANT USE CATALOG ON CATALOG main TO `data_analysts`;
GRANT USE SCHEMA ON SCHEMA main.gold TO `data_analysts`;
GRANT SELECT ON TABLE main.gold.sales_summary TO `data_analysts`;
Compute Selection
| Need | Choose | Why |
|---|
| Interactive notebook development | All-purpose compute | Supports iterative exploration |
| Scheduled production task | Job compute | Created for job run, easier lifecycle control |
| SQL dashboards and BI queries | SQL warehouse | Optimized SQL serving experience |
| Isolate workloads and reduce idle cost | Job clusters / task-specific compute | Runs only when needed |
| Enforce standardized settings | Cluster policies | Governance and cost control |
| Faster SQL/DataFrame execution where available | Photon-enabled compute | Vectorized execution engine for supported workloads |
| Avoid installing libraries manually each run | Job/task library configuration | Reproducible production setup |
Common traps:
| Trap | Correction |
|---|
| Using all-purpose clusters for every scheduled workload | Prefer job compute for production jobs |
| Assuming driver memory solves all performance issues | Large shuffles/skew need query/data design fixes |
| Installing libraries interactively only | Configure libraries on job/cluster for repeatability |
| Giving broad cluster permissions | Use least privilege and cluster policies |
Databricks Workflows and Jobs
| Concept | What to know |
|---|
| Job | Production unit for scheduled or triggered work |
| Task | Step inside a job, such as notebook, Python script, SQL task, pipeline, or JAR |
| Dependency | Defines task order; downstream tasks wait for upstream success |
| Job cluster | Compute created for a job/task run |
| Parameters | Pass runtime values into notebooks/scripts/SQL |
| Retry | Automatically rerun failed tasks based on configuration |
| Alert/notification | Inform operators on failure, success, or duration conditions |
| Repair run | Rerun failed/skipped tasks without rerunning everything where supported |
| Task values | Pass small values between tasks in multi-task jobs |
Workflow design pattern:
flowchart LR
A[Ingest bronze] --> B[Validate and clean silver]
B --> C[MERGE dimensions]
B --> D[Build facts]
C --> E[Refresh gold marts]
D --> E
E --> F[Run quality checks]
Exam decision points:
| Requirement | Answer pattern |
|---|
| Run notebook A before notebook B | Multi-task job with dependency |
| Reuse same pipeline with different dates | Job parameters |
| Use isolated compute for production | Job cluster |
| Notify on failure | Job notification/alert |
| Avoid rerunning successful upstream tasks after partial failure | Repair failed tasks where appropriate |
| Control who can edit/run job | Job permissions |
SQL Warehouse vs Cluster
| Feature | SQL warehouse | Interactive/job cluster |
|---|
| Primary use | SQL queries, dashboards, BI | Notebooks, Spark jobs, ML/data engineering code |
| Interface | Databricks SQL editor, BI integrations | Notebooks, jobs, Spark APIs |
| Language focus | SQL | SQL, Python, Scala, R depending on context |
| Production serving | Good for SQL analytics | Good for ETL and programmatic pipelines |
| Exam clue | “Dashboard,” “analyst,” “BI query” | “Notebook ETL,” “PySpark,” “library,” “job task” |
| Symptom | Likely cause | Practical fix |
|---|
| Many tiny files | Frequent small writes, streaming micro-batches | OPTIMIZE, tune write patterns, compact periodically |
| Query scans too much data | Poor filters, no useful layout, unselective partitions | Filter early, select needed columns, optimize layout |
| Slow join | Large shuffle, missing broadcast opportunity, skew | Broadcast small dimension, handle skew, filter before join |
| Out-of-memory on driver | collect() or large result to driver | Avoid collect; write/query distributed results |
| Slow aggregation | High-cardinality shuffle | Pre-filter, aggregate in stages, review partitioning |
| Repeated computation | No caching/materialization | Cache selectively, materialize intermediate Delta table |
| Stale or inefficient plans | Missing stats or poor query design | Analyze/explain query, optimize tables |
| Streaming state grows | No watermark or unbounded keys | Add watermark, deduplicate carefully, manage state |
Partitioning, OPTIMIZE, and Data Skipping
| Technique | Use when | Avoid when |
|---|
| Table partitioning | Low/moderate-cardinality column commonly used for filters | High-cardinality columns like unique IDs |
| OPTIMIZE | Table has many small files or frequent incremental writes | Tiny tables with no performance issue |
| Z-ordering / clustering-style layout | Queries repeatedly filter on certain columns | Columns are not used in filters or are too random |
| Caching | Same data reused repeatedly in active session/job | Data is huge, rarely reused, or memory pressure is high |
| Materialized gold table | Many users need same transformed result | Source changes constantly and freshness requirements conflict |
Troubleshooting Checklist
| Problem | Check first |
|---|
| Permission denied | Unity Catalog grants, catalog/schema/table privileges, external location privileges |
| Table not found | Current catalog/schema, object name, workspace/metastore attachment |
| Stream reprocessed data | Checkpoint path changed/deleted, source semantics, output mode |
| Schema mismatch | Source schema drift, table schema, explicit schema, evolution settings |
| Duplicate rows after load | Non-idempotent append, missing merge key, source duplicates |
| Slow job after data growth | File count, skew, shuffle stages, join order, filters, OPTIMIZE need |
| Notebook works manually but job fails | Job parameters, cluster libraries, permissions, secrets, current working context |
collect() crashes driver | Too much data returned to driver; use distributed write or limited sample |
| VACUUM removed needed files | Retention/time travel expectations were not considered before cleanup |
Common Exam Traps
| Trap | Correct understanding |
|---|
| “Delta Lake is just Parquet files” | Delta uses Parquet data files plus a transaction log for reliability and metadata |
| “Views store data” | Standard views store query definitions; materialized views store results |
| “A DataFrame transformation immediately runs” | Spark transformations are lazy until an action |
“count() is harmless” | It is an action and can scan large data |
| “Streaming means real-time only” | Structured Streaming can also process available data incrementally with bounded-style triggers |
| “Append is safe for all incremental loads” | Updates/deletes require merge or CDC-aware logic |
| “Partition by unique ID for faster lookup” | High-cardinality partitioning often creates too many small partitions/files |
| “Delete checkpoint to fix stream” | This can cause data duplication or state loss |
| “External table means ungoverned” | External tables can still be governed through Unity Catalog |
| “VACUUM improves query speed directly” | It removes obsolete files; compaction/layout are separate performance concerns |
| “Cache guarantees faster queries” | Cache helps only if reused and materialized; it can also create memory pressure |
| “Job success means data quality is correct” | Jobs can succeed while loading bad data unless quality checks are implemented |
Compact Snippet Bank
Create and Use Namespaces
CREATE CATALOG IF NOT EXISTS main;
CREATE SCHEMA IF NOT EXISTS main.silver;
USE CATALOG main;
USE SCHEMA silver;
Create Table from Query
CREATE OR REPLACE TABLE main.gold.daily_sales AS
SELECT
date(order_ts) AS order_date,
sum(amount) AS total_amount,
count(*) AS order_count
FROM main.silver.orders
GROUP BY date(order_ts);
Append Cleaned Data
clean_df = (
raw_df
.filter("order_id IS NOT NULL")
.withColumn("ingestion_ts", current_timestamp())
)
clean_df.write.mode("append").format("delta").saveAsTable("main.silver.orders")
Overwrite Safely for Rebuildable Gold Table
CREATE OR REPLACE TABLE main.gold.customer_metrics AS
SELECT
customer_id,
count(*) AS order_count,
sum(amount) AS lifetime_value
FROM main.silver.orders
GROUP BY customer_id;
Watermarked Streaming Deduplication
deduped = (
stream_df
.withWatermark("event_ts", "1 day")
.dropDuplicates(["event_id"])
)
Explain a Query Plan
EXPLAIN
SELECT customer_id, sum(amount)
FROM main.silver.orders
WHERE order_ts >= current_date() - INTERVAL 30 DAYS
GROUP BY customer_id;
Decision Trees
Ingestion Choice
flowchart TD
A[Need to ingest data?] --> B{Source is files?}
B -- No --> C{Source is continuous events?}
C -- Yes --> D[Structured Streaming connector]
C -- No --> E[Batch read or connector-specific load]
B -- Yes --> F{Need scalable continuous file discovery?}
F -- Yes --> G[Auto Loader]
F -- No --> H{Prefer SQL incremental load?}
H -- Yes --> I[COPY INTO]
H -- No --> J[Spark batch read]
Table Maintenance Choice
flowchart TD
A[Need to update target table?] --> B{Only new rows?}
B -- Yes --> C[Append]
B -- No --> D{Need inserts and updates?}
D -- Yes --> E[MERGE INTO]
D -- No --> F{Rebuild full result?}
F -- Yes --> G[CREATE OR REPLACE TABLE]
F -- No --> H[Use DELETE/UPDATE with clear predicate]
Final Review Checklist
Before exam day, make sure you can:
- Explain Delta Lake transaction log, time travel, schema enforcement, and VACUUM.
- Choose between
COPY INTO, Auto Loader, batch reads, and Structured Streaming. - Write or recognize
MERGE INTO for upserts. - Distinguish managed tables, external tables, views, materialized views, and volumes.
- Apply Unity Catalog hierarchy and least-privilege grants.
- Identify when a Spark operation is lazy, an action, narrow, or wide.
- Use SQL windows for latest-record deduplication.
- Choose job compute, all-purpose compute, or SQL warehouses based on workload.
- Diagnose common failures involving checkpoints, permissions, schema drift, small files, and driver collection.
- Recognize production patterns: idempotency, parameterization, retries, alerts, and quality checks.
For the next step, practice with scenario questions that force you to choose the right Databricks service, SQL command, Delta Lake operation, or production troubleshooting action under exam-style constraints.