Databricks Certified Data Engineer Associate Quick Reference

Compact exam-prep reference for Databricks Certified Data Engineer Associate candidates: Delta Lake, Spark SQL, ingestion, governance, jobs, and troubleshooting.

This Quick Reference is an independent study aid for candidates preparing for the Databricks Certified Data Engineer Associate exam, code Databricks DEA. It focuses on practical distinctions, commands, and design choices that commonly appear in data engineering scenarios on Databricks.

High-Yield Exam Map

AreaKnow how to answer
Lakehouse architectureBronze/silver/gold design, Delta Lake as the table format, batch vs streaming, data quality layers
Delta LakeACID transactions, transaction log, time travel, schema enforcement/evolution, MERGE, OPTIMIZE, VACUUM
Spark SQL and DataFramesTransformations vs actions, joins, aggregations, windows, null handling, deduplication, file/table reads
IngestionCOPY INTO, Auto Loader, batch reads, streaming reads, schema inference, checkpoints
OrchestrationDatabricks Workflows, jobs, tasks, dependencies, job clusters, scheduling, retries
GovernanceUnity Catalog object hierarchy, catalogs, schemas, tables, views, volumes, privileges, managed vs external data
Performance and troubleshootingPartitioning, file sizes, data skipping, caching, broadcast joins, shuffle, skew, query plans
Production behaviorIdempotency, incremental loads, checkpointing, permissions, alerts, parameterization

Lakehouse Object Model

ConceptExam-ready meaningCommon trap
WorkspaceUser-facing Databricks environment for notebooks, jobs, clusters, repos, SQL assetsWorkspace is not the same as the Unity Catalog metastore
MetastoreGovernance container for Unity Catalog metadataA workspace can be attached to a metastore; permissions are still object-level
CatalogTop-level namespace in Unity CatalogCatalogs contain schemas, not directly arbitrary notebooks
SchemaNamespace inside a catalog; similar to a databaseIn SQL, USE SCHEMA or fully qualify names to avoid wrong object references
TableStructured data object, commonly DeltaA table may be managed or external
ViewSaved query over tables/viewsStandard views do not store data; permissions and lineage matter
Materialized viewStores query results and can be refreshedNot the same as a regular view
VolumeUnity Catalog object for non-tabular filesUse volumes for governed file access, not table queries
External locationGoverned reference to cloud storageRequires appropriate storage credential and privileges
Storage credentialIdentity/credential Databricks uses to access cloud storageDo not confuse with a user secret or personal access token

Naming Pattern

Prefer fully qualified object names in exam scenarios involving governance or multiple environments:

SELECT *
FROM catalog_name.schema_name.table_name;

Managed vs External Tables

DecisionManaged tableExternal table
Data locationDatabricks-managed storage locationUser-specified external path
LifecycleDropping the table can remove both metadata and managed dataDropping the table removes metadata, not necessarily underlying files
GovernanceGoverned through Unity CatalogGoverned through Unity Catalog plus external location controls
Best forDefault lakehouse tables, simplified lifecycleShared storage, existing data lakes, cross-system data ownership
Exam signal“Let Databricks manage storage”“Data already exists in cloud storage” or “retain files after dropping table”

Example DDL:

-- Managed Delta table
CREATE TABLE main.sales.orders (
  order_id STRING,
  order_ts TIMESTAMP,
  amount DECIMAL(10,2)
);

-- External Delta table
CREATE TABLE main.sales.orders_ext
LOCATION 's3://example-bucket/path/orders';

Delta Lake Essentials

FeatureWhat it doesExam use
ACID transactionsReliable concurrent reads/writesAvoid corrupt partial writes
Transaction log_delta_log records table versionsEnables time travel, rollback-style reads, metadata tracking
Schema enforcementRejects incompatible writesProtects table quality
Schema evolutionAllows approved schema changesUseful for evolving source data, but should be explicit in production
Time travelQuery older table versions or timestampsAudit, reproduce, recover from bad writes
MERGEUpsert/delete based on match conditionIncremental CDC-style loads
OPTIMIZECompacts small filesImprove scan performance
Z-ordering / clustering conceptsCo-locates related data for skippingUseful for common filter columns; do not use blindly
VACUUMRemoves old unused data filesCan limit time travel and rollback options
Change Data FeedExposes row-level changes when enabledIncremental downstream processing

Delta Time Travel

SELECT *
FROM main.sales.orders VERSION AS OF 12;

SELECT *
FROM main.sales.orders TIMESTAMP AS OF '2026-06-01T00:00:00Z';

High-yield distinction:

NeedChoose
Query previous table stateTime travel
Undo bad write manuallyRead old version, then overwrite/restore using approved pattern
Free old file storageVACUUM
Track row-level inserts/updates/deletesChange Data Feed

Medallion Architecture

LayerPurposeTypical operationsQuality expectation
BronzeRaw or lightly processed landing dataIngest, append, capture metadata, preserve source fidelityLow; keep source as received
SilverCleaned, conformed, deduplicated dataParse, cast, validate, standardize, deduplicate, mergeMedium to high
GoldBusiness-ready aggregates or martsAggregate, join dimensions, serve BI/ML use casesHigh; curated and query-optimized

Common exam pattern:

  1. Land source files into bronze.
  2. Apply schema, validation, deduplication into silver.
  3. Build aggregated or dimensional outputs in gold.
  4. Orchestrate dependencies with a Databricks job or declarative pipeline.
  5. Govern access by catalog/schema/table privileges.

Ingestion Selection Matrix

RequirementBest fitWhy
Load files once or periodically with SQLCOPY INTOSimple incremental file ingestion into Delta
Continuously ingest new files from cloud storageAuto LoaderScalable file discovery, schema handling, checkpointing
Read static files for ad hoc transformationSpark batch readDirect, flexible, not automatically incremental
Process event streams incrementallyStructured StreamingStateful streaming engine with checkpoints
Build managed declarative ETL with quality rulesDatabricks declarative pipeline conceptsPipeline orchestration, dependencies, expectations
Ingest small manual datasetsUI upload or simple table creationConvenience, not production-scale ingestion

COPY INTO

Use when the source is file-based and the target is a Delta table.

COPY INTO main.bronze.orders_raw
FROM 's3://example-bucket/incoming/orders/'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true');

Exam reminders:

PointRemember
Incremental behaviorCOPY INTO tracks previously loaded files for the target table
Best useSimple file ingestion without custom streaming logic
TargetTypically a Delta table
TrapIt is not the same as INSERT INTO SELECT from an already registered table

Auto Loader

Use Auto Loader for scalable incremental file ingestion.

df = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/Volumes/main/ops/checkpoints/orders_schema")
    .load("/Volumes/main/landing/orders")
)

(
    df.writeStream
    .format("delta")
    .option("checkpointLocation", "/Volumes/main/ops/checkpoints/orders_stream")
    .toTable("main.bronze.orders_raw")
)
Auto Loader conceptExam meaning
cloudFilesFormat used by Auto Loader
Schema locationStores inferred/evolving schema metadata
Checkpoint locationTracks streaming progress and state
Rescue dataCaptures unexpected columns or malformed fields depending on configuration
Incremental discoveryProcesses new files without re-reading all old files

Batch vs Streaming

Scenario clueUse batchUse streaming
Files arrive once per day and can be loaded as a batchYesOptional
Data must be processed continuously as it arrivesNoYes
Query needs watermarking for late eventsNoYes
Need exact same transformation logic on bounded dataYesSometimes with available-now style trigger
Job should terminate after processing available dataYesUse a bounded/available trigger if streaming ingestion is still desired
Stateful deduplication over timeLimitedYes, with watermark/checkpoint

Structured Streaming Checkpoints

RuleWhy it matters
Each streaming query needs its own checkpointPrevents state/progress conflicts
Do not casually delete checkpointsCan cause reprocessing or state loss
Use stable storage for checkpointsRequired for reliable recovery
Changing query logic may require checkpoint planningState schema and output behavior can be incompatible

Spark SQL and DataFrame Core

Transformations vs Actions

TypeExamplesBehavior
Transformationselect, filter, withColumn, join, groupBy, orderByLazy; builds logical plan
Actioncount, collect, show, write, displayTriggers execution
Wide transformationgroupBy, join, distinct, orderByOften causes shuffle
Narrow transformationselect, simple filter, many column expressionsUsually no shuffle

Common trap: caching a DataFrame is lazy. It is materialized only after an action.

df_cached = df.filter("amount > 0").cache()
df_cached.count()  # materializes cache

SQL Operations to Know

OperationPatternExam note
FilterWHERE amount > 0Applied before aggregation
Aggregate filterHAVING count(*) > 1Applied after GROUP BY
Null comparisonIS NULL, IS NOT NULLDo not use = NULL
Conditional logicCASE WHEN ... THEN ... ENDUseful for derived fields
JoinINNER, LEFT, RIGHT, FULL, CROSS, ANTI, SEMIKnow output semantics
DedupDISTINCT, dropDuplicates, window row_numberWindow pattern gives deterministic survivor
Explodeexplode(array_col)Converts array elements to rows
WindowOVER (PARTITION BY ... ORDER BY ...)Ranking, running totals, latest record selection

Join Types

JoinReturns
InnerMatching rows from both sides
Left outerAll left rows plus matching right rows
Right outerAll right rows plus matching left rows
Full outerAll rows from both sides, matched where possible
Left semiLeft rows that have a match; only left columns
Left antiLeft rows with no match; only left columns
CrossCartesian product; usually avoid unless intentional

Example anti join for new records:

new_customers = incoming.join(existing, on="customer_id", how="left_anti")

Deduplication Patterns

NeedPatternNotes
Remove exact duplicate rowsSELECT DISTINCT *Simple but may shuffle heavily
Deduplicate by key, arbitrary survivordropDuplicates(["id"])Survivor may not be deterministic
Keep latest record per keyWindow with row_number()Most exam-safe when order column exists
Streaming deduplicationdropDuplicates with watermarkControls state growth and late data handling
Upsert latest changes into targetMERGE INTOBest for incremental table maintenance

Latest record per key:

WITH ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id
      ORDER BY updated_at DESC
    ) AS rn
  FROM main.bronze.customers_raw
)
SELECT *
FROM ranked
WHERE rn = 1;

MERGE for Upserts

Use MERGE when records may be new, changed, or deleted.

MERGE INTO main.silver.customers AS target
USING main.bronze.customers_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET *
WHEN NOT MATCHED THEN
  INSERT *;
ClauseMeaning
WHEN MATCHED THEN UPDATEExisting target row is changed
WHEN MATCHED THEN DELETEExisting target row is removed
WHEN NOT MATCHED THEN INSERTNew target row is inserted
ON conditionDefines business key match
Source duplicatesCan cause ambiguous matches; deduplicate source first

Exam-safe CDC flow:

  1. Read source changes.
  2. Cast and validate fields.
  3. Deduplicate changes by business key and sequence/timestamp.
  4. MERGE into silver table.
  5. Write audit metrics or job status.

Schema Handling

FeaturePurposeTrap
Schema inferenceDetects schema from source dataConvenient but risky for production consistency
Explicit schemaDefines expected columns and typesPreferred for stable pipelines
Schema enforcementPrevents invalid writes to DeltaDoes not automatically fix bad data
Schema evolutionAdds or changes schema when allowedShould be controlled; avoid accidental drift
CastsConvert strings to dates, timestamps, decimalsBad casts may produce nulls or errors depending on mode
ConstraintsEnforce table-level data rulesUse for quality guarantees where supported

Example explicit schema:

from pyspark.sql.types import StructType, StructField, StringType, TimestampType, DecimalType

schema = StructType([
    StructField("order_id", StringType(), False),
    StructField("order_ts", TimestampType(), True),
    StructField("amount", DecimalType(10, 2), True)
])

Data Quality Patterns

RequirementPattern
Reject records with missing primary keyFilter invalid rows or use expectations/constraints
Quarantine malformed dataWrite invalid records to separate error table
Track ingestion lineageAdd source file name, ingestion timestamp, batch ID
Prevent duplicate business keysDeduplicate before merge; enforce uniqueness through pipeline logic
Validate referential qualityJoin to dimension/reference tables and isolate non-matches
Monitor row countsCompare source, accepted, rejected, inserted, updated counts

Useful metadata columns:

SELECT
  *,
  current_timestamp() AS ingestion_ts,
  _metadata.file_name AS source_file
FROM read_files('/Volumes/main/landing/orders', format => 'json');

Unity Catalog Security Quick Reference

ObjectTypical privilege ideaExam use
MetastoreAdministrative governance boundaryUsually not granted broadly
CatalogAccess top-level namespaceNeed catalog access before schema/table work
SchemaUse namespace and create objectsRequired for table/view creation in that schema
TableSelect, modify, manage depending on roleGrant least privilege
ViewProvide restricted access to query resultsUse to hide columns/rows or simplify access
VolumeRead/write governed filesFor non-tabular data access
External locationAccess cloud storage pathNeeded for external tables/volumes
Storage credentialCloud identity abstractionSecured tightly; not for general users

Governance Decision Table

RequirementPrefer
Govern tabular data with SQL permissionsUnity Catalog tables/views
Govern raw files that are not tablesUnity Catalog volumes
Share a subset of columns or rowsViews with appropriate grants
Isolate dev/test/prod namespacesSeparate catalogs or schemas
Avoid hard-coded cloud credentialsStorage credentials, external locations, secrets
Grant only read accessSELECT on table/view, plus required namespace usage
Let analysts query without modifying dataRead-only grants on curated gold tables/views

Example grants:

GRANT USE CATALOG ON CATALOG main TO `data_analysts`;
GRANT USE SCHEMA ON SCHEMA main.gold TO `data_analysts`;
GRANT SELECT ON TABLE main.gold.sales_summary TO `data_analysts`;

Compute Selection

NeedChooseWhy
Interactive notebook developmentAll-purpose computeSupports iterative exploration
Scheduled production taskJob computeCreated for job run, easier lifecycle control
SQL dashboards and BI queriesSQL warehouseOptimized SQL serving experience
Isolate workloads and reduce idle costJob clusters / task-specific computeRuns only when needed
Enforce standardized settingsCluster policiesGovernance and cost control
Faster SQL/DataFrame execution where availablePhoton-enabled computeVectorized execution engine for supported workloads
Avoid installing libraries manually each runJob/task library configurationReproducible production setup

Common traps:

TrapCorrection
Using all-purpose clusters for every scheduled workloadPrefer job compute for production jobs
Assuming driver memory solves all performance issuesLarge shuffles/skew need query/data design fixes
Installing libraries interactively onlyConfigure libraries on job/cluster for repeatability
Giving broad cluster permissionsUse least privilege and cluster policies

Databricks Workflows and Jobs

ConceptWhat to know
JobProduction unit for scheduled or triggered work
TaskStep inside a job, such as notebook, Python script, SQL task, pipeline, or JAR
DependencyDefines task order; downstream tasks wait for upstream success
Job clusterCompute created for a job/task run
ParametersPass runtime values into notebooks/scripts/SQL
RetryAutomatically rerun failed tasks based on configuration
Alert/notificationInform operators on failure, success, or duration conditions
Repair runRerun failed/skipped tasks without rerunning everything where supported
Task valuesPass small values between tasks in multi-task jobs

Workflow design pattern:

    flowchart LR
	    A[Ingest bronze] --> B[Validate and clean silver]
	    B --> C[MERGE dimensions]
	    B --> D[Build facts]
	    C --> E[Refresh gold marts]
	    D --> E
	    E --> F[Run quality checks]

Exam decision points:

RequirementAnswer pattern
Run notebook A before notebook BMulti-task job with dependency
Reuse same pipeline with different datesJob parameters
Use isolated compute for productionJob cluster
Notify on failureJob notification/alert
Avoid rerunning successful upstream tasks after partial failureRepair failed tasks where appropriate
Control who can edit/run jobJob permissions

SQL Warehouse vs Cluster

FeatureSQL warehouseInteractive/job cluster
Primary useSQL queries, dashboards, BINotebooks, Spark jobs, ML/data engineering code
InterfaceDatabricks SQL editor, BI integrationsNotebooks, jobs, Spark APIs
Language focusSQLSQL, Python, Scala, R depending on context
Production servingGood for SQL analyticsGood for ETL and programmatic pipelines
Exam clue“Dashboard,” “analyst,” “BI query”“Notebook ETL,” “PySpark,” “library,” “job task”

Performance Reference

SymptomLikely causePractical fix
Many tiny filesFrequent small writes, streaming micro-batchesOPTIMIZE, tune write patterns, compact periodically
Query scans too much dataPoor filters, no useful layout, unselective partitionsFilter early, select needed columns, optimize layout
Slow joinLarge shuffle, missing broadcast opportunity, skewBroadcast small dimension, handle skew, filter before join
Out-of-memory on drivercollect() or large result to driverAvoid collect; write/query distributed results
Slow aggregationHigh-cardinality shufflePre-filter, aggregate in stages, review partitioning
Repeated computationNo caching/materializationCache selectively, materialize intermediate Delta table
Stale or inefficient plansMissing stats or poor query designAnalyze/explain query, optimize tables
Streaming state growsNo watermark or unbounded keysAdd watermark, deduplicate carefully, manage state

Partitioning, OPTIMIZE, and Data Skipping

TechniqueUse whenAvoid when
Table partitioningLow/moderate-cardinality column commonly used for filtersHigh-cardinality columns like unique IDs
OPTIMIZETable has many small files or frequent incremental writesTiny tables with no performance issue
Z-ordering / clustering-style layoutQueries repeatedly filter on certain columnsColumns are not used in filters or are too random
CachingSame data reused repeatedly in active session/jobData is huge, rarely reused, or memory pressure is high
Materialized gold tableMany users need same transformed resultSource changes constantly and freshness requirements conflict

Troubleshooting Checklist

ProblemCheck first
Permission deniedUnity Catalog grants, catalog/schema/table privileges, external location privileges
Table not foundCurrent catalog/schema, object name, workspace/metastore attachment
Stream reprocessed dataCheckpoint path changed/deleted, source semantics, output mode
Schema mismatchSource schema drift, table schema, explicit schema, evolution settings
Duplicate rows after loadNon-idempotent append, missing merge key, source duplicates
Slow job after data growthFile count, skew, shuffle stages, join order, filters, OPTIMIZE need
Notebook works manually but job failsJob parameters, cluster libraries, permissions, secrets, current working context
collect() crashes driverToo much data returned to driver; use distributed write or limited sample
VACUUM removed needed filesRetention/time travel expectations were not considered before cleanup

Common Exam Traps

TrapCorrect understanding
“Delta Lake is just Parquet files”Delta uses Parquet data files plus a transaction log for reliability and metadata
“Views store data”Standard views store query definitions; materialized views store results
“A DataFrame transformation immediately runs”Spark transformations are lazy until an action
count() is harmless”It is an action and can scan large data
“Streaming means real-time only”Structured Streaming can also process available data incrementally with bounded-style triggers
“Append is safe for all incremental loads”Updates/deletes require merge or CDC-aware logic
“Partition by unique ID for faster lookup”High-cardinality partitioning often creates too many small partitions/files
“Delete checkpoint to fix stream”This can cause data duplication or state loss
“External table means ungoverned”External tables can still be governed through Unity Catalog
“VACUUM improves query speed directly”It removes obsolete files; compaction/layout are separate performance concerns
“Cache guarantees faster queries”Cache helps only if reused and materialized; it can also create memory pressure
“Job success means data quality is correct”Jobs can succeed while loading bad data unless quality checks are implemented

Compact Snippet Bank

Create and Use Namespaces

CREATE CATALOG IF NOT EXISTS main;
CREATE SCHEMA IF NOT EXISTS main.silver;

USE CATALOG main;
USE SCHEMA silver;

Create Table from Query

CREATE OR REPLACE TABLE main.gold.daily_sales AS
SELECT
  date(order_ts) AS order_date,
  sum(amount) AS total_amount,
  count(*) AS order_count
FROM main.silver.orders
GROUP BY date(order_ts);

Append Cleaned Data

clean_df = (
    raw_df
    .filter("order_id IS NOT NULL")
    .withColumn("ingestion_ts", current_timestamp())
)

clean_df.write.mode("append").format("delta").saveAsTable("main.silver.orders")

Overwrite Safely for Rebuildable Gold Table

CREATE OR REPLACE TABLE main.gold.customer_metrics AS
SELECT
  customer_id,
  count(*) AS order_count,
  sum(amount) AS lifetime_value
FROM main.silver.orders
GROUP BY customer_id;

Watermarked Streaming Deduplication

deduped = (
    stream_df
    .withWatermark("event_ts", "1 day")
    .dropDuplicates(["event_id"])
)

Explain a Query Plan

EXPLAIN
SELECT customer_id, sum(amount)
FROM main.silver.orders
WHERE order_ts >= current_date() - INTERVAL 30 DAYS
GROUP BY customer_id;

Decision Trees

Ingestion Choice

    flowchart TD
	    A[Need to ingest data?] --> B{Source is files?}
	    B -- No --> C{Source is continuous events?}
	    C -- Yes --> D[Structured Streaming connector]
	    C -- No --> E[Batch read or connector-specific load]
	    B -- Yes --> F{Need scalable continuous file discovery?}
	    F -- Yes --> G[Auto Loader]
	    F -- No --> H{Prefer SQL incremental load?}
	    H -- Yes --> I[COPY INTO]
	    H -- No --> J[Spark batch read]

Table Maintenance Choice

    flowchart TD
	    A[Need to update target table?] --> B{Only new rows?}
	    B -- Yes --> C[Append]
	    B -- No --> D{Need inserts and updates?}
	    D -- Yes --> E[MERGE INTO]
	    D -- No --> F{Rebuild full result?}
	    F -- Yes --> G[CREATE OR REPLACE TABLE]
	    F -- No --> H[Use DELETE/UPDATE with clear predicate]

Final Review Checklist

Before exam day, make sure you can:

  • Explain Delta Lake transaction log, time travel, schema enforcement, and VACUUM.
  • Choose between COPY INTO, Auto Loader, batch reads, and Structured Streaming.
  • Write or recognize MERGE INTO for upserts.
  • Distinguish managed tables, external tables, views, materialized views, and volumes.
  • Apply Unity Catalog hierarchy and least-privilege grants.
  • Identify when a Spark operation is lazy, an action, narrow, or wide.
  • Use SQL windows for latest-record deduplication.
  • Choose job compute, all-purpose compute, or SQL warehouses based on workload.
  • Diagnose common failures involving checkpoints, permissions, schema drift, small files, and driver collection.
  • Recognize production patterns: idempotency, parameterization, retries, alerts, and quality checks.

For the next step, practice with scenario questions that force you to choose the right Databricks service, SQL command, Delta Lake operation, or production troubleshooting action under exam-style constraints.