DP-750 — Microsoft Certified: Azure Databricks Data Engineer Associate Quick Reference

Last revised: June 18, 2026

Compact DP-750 reference for Azure Databricks data engineering patterns, Delta Lake, Unity Catalog, pipelines, security, and optimization.

Use this Quick Reference as independent review support for Microsoft Certified: Azure Databricks Data Engineer Associate (DP-750). The exam rewards practical decisions: how to ingest data, transform it with Delta Lake, govern it with Unity Catalog, run reliable pipelines, and troubleshoot Azure Databricks workloads.

DP-750 Mental Model

Azure Databricks data engineering questions usually combine four layers:

Layer	What to decide	High-yield exam cues
Ingestion	How data enters the lakehouse	Auto Loader vs COPY INTO vs streaming source vs batch read
Storage and format	How data is stored and modeled	Delta tables, managed vs external, medallion architecture, schema evolution
Processing	How data is transformed and scheduled	Spark SQL, PySpark, jobs, Lakeflow Declarative Pipelines / Delta Live Tables
Governance and operations	How data is secured, monitored, and optimized	Unity Catalog, permissions, external locations, lineage, jobs UI, Spark UI, OPTIMIZE

    flowchart LR
	    A[Sources: files, databases, events, APIs] --> B[Landing storage in ADLS Gen2]
	    B --> C[Ingestion: Auto Loader, COPY INTO, Structured Streaming]
	    C --> D[Bronze Delta tables]
	    D --> E[Silver cleansing and conformance]
	    E --> F[Gold marts and aggregates]
	    F --> G[BI, ML, apps, sharing]
	    H[Unity Catalog] -. governs .-> D
	    H -. governs .-> E
	    H -. governs .-> F
	    I[Jobs / Workflows / Pipelines] -. orchestrate .-> C
	    I -. orchestrate .-> E
	    I -. orchestrate .-> F

Core Azure Databricks Object Map

Object	What it is	Exam point
Workspace	Azure Databricks environment for users, compute, notebooks, jobs	Workspace is not the primary data governance boundary when Unity Catalog is used
Metastore	Unity Catalog governance container	Assigned to workspaces; contains catalogs
Catalog	Top-level namespace in Unity Catalog	Use for environment, business domain, or governance boundary
Schema	Namespace inside a catalog	Also called database in older Spark/Hive terminology
Table	Structured data object, usually Delta	Prefer Delta for reliability, transactions, and optimization
View	Saved query	Useful for abstraction, row filtering, column masking, and simplified access
Volume	Unity Catalog object for non-tabular files	Prefer over legacy mounts for governed file access
External location	Governed reference to cloud storage path	Grants file access without exposing storage credentials
Storage credential	Unity Catalog credential for cloud storage	In Azure, commonly backed by managed identity / access connector patterns
Cluster	Spark compute for notebooks and jobs	Choose access mode and policies carefully for governance
SQL warehouse	Compute for Databricks SQL	Best for SQL analytics, dashboards, BI, SQL queries
Job	Orchestrated workflow	Use task dependencies, parameters, retries, schedules, and job clusters
Pipeline	Declarative data pipeline	Use Lakeflow Declarative Pipelines / Delta Live Tables for managed dependencies and quality rules
Notebook	Interactive or scheduled code unit	Good for development; production needs parameters, source control, and job orchestration
Secret scope	Secure reference to secrets	Prefer managed identities and Unity Catalog storage credentials where possible

Medallion Architecture Reference

Layer	Typical contents	Common operations	Quality expectation
Bronze	Raw or lightly parsed data	Ingest, append, preserve source metadata	Minimal transformation; keep recoverability
Silver	Cleaned, deduplicated, conformed data	Type casting, validation, joins, CDC handling, deduplication	Business-ready entity tables
Gold	Aggregated or serving data	Star schemas, marts, KPIs, feature tables, BI extracts	Optimized for consumption

High-yield distinctions:

Decision	Choose this when	Avoid this trap
Bronze stores raw records	You need replay, audit, or schema recovery	Do not overwrite raw history without a retention strategy
Silver applies business rules	You need reusable clean entities	Do not bury cleansing logic only in gold reports
Gold serves consumers	You need fast BI or domain-specific outputs	Do not make every downstream team read raw bronze data
Delta for all layers	You need ACID, schema enforcement, time travel, MERGE, optimization	Do not use plain Parquet when transactional updates are required

Feature Selection Matrix

Requirement	Best fit	Why
Incrementally ingest new files from cloud storage	Auto Loader	Tracks discovered files, supports schema inference/evolution, works with streaming
Load known files idempotently into Delta	COPY INTO	Simple SQL pattern for batch file ingestion
Process continuously arriving events	Structured Streaming	Handles streaming sources, checkpoints, state, watermarks
Build declarative, managed ETL with quality checks	Lakeflow Declarative Pipelines / Delta Live Tables	Manages dependencies, expectations, event logs, pipeline execution
Upsert changed records into a Delta table	MERGE INTO	Supports inserts, updates, deletes in one atomic operation
Read only changed rows from a Delta table	Change Data Feed	Efficient downstream incremental processing
Govern data and file access centrally	Unity Catalog	Catalog/schema/table permissions, external locations, lineage
Secure access to ADLS Gen2	Unity Catalog external locations and storage credentials	Avoid hard-coded keys and legacy mount-first designs
Run production Spark tasks	Jobs with job clusters	Repeatable, isolated, schedulable execution
Run BI SQL workloads	SQL warehouse	Optimized for SQL queries, dashboards, and BI tools
Explore interactively	All-purpose compute	Flexible for development, less ideal for production cost control
Standardize compute settings	Cluster policies	Enforce allowed node types, access modes, tags, and security settings
Reduce repeated cluster startup latency	Pools or serverless options where available	Use only when supported and appropriate for workload

Ingestion Quick Reference

Batch and File Ingestion

Pattern	Use when	Key details
`COPY INTO`	Periodic batch loads from files into Delta	SQL-friendly; tracks previously loaded files for the target
Auto Loader	New files arrive continuously or unpredictably	Uses `cloudFiles`; requires checkpoint and schema location
Direct Spark read	One-time exploration or controlled batch	Not ideal for incremental file tracking
External table over files	Query data in place	Use when data lifecycle is managed outside Databricks
Managed Delta table	Databricks manages table storage	Dropping table removes managed data
External Delta table	Data remains in external location	Dropping table removes metadata, not underlying files

COPY INTO Pattern

COPY INTO prod.bronze.orders_raw
FROM 'abfss://landing@storageacct.dfs.core.windows.net/orders/'
FILEFORMAT = JSON
FORMAT_OPTIONS ('inferSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');

Exam cues:

Cue	Interpretation
“Simple SQL ingestion from files”	Consider `COPY INTO`
“Need automatic incremental file discovery”	Consider Auto Loader
“Need streaming semantics and checkpoints”	Use Structured Streaming / Auto Loader
“Files may arrive late or in nested directories”	Auto Loader is usually stronger than manual file lists

Auto Loader Pattern

raw_path = "abfss://landing@storageacct.dfs.core.windows.net/orders/"
schema_path = "abfss://checkpoint@storageacct.dfs.core.windows.net/schemas/orders/"
checkpoint_path = "abfss://checkpoint@storageacct.dfs.core.windows.net/checkpoints/orders/"

(
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", schema_path)
    .load(raw_path)
    .writeStream
    .option("checkpointLocation", checkpoint_path)
    .trigger(availableNow=True)
    .toTable("prod.bronze.orders_raw")
)

Auto Loader exam traps:

Trap	Correct approach
Reusing the same checkpoint for multiple streams	Use a separate checkpoint per streaming query
Deleting checkpoint state casually	Expect duplicates or reprocessing unless designed for it
Omitting schema location	Schema inference/evolution becomes harder to manage
Treating Auto Loader like a one-time file read	It is designed for incremental discovery and streaming-style processing
Writing to non-idempotent sinks	Use Delta tables and deterministic logic when possible

Streaming Ingestion and Processing

Concept	Meaning	Exam use
Checkpoint	Stores stream progress and state	Required for fault tolerance
Trigger	Controls when micro-batches run	Scheduled-like incremental processing or continuous-like workloads
Output mode	Append, update, or complete	Depends on aggregation/stateful logic
Watermark	Bound on how long to keep state for late data	Required for many deduplication and time-window scenarios
State store	Maintains streaming aggregation/join/dedup state	Watch for state growth and late data
`foreachBatch`	Applies batch logic to each micro-batch	Useful for MERGE/upsert patterns

from pyspark.sql.functions import col

updates = (
    spark.readStream.table("prod.bronze.orders_raw")
    .filter(col("order_id").isNotNull())
)

def upsert_orders(batch_df, batch_id):
    batch_df.createOrReplaceTempView("orders_updates")
    spark.sql("""
        MERGE INTO prod.silver.orders AS t
        USING orders_updates AS s
        ON t.order_id = s.order_id
        WHEN MATCHED THEN UPDATE SET *
        WHEN NOT MATCHED THEN INSERT *
    """)

(
    updates.writeStream
    .foreachBatch(upsert_orders)
    .option("checkpointLocation", "abfss://checkpoint@storageacct.dfs.core.windows.net/checkpoints/orders_merge/")
    .start()
)

Delta Lake Operations Reference

Operation	Use when	Syntax cue
Create Delta table	Persist reliable lakehouse data	`CREATE TABLE ... USING DELTA` or DataFrame write
Append	Add new records	`INSERT INTO`, `.mode("append")`
Overwrite	Replace a dataset or partition carefully	`.mode("overwrite")`, `INSERT OVERWRITE`
MERGE	Upsert or delete based on keys	`MERGE INTO target USING source`
Time travel	Query previous table version	`VERSION AS OF` or `TIMESTAMP AS OF`
Restore	Roll table back to a prior version	`RESTORE TABLE ... TO VERSION AS OF`
History	Inspect operations and versions	`DESCRIBE HISTORY`
Change Data Feed	Read row-level changes	`table_changes()` or read options
OPTIMIZE	Compact small files	`OPTIMIZE table`
ZORDER	Improve data skipping for selected columns	`OPTIMIZE table ZORDER BY (...)`
VACUUM	Remove unreferenced old files	Be careful with time travel and lagging streams

MERGE Pattern

MERGE INTO prod.silver.customers AS target
USING prod.bronze.customers_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.operation = 'DELETE' THEN DELETE
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

MERGE traps:

Trap	Why it matters
Duplicate source rows match one target row	Can cause ambiguous updates; deduplicate source first
No stable business key	Upserts become unreliable
Using MERGE for pure append	Adds unnecessary complexity
Not handling deletes from CDC source	Target keeps records that should be removed or expired
Schema drift not planned	MERGE may fail or create inconsistent expectations

Time Travel and History

DESCRIBE HISTORY prod.silver.orders;

SELECT *
FROM prod.silver.orders VERSION AS OF 12;

RESTORE TABLE prod.silver.orders TO VERSION AS OF 12;

Feature	Use for	Watch out
Time travel	Auditing, debugging, reproducibility	Limited by retained Delta log/data files
Restore	Operational rollback	It creates a new table version
VACUUM	Storage cleanup	Can remove files needed for older time travel or delayed streaming readers

Change Data Feed

ALTER TABLE prod.silver.orders
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

SELECT *
FROM table_changes('prod.silver.orders', 10);

Use CDF when	Do not use CDF when
Downstream jobs need incremental changes	Full refresh is simpler and small enough
You need inserts, updates, and deletes from Delta	Source is not Delta or CDF is not enabled
You want efficient propagation to gold tables	Consumers cannot keep up with retention expectations

Schema, Tables, and Storage Decisions

Managed vs External Tables

Table type	Storage controlled by	Drop behavior	Best use
Managed table	Databricks / Unity Catalog managed storage	Dropping table removes managed data	Standard curated lakehouse tables
External table	External cloud storage location	Dropping table removes metadata only	Data shared with other systems or lifecycle managed externally

Schema Enforcement and Evolution

Requirement	Option	Exam note
Reject unexpected schema	Schema enforcement	Good for trusted silver/gold layers
Allow new columns during ingestion	Schema evolution	Common for bronze Auto Loader
Overwrite schema intentionally	`overwriteSchema` patterns	Use carefully; can break consumers
Merge with new columns	`mergeSchema` / controlled evolution	Validate before using in curated layers
Store rescued data	Rescue column patterns	Useful when raw records may contain unexpected fields

Partitioning, Clustering, and File Layout

Technique	Use when	Avoid
Partitioning	Large tables frequently filtered by low/moderate-cardinality columns	High-cardinality partitions that create many tiny files
OPTIMIZE	Table has many small files	Running constantly without need
ZORDER	Queries filter on selective columns	Too many columns or columns rarely used in filters
Liquid clustering	Supported environment and evolving query patterns	Combining blindly with older partition/ZORDER assumptions
Auto compaction / optimized writes	Need better file sizing with less manual work	Assuming they fix bad table design

Lakeflow Declarative Pipelines / Delta Live Tables

Microsoft and Databricks materials may reference Delta Live Tables (DLT) and newer Lakeflow Declarative Pipelines terminology. For exam purposes, focus on the concepts: declarative tables, dependencies, streaming tables, data quality expectations, pipeline monitoring, and managed execution.

Concept	Meaning	Exam use
Pipeline	Managed set of table definitions and flows	Use for reliable ETL with dependency management
Live table / materialized table	Declarative table created from query logic	Good for batch transformations
Streaming live table	Table fed by streaming input	Good for incremental file/event ingestion
Expectation	Data quality rule	Warn, drop invalid rows, or fail pipeline
Event log	Pipeline operational log	Troubleshooting and audit of pipeline runs
Development mode	Faster iteration	Not the same as production reliability settings
Triggered pipeline	Runs then stops	Batch-like scheduled processing
Continuous pipeline	Keeps processing	Streaming-style processing

Example DLT-style SQL:

CREATE OR REFRESH STREAMING LIVE TABLE bronze_orders
AS
SELECT *
FROM cloud_files(
  "abfss://landing@storageacct.dfs.core.windows.net/orders/",
  "json"
);

CREATE OR REFRESH LIVE TABLE silver_orders
(
  CONSTRAINT valid_order_id EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW,
  CONSTRAINT valid_amount EXPECT (amount >= 0) ON VIOLATION DROP ROW
)
AS
SELECT
  order_id,
  customer_id,
  CAST(amount AS DECIMAL(18,2)) AS amount,
  CAST(order_timestamp AS TIMESTAMP) AS order_timestamp
FROM STREAM(LIVE.bronze_orders);

Expectation actions:

Action	Result	Use when
Warn	Records are kept; metric is recorded	You need visibility without blocking
Drop	Invalid records are removed	Bad rows should not enter curated data
Fail	Pipeline stops	Data quality failure should block publication

DLT / Lakeflow traps:

Trap	Correct thinking
Treating pipeline tables as manually updated tables	They are managed by pipeline definitions
Hiding quality rules in notebook code only	Use expectations when pipeline-level metrics matter
Ignoring event logs	Event logs are key for troubleshooting
Using continuous mode for everything	Triggered mode is often enough for scheduled batch-style ingestion
Mixing dev and prod assumptions	Production pipelines need stable configuration, permissions, and monitoring

Unity Catalog and Governance

Namespace and Securable Hierarchy

Level	Example	Notes
Metastore	`main metastore`	Assigned to one or more workspaces
Catalog	`prod`	Top-level data namespace
Schema	`prod.silver`	Groups tables, views, functions, volumes
Object	`prod.silver.orders`	Table, view, volume, model, function, etc.

Use three-level names for clarity:

SELECT *
FROM prod.silver.orders;

Common Unity Catalog Privileges

Privilege	Grants ability to
`USE CATALOG`	Access objects inside a catalog, subject to lower-level grants
`USE SCHEMA`	Access objects inside a schema, subject to object grants
`SELECT`	Read table or view data
`MODIFY`	Insert, update, delete, merge, or otherwise change table data
`CREATE TABLE`	Create tables in a schema
`CREATE VOLUME`	Create volumes in a schema
`READ VOLUME`	Read files in a volume
`WRITE VOLUME`	Write files in a volume
`EXECUTE`	Run functions or models where applicable
`MANAGE`	Manage privileges or object settings, depending on object type

Basic grant pattern:

GRANT USE CATALOG ON CATALOG prod TO `data-engineers`;
GRANT USE SCHEMA ON SCHEMA prod.silver TO `data-engineers`;
GRANT SELECT ON TABLE prod.silver.orders TO `analysts`;
GRANT SELECT, MODIFY ON TABLE prod.silver.orders TO `etl-service-principal`;

High-yield permission rule: object-level access is not enough if the principal lacks USE CATALOG and USE SCHEMA.

External Locations and Volumes

Object	Purpose	Exam distinction
Storage credential	Authenticates to cloud storage	Do not expose raw keys in notebooks
External location	Governs access to a cloud path	Grants `READ FILES` / `WRITE FILES`
External table	Table metadata over data in external storage	For structured tabular data
Volume	Governed access to files	For non-tabular files, libraries, ML files, landing files
Legacy mount	Workspace-level storage mount	Avoid as a UC-first governance design

Example external location pattern:

CREATE EXTERNAL LOCATION raw_orders_location
URL 'abfss://landing@storageacct.dfs.core.windows.net/orders/'
WITH (STORAGE CREDENTIAL adls_storage_credential);

GRANT READ FILES ON EXTERNAL LOCATION raw_orders_location TO `ingestion-engineers`;

Security Decision Table

Requirement	Prefer	Avoid
Govern table access across workspaces	Unity Catalog grants	Workspace-local ACL-only design
Secure ADLS Gen2 access	Managed identity/access connector with UC storage credential	Hard-coded account keys in notebooks
Restrict rows by user/group	Row filters or dynamic views	Duplicating many physical tables
Restrict sensitive columns	Column masks or secure views	Giving broad table access then relying on consumers
Store passwords/API keys	Secret scopes or managed identities	Plain text in notebooks, jobs, or repos
Production service identity	Service principal or managed identity pattern	Personal user identity for scheduled jobs
Govern non-tabular files	Volumes and external locations	Ungoverned DBFS or ad hoc mounts

Unity Catalog Traps

Symptom	Likely cause
User can see table name but query fails	Missing `SELECT`, `USE SCHEMA`, or `USE CATALOG`
Job works for developer but fails in production	Job identity lacks UC or storage permissions
External table cannot read files	External location or storage credential issue
Notebook path works in one workspace only	Workspace-local mount or DBFS dependency
Data appears outside lineage/governance	Legacy metastore, unmanaged path, or direct cloud access bypassing UC
Drop table removed data unexpectedly	It was a managed table

Compute Selection

Compute type	Best for	Exam notes
All-purpose compute	Interactive notebooks, exploration, development	Flexible but not ideal as default production runtime
Job compute / job cluster	Scheduled production tasks	Ephemeral, repeatable, easier cost and dependency control
SQL warehouse	SQL queries, dashboards, BI, Databricks SQL	Not a general PySpark notebook cluster
Serverless compute where available	Reduced infrastructure management	Confirm workload and governance support in scenario
Cluster pool	Faster cluster startup	Useful when many similar clusters start frequently
Photon-enabled compute	SQL and Delta-heavy workloads	Often improves query performance for supported operations

Access Modes

Access mode terminology	Use case	Exam point
Standard / shared	Multiple users with governance controls	Common for UC-enabled collaborative workloads
Dedicated / single user	One user or assigned identity	Useful for isolation and certain workloads
No isolation shared / legacy	Older less-isolated mode	Avoid for modern governed UC workloads

Cluster Configuration Cues

Requirement	Setting or feature
Enforce approved settings	Cluster policy
Control library versions	Job cluster config, init scripts only when needed, pinned dependencies
Minimize idle cost	Auto-termination for interactive clusters
Handle variable load	Autoscaling
Separate dev/test/prod	Separate workspaces, catalogs, schemas, or policies as appropriate
Improve repeatability	Jobs, parameters, source-controlled code

Jobs, Workflows, and Deployment

Feature	Use for
Task dependencies	Build DAG-style workflows
Job parameters	Avoid hard-coded environment paths and dates
Retries	Handle transient failures
Run-if conditions	Control downstream behavior after success/failure
Job clusters	Isolated production compute per job or task
Shared job cluster	Reuse compute among tasks in same job when appropriate
Schedule trigger	Time-based orchestration
File arrival trigger	Start when new data arrives, where supported
Continuous trigger	Always-on processing pattern
Alerts/notifications	Operational awareness
Git integration / source control	Version notebooks, code, SQL, pipeline definitions
Databricks Asset Bundles or deployment tooling	Promote repeatable assets across environments

Production readiness checklist:

Use service principals or managed identities for scheduled workloads.
Parameterize catalog, schema, storage path, and processing date.
Separate development and production data namespaces.
Store secrets outside code.
Use job clusters or governed compute policies.
Define retry behavior and failure notifications.
Log row counts, rejected records, and important pipeline metrics.
Avoid relying on an interactive user’s cluster, credentials, or notebook state.

Performance and Optimization Reference

Table and Query Tuning

Symptom	Likely cause	Corrective action
Query scans too much data	Poor filters, no data skipping, bad layout	Partition appropriately, ZORDER/liquid clustering, collect stats where relevant
Many small files	Frequent small writes or streaming micro-batches	OPTIMIZE, optimized writes, compaction strategy
Slow joins	Shuffle-heavy join, skew, missing broadcast opportunity	Broadcast small dimension, repartition, handle skew, use AQE
OOM during transformation	Large shuffle or wide operation	Reduce data earlier, repartition carefully, avoid collecting to driver
Slow Python UDFs	Row-by-row Python execution	Prefer built-in Spark SQL functions or vectorized patterns
Streaming state grows	No watermark or broad aggregation key	Add watermark, reduce state, tune late-data assumptions
Dashboard slow	Gold table not serving-shaped	Pre-aggregate, use SQL warehouse, optimize table layout
Repeated full recomputation	No incremental design	Use CDF, Auto Loader, MERGE, or pipeline incremental patterns

Spark Execution Concepts

Concept	Why it matters
Lazy evaluation	Transformations run only when an action executes
Narrow transformation	No shuffle; generally cheaper
Wide transformation	Requires shuffle; often expensive
Shuffle	Data redistributed across partitions
Partition	Unit of parallel data processing
Driver	Coordinates Spark application
Executor	Runs tasks and stores/cache partitions
Cache/persist	Useful for reused intermediate data; can become stale or consume memory
Adaptive Query Execution	Runtime query optimization for supported workloads
Broadcast join	Sends small table to workers to avoid large shuffle

Optimization Traps

Trap	Better answer
“Partition by every filter column”	Partition only when cardinality and access patterns justify it
“Cache everything”	Cache reused data only; unpersist when done
“Use bigger clusters first”	Fix data layout, shuffles, and query logic before scaling blindly
“VACUUM aggressively”	Respect time travel, rollback, and streaming readers
“One huge notebook does everything”	Break into tasks/pipelines with observable boundaries
“Use UDFs for simple expressions”	Use native Spark functions for optimizer support

Data Quality and Reliability

Requirement	Feature or pattern
Reject invalid rows in pipeline	Expectations with drop/fail action
Track invalid rows for review	Quarantine table pattern
Enforce non-null or valid values	Constraints or expectations
Deduplicate events	Window functions, watermark-based streaming deduplication
Handle late data	Watermarks and business-defined lateness rules
Idempotent reruns	MERGE by key, deterministic outputs, checkpoint discipline
Audit load metadata	Add source file, ingestion timestamp, batch ID
Backfill historical data	Parameterized jobs, controlled overwrite/merge, separate checkpoint strategy

Useful metadata columns:

Column	Purpose
`ingestion_timestamp`	When the platform ingested the row
`source_file`	File lineage and troubleshooting
`batch_id`	Rerun and reconciliation
`record_hash`	Change detection or deduplication
`is_quarantined` / error fields	Data quality review

CDC and Slowly Changing Dimensions

Pattern	Use when	Core logic
Type 1 SCD	Keep only latest value	MERGE update overwrites existing row
Type 2 SCD	Preserve history	Expire current row, insert new current row
Delete propagation	Source emits deletes	MERGE with `WHEN MATCHED ... DELETE` or expire record
CDF downstream	Delta source changes should feed another table	Read changes since last version
Pipeline CDC helpers	Declarative CDC in Lakeflow/DLT scenarios	Use when exam scenario emphasizes managed CDC pipeline

SCD Type 2 cues:

Field	Purpose
Business key	Identifies entity, such as `customer_id`
Surrogate key	Unique dimension row key
Effective start/end	Validity range
Current flag	Identifies active record
Hash diff	Detects attribute changes

Monitoring and Troubleshooting

Where to Look

Problem	First places to inspect
Job failed	Jobs run output, task logs, cluster logs
Spark query slow	Spark UI, SQL query profile, stages, shuffle metrics
SQL dashboard slow	Query history, warehouse size/state, table layout
Pipeline failed	Pipeline event log, expectation metrics, failed flow/table
Stream stalled	Streaming query progress, checkpoint path, source backlog
Permission denied	Unity Catalog grants, external location grants, cloud IAM
Data missing	Ingestion file discovery, filters, expectations, checkpoints
Duplicates	Checkpoint reset, non-idempotent writes, source replay, missing dedup key
Schema error	Auto Loader schema location, rescued data, evolution settings

Common Error Patterns

Error pattern	Likely cause	Fix direction
Cannot access `abfss://...`	Storage credential, external location, or cloud permission missing	Validate UC external location and identity
Table not found	Wrong catalog/schema or current context	Use three-level names
Permission denied on table	Missing grants	Grant `USE CATALOG`, `USE SCHEMA`, and object privilege
Stream already active or checkpoint conflict	Reused checkpoint/query	Use distinct checkpoint and stop old query
Schema mismatch	Source changed	Apply controlled schema evolution or rescue/quarantine
Slow stage with high shuffle	Skew or wide transformation	Repartition, broadcast, filter earlier, handle skew
Driver out of memory	Collected too much data to driver	Avoid `.collect()` for large data; write distributed outputs
Unexpected old results	Cached data or stale table/view	Refresh/unpersist or rerun after invalidating cache

Triage Order

Identify whether the failure is permissions, code, data, compute, or orchestration.
Check the job or pipeline run output.
Inspect table names, catalog/schema context, and identity used by the run.
Validate source paths, external locations, and grants.
Review schema changes and data quality expectation failures.
Use Spark UI or SQL profile for performance bottlenecks.
Fix idempotency before rerunning failed writes.

Azure-Specific Integration Points

Azure component	Azure Databricks role	Exam note
ADLS Gen2	Primary lakehouse storage	Use `abfss://` paths and governed access
Microsoft Entra ID	Users, groups, service principals	Prefer group-based access management
Managed identities / access connectors	Secure Azure resource access	Avoid embedded credentials
Azure Key Vault	Secret backing for secret scopes	Useful for external secrets, but not a substitute for UC governance
Azure Data Factory / Synapse pipelines	External orchestration	Can trigger Databricks jobs/notebooks
Event Hubs	Streaming source	Often used with Structured Streaming
Microsoft Purview	Broader data catalog/governance integration	Unity Catalog handles Databricks-native governance and lineage

SQL and PySpark Syntax Mini-Reference

Task	SQL / PySpark cue
Set catalog	`USE CATALOG prod;`
Set schema	`USE SCHEMA silver;`
Create table as select	`CREATE TABLE prod.gold.sales AS SELECT ...`
Insert data	`INSERT INTO prod.silver.orders SELECT ...`
Overwrite table	`CREATE OR REPLACE TABLE ... AS SELECT ...`
Merge updates	`MERGE INTO ... USING ... ON ...`
Inspect history	`DESCRIBE HISTORY catalog.schema.table`
Optimize table	`OPTIMIZE catalog.schema.table`
Read table in PySpark	`spark.table("prod.silver.orders")`
Write table in PySpark	`df.write.mode("append").saveAsTable("prod.silver.orders")`
Streaming read table	`spark.readStream.table("prod.bronze.orders")`
Streaming write table	`df.writeStream.toTable("prod.silver.orders")`

High-Yield Exam Traps Checklist

Do not choose legacy mounts when the scenario emphasizes Unity Catalog governance.
Do not grant SELECT only and forget USE CATALOG and USE SCHEMA.
Do not use an all-purpose interactive cluster as the default production answer.
Do not reset or share streaming checkpoints without understanding replay and duplicates.
Do not use direct file reads when the requirement is incremental file discovery.
Do not use COPY INTO for continuous event streams.
Do not use MERGE without a stable key and deduplicated source.
Do not overpartition high-cardinality columns.
Do not run VACUUM casually when rollback, time travel, or lagging streams matter.
Do not store production data in DBFS root as a governance strategy.
Do not assume a notebook user’s permissions are the same as the job’s service identity.
Do not hide all data quality logic downstream in BI; validate in silver/pipeline layers.
Do not choose Python UDFs for simple transformations that Spark SQL functions can handle.
Do not ignore pipeline event logs, job task logs, Spark UI, and SQL query profiles during troubleshooting.

Last-Mile Practice Plan

Practice DP-750 scenarios by forcing yourself to choose: ingestion pattern, Delta table design, Unity Catalog permissions, compute type, orchestration method, and troubleshooting path. Then implement small end-to-end exercises: Auto Loader to bronze, MERGE to silver, aggregate to gold, secure with Unity Catalog grants, schedule as a job, and diagnose one intentional failure.

Scenario Guide

Azure Databricks Setup