DP-750 — Microsoft Certified: Azure Databricks Data Engineer Associate Quick Reference

Compact DP-750 reference for Azure Databricks data engineering patterns, Delta Lake, Unity Catalog, pipelines, security, and optimization.

Use this Quick Reference as independent review support for Microsoft Certified: Azure Databricks Data Engineer Associate (DP-750). The exam rewards practical decisions: how to ingest data, transform it with Delta Lake, govern it with Unity Catalog, run reliable pipelines, and troubleshoot Azure Databricks workloads.

DP-750 Mental Model

Azure Databricks data engineering questions usually combine four layers:

LayerWhat to decideHigh-yield exam cues
IngestionHow data enters the lakehouseAuto Loader vs COPY INTO vs streaming source vs batch read
Storage and formatHow data is stored and modeledDelta tables, managed vs external, medallion architecture, schema evolution
ProcessingHow data is transformed and scheduledSpark SQL, PySpark, jobs, Lakeflow Declarative Pipelines / Delta Live Tables
Governance and operationsHow data is secured, monitored, and optimizedUnity Catalog, permissions, external locations, lineage, jobs UI, Spark UI, OPTIMIZE
    flowchart LR
	    A[Sources: files, databases, events, APIs] --> B[Landing storage in ADLS Gen2]
	    B --> C[Ingestion: Auto Loader, COPY INTO, Structured Streaming]
	    C --> D[Bronze Delta tables]
	    D --> E[Silver cleansing and conformance]
	    E --> F[Gold marts and aggregates]
	    F --> G[BI, ML, apps, sharing]
	    H[Unity Catalog] -. governs .-> D
	    H -. governs .-> E
	    H -. governs .-> F
	    I[Jobs / Workflows / Pipelines] -. orchestrate .-> C
	    I -. orchestrate .-> E
	    I -. orchestrate .-> F

Core Azure Databricks Object Map

ObjectWhat it isExam point
WorkspaceAzure Databricks environment for users, compute, notebooks, jobsWorkspace is not the primary data governance boundary when Unity Catalog is used
MetastoreUnity Catalog governance containerAssigned to workspaces; contains catalogs
CatalogTop-level namespace in Unity CatalogUse for environment, business domain, or governance boundary
SchemaNamespace inside a catalogAlso called database in older Spark/Hive terminology
TableStructured data object, usually DeltaPrefer Delta for reliability, transactions, and optimization
ViewSaved queryUseful for abstraction, row filtering, column masking, and simplified access
VolumeUnity Catalog object for non-tabular filesPrefer over legacy mounts for governed file access
External locationGoverned reference to cloud storage pathGrants file access without exposing storage credentials
Storage credentialUnity Catalog credential for cloud storageIn Azure, commonly backed by managed identity / access connector patterns
ClusterSpark compute for notebooks and jobsChoose access mode and policies carefully for governance
SQL warehouseCompute for Databricks SQLBest for SQL analytics, dashboards, BI, SQL queries
JobOrchestrated workflowUse task dependencies, parameters, retries, schedules, and job clusters
PipelineDeclarative data pipelineUse Lakeflow Declarative Pipelines / Delta Live Tables for managed dependencies and quality rules
NotebookInteractive or scheduled code unitGood for development; production needs parameters, source control, and job orchestration
Secret scopeSecure reference to secretsPrefer managed identities and Unity Catalog storage credentials where possible

Medallion Architecture Reference

LayerTypical contentsCommon operationsQuality expectation
BronzeRaw or lightly parsed dataIngest, append, preserve source metadataMinimal transformation; keep recoverability
SilverCleaned, deduplicated, conformed dataType casting, validation, joins, CDC handling, deduplicationBusiness-ready entity tables
GoldAggregated or serving dataStar schemas, marts, KPIs, feature tables, BI extractsOptimized for consumption

High-yield distinctions:

DecisionChoose this whenAvoid this trap
Bronze stores raw recordsYou need replay, audit, or schema recoveryDo not overwrite raw history without a retention strategy
Silver applies business rulesYou need reusable clean entitiesDo not bury cleansing logic only in gold reports
Gold serves consumersYou need fast BI or domain-specific outputsDo not make every downstream team read raw bronze data
Delta for all layersYou need ACID, schema enforcement, time travel, MERGE, optimizationDo not use plain Parquet when transactional updates are required

Feature Selection Matrix

RequirementBest fitWhy
Incrementally ingest new files from cloud storageAuto LoaderTracks discovered files, supports schema inference/evolution, works with streaming
Load known files idempotently into DeltaCOPY INTOSimple SQL pattern for batch file ingestion
Process continuously arriving eventsStructured StreamingHandles streaming sources, checkpoints, state, watermarks
Build declarative, managed ETL with quality checksLakeflow Declarative Pipelines / Delta Live TablesManages dependencies, expectations, event logs, pipeline execution
Upsert changed records into a Delta tableMERGE INTOSupports inserts, updates, deletes in one atomic operation
Read only changed rows from a Delta tableChange Data FeedEfficient downstream incremental processing
Govern data and file access centrallyUnity CatalogCatalog/schema/table permissions, external locations, lineage
Secure access to ADLS Gen2Unity Catalog external locations and storage credentialsAvoid hard-coded keys and legacy mount-first designs
Run production Spark tasksJobs with job clustersRepeatable, isolated, schedulable execution
Run BI SQL workloadsSQL warehouseOptimized for SQL queries, dashboards, and BI tools
Explore interactivelyAll-purpose computeFlexible for development, less ideal for production cost control
Standardize compute settingsCluster policiesEnforce allowed node types, access modes, tags, and security settings
Reduce repeated cluster startup latencyPools or serverless options where availableUse only when supported and appropriate for workload

Ingestion Quick Reference

Batch and File Ingestion

PatternUse whenKey details
COPY INTOPeriodic batch loads from files into DeltaSQL-friendly; tracks previously loaded files for the target
Auto LoaderNew files arrive continuously or unpredictablyUses cloudFiles; requires checkpoint and schema location
Direct Spark readOne-time exploration or controlled batchNot ideal for incremental file tracking
External table over filesQuery data in placeUse when data lifecycle is managed outside Databricks
Managed Delta tableDatabricks manages table storageDropping table removes managed data
External Delta tableData remains in external locationDropping table removes metadata, not underlying files

COPY INTO Pattern

COPY INTO prod.bronze.orders_raw
FROM 'abfss://landing@storageacct.dfs.core.windows.net/orders/'
FILEFORMAT = JSON
FORMAT_OPTIONS ('inferSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');

Exam cues:

CueInterpretation
“Simple SQL ingestion from files”Consider COPY INTO
“Need automatic incremental file discovery”Consider Auto Loader
“Need streaming semantics and checkpoints”Use Structured Streaming / Auto Loader
“Files may arrive late or in nested directories”Auto Loader is usually stronger than manual file lists

Auto Loader Pattern

raw_path = "abfss://landing@storageacct.dfs.core.windows.net/orders/"
schema_path = "abfss://checkpoint@storageacct.dfs.core.windows.net/schemas/orders/"
checkpoint_path = "abfss://checkpoint@storageacct.dfs.core.windows.net/checkpoints/orders/"

(
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", schema_path)
    .load(raw_path)
    .writeStream
    .option("checkpointLocation", checkpoint_path)
    .trigger(availableNow=True)
    .toTable("prod.bronze.orders_raw")
)

Auto Loader exam traps:

TrapCorrect approach
Reusing the same checkpoint for multiple streamsUse a separate checkpoint per streaming query
Deleting checkpoint state casuallyExpect duplicates or reprocessing unless designed for it
Omitting schema locationSchema inference/evolution becomes harder to manage
Treating Auto Loader like a one-time file readIt is designed for incremental discovery and streaming-style processing
Writing to non-idempotent sinksUse Delta tables and deterministic logic when possible

Streaming Ingestion and Processing

ConceptMeaningExam use
CheckpointStores stream progress and stateRequired for fault tolerance
TriggerControls when micro-batches runScheduled-like incremental processing or continuous-like workloads
Output modeAppend, update, or completeDepends on aggregation/stateful logic
WatermarkBound on how long to keep state for late dataRequired for many deduplication and time-window scenarios
State storeMaintains streaming aggregation/join/dedup stateWatch for state growth and late data
foreachBatchApplies batch logic to each micro-batchUseful for MERGE/upsert patterns
from pyspark.sql.functions import col

updates = (
    spark.readStream.table("prod.bronze.orders_raw")
    .filter(col("order_id").isNotNull())
)

def upsert_orders(batch_df, batch_id):
    batch_df.createOrReplaceTempView("orders_updates")
    spark.sql("""
        MERGE INTO prod.silver.orders AS t
        USING orders_updates AS s
        ON t.order_id = s.order_id
        WHEN MATCHED THEN UPDATE SET *
        WHEN NOT MATCHED THEN INSERT *
    """)

(
    updates.writeStream
    .foreachBatch(upsert_orders)
    .option("checkpointLocation", "abfss://checkpoint@storageacct.dfs.core.windows.net/checkpoints/orders_merge/")
    .start()
)

Delta Lake Operations Reference

OperationUse whenSyntax cue
Create Delta tablePersist reliable lakehouse dataCREATE TABLE ... USING DELTA or DataFrame write
AppendAdd new recordsINSERT INTO, .mode("append")
OverwriteReplace a dataset or partition carefully.mode("overwrite"), INSERT OVERWRITE
MERGEUpsert or delete based on keysMERGE INTO target USING source
Time travelQuery previous table versionVERSION AS OF or TIMESTAMP AS OF
RestoreRoll table back to a prior versionRESTORE TABLE ... TO VERSION AS OF
HistoryInspect operations and versionsDESCRIBE HISTORY
Change Data FeedRead row-level changestable_changes() or read options
OPTIMIZECompact small filesOPTIMIZE table
ZORDERImprove data skipping for selected columnsOPTIMIZE table ZORDER BY (...)
VACUUMRemove unreferenced old filesBe careful with time travel and lagging streams

MERGE Pattern

MERGE INTO prod.silver.customers AS target
USING prod.bronze.customers_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.operation = 'DELETE' THEN DELETE
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

MERGE traps:

TrapWhy it matters
Duplicate source rows match one target rowCan cause ambiguous updates; deduplicate source first
No stable business keyUpserts become unreliable
Using MERGE for pure appendAdds unnecessary complexity
Not handling deletes from CDC sourceTarget keeps records that should be removed or expired
Schema drift not plannedMERGE may fail or create inconsistent expectations

Time Travel and History

DESCRIBE HISTORY prod.silver.orders;

SELECT *
FROM prod.silver.orders VERSION AS OF 12;

RESTORE TABLE prod.silver.orders TO VERSION AS OF 12;
FeatureUse forWatch out
Time travelAuditing, debugging, reproducibilityLimited by retained Delta log/data files
RestoreOperational rollbackIt creates a new table version
VACUUMStorage cleanupCan remove files needed for older time travel or delayed streaming readers

Change Data Feed

ALTER TABLE prod.silver.orders
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

SELECT *
FROM table_changes('prod.silver.orders', 10);
Use CDF whenDo not use CDF when
Downstream jobs need incremental changesFull refresh is simpler and small enough
You need inserts, updates, and deletes from DeltaSource is not Delta or CDF is not enabled
You want efficient propagation to gold tablesConsumers cannot keep up with retention expectations

Schema, Tables, and Storage Decisions

Managed vs External Tables

Table typeStorage controlled byDrop behaviorBest use
Managed tableDatabricks / Unity Catalog managed storageDropping table removes managed dataStandard curated lakehouse tables
External tableExternal cloud storage locationDropping table removes metadata onlyData shared with other systems or lifecycle managed externally

Schema Enforcement and Evolution

RequirementOptionExam note
Reject unexpected schemaSchema enforcementGood for trusted silver/gold layers
Allow new columns during ingestionSchema evolutionCommon for bronze Auto Loader
Overwrite schema intentionallyoverwriteSchema patternsUse carefully; can break consumers
Merge with new columnsmergeSchema / controlled evolutionValidate before using in curated layers
Store rescued dataRescue column patternsUseful when raw records may contain unexpected fields

Partitioning, Clustering, and File Layout

TechniqueUse whenAvoid
PartitioningLarge tables frequently filtered by low/moderate-cardinality columnsHigh-cardinality partitions that create many tiny files
OPTIMIZETable has many small filesRunning constantly without need
ZORDERQueries filter on selective columnsToo many columns or columns rarely used in filters
Liquid clusteringSupported environment and evolving query patternsCombining blindly with older partition/ZORDER assumptions
Auto compaction / optimized writesNeed better file sizing with less manual workAssuming they fix bad table design

Lakeflow Declarative Pipelines / Delta Live Tables

Microsoft and Databricks materials may reference Delta Live Tables (DLT) and newer Lakeflow Declarative Pipelines terminology. For exam purposes, focus on the concepts: declarative tables, dependencies, streaming tables, data quality expectations, pipeline monitoring, and managed execution.

ConceptMeaningExam use
PipelineManaged set of table definitions and flowsUse for reliable ETL with dependency management
Live table / materialized tableDeclarative table created from query logicGood for batch transformations
Streaming live tableTable fed by streaming inputGood for incremental file/event ingestion
ExpectationData quality ruleWarn, drop invalid rows, or fail pipeline
Event logPipeline operational logTroubleshooting and audit of pipeline runs
Development modeFaster iterationNot the same as production reliability settings
Triggered pipelineRuns then stopsBatch-like scheduled processing
Continuous pipelineKeeps processingStreaming-style processing

Example DLT-style SQL:

CREATE OR REFRESH STREAMING LIVE TABLE bronze_orders
AS
SELECT *
FROM cloud_files(
  "abfss://landing@storageacct.dfs.core.windows.net/orders/",
  "json"
);

CREATE OR REFRESH LIVE TABLE silver_orders
(
  CONSTRAINT valid_order_id EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW,
  CONSTRAINT valid_amount EXPECT (amount >= 0) ON VIOLATION DROP ROW
)
AS
SELECT
  order_id,
  customer_id,
  CAST(amount AS DECIMAL(18,2)) AS amount,
  CAST(order_timestamp AS TIMESTAMP) AS order_timestamp
FROM STREAM(LIVE.bronze_orders);

Expectation actions:

ActionResultUse when
WarnRecords are kept; metric is recordedYou need visibility without blocking
DropInvalid records are removedBad rows should not enter curated data
FailPipeline stopsData quality failure should block publication

DLT / Lakeflow traps:

TrapCorrect thinking
Treating pipeline tables as manually updated tablesThey are managed by pipeline definitions
Hiding quality rules in notebook code onlyUse expectations when pipeline-level metrics matter
Ignoring event logsEvent logs are key for troubleshooting
Using continuous mode for everythingTriggered mode is often enough for scheduled batch-style ingestion
Mixing dev and prod assumptionsProduction pipelines need stable configuration, permissions, and monitoring

Unity Catalog and Governance

Namespace and Securable Hierarchy

LevelExampleNotes
Metastoremain metastoreAssigned to one or more workspaces
CatalogprodTop-level data namespace
Schemaprod.silverGroups tables, views, functions, volumes
Objectprod.silver.ordersTable, view, volume, model, function, etc.

Use three-level names for clarity:

SELECT *
FROM prod.silver.orders;

Common Unity Catalog Privileges

PrivilegeGrants ability to
USE CATALOGAccess objects inside a catalog, subject to lower-level grants
USE SCHEMAAccess objects inside a schema, subject to object grants
SELECTRead table or view data
MODIFYInsert, update, delete, merge, or otherwise change table data
CREATE TABLECreate tables in a schema
CREATE VOLUMECreate volumes in a schema
READ VOLUMERead files in a volume
WRITE VOLUMEWrite files in a volume
EXECUTERun functions or models where applicable
MANAGEManage privileges or object settings, depending on object type

Basic grant pattern:

GRANT USE CATALOG ON CATALOG prod TO `data-engineers`;
GRANT USE SCHEMA ON SCHEMA prod.silver TO `data-engineers`;
GRANT SELECT ON TABLE prod.silver.orders TO `analysts`;
GRANT SELECT, MODIFY ON TABLE prod.silver.orders TO `etl-service-principal`;

High-yield permission rule: object-level access is not enough if the principal lacks USE CATALOG and USE SCHEMA.

External Locations and Volumes

ObjectPurposeExam distinction
Storage credentialAuthenticates to cloud storageDo not expose raw keys in notebooks
External locationGoverns access to a cloud pathGrants READ FILES / WRITE FILES
External tableTable metadata over data in external storageFor structured tabular data
VolumeGoverned access to filesFor non-tabular files, libraries, ML files, landing files
Legacy mountWorkspace-level storage mountAvoid as a UC-first governance design

Example external location pattern:

CREATE EXTERNAL LOCATION raw_orders_location
URL 'abfss://landing@storageacct.dfs.core.windows.net/orders/'
WITH (STORAGE CREDENTIAL adls_storage_credential);

GRANT READ FILES ON EXTERNAL LOCATION raw_orders_location TO `ingestion-engineers`;

Security Decision Table

RequirementPreferAvoid
Govern table access across workspacesUnity Catalog grantsWorkspace-local ACL-only design
Secure ADLS Gen2 accessManaged identity/access connector with UC storage credentialHard-coded account keys in notebooks
Restrict rows by user/groupRow filters or dynamic viewsDuplicating many physical tables
Restrict sensitive columnsColumn masks or secure viewsGiving broad table access then relying on consumers
Store passwords/API keysSecret scopes or managed identitiesPlain text in notebooks, jobs, or repos
Production service identityService principal or managed identity patternPersonal user identity for scheduled jobs
Govern non-tabular filesVolumes and external locationsUngoverned DBFS or ad hoc mounts

Unity Catalog Traps

SymptomLikely cause
User can see table name but query failsMissing SELECT, USE SCHEMA, or USE CATALOG
Job works for developer but fails in productionJob identity lacks UC or storage permissions
External table cannot read filesExternal location or storage credential issue
Notebook path works in one workspace onlyWorkspace-local mount or DBFS dependency
Data appears outside lineage/governanceLegacy metastore, unmanaged path, or direct cloud access bypassing UC
Drop table removed data unexpectedlyIt was a managed table

Compute Selection

Compute typeBest forExam notes
All-purpose computeInteractive notebooks, exploration, developmentFlexible but not ideal as default production runtime
Job compute / job clusterScheduled production tasksEphemeral, repeatable, easier cost and dependency control
SQL warehouseSQL queries, dashboards, BI, Databricks SQLNot a general PySpark notebook cluster
Serverless compute where availableReduced infrastructure managementConfirm workload and governance support in scenario
Cluster poolFaster cluster startupUseful when many similar clusters start frequently
Photon-enabled computeSQL and Delta-heavy workloadsOften improves query performance for supported operations

Access Modes

Access mode terminologyUse caseExam point
Standard / sharedMultiple users with governance controlsCommon for UC-enabled collaborative workloads
Dedicated / single userOne user or assigned identityUseful for isolation and certain workloads
No isolation shared / legacyOlder less-isolated modeAvoid for modern governed UC workloads

Cluster Configuration Cues

RequirementSetting or feature
Enforce approved settingsCluster policy
Control library versionsJob cluster config, init scripts only when needed, pinned dependencies
Minimize idle costAuto-termination for interactive clusters
Handle variable loadAutoscaling
Separate dev/test/prodSeparate workspaces, catalogs, schemas, or policies as appropriate
Improve repeatabilityJobs, parameters, source-controlled code

Jobs, Workflows, and Deployment

FeatureUse for
Task dependenciesBuild DAG-style workflows
Job parametersAvoid hard-coded environment paths and dates
RetriesHandle transient failures
Run-if conditionsControl downstream behavior after success/failure
Job clustersIsolated production compute per job or task
Shared job clusterReuse compute among tasks in same job when appropriate
Schedule triggerTime-based orchestration
File arrival triggerStart when new data arrives, where supported
Continuous triggerAlways-on processing pattern
Alerts/notificationsOperational awareness
Git integration / source controlVersion notebooks, code, SQL, pipeline definitions
Databricks Asset Bundles or deployment toolingPromote repeatable assets across environments

Production readiness checklist:

  • Use service principals or managed identities for scheduled workloads.
  • Parameterize catalog, schema, storage path, and processing date.
  • Separate development and production data namespaces.
  • Store secrets outside code.
  • Use job clusters or governed compute policies.
  • Define retry behavior and failure notifications.
  • Log row counts, rejected records, and important pipeline metrics.
  • Avoid relying on an interactive user’s cluster, credentials, or notebook state.

Performance and Optimization Reference

Table and Query Tuning

SymptomLikely causeCorrective action
Query scans too much dataPoor filters, no data skipping, bad layoutPartition appropriately, ZORDER/liquid clustering, collect stats where relevant
Many small filesFrequent small writes or streaming micro-batchesOPTIMIZE, optimized writes, compaction strategy
Slow joinsShuffle-heavy join, skew, missing broadcast opportunityBroadcast small dimension, repartition, handle skew, use AQE
OOM during transformationLarge shuffle or wide operationReduce data earlier, repartition carefully, avoid collecting to driver
Slow Python UDFsRow-by-row Python executionPrefer built-in Spark SQL functions or vectorized patterns
Streaming state growsNo watermark or broad aggregation keyAdd watermark, reduce state, tune late-data assumptions
Dashboard slowGold table not serving-shapedPre-aggregate, use SQL warehouse, optimize table layout
Repeated full recomputationNo incremental designUse CDF, Auto Loader, MERGE, or pipeline incremental patterns

Spark Execution Concepts

ConceptWhy it matters
Lazy evaluationTransformations run only when an action executes
Narrow transformationNo shuffle; generally cheaper
Wide transformationRequires shuffle; often expensive
ShuffleData redistributed across partitions
PartitionUnit of parallel data processing
DriverCoordinates Spark application
ExecutorRuns tasks and stores/cache partitions
Cache/persistUseful for reused intermediate data; can become stale or consume memory
Adaptive Query ExecutionRuntime query optimization for supported workloads
Broadcast joinSends small table to workers to avoid large shuffle

Optimization Traps

TrapBetter answer
“Partition by every filter column”Partition only when cardinality and access patterns justify it
“Cache everything”Cache reused data only; unpersist when done
“Use bigger clusters first”Fix data layout, shuffles, and query logic before scaling blindly
“VACUUM aggressively”Respect time travel, rollback, and streaming readers
“One huge notebook does everything”Break into tasks/pipelines with observable boundaries
“Use UDFs for simple expressions”Use native Spark functions for optimizer support

Data Quality and Reliability

RequirementFeature or pattern
Reject invalid rows in pipelineExpectations with drop/fail action
Track invalid rows for reviewQuarantine table pattern
Enforce non-null or valid valuesConstraints or expectations
Deduplicate eventsWindow functions, watermark-based streaming deduplication
Handle late dataWatermarks and business-defined lateness rules
Idempotent rerunsMERGE by key, deterministic outputs, checkpoint discipline
Audit load metadataAdd source file, ingestion timestamp, batch ID
Backfill historical dataParameterized jobs, controlled overwrite/merge, separate checkpoint strategy

Useful metadata columns:

ColumnPurpose
ingestion_timestampWhen the platform ingested the row
source_fileFile lineage and troubleshooting
batch_idRerun and reconciliation
record_hashChange detection or deduplication
is_quarantined / error fieldsData quality review

CDC and Slowly Changing Dimensions

PatternUse whenCore logic
Type 1 SCDKeep only latest valueMERGE update overwrites existing row
Type 2 SCDPreserve historyExpire current row, insert new current row
Delete propagationSource emits deletesMERGE with WHEN MATCHED ... DELETE or expire record
CDF downstreamDelta source changes should feed another tableRead changes since last version
Pipeline CDC helpersDeclarative CDC in Lakeflow/DLT scenariosUse when exam scenario emphasizes managed CDC pipeline

SCD Type 2 cues:

FieldPurpose
Business keyIdentifies entity, such as customer_id
Surrogate keyUnique dimension row key
Effective start/endValidity range
Current flagIdentifies active record
Hash diffDetects attribute changes

Monitoring and Troubleshooting

Where to Look

ProblemFirst places to inspect
Job failedJobs run output, task logs, cluster logs
Spark query slowSpark UI, SQL query profile, stages, shuffle metrics
SQL dashboard slowQuery history, warehouse size/state, table layout
Pipeline failedPipeline event log, expectation metrics, failed flow/table
Stream stalledStreaming query progress, checkpoint path, source backlog
Permission deniedUnity Catalog grants, external location grants, cloud IAM
Data missingIngestion file discovery, filters, expectations, checkpoints
DuplicatesCheckpoint reset, non-idempotent writes, source replay, missing dedup key
Schema errorAuto Loader schema location, rescued data, evolution settings

Common Error Patterns

Error patternLikely causeFix direction
Cannot access abfss://...Storage credential, external location, or cloud permission missingValidate UC external location and identity
Table not foundWrong catalog/schema or current contextUse three-level names
Permission denied on tableMissing grantsGrant USE CATALOG, USE SCHEMA, and object privilege
Stream already active or checkpoint conflictReused checkpoint/queryUse distinct checkpoint and stop old query
Schema mismatchSource changedApply controlled schema evolution or rescue/quarantine
Slow stage with high shuffleSkew or wide transformationRepartition, broadcast, filter earlier, handle skew
Driver out of memoryCollected too much data to driverAvoid .collect() for large data; write distributed outputs
Unexpected old resultsCached data or stale table/viewRefresh/unpersist or rerun after invalidating cache

Triage Order

  1. Identify whether the failure is permissions, code, data, compute, or orchestration.
  2. Check the job or pipeline run output.
  3. Inspect table names, catalog/schema context, and identity used by the run.
  4. Validate source paths, external locations, and grants.
  5. Review schema changes and data quality expectation failures.
  6. Use Spark UI or SQL profile for performance bottlenecks.
  7. Fix idempotency before rerunning failed writes.

Azure-Specific Integration Points

Azure componentAzure Databricks roleExam note
ADLS Gen2Primary lakehouse storageUse abfss:// paths and governed access
Microsoft Entra IDUsers, groups, service principalsPrefer group-based access management
Managed identities / access connectorsSecure Azure resource accessAvoid embedded credentials
Azure Key VaultSecret backing for secret scopesUseful for external secrets, but not a substitute for UC governance
Azure Data Factory / Synapse pipelinesExternal orchestrationCan trigger Databricks jobs/notebooks
Event HubsStreaming sourceOften used with Structured Streaming
Microsoft PurviewBroader data catalog/governance integrationUnity Catalog handles Databricks-native governance and lineage

SQL and PySpark Syntax Mini-Reference

TaskSQL / PySpark cue
Set catalogUSE CATALOG prod;
Set schemaUSE SCHEMA silver;
Create table as selectCREATE TABLE prod.gold.sales AS SELECT ...
Insert dataINSERT INTO prod.silver.orders SELECT ...
Overwrite tableCREATE OR REPLACE TABLE ... AS SELECT ...
Merge updatesMERGE INTO ... USING ... ON ...
Inspect historyDESCRIBE HISTORY catalog.schema.table
Optimize tableOPTIMIZE catalog.schema.table
Read table in PySparkspark.table("prod.silver.orders")
Write table in PySparkdf.write.mode("append").saveAsTable("prod.silver.orders")
Streaming read tablespark.readStream.table("prod.bronze.orders")
Streaming write tabledf.writeStream.toTable("prod.silver.orders")

High-Yield Exam Traps Checklist

  • Do not choose legacy mounts when the scenario emphasizes Unity Catalog governance.
  • Do not grant SELECT only and forget USE CATALOG and USE SCHEMA.
  • Do not use an all-purpose interactive cluster as the default production answer.
  • Do not reset or share streaming checkpoints without understanding replay and duplicates.
  • Do not use direct file reads when the requirement is incremental file discovery.
  • Do not use COPY INTO for continuous event streams.
  • Do not use MERGE without a stable key and deduplicated source.
  • Do not overpartition high-cardinality columns.
  • Do not run VACUUM casually when rollback, time travel, or lagging streams matter.
  • Do not store production data in DBFS root as a governance strategy.
  • Do not assume a notebook user’s permissions are the same as the job’s service identity.
  • Do not hide all data quality logic downstream in BI; validate in silver/pipeline layers.
  • Do not choose Python UDFs for simple transformations that Spark SQL functions can handle.
  • Do not ignore pipeline event logs, job task logs, Spark UI, and SQL query profiles during troubleshooting.

Last-Mile Practice Plan

Practice DP-750 scenarios by forcing yourself to choose: ingestion pattern, Delta table design, Unity Catalog permissions, compute type, orchestration method, and troubleshooting path. Then implement small end-to-end exercises: Auto Loader to bronze, MERGE to silver, aggregate to gold, secure with Unity Catalog grants, schedule as a job, and diagnose one intentional failure.

Browse Certification Practice Tests by Exam Family