DP-750 — Microsoft Certified: Azure Databricks Data Engineer Associate Exam Blueprint

Practical DP-750 exam blueprint for Microsoft Azure Databricks data engineer candidates: Delta Lake, Spark, ingestion, governance, orchestration, monitoring, and production readiness.

How to Use This Exam Blueprint

Use this independent Exam Blueprint as a practical study map for Microsoft DP-750, the exam for Microsoft Certified: Azure Databricks Data Engineer Associate (DP-750). It is designed for final review and gap-finding, not as a replacement for hands-on Azure Databricks practice.

For each topic area, ask:

  • Can I explain the concept without notes?
  • Can I choose the right Azure Databricks feature for a scenario?
  • Can I recognize a bad design and correct it?
  • Can I troubleshoot a failed pipeline, slow query, permission error, or data-quality issue?
  • Can I connect Azure Databricks concepts with Azure storage, security, monitoring, and production operations?

If an area feels familiar but you cannot apply it in a scenario, mark it for more practice.

Readiness areas at a glance

Readiness areaWhat to be ready forYou are ready when you can…Common weak signal
Azure Databricks architectureWorkspaces, compute, notebooks, jobs, SQL warehouses, storage integrationExplain how data engineers use Azure Databricks in a lakehouse workflowYou know notebooks but not production deployment patterns
Lakehouse designBronze, silver, gold layers; managed and external tables; medallion architectureDesign a layered data flow from raw ingestion to curated analytics tablesYou treat all tables as the same type of storage object
Delta LakeDelta tables, ACID transactions, schema enforcement, time travel, merge, optimize, vacuumPick the right Delta operation for incremental updates, corrections, retention, and performanceYou memorize commands but cannot explain side effects
Spark and transformationsSpark SQL, PySpark, DataFrames, joins, aggregations, windowing, UDF tradeoffsRead or write transformation logic and reason about distributed executionYou write SQL but cannot diagnose shuffle or join problems
Data ingestionBatch files, streaming, Auto Loader-style patterns, incremental loads, checkpointsChoose ingestion patterns for new files, event data, schema drift, and replayYou forget checkpointing, idempotency, or schema evolution
OrchestrationJobs, tasks, dependencies, parameters, retries, schedules, pipeline promotionBuild and operate multi-step pipelines with failure handlingYou rely only on manual notebook execution
Governance and securityUnity Catalog concepts, access control, storage credentials, secrets, lineageDistinguish workspace access, data permissions, and cloud storage permissionsYou assume Azure role assignment alone grants table access
Performance tuningFile size, partitioning, clustering, query plans, caching, compaction, statisticsIdentify the likely cause of slow reads, joins, streaming, or writesYou try random tuning instead of reading symptoms
Monitoring and troubleshootingJob runs, logs, metrics, Spark UI, query history, data-quality alertsTrace failures from symptoms to root cause and propose corrective actionYou only check the final error message
DevOps and productionRepos, source control, deployment, environment separation, configuration managementMove logic safely from development to production with repeatable artifactsYou hard-code paths, secrets, or environment names
Cost and operational controlCompute sizing, job clusters, SQL warehouses, autoscaling, cluster policiesBalance performance, reliability, and cost in a scenarioYou always choose larger compute without justification

Core DP-750 skill checklist

Azure Databricks platform foundations

Be ready to show that you understand how Azure Databricks is used as a data engineering platform, not just as a notebook tool.

  • Explain the role of an Azure Databricks workspace.
  • Distinguish interactive development from scheduled production execution.
  • Compare all-purpose compute, job compute, and SQL warehouse-style query execution at a practical level.
  • Identify when notebooks, jobs, workflows, and SQL are appropriate.
  • Explain how Azure Databricks interacts with cloud storage such as Azure Data Lake Storage.
  • Recognize why compute configuration affects performance, cost, isolation, and reliability.
  • Understand the difference between workspace-level objects and governed data objects.
  • Explain how libraries, runtime versions, policies, and environment configuration can affect pipeline behavior.
  • Identify where secrets, credentials, and service identities should be handled rather than hard-coded.

Can you do this?

PromptReady answer should include
A team manually runs a notebook every morning. What should change for production?Schedule or workflow, task dependencies, parameters, retry policy, monitoring, source control, controlled compute
A query runs well in development but fails in production. What do you check?Data volume, permissions, cluster/runtime differences, libraries, configs, paths, parameters, job logs
A user can open the workspace but cannot query a table. Why?Workspace access is not the same as catalog/schema/table permissions or storage access

Lakehouse design and data organization

DP-750 readiness requires more than knowing that Delta tables exist. You should be able to design a maintainable data layout.

  • Explain bronze, silver, and gold layers and what each layer typically stores.
  • Identify raw, cleansed, conformed, aggregated, and serving data responsibilities.
  • Choose managed or external tables based on ownership, lifecycle, governance, and storage needs.
  • Explain how catalogs, schemas, and tables organize data assets.
  • Use consistent naming, pathing, and environment separation.
  • Identify when to preserve raw data versus overwrite or transform in place.
  • Explain how incremental processing reduces cost and improves timeliness.
  • Recognize when a pipeline should be replayable from raw data.
  • Plan for schema drift, late-arriving data, duplicates, and bad records.
  • Connect data modeling choices to downstream analytics and reporting needs.
Design choiceUse when…Watch for…
Bronze/raw tableYou need durable landing data, replay, audit, or troubleshootingDo not over-clean raw data and lose original evidence
Silver/cleansed tableYou need validated, deduplicated, conformed recordsDefine data-quality rules and key business logic clearly
Gold/serving tableYou need aggregated or business-ready data for consumersAvoid embedding one-off report logic that is hard to maintain
Managed tablePlatform should manage table metadata and storage lifecycleUnderstand deletion and ownership implications
External tableData location/lifecycle is managed outside the table definitionEnsure permissions and storage credentials are correct

Delta Lake table operations

Know the purpose and risks of Delta operations. Scenario questions often test judgment: when to merge, when to overwrite, when to optimize, and when not to vacuum.

  • Explain ACID transaction benefits for concurrent reads and writes.
  • Use Delta tables for reliable batch and streaming workloads.
  • Explain schema enforcement and schema evolution.
  • Use MERGE logic for upserts, slowly changing records, or deduplication scenarios.
  • Understand append, overwrite, update, delete, and merge semantics.
  • Use table history to investigate changes.
  • Explain time travel at a practical level.
  • Understand the purpose of compaction and optimization.
  • Explain why small files can degrade performance.
  • Understand the purpose and risk of vacuuming old files.
  • Recognize when partitioning, clustering, data skipping, or statistics can improve query performance.
  • Avoid over-partitioning high-cardinality columns.
  • Understand constraints and expectations as data-quality controls where applicable.

Delta operation readiness table

Operation or conceptKnow how to use it for…Exam-style trap
MERGEUpserts, deduplication, CDC-style processingMissing match keys or creating duplicate target rows
DESCRIBE HISTORYAuditing table changes and troubleshootingChecking only the current table state
Time travelRecovering or comparing prior table versionsAssuming old data remains available after cleanup
OPTIMIZE or compactionReducing small-file overheadOptimizing blindly without understanding workload
VACUUMCleaning obsolete filesBreaking historical reads or recovery expectations
Schema evolutionHandling new columns safelyAllowing uncontrolled schema drift into curated layers
Partitioning/clusteringReducing scan volume for common filtersPartitioning by unique IDs or very high-cardinality fields

Data ingestion and incremental processing

Be ready to choose an ingestion pattern from scenario details: source type, latency, file arrival pattern, schema drift, volume, idempotency, and failure recovery.

  • Distinguish batch ingestion from streaming ingestion.
  • Explain when incremental ingestion is preferable to full reloads.
  • Use checkpointing concepts for streaming or incremental file processing.
  • Identify why idempotent writes matter when jobs retry.
  • Handle duplicate files, duplicate records, and late-arriving data.
  • Explain schema inference, schema location, and controlled schema evolution.
  • Choose raw ingestion into bronze before cleansing when replay and audit matter.
  • Know how bad records can be captured, quarantined, or rejected.
  • Explain when event streaming is appropriate versus scheduled file ingestion.
  • Recognize when watermarks, windows, and state management are relevant.
  • Understand how secrets and credentials should be used for source connections.
  • Validate source-to-target row counts and quality checks after ingestion.

Ingestion decision cues

Scenario cueLikely design direction
New files arrive continuously in cloud storageIncremental file ingestion with checkpointing
Source sends events with low-latency needsStreaming ingestion with state and failure recovery design
Source sends a daily full extractBatch load, compare to existing target, possibly merge or replace
Source schema changes unpredictablyBronze capture plus controlled schema evolution into silver
Source data has duplicatesDeduplicate using keys, timestamps, or business rules before serving
Pipeline may rerun after failureMake writes idempotent and checkpoint-aware
    flowchart TD
	    A[New data source] --> B{Latency requirement?}
	    B -->|Near real time| C[Streaming or continuous incremental ingestion]
	    B -->|Scheduled batch| D[Batch or trigger-based ingestion]
	    C --> E{Can records arrive late or out of order?}
	    E -->|Yes| F[Use event-time logic, watermarking, deduplication]
	    E -->|No| G[Use simpler append or merge pattern]
	    D --> H{Full reload or changed data only?}
	    H -->|Changed data only| I[Incremental load with merge/upsert]
	    H -->|Full extract| J[Replace, compare, or merge based on target needs]
	    F --> K[Write bronze, validate, promote to silver/gold]
	    G --> K
	    I --> K
	    J --> K

Spark SQL, PySpark, and transformation logic

For DP-750, you should be comfortable reading both SQL-style and DataFrame-style logic. You do not need to turn every problem into code, but you should understand what the code is doing.

  • Write and interpret joins, filters, aggregations, and window functions.
  • Distinguish inner, left, right, full, semi, and anti join behavior.
  • Use window functions for ranking, deduplication, running totals, and latest-record selection.
  • Explain when dropDuplicates is insufficient without a deterministic rule.
  • Understand null handling in filters, joins, and aggregations.
  • Apply business rules consistently across batch and streaming paths.
  • Avoid unnecessary UDFs when built-in Spark functions can perform better.
  • Understand lazy evaluation at a practical level.
  • Recognize transformations that cause shuffles.
  • Explain when broadcast joins may help and when they may be unsafe.
  • Validate transformation outputs with row counts, constraints, and sample checks.
  • Understand how to parameterize paths, catalog names, dates, and environments.

Transformation checks

Can you do this?Why it matters
Select the latest record per key using a window functionCommon deduplication and CDC pattern
Explain why a join creates unexpected duplicatesTests data modeling and key understanding
Rewrite a Python loop as a Spark transformationTests distributed-processing judgment
Identify why null values disappear from a filterPrevents silent data loss
Explain why a UDF slows a pipelineTests Spark execution awareness

Orchestration, jobs, and pipeline operations

The exam identity is data engineering, so be ready for production pipeline thinking: dependencies, retries, monitoring, permissions, parameters, and promotion.

  • Create a mental model of a multi-task workflow.
  • Explain task dependencies and failure behavior.
  • Use parameters instead of hard-coded values.
  • Know why job compute is often preferable for repeatable scheduled workloads.
  • Configure retries and notifications based on failure impact.
  • Separate development, test, and production environments.
  • Track job run history and logs.
  • Understand how pipeline code is promoted through source control.
  • Identify when a task should fail fast versus quarantine bad data.
  • Explain how to rerun failed tasks without corrupting target data.
  • Recognize when a pipeline should be split into independent tasks.
  • Understand the role of declarative or managed pipeline features where used in Azure Databricks.
Pipeline concernReadiness question
DependenciesCan you explain what should happen if an upstream bronze load fails?
ParametersCan the same notebook run for dev/test/prod without editing code?
RetriesWill retrying create duplicate records or inconsistent state?
NotificationsWho should be alerted, and for which failure severity?
ObservabilityWhere would you find the failed task, logs, and input parameters?
RecoveryCan you replay from raw data or checkpoints safely?

Governance, security, and Unity Catalog concepts

Security questions often test boundaries: workspace access, data access, storage access, identity, and secret handling are related but not identical.

  • Explain the purpose of centralized data governance in Azure Databricks.
  • Understand catalogs, schemas, tables, views, volumes, and other governed objects at a conceptual level.
  • Distinguish data-plane access from workspace UI access.
  • Explain why least privilege matters for pipelines and users.
  • Use groups or service identities rather than assigning broad permissions to individuals.
  • Recognize when a storage credential or external location pattern is needed.
  • Avoid hard-coding secrets, tokens, connection strings, or keys in notebooks.
  • Understand how views can be used to restrict or shape data access.
  • Recognize row-level, column-level, masking, or filtering patterns where applicable.
  • Explain lineage and audit value for production data pipelines.
  • Know how ownership affects the ability to manage objects.
  • Troubleshoot permission errors by checking both Databricks permissions and Azure resource access.

Security decision checks

ScenarioWhat to check
User can see a workspace but cannot query a tableCatalog/schema/table privileges, ownership, group membership
Job fails when reading cloud storageService identity, storage credential, external location, Azure storage permissions
Notebook contains a storage keyReplace with secret management or governed credential pattern
Analysts need limited fields onlyView, column masking, column selection, or separate serving table
Pipeline needs production write accessUse controlled service identity and least-privilege permissions

Performance tuning and optimization

Performance readiness means reading symptoms and choosing targeted fixes. Avoid “bigger cluster” as the default answer.

  • Explain how partition pruning, data skipping, clustering, and file organization affect reads.
  • Identify small-file problems and when compaction helps.
  • Recognize over-partitioning and skewed partitioning.
  • Use query plans or Spark UI clues to reason about slow jobs.
  • Understand shuffle-heavy operations such as joins, aggregations, and repartitions.
  • Explain skew and possible mitigations.
  • Know when caching can help repeated reads and when it wastes memory.
  • Compare scaling compute with optimizing data layout.
  • Identify when statistics or table maintenance can improve query planning.
  • Understand that streaming performance depends on source rate, trigger interval, state, checkpointing, and sink throughput.
  • Avoid collecting large datasets to the driver.
  • Avoid Python-side loops over large distributed datasets.
SymptomLikely causesBetter next step
Query scans too much dataPoor filters, no pruning, bad layoutReview predicates, layout, clustering/partitioning, statistics
Job has many tiny tasks/filesSmall-file problemCompact files and tune write strategy
Join is very slowShuffle, skew, missing filtering, wrong join typeInspect plan, reduce data, handle skew, consider broadcast where safe
Driver crashesCollecting too much data, oversized metadata, bad loopKeep processing distributed, reduce driver-side operations
Streaming backlog growsSource rate exceeds processing, slow sink, state growthCheck metrics, scale/tune, optimize sink, review state/watermark logic

Monitoring, troubleshooting, and reliability

Be ready to diagnose pipeline failures from available artifacts: job history, task logs, Spark UI, query history, table history, event logs, and data-quality results.

  • Locate job run failures and identify failed tasks.
  • Read error messages for permission, schema, path, memory, timeout, and dependency issues.
  • Use table history to investigate recent writes.
  • Compare expected and actual row counts.
  • Validate that checkpoints are not accidentally deleted or reused incorrectly.
  • Identify malformed records or schema drift as ingestion failure causes.
  • Explain how retries interact with idempotent writes.
  • Recognize when a failure is data-related versus infrastructure-related.
  • Use logging and metrics to support root-cause analysis.
  • Define alerts for failed jobs, late data, quality failures, and performance regressions.
  • Know how to communicate impact: failed load, stale table, partial refresh, or incorrect data.
Failure typeFirst things to inspect
Permission deniedUser/service identity, catalog privileges, storage access, secret scope
Schema mismatchSource schema, target schema, evolution settings, recent upstream change
File not foundPath parameter, environment, lifecycle policy, external location
Duplicate recordsRetry behavior, merge keys, checkpoint reuse, source duplicates
Slow jobSpark UI, query plan, file layout, skew, compute sizing
Stale dashboard dataJob schedule, last successful run, downstream dependency, refresh logic

DevOps, deployment, and lifecycle management

A data engineer associate should be able to reason about repeatable delivery, not just interactive exploration.

  • Use source control for notebooks, scripts, SQL, and configuration.
  • Separate code from environment-specific settings.
  • Promote artifacts through development, test, and production.
  • Use job definitions or deployment artifacts rather than manual recreation.
  • Parameterize catalog, schema, storage path, checkpoint path, and run date.
  • Manage dependencies and libraries consistently.
  • Use reviews and testing for transformation logic.
  • Validate data quality before publishing curated tables.
  • Protect production data from accidental overwrite.
  • Plan rollback or recovery using versioned code and table history where applicable.
  • Understand how automated deployment reduces configuration drift.

SQL and PySpark artifact checks

You do not need to memorize every syntax variation, but you should recognize the intent, risks, and expected result of common data engineering patterns.

Delta merge pattern

MERGE INTO catalog.silver.customers AS target
USING catalog.bronze.customer_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.updated_at > target.updated_at THEN
  UPDATE SET *
WHEN NOT MATCHED THEN
  INSERT *

Be ready to explain:

  • What column or business key determines a match.
  • Why the timestamp condition matters.
  • What happens if the source has multiple rows for the same key.
  • Whether the operation is idempotent if rerun.
  • How you would validate the target after the merge.

Incremental file ingestion pattern

(
    spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "json")
        .option("cloudFiles.schemaLocation", schema_path)
        .load(raw_path)
        .writeStream
        .option("checkpointLocation", checkpoint_path)
        .trigger(availableNow=True)
        .toTable("catalog.bronze.events")
)

Be ready to explain:

  • Why a schema location is used.
  • Why a checkpoint location is required.
  • What happens if the checkpoint is deleted.
  • Why the bronze target is useful before cleansing.
  • How schema drift should be controlled before promotion to silver.

Window-based deduplication pattern

CREATE OR REPLACE TABLE catalog.silver.latest_orders AS
SELECT *
FROM (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY order_id
            ORDER BY updated_at DESC
        ) AS rn
    FROM catalog.bronze.orders
)
WHERE rn = 1;

Be ready to explain:

  • Why ROW_NUMBER provides deterministic latest-record selection only if the ordering is sufficient.
  • What additional tie-breaker may be needed.
  • Whether this is a full rebuild or should be adapted for incremental processing.
  • How late-arriving records could affect the result.

Table investigation commands

ArtifactWhat it helps you determine
DESCRIBE HISTORY table_nameRecent writes, operations, users/jobs, operation metrics
Query historySlow SQL statements, duration, user, warehouse context
Job run logsTask failure details, parameters, cluster/runtime information
Spark UIStages, tasks, shuffles, skew, spill, executor behavior
Table metadataLocation, schema, provider, properties, partition columns
Lineage viewUpstream and downstream dependencies where available

Scenario and decision-point checks

Use these prompts to test whether you can choose the best option under constraints.

ScenarioStrong answer should considerWeak answer
New JSON files land throughout the day and must be processed incrementallyIncremental ingestion, schema tracking, checkpointing, bronze table, retriesReread the whole folder every hour
A table has slow queries filtering by date and regionFile layout, partitioning/clustering, pruning, stats, query predicatesIncrease compute without checking scan pattern
A scheduled job sometimes creates duplicate recordsIdempotency, merge keys, checkpointing, source duplicates, retry behaviorDisable retries only
Analysts need access to curated sales data but not raw PIICatalog permissions, views/masking/filtering, gold tables, least privilegeGive broad access to the storage account
A schema change in the source breaks the silver pipelineCapture schema in bronze, controlled evolution, validation, alertingAutomatically allow all schema changes into gold
Pipeline fails after a cluster/runtime changeLibrary versions, runtime compatibility, configuration drift, test environmentAssume the data source is wrong
Streaming query falls behind during peak loadSource rate, micro-batch duration, state, sink performance, compute sizingRestart repeatedly without checking metrics
Data must be reproducible for auditRaw retention, table history, versioned code, controlled writes, lineageOverwrite raw data to save space
Developers need separate test runsParameterized catalogs/schemas/paths, dev/test/prod separationEdit production notebook paths manually
A service account can write to storage but cannot create a tableDatabricks object privileges, catalog/schema permissions, ownershipCheck only Azure RBAC

Common weak areas and traps

TrapWhy it hurts readinessWhat to review
Confusing workspace permissions with data permissionsUsers may access the workspace but not governed dataUnity Catalog-style privilege model and object hierarchy
Treating Delta as just Parquet filesMisses transactions, history, schema enforcement, and maintenanceDelta table operations and transaction log concepts
Forgetting checkpointsStreaming and incremental jobs may reread or duplicate dataCheckpoint design and replay behavior
Using overwrite when merge is requiredCan destroy history or cause downstream inconsistenciesIncremental load patterns and idempotent writes
Vacuuming without understanding retention needsCan remove files needed for history or recoveryTable maintenance and time-travel implications
Over-partitioningCreates too many small files and slow metadata operationsPartition strategy and clustering alternatives
Caching everythingWastes memory and may hide stale-data issuesCache only repeated expensive reads with clear lifecycle
Relying on driver-side logicBreaks distributed processing and can crash at scaleSpark transformations and avoiding collect misuse
Ignoring null semanticsJoins and filters may silently drop or misclassify dataSQL null behavior and data-quality tests
Hard-coding secrets or pathsCreates security and deployment riskSecrets, parameters, and environment configuration
No data-quality gate before goldBad data reaches consumersExpectations, constraints, validation, quarantine patterns
No recovery planFailures become manual rebuildsReplayable raw data, idempotent writes, versioned code

Final-week DP-750 review checklist

TimingFocusWhat to complete
7 days outBaseline readinessMark every checklist item as confident, partial, or weak
6 days outDelta and lakehousePractice merge, history, schema evolution, table layout, optimization scenarios
5 days outIngestion and streamingReview checkpoints, incremental files, schema drift, deduplication, late data
4 days outGovernance and securityDrill catalog permissions, storage access, secrets, service identities, least privilege
3 days outPerformance and troubleshootingPractice symptoms: slow joins, small files, skew, failed jobs, permission errors
2 days outOrchestration and productionReview jobs, tasks, dependencies, retries, parameters, monitoring, deployment
1 day outMixed scenariosAnswer “what would you choose and why?” prompts without notes
Exam dayLight reviewRevisit traps, decision tables, and weak areas only

Quick self-assessment scorecard

Rate each area from 1 to 3:

  • 1 = I recognize terms but cannot apply them
  • 2 = I can solve basic questions but struggle with scenarios
  • 3 = I can explain tradeoffs and troubleshoot
AreaScore
Azure Databricks workspace, compute, and jobs
Lakehouse architecture and medallion design
Delta Lake operations and table maintenance
Batch ingestion and incremental processing
Streaming concepts, checkpointing, and late data
Spark SQL, PySpark, joins, windows, and aggregations
Data quality, schema drift, and validation
Unity Catalog, permissions, secrets, and governance
Performance tuning and query troubleshooting
Monitoring, reliability, and production operations
DevOps, deployment, and environment separation

If any area is a 1, review it before taking full-length practice. If an area is a 2, focus on scenario questions that force you to choose between two plausible designs.

Practical next step

Turn this Exam Blueprint into a study tracker. For each missed practice question, tag it to one of the readiness areas above, write the reason you missed it, and practice a similar Azure Databricks scenario until you can explain the correct choice, the rejected alternatives, and the production tradeoff.

Browse Certification Practice Tests by Exam Family