DP-750 — Microsoft Certified: Azure Databricks Data Engineer Associate Exam Blueprint
Practical DP-750 exam blueprint for Microsoft Azure Databricks data engineer candidates: Delta Lake, Spark, ingestion, governance, orchestration, monitoring, and production readiness.
How to Use This Exam Blueprint
Use this independent Exam Blueprint as a practical study map for Microsoft DP-750, the exam for Microsoft Certified: Azure Databricks Data Engineer Associate (DP-750). It is designed for final review and gap-finding, not as a replacement for hands-on Azure Databricks practice.
For each topic area, ask:
- Can I explain the concept without notes?
- Can I choose the right Azure Databricks feature for a scenario?
- Can I recognize a bad design and correct it?
- Can I troubleshoot a failed pipeline, slow query, permission error, or data-quality issue?
- Can I connect Azure Databricks concepts with Azure storage, security, monitoring, and production operations?
If an area feels familiar but you cannot apply it in a scenario, mark it for more practice.
Readiness areas at a glance
| Readiness area | What to be ready for | You are ready when you can… | Common weak signal |
|---|---|---|---|
| Azure Databricks architecture | Workspaces, compute, notebooks, jobs, SQL warehouses, storage integration | Explain how data engineers use Azure Databricks in a lakehouse workflow | You know notebooks but not production deployment patterns |
| Lakehouse design | Bronze, silver, gold layers; managed and external tables; medallion architecture | Design a layered data flow from raw ingestion to curated analytics tables | You treat all tables as the same type of storage object |
| Delta Lake | Delta tables, ACID transactions, schema enforcement, time travel, merge, optimize, vacuum | Pick the right Delta operation for incremental updates, corrections, retention, and performance | You memorize commands but cannot explain side effects |
| Spark and transformations | Spark SQL, PySpark, DataFrames, joins, aggregations, windowing, UDF tradeoffs | Read or write transformation logic and reason about distributed execution | You write SQL but cannot diagnose shuffle or join problems |
| Data ingestion | Batch files, streaming, Auto Loader-style patterns, incremental loads, checkpoints | Choose ingestion patterns for new files, event data, schema drift, and replay | You forget checkpointing, idempotency, or schema evolution |
| Orchestration | Jobs, tasks, dependencies, parameters, retries, schedules, pipeline promotion | Build and operate multi-step pipelines with failure handling | You rely only on manual notebook execution |
| Governance and security | Unity Catalog concepts, access control, storage credentials, secrets, lineage | Distinguish workspace access, data permissions, and cloud storage permissions | You assume Azure role assignment alone grants table access |
| Performance tuning | File size, partitioning, clustering, query plans, caching, compaction, statistics | Identify the likely cause of slow reads, joins, streaming, or writes | You try random tuning instead of reading symptoms |
| Monitoring and troubleshooting | Job runs, logs, metrics, Spark UI, query history, data-quality alerts | Trace failures from symptoms to root cause and propose corrective action | You only check the final error message |
| DevOps and production | Repos, source control, deployment, environment separation, configuration management | Move logic safely from development to production with repeatable artifacts | You hard-code paths, secrets, or environment names |
| Cost and operational control | Compute sizing, job clusters, SQL warehouses, autoscaling, cluster policies | Balance performance, reliability, and cost in a scenario | You always choose larger compute without justification |
Core DP-750 skill checklist
Azure Databricks platform foundations
Be ready to show that you understand how Azure Databricks is used as a data engineering platform, not just as a notebook tool.
- Explain the role of an Azure Databricks workspace.
- Distinguish interactive development from scheduled production execution.
- Compare all-purpose compute, job compute, and SQL warehouse-style query execution at a practical level.
- Identify when notebooks, jobs, workflows, and SQL are appropriate.
- Explain how Azure Databricks interacts with cloud storage such as Azure Data Lake Storage.
- Recognize why compute configuration affects performance, cost, isolation, and reliability.
- Understand the difference between workspace-level objects and governed data objects.
- Explain how libraries, runtime versions, policies, and environment configuration can affect pipeline behavior.
- Identify where secrets, credentials, and service identities should be handled rather than hard-coded.
Can you do this?
| Prompt | Ready answer should include |
|---|---|
| A team manually runs a notebook every morning. What should change for production? | Schedule or workflow, task dependencies, parameters, retry policy, monitoring, source control, controlled compute |
| A query runs well in development but fails in production. What do you check? | Data volume, permissions, cluster/runtime differences, libraries, configs, paths, parameters, job logs |
| A user can open the workspace but cannot query a table. Why? | Workspace access is not the same as catalog/schema/table permissions or storage access |
Lakehouse design and data organization
DP-750 readiness requires more than knowing that Delta tables exist. You should be able to design a maintainable data layout.
- Explain bronze, silver, and gold layers and what each layer typically stores.
- Identify raw, cleansed, conformed, aggregated, and serving data responsibilities.
- Choose managed or external tables based on ownership, lifecycle, governance, and storage needs.
- Explain how catalogs, schemas, and tables organize data assets.
- Use consistent naming, pathing, and environment separation.
- Identify when to preserve raw data versus overwrite or transform in place.
- Explain how incremental processing reduces cost and improves timeliness.
- Recognize when a pipeline should be replayable from raw data.
- Plan for schema drift, late-arriving data, duplicates, and bad records.
- Connect data modeling choices to downstream analytics and reporting needs.
| Design choice | Use when… | Watch for… |
|---|---|---|
| Bronze/raw table | You need durable landing data, replay, audit, or troubleshooting | Do not over-clean raw data and lose original evidence |
| Silver/cleansed table | You need validated, deduplicated, conformed records | Define data-quality rules and key business logic clearly |
| Gold/serving table | You need aggregated or business-ready data for consumers | Avoid embedding one-off report logic that is hard to maintain |
| Managed table | Platform should manage table metadata and storage lifecycle | Understand deletion and ownership implications |
| External table | Data location/lifecycle is managed outside the table definition | Ensure permissions and storage credentials are correct |
Delta Lake table operations
Know the purpose and risks of Delta operations. Scenario questions often test judgment: when to merge, when to overwrite, when to optimize, and when not to vacuum.
- Explain ACID transaction benefits for concurrent reads and writes.
- Use Delta tables for reliable batch and streaming workloads.
- Explain schema enforcement and schema evolution.
- Use
MERGElogic for upserts, slowly changing records, or deduplication scenarios. - Understand append, overwrite, update, delete, and merge semantics.
- Use table history to investigate changes.
- Explain time travel at a practical level.
- Understand the purpose of compaction and optimization.
- Explain why small files can degrade performance.
- Understand the purpose and risk of vacuuming old files.
- Recognize when partitioning, clustering, data skipping, or statistics can improve query performance.
- Avoid over-partitioning high-cardinality columns.
- Understand constraints and expectations as data-quality controls where applicable.
Delta operation readiness table
| Operation or concept | Know how to use it for… | Exam-style trap |
|---|---|---|
MERGE | Upserts, deduplication, CDC-style processing | Missing match keys or creating duplicate target rows |
DESCRIBE HISTORY | Auditing table changes and troubleshooting | Checking only the current table state |
| Time travel | Recovering or comparing prior table versions | Assuming old data remains available after cleanup |
OPTIMIZE or compaction | Reducing small-file overhead | Optimizing blindly without understanding workload |
VACUUM | Cleaning obsolete files | Breaking historical reads or recovery expectations |
| Schema evolution | Handling new columns safely | Allowing uncontrolled schema drift into curated layers |
| Partitioning/clustering | Reducing scan volume for common filters | Partitioning by unique IDs or very high-cardinality fields |
Data ingestion and incremental processing
Be ready to choose an ingestion pattern from scenario details: source type, latency, file arrival pattern, schema drift, volume, idempotency, and failure recovery.
- Distinguish batch ingestion from streaming ingestion.
- Explain when incremental ingestion is preferable to full reloads.
- Use checkpointing concepts for streaming or incremental file processing.
- Identify why idempotent writes matter when jobs retry.
- Handle duplicate files, duplicate records, and late-arriving data.
- Explain schema inference, schema location, and controlled schema evolution.
- Choose raw ingestion into bronze before cleansing when replay and audit matter.
- Know how bad records can be captured, quarantined, or rejected.
- Explain when event streaming is appropriate versus scheduled file ingestion.
- Recognize when watermarks, windows, and state management are relevant.
- Understand how secrets and credentials should be used for source connections.
- Validate source-to-target row counts and quality checks after ingestion.
Ingestion decision cues
| Scenario cue | Likely design direction |
|---|---|
| New files arrive continuously in cloud storage | Incremental file ingestion with checkpointing |
| Source sends events with low-latency needs | Streaming ingestion with state and failure recovery design |
| Source sends a daily full extract | Batch load, compare to existing target, possibly merge or replace |
| Source schema changes unpredictably | Bronze capture plus controlled schema evolution into silver |
| Source data has duplicates | Deduplicate using keys, timestamps, or business rules before serving |
| Pipeline may rerun after failure | Make writes idempotent and checkpoint-aware |
flowchart TD
A[New data source] --> B{Latency requirement?}
B -->|Near real time| C[Streaming or continuous incremental ingestion]
B -->|Scheduled batch| D[Batch or trigger-based ingestion]
C --> E{Can records arrive late or out of order?}
E -->|Yes| F[Use event-time logic, watermarking, deduplication]
E -->|No| G[Use simpler append or merge pattern]
D --> H{Full reload or changed data only?}
H -->|Changed data only| I[Incremental load with merge/upsert]
H -->|Full extract| J[Replace, compare, or merge based on target needs]
F --> K[Write bronze, validate, promote to silver/gold]
G --> K
I --> K
J --> K
Spark SQL, PySpark, and transformation logic
For DP-750, you should be comfortable reading both SQL-style and DataFrame-style logic. You do not need to turn every problem into code, but you should understand what the code is doing.
- Write and interpret joins, filters, aggregations, and window functions.
- Distinguish inner, left, right, full, semi, and anti join behavior.
- Use window functions for ranking, deduplication, running totals, and latest-record selection.
- Explain when
dropDuplicatesis insufficient without a deterministic rule. - Understand null handling in filters, joins, and aggregations.
- Apply business rules consistently across batch and streaming paths.
- Avoid unnecessary UDFs when built-in Spark functions can perform better.
- Understand lazy evaluation at a practical level.
- Recognize transformations that cause shuffles.
- Explain when broadcast joins may help and when they may be unsafe.
- Validate transformation outputs with row counts, constraints, and sample checks.
- Understand how to parameterize paths, catalog names, dates, and environments.
Transformation checks
| Can you do this? | Why it matters |
|---|---|
| Select the latest record per key using a window function | Common deduplication and CDC pattern |
| Explain why a join creates unexpected duplicates | Tests data modeling and key understanding |
| Rewrite a Python loop as a Spark transformation | Tests distributed-processing judgment |
| Identify why null values disappear from a filter | Prevents silent data loss |
| Explain why a UDF slows a pipeline | Tests Spark execution awareness |
Orchestration, jobs, and pipeline operations
The exam identity is data engineering, so be ready for production pipeline thinking: dependencies, retries, monitoring, permissions, parameters, and promotion.
- Create a mental model of a multi-task workflow.
- Explain task dependencies and failure behavior.
- Use parameters instead of hard-coded values.
- Know why job compute is often preferable for repeatable scheduled workloads.
- Configure retries and notifications based on failure impact.
- Separate development, test, and production environments.
- Track job run history and logs.
- Understand how pipeline code is promoted through source control.
- Identify when a task should fail fast versus quarantine bad data.
- Explain how to rerun failed tasks without corrupting target data.
- Recognize when a pipeline should be split into independent tasks.
- Understand the role of declarative or managed pipeline features where used in Azure Databricks.
| Pipeline concern | Readiness question |
|---|---|
| Dependencies | Can you explain what should happen if an upstream bronze load fails? |
| Parameters | Can the same notebook run for dev/test/prod without editing code? |
| Retries | Will retrying create duplicate records or inconsistent state? |
| Notifications | Who should be alerted, and for which failure severity? |
| Observability | Where would you find the failed task, logs, and input parameters? |
| Recovery | Can you replay from raw data or checkpoints safely? |
Governance, security, and Unity Catalog concepts
Security questions often test boundaries: workspace access, data access, storage access, identity, and secret handling are related but not identical.
- Explain the purpose of centralized data governance in Azure Databricks.
- Understand catalogs, schemas, tables, views, volumes, and other governed objects at a conceptual level.
- Distinguish data-plane access from workspace UI access.
- Explain why least privilege matters for pipelines and users.
- Use groups or service identities rather than assigning broad permissions to individuals.
- Recognize when a storage credential or external location pattern is needed.
- Avoid hard-coding secrets, tokens, connection strings, or keys in notebooks.
- Understand how views can be used to restrict or shape data access.
- Recognize row-level, column-level, masking, or filtering patterns where applicable.
- Explain lineage and audit value for production data pipelines.
- Know how ownership affects the ability to manage objects.
- Troubleshoot permission errors by checking both Databricks permissions and Azure resource access.
Security decision checks
| Scenario | What to check |
|---|---|
| User can see a workspace but cannot query a table | Catalog/schema/table privileges, ownership, group membership |
| Job fails when reading cloud storage | Service identity, storage credential, external location, Azure storage permissions |
| Notebook contains a storage key | Replace with secret management or governed credential pattern |
| Analysts need limited fields only | View, column masking, column selection, or separate serving table |
| Pipeline needs production write access | Use controlled service identity and least-privilege permissions |
Performance tuning and optimization
Performance readiness means reading symptoms and choosing targeted fixes. Avoid “bigger cluster” as the default answer.
- Explain how partition pruning, data skipping, clustering, and file organization affect reads.
- Identify small-file problems and when compaction helps.
- Recognize over-partitioning and skewed partitioning.
- Use query plans or Spark UI clues to reason about slow jobs.
- Understand shuffle-heavy operations such as joins, aggregations, and repartitions.
- Explain skew and possible mitigations.
- Know when caching can help repeated reads and when it wastes memory.
- Compare scaling compute with optimizing data layout.
- Identify when statistics or table maintenance can improve query planning.
- Understand that streaming performance depends on source rate, trigger interval, state, checkpointing, and sink throughput.
- Avoid collecting large datasets to the driver.
- Avoid Python-side loops over large distributed datasets.
| Symptom | Likely causes | Better next step |
|---|---|---|
| Query scans too much data | Poor filters, no pruning, bad layout | Review predicates, layout, clustering/partitioning, statistics |
| Job has many tiny tasks/files | Small-file problem | Compact files and tune write strategy |
| Join is very slow | Shuffle, skew, missing filtering, wrong join type | Inspect plan, reduce data, handle skew, consider broadcast where safe |
| Driver crashes | Collecting too much data, oversized metadata, bad loop | Keep processing distributed, reduce driver-side operations |
| Streaming backlog grows | Source rate exceeds processing, slow sink, state growth | Check metrics, scale/tune, optimize sink, review state/watermark logic |
Monitoring, troubleshooting, and reliability
Be ready to diagnose pipeline failures from available artifacts: job history, task logs, Spark UI, query history, table history, event logs, and data-quality results.
- Locate job run failures and identify failed tasks.
- Read error messages for permission, schema, path, memory, timeout, and dependency issues.
- Use table history to investigate recent writes.
- Compare expected and actual row counts.
- Validate that checkpoints are not accidentally deleted or reused incorrectly.
- Identify malformed records or schema drift as ingestion failure causes.
- Explain how retries interact with idempotent writes.
- Recognize when a failure is data-related versus infrastructure-related.
- Use logging and metrics to support root-cause analysis.
- Define alerts for failed jobs, late data, quality failures, and performance regressions.
- Know how to communicate impact: failed load, stale table, partial refresh, or incorrect data.
| Failure type | First things to inspect |
|---|---|
| Permission denied | User/service identity, catalog privileges, storage access, secret scope |
| Schema mismatch | Source schema, target schema, evolution settings, recent upstream change |
| File not found | Path parameter, environment, lifecycle policy, external location |
| Duplicate records | Retry behavior, merge keys, checkpoint reuse, source duplicates |
| Slow job | Spark UI, query plan, file layout, skew, compute sizing |
| Stale dashboard data | Job schedule, last successful run, downstream dependency, refresh logic |
DevOps, deployment, and lifecycle management
A data engineer associate should be able to reason about repeatable delivery, not just interactive exploration.
- Use source control for notebooks, scripts, SQL, and configuration.
- Separate code from environment-specific settings.
- Promote artifacts through development, test, and production.
- Use job definitions or deployment artifacts rather than manual recreation.
- Parameterize catalog, schema, storage path, checkpoint path, and run date.
- Manage dependencies and libraries consistently.
- Use reviews and testing for transformation logic.
- Validate data quality before publishing curated tables.
- Protect production data from accidental overwrite.
- Plan rollback or recovery using versioned code and table history where applicable.
- Understand how automated deployment reduces configuration drift.
SQL and PySpark artifact checks
You do not need to memorize every syntax variation, but you should recognize the intent, risks, and expected result of common data engineering patterns.
Delta merge pattern
MERGE INTO catalog.silver.customers AS target
USING catalog.bronze.customer_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.updated_at > target.updated_at THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
Be ready to explain:
- What column or business key determines a match.
- Why the timestamp condition matters.
- What happens if the source has multiple rows for the same key.
- Whether the operation is idempotent if rerun.
- How you would validate the target after the merge.
Incremental file ingestion pattern
(
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", schema_path)
.load(raw_path)
.writeStream
.option("checkpointLocation", checkpoint_path)
.trigger(availableNow=True)
.toTable("catalog.bronze.events")
)
Be ready to explain:
- Why a schema location is used.
- Why a checkpoint location is required.
- What happens if the checkpoint is deleted.
- Why the bronze target is useful before cleansing.
- How schema drift should be controlled before promotion to silver.
Window-based deduplication pattern
CREATE OR REPLACE TABLE catalog.silver.latest_orders AS
SELECT *
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY updated_at DESC
) AS rn
FROM catalog.bronze.orders
)
WHERE rn = 1;
Be ready to explain:
- Why
ROW_NUMBERprovides deterministic latest-record selection only if the ordering is sufficient. - What additional tie-breaker may be needed.
- Whether this is a full rebuild or should be adapted for incremental processing.
- How late-arriving records could affect the result.
Table investigation commands
| Artifact | What it helps you determine |
|---|---|
DESCRIBE HISTORY table_name | Recent writes, operations, users/jobs, operation metrics |
| Query history | Slow SQL statements, duration, user, warehouse context |
| Job run logs | Task failure details, parameters, cluster/runtime information |
| Spark UI | Stages, tasks, shuffles, skew, spill, executor behavior |
| Table metadata | Location, schema, provider, properties, partition columns |
| Lineage view | Upstream and downstream dependencies where available |
Scenario and decision-point checks
Use these prompts to test whether you can choose the best option under constraints.
| Scenario | Strong answer should consider | Weak answer |
|---|---|---|
| New JSON files land throughout the day and must be processed incrementally | Incremental ingestion, schema tracking, checkpointing, bronze table, retries | Reread the whole folder every hour |
| A table has slow queries filtering by date and region | File layout, partitioning/clustering, pruning, stats, query predicates | Increase compute without checking scan pattern |
| A scheduled job sometimes creates duplicate records | Idempotency, merge keys, checkpointing, source duplicates, retry behavior | Disable retries only |
| Analysts need access to curated sales data but not raw PII | Catalog permissions, views/masking/filtering, gold tables, least privilege | Give broad access to the storage account |
| A schema change in the source breaks the silver pipeline | Capture schema in bronze, controlled evolution, validation, alerting | Automatically allow all schema changes into gold |
| Pipeline fails after a cluster/runtime change | Library versions, runtime compatibility, configuration drift, test environment | Assume the data source is wrong |
| Streaming query falls behind during peak load | Source rate, micro-batch duration, state, sink performance, compute sizing | Restart repeatedly without checking metrics |
| Data must be reproducible for audit | Raw retention, table history, versioned code, controlled writes, lineage | Overwrite raw data to save space |
| Developers need separate test runs | Parameterized catalogs/schemas/paths, dev/test/prod separation | Edit production notebook paths manually |
| A service account can write to storage but cannot create a table | Databricks object privileges, catalog/schema permissions, ownership | Check only Azure RBAC |
Common weak areas and traps
| Trap | Why it hurts readiness | What to review |
|---|---|---|
| Confusing workspace permissions with data permissions | Users may access the workspace but not governed data | Unity Catalog-style privilege model and object hierarchy |
| Treating Delta as just Parquet files | Misses transactions, history, schema enforcement, and maintenance | Delta table operations and transaction log concepts |
| Forgetting checkpoints | Streaming and incremental jobs may reread or duplicate data | Checkpoint design and replay behavior |
| Using overwrite when merge is required | Can destroy history or cause downstream inconsistencies | Incremental load patterns and idempotent writes |
| Vacuuming without understanding retention needs | Can remove files needed for history or recovery | Table maintenance and time-travel implications |
| Over-partitioning | Creates too many small files and slow metadata operations | Partition strategy and clustering alternatives |
| Caching everything | Wastes memory and may hide stale-data issues | Cache only repeated expensive reads with clear lifecycle |
| Relying on driver-side logic | Breaks distributed processing and can crash at scale | Spark transformations and avoiding collect misuse |
| Ignoring null semantics | Joins and filters may silently drop or misclassify data | SQL null behavior and data-quality tests |
| Hard-coding secrets or paths | Creates security and deployment risk | Secrets, parameters, and environment configuration |
| No data-quality gate before gold | Bad data reaches consumers | Expectations, constraints, validation, quarantine patterns |
| No recovery plan | Failures become manual rebuilds | Replayable raw data, idempotent writes, versioned code |
Final-week DP-750 review checklist
| Timing | Focus | What to complete |
|---|---|---|
| 7 days out | Baseline readiness | Mark every checklist item as confident, partial, or weak |
| 6 days out | Delta and lakehouse | Practice merge, history, schema evolution, table layout, optimization scenarios |
| 5 days out | Ingestion and streaming | Review checkpoints, incremental files, schema drift, deduplication, late data |
| 4 days out | Governance and security | Drill catalog permissions, storage access, secrets, service identities, least privilege |
| 3 days out | Performance and troubleshooting | Practice symptoms: slow joins, small files, skew, failed jobs, permission errors |
| 2 days out | Orchestration and production | Review jobs, tasks, dependencies, retries, parameters, monitoring, deployment |
| 1 day out | Mixed scenarios | Answer “what would you choose and why?” prompts without notes |
| Exam day | Light review | Revisit traps, decision tables, and weak areas only |
Quick self-assessment scorecard
Rate each area from 1 to 3:
- 1 = I recognize terms but cannot apply them
- 2 = I can solve basic questions but struggle with scenarios
- 3 = I can explain tradeoffs and troubleshoot
| Area | Score |
|---|---|
| Azure Databricks workspace, compute, and jobs | |
| Lakehouse architecture and medallion design | |
| Delta Lake operations and table maintenance | |
| Batch ingestion and incremental processing | |
| Streaming concepts, checkpointing, and late data | |
| Spark SQL, PySpark, joins, windows, and aggregations | |
| Data quality, schema drift, and validation | |
| Unity Catalog, permissions, secrets, and governance | |
| Performance tuning and query troubleshooting | |
| Monitoring, reliability, and production operations | |
| DevOps, deployment, and environment separation |
If any area is a 1, review it before taking full-length practice. If an area is a 2, focus on scenario questions that force you to choose between two plausible designs.
Practical next step
Turn this Exam Blueprint into a study tracker. For each missed practice question, tag it to one of the readiness areas above, write the reason you missed it, and practice a similar Azure Databricks scenario until you can explain the correct choice, the rejected alternatives, and the production tradeoff.