DP-700 — Microsoft Fabric Data Engineer Associate Exam Blueprint
Practical DP-700 exam blueprint for Microsoft Fabric Data Engineer Associate candidates: Fabric architecture, ingestion, transformation, governance, monitoring, and optimization readiness.
How to Use This Exam Blueprint
Use this independent checklist to prepare for the Microsoft Microsoft Fabric Data Engineer Associate (DP-700) exam. It is a practical study map for DP-700, not an official Microsoft skills outline and not a claim about exact exam weights.
Mark a topic as “ready” only when you can do three things:
- Choose the right Fabric service or artifact for a scenario.
- Explain the tradeoff: security, performance, governance, cost, maintainability, or operational impact.
- Troubleshoot a realistic failure without relying only on memorized UI steps.
For final review, focus less on definitions and more on scenario judgment: lakehouse vs warehouse, notebook vs Dataflow Gen2, copy vs shortcut, full load vs incremental load, and successful run vs trustworthy data.
DP-700 readiness map
| Readiness area | What to review | What “ready” looks like |
|---|---|---|
| Microsoft Fabric platform foundation | Workspaces, capacities, items, OneLake, tenant and workspace concepts, item relationships | You can explain how Fabric organizes data engineering work and how a data pipeline, lakehouse, warehouse, notebook, semantic model, and OneLake path relate to each other. |
| Lakehouse and OneLake architecture | Lakehouse tables, files, Delta Lake concepts, shortcuts, medallion layers, schemas, SQL analytics endpoint | You can design a lakehouse layout for raw, cleansed, and curated data and explain when to avoid copying data by using a shortcut or other integration pattern. |
| Warehouse and SQL engineering | Warehouse use cases, relational modeling, T-SQL transformations, dimensional models, views, stored logic, serving layers | You can decide when a SQL-first warehouse is better than a Spark/lakehouse-first design and can model fact and dimension tables for analytics. |
| Data ingestion | Pipelines, Copy activity, Dataflow Gen2, notebooks, source connections, gateways, parameters, incremental patterns | You can design a repeatable ingestion flow with authentication, schema handling, error handling, and full or incremental load logic. |
| Data transformation | Spark notebooks, Spark SQL, PySpark, Dataflow Gen2 transformations, warehouse SQL, Delta table writes | You can transform raw data into validated serving tables and justify the tool choice for code-first, low-code, or SQL-first work. |
| Orchestration and scheduling | Pipeline activities, dependencies, parameters, variables, retries, triggers, run history | You can chain ingestion, validation, transformation, and notification steps into an operational workflow. |
| Data quality and reliability | Idempotent loads, deduplication, validation rules, watermarks, late-arriving data, schema drift, error tables | You can make a pipeline safe to rerun and can detect when a “successful” run produced incomplete or invalid data. |
| Security and governance | Workspace roles, item permissions, data permissions, sensitivity labels, lineage, endorsements, gateway credentials | You can apply least privilege and explain the difference between access to a Fabric workspace, access to an item, and access to the data inside that item. |
| Monitoring and troubleshooting | Monitoring hub, pipeline run details, Spark logs, refresh/run failures, capacity signals, lineage, query diagnostics | You can identify whether a failure is caused by credentials, schema mismatch, source limits, Spark performance, SQL logic, or capacity pressure. |
| Performance optimization | Partitioning, file size management, predicate pushdown, shuffle reduction, query design, table maintenance, workload scheduling | You can improve a slow load or query using evidence rather than guessing. |
| Deployment and lifecycle | Git integration concepts, deployment pipelines, workspace promotion, parameterization, environment separation | You can explain how to move data engineering artifacts from development to test or production with minimal manual rework. |
| Analytics handoff | Semantic models, Direct Lake-style serving concepts where appropriate, Power BI consumption patterns, curated gold layer | You can prepare data so downstream analysts and reports consume governed, documented, reliable tables. |
Core service and artifact selection
| Scenario cue | Strong candidate choice | Why it fits | Common trap |
|---|---|---|---|
| Need a file-based analytical store with Spark transformations | Lakehouse | Supports files, tables, Delta patterns, notebooks, and flexible data engineering | Treating every lakehouse table like a fully modeled warehouse table from day one |
| Need SQL-first relational analytics and curated dimensional structures | Warehouse | Better fit for SQL developers, relational modeling, and serving structured analytical data | Using Spark notebooks for transformations that are simpler and clearer in SQL |
| Need to orchestrate multiple steps with dependencies | Data pipeline | Coordinates copy, notebook, dataflow, validation, branching, and retry logic | Hiding orchestration inside a long notebook with no clear operational visibility |
| Need low-code data shaping from common sources | Dataflow Gen2 | Useful for visual transformation, mapping, and repeatable data preparation | Choosing it for highly custom code-heavy logic better suited to notebooks |
| Need Python, PySpark, custom libraries, or complex data engineering logic | Notebook or Spark job pattern | Supports code-first transformation and advanced processing | Using notebooks without parameterization, logging, or rerun safety |
| Need to expose curated data to reporting | Semantic model or curated serving tables | Separates engineering from consumption and supports governed analytics | Pointing reports directly at raw or unstable tables |
| Need to reference data without physically copying it | Shortcut or equivalent integration pattern | Reduces duplication and can simplify lake architecture | Forgetting that permissions, source availability, and governance still matter |
| Need near-real-time or event-oriented processing | Fabric real-time or event-oriented item, when in scope for the solution | Fits telemetry, streams, and time-sensitive ingestion patterns | Forcing a batch pipeline onto a streaming requirement without latency analysis |
| Need environment promotion | Deployment pipeline, Git integration, parameters | Supports repeatable movement across workspaces or stages | Hard-coding source paths, workspace names, or credentials |
Can you do this? High-value DP-700 skills
Fabric architecture and platform judgment
- Explain the role of OneLake in a Fabric data estate.
- Distinguish a workspace, capacity-backed environment, item, lakehouse, warehouse, pipeline, notebook, and semantic model.
- Choose between lakehouse, warehouse, Dataflow Gen2, pipeline, notebook, and shortcut based on a scenario.
- Design a workspace layout for development, test, and production without hard-coding environment-specific values.
- Explain how lineage helps troubleshoot dependencies and downstream impact.
- Identify which artifact owns storage, which artifact transforms data, and which artifact serves data.
- Recognize when a successful pipeline run still requires data validation before publishing results.
Lakehouse, OneLake, and Delta readiness
- Organize data into raw, cleaned, and curated zones or medallion-style layers.
- Explain the difference between files and managed tables in a lakehouse-style design.
- Describe why Delta Lake concepts matter for reliability, schema management, and analytical reads.
- Use partitioning deliberately rather than by habit.
- Recognize small-file, skew, and over-partitioning symptoms.
- Explain when to preserve raw source data unchanged.
- Implement deduplication and upsert logic using business keys and timestamps.
- Handle source schema changes without silently breaking downstream tables.
- Explain how shortcuts can reduce duplication and where they add dependency risk.
- Validate row counts, null rates, duplicate counts, and referential assumptions across layers.
Warehouse and SQL engineering readiness
- Define the grain of a fact table before building it.
- Choose star schema patterns for reporting-friendly curated data.
- Distinguish business keys, surrogate keys, natural keys, and composite keys.
- Handle slowly changing dimension requirements at a conceptual level.
- Choose between a view and a materialized/physical table based on performance, freshness, and maintainability.
- Write SQL transformations that are clear, testable, and rerunnable.
- Avoid building reports directly on staging tables unless the scenario explicitly supports it.
- Explain how SQL serving layers relate to lakehouse and warehouse choices.
Ingestion and orchestration readiness
- Choose full load, append load, incremental load, or change-based load based on source capability and business requirements.
- Design a watermark strategy for incremental ingestion.
- Parameterize source path, destination path, date range, environment, and table name where appropriate.
- Configure connection and credential patterns securely.
- Recognize when an on-premises or private source may require a gateway or equivalent connectivity pattern.
- Add retry, timeout, failure branch, and notification logic to operational pipelines.
- Make loads idempotent so a retry does not duplicate data.
- Capture rejected rows or invalid records for review.
- Separate ingestion, transformation, validation, and publishing steps when operational clarity matters.
- Use run history and activity outputs to determine where a pipeline failed.
Transformation readiness
- Choose Spark/PySpark when transformations need distributed processing or custom code.
- Choose Dataflow Gen2 when a visual, low-code transformation is more maintainable.
- Choose SQL when the logic is relational, set-based, and close to the serving model.
- Convert semi-structured data into structured tables.
- Normalize date, time, currency, and identifier fields.
- Enforce data types before data reaches curated tables.
- Implement deduplication by key and precedence rule.
- Detect late-arriving records and decide whether to restate downstream tables.
- Validate transformation outputs against source totals and expected business rules.
- Document assumptions that downstream report authors depend on.
Security, governance, and access control readiness
- Explain least privilege for Fabric workspaces and data artifacts.
- Distinguish workspace roles from item-level permissions and data-level permissions.
- Identify when SQL object-level security or data access controls are needed in addition to workspace access.
- Protect credentials used by pipelines, dataflows, notebooks, and gateways.
- Apply sensitivity and endorsement concepts appropriately.
- Use lineage to understand downstream effects before changing or deleting data assets.
- Recognize governance risks from unmanaged shortcuts, copied data, and duplicated curated tables.
- Explain why production data engineering work should avoid personal credentials where possible.
- Review sharing decisions from both convenience and data exposure perspectives.
Monitoring, troubleshooting, and optimization readiness
- Use pipeline run details to locate the failed activity and inspect error messages.
- Use notebook and Spark logs to identify failed cells, package issues, executor errors, skew, or memory pressure.
- Check source authentication and destination permissions before rewriting transformation logic.
- Diagnose schema mismatch, missing columns, changed data types, and malformed files.
- Explain why a query may be slow due to file layout, partitioning, joins, filters, or workload concurrency.
- Identify small-file issues and when table maintenance or compaction-style actions may help.
- Use filters and column pruning to reduce unnecessary reads.
- Avoid expensive transformations in the serving path when they can be precomputed.
- Schedule heavy jobs to reduce contention where the business allows.
- Use monitoring signals to distinguish data failure, compute failure, and capacity pressure.
Medallion and serving-layer checklist
| Layer | Purpose | Candidate checks |
|---|---|---|
| Bronze/raw | Preserve source data with minimal transformation | Can you reload from source? Did you capture load time, source file/table name, and ingestion batch metadata where useful? |
| Silver/cleansed | Standardize, type, deduplicate, validate | Did you enforce schemas, remove duplicates, handle invalid records, and apply consistent business keys? |
| Gold/curated | Serve analytics-ready facts, dimensions, aggregates, or domain tables | Is the grain clear? Are measures and dimensions report-friendly? Are joins predictable? |
| Semantic/reporting layer | Provide governed consumption for analysts and business users | Are table names, relationships, permissions, and refresh/serving choices appropriate? |
Ingestion decision checks
| Question | If yes, consider | If no, consider |
|---|---|---|
| Does the source support reliable change detection? | Incremental or change-based load with watermark/checkpoint logic | Full load, snapshot comparison, or source-side export pattern |
| Is the source large or slow to extract? | Incremental copy, partitioned extraction, staged loads | Simpler full load may be acceptable for small reference data |
| Is transformation mostly visual and repeatable? | Dataflow Gen2 | Notebook or SQL if logic is complex or code-heavy |
| Do you need multiple dependent steps? | Pipeline orchestration | Single dataflow or notebook schedule may be enough for simple jobs |
| Is data already available in a compatible cloud location? | Shortcut or direct integration pattern | Copy into OneLake/lakehouse when isolation or performance requires it |
| Are credentials or network access complex? | Connection, gateway, managed access pattern, or service identity approach | Standard cloud connector may be sufficient |
| Must the process be safe to rerun? | Idempotent design, staging, merge/upsert, batch IDs | Append-only may be acceptable only for immutable event data |
Transformation patterns to recognize
Incremental watermark pattern
A strong DP-700 candidate can explain the purpose of a watermark even if syntax differs by tool.
-- Conceptual incremental filter pattern
WHERE SourceModifiedDate > @LastSuccessfulWatermark
AND SourceModifiedDate <= @CurrentWatermark;
Readiness checks:
- You know where the previous successful watermark is stored.
- You know when the new watermark is committed.
- You avoid advancing the watermark before validation succeeds.
- You have a plan for late-arriving records.
- You can rerun a failed batch without duplicating rows.
PySpark table transformation pattern
You do not need to memorize every API call, but you should recognize the intent of common Spark/Delta operations.
orders = spark.read.format("delta").table("bronze_orders")
clean_orders = (
orders
.dropDuplicates(["OrderId"])
.withColumnRenamed("OrderDateText", "OrderDate")
)
clean_orders.write.format("delta").mode("overwrite").saveAsTable("silver_orders")
Readiness checks:
- Can you explain what table is read and what table is written?
- Can you identify whether the write pattern is append, overwrite, or merge/upsert?
- Can you explain why overwrite may be risky for large or production tables?
- Can you add validation before publishing the result?
- Can you parameterize table names for different environments?
Merge/upsert reasoning
You should be able to explain when an upsert is safer than a blind append.
| Situation | Better pattern | Why |
|---|---|---|
| New immutable event rows | Append | Events are not expected to change after arrival |
| Source sends corrections for existing records | Merge/upsert | Existing rows may need updates |
| Source sends full daily snapshot | Replace snapshot or compare-and-merge | Avoid duplicate active records |
| Dimension attributes change over time | Type 1 or Type 2 dimension approach | Business requirement determines whether to preserve history |
| Deletions must be reflected | Change detection with delete handling | Append-only loads will leave stale rows |
Security and governance decision checks
| Scenario | What to think through |
|---|---|
| A developer needs to edit a notebook but not administer the workspace | Workspace role selection, item permissions, and separation of duties |
| A reporting user needs to view curated data only | Semantic model permissions, warehouse/lakehouse data permissions, and avoiding raw data exposure |
| A pipeline connects to a production source | Credential storage, identity choice, gateway requirements, and auditability |
| A shortcut references data owned by another team | Source permissions, lineage, ownership, availability, and change coordination |
| Sensitive columns appear in raw data | Classification, access controls, masking or exclusion in curated layers, and downstream exposure |
| A dataset is certified or endorsed | Ownership, quality expectations, lineage, and controlled change process |
| Production and development share data assets | Risk of accidental modification, credential leakage, and environment contamination |
Monitoring and troubleshooting checklist
| Symptom | Likely areas to inspect | Candidate response |
|---|---|---|
| Pipeline activity fails immediately | Credentials, connection, gateway, source path, permissions | Check authentication and connectivity before changing transformation code. |
| Copy succeeds but destination table is empty | Filters, parameters, source query, date range, destination mapping | Inspect activity inputs/outputs and validate source row counts. |
| Notebook fails after running for a long time | Spark logs, data skew, memory, shuffle, package dependency, bad record | Identify the failing stage or transformation and reduce data movement. |
| Schema mismatch error appears | Source column changes, data type changes, destination schema enforcement | Decide whether to update schema, handle drift, or reject the batch. |
| Query against curated table is slow | File layout, partitions, joins, filters, table size, unnecessary columns | Use pruning, precomputation, and table maintenance where appropriate. |
| Refresh/report output is wrong but pipeline succeeded | Data quality checks, business logic, duplicates, joins, late records | Treat operational success and data correctness as separate checks. |
| Users lost access after workspace change | Workspace role, item sharing, SQL/data permissions, semantic model permissions | Trace access from workspace to item to data. |
| Job performance varies by time of day | Capacity pressure, concurrency, scheduled workloads, source throttling | Review monitoring signals and schedule/resource tradeoffs. |
Performance optimization checks
Spark and lakehouse performance
- Filter early and select only needed columns.
- Avoid unnecessary shuffles and wide transformations.
- Understand when joins create skew or memory pressure.
- Use partitioning only when it supports common filters and does not create excessive small files.
- Recognize symptoms of too many small files.
- Use table maintenance or optimization features where appropriate for the Fabric item and table format.
- Cache only when reused data justifies it.
- Avoid collecting large datasets to the driver.
- Precompute curated tables rather than repeatedly transforming raw data for every report.
- Validate that performance improvements preserve correct results.
SQL and warehouse performance
- Write set-based transformations instead of row-by-row logic where possible.
- Push filters close to the source or staging layer.
- Avoid selecting unused columns in large transformations.
- Use clear join keys and validate join cardinality.
- Materialize expensive repeated logic when the scenario justifies it.
- Separate staging, transformation, and serving objects for maintainability.
- Check whether slow performance is caused by data design, query design, or resource contention.
Lifecycle, deployment, and maintainability
| Area | Checklist |
|---|---|
| Environment separation | Development, test, and production should not depend on manually edited paths or personal connections. |
| Parameterization | Pipelines and notebooks should accept environment-specific values instead of hard-coded constants. |
| Source control | Know why Git integration or versioning helps with collaboration, rollback, and review. |
| Deployment | Understand the purpose of deployment pipelines or promotion patterns across workspaces. |
| Secrets and credentials | Do not place secrets directly in notebooks or scripts. Use secure connection and credential patterns. |
| Documentation | Document table purpose, grain, ownership, refresh/load pattern, and known data quality rules. |
| Impact analysis | Use lineage and dependency review before renaming, deleting, or changing tables. |
| Operational ownership | Know who responds to failed runs, bad data, source changes, and access requests. |
Common weak areas and traps
| Trap | Why it hurts exam readiness | Better habit |
|---|---|---|
| Memorizing UI clicks only | DP-700 scenarios test judgment, not just navigation | Learn the purpose of each Fabric artifact and when to use it. |
| Confusing workspace access with data access | A user may see an item but still lack permission to query certain data, or the reverse may be governed separately | Trace access at workspace, item, and data levels. |
| Treating lakehouse and warehouse as interchangeable | They support different engineering and consumption patterns | Choose based on workload, skill set, modeling, and transformation needs. |
| Using append for every ingestion job | Corrections and updates create duplicates or stale data | Use keys, watermarks, merge/upsert logic, or snapshot handling. |
| Advancing a watermark before validation | Failed or partial loads can cause permanent gaps | Commit watermarks only after the batch is verified. |
| Ignoring schema drift | Source changes can silently break curated outputs | Add schema checks and controlled evolution. |
| Over-partitioning | Too many partitions can create small-file and management problems | Partition for common filters and data volume, not every column. |
| Copying data unnecessarily | Duplication increases storage, governance, and freshness problems | Consider shortcuts or direct integration patterns when appropriate. |
| Hiding orchestration inside notebooks | Operations teams lose visibility into dependencies and failures | Use pipelines for multi-step control flow. |
| Equating pipeline success with data quality | A job can complete while producing wrong numbers | Add row counts, duplicate checks, null checks, and business validations. |
| Hard-coding environment values | Deployment becomes fragile | Parameterize workspace, lakehouse, table, path, and connection values. |
| Skipping lineage review | Changes can break downstream reports and semantic models | Review dependencies before changing shared assets. |
Scenario practice prompts
Use these prompts to test whether you can reason like a DP-700 candidate.
Scenario 1: Daily sales ingestion
A sales system exports daily files. Business users need updated reports each morning. Files may be resent with corrections.
Can you answer?
- Would you treat the files as append-only, full snapshots, or correction-capable inputs?
- Where would you land raw files?
- How would you prevent duplicate sales records?
- What validation checks would you run before publishing curated tables?
- How would you alert the team if the export is missing or malformed?
- Would the serving layer be a lakehouse table, warehouse table, semantic model, or combination?
Scenario 2: Slow notebook transformation
A notebook that joins large order and customer datasets has become slow and unreliable.
Can you answer?
- Which logs or monitoring views would you inspect first?
- Could the issue be skew, shuffle, unnecessary columns, poor partitioning, or small files?
- Can filters be applied earlier?
- Should a reusable intermediate table be materialized?
- Would SQL be clearer for part of the transformation?
- How would you prove the optimized output is still correct?
Scenario 3: Secure curated reporting
Analysts should see only curated sales metrics, not raw customer data.
Can you answer?
- Which workspace and item permissions are needed?
- Where should raw data live relative to curated data?
- Should analysts query a warehouse, lakehouse SQL endpoint, or semantic model?
- How are sensitive columns excluded, masked, or controlled?
- How would lineage show the relationship between raw and curated assets?
- What happens when a new analyst joins the team?
Scenario 4: Source schema change
A source system adds a nullable column and changes the type of an existing field.
Can you answer?
- Which ingestion or transformation step detects the change?
- Should the pipeline fail fast or tolerate the change?
- How does the bronze layer preserve the source state?
- What changes are needed in silver and gold tables?
- Which downstream reports or semantic models are affected?
- How would you prevent silent incorrect results?
Final-week DP-700 checklist
| Final-review task | Done |
|---|---|
| Compare this checklist with the current Microsoft DP-700 skills outline and mark any missing official topics for review. | [ ] |
| Build or rehearse one end-to-end Fabric data engineering flow: ingest, transform, validate, publish, and monitor. | [ ] |
| Practice choosing between lakehouse, warehouse, pipeline, Dataflow Gen2, notebook, shortcut, and semantic model. | [ ] |
| Review workspace roles, item permissions, data permissions, credentials, and lineage. | [ ] |
| Rehearse incremental load, watermark, deduplication, and merge/upsert scenarios. | [ ] |
| Review Spark troubleshooting: logs, skew, shuffle, small files, partitioning, and failed notebook runs. | [ ] |
| Review SQL modeling: fact grain, dimensions, keys, views vs tables, and curated serving layers. | [ ] |
| Practice reading error messages from pipeline, notebook, dataflow, and query scenarios. | [ ] |
| Create a one-page artifact selection sheet in your own words. | [ ] |
| Rework missed practice questions by explaining why each wrong option is wrong. | [ ] |
| Do a mixed timed practice set rather than studying one topic at a time. | [ ] |
| Stop memorizing exact UI paths unless they reinforce an architectural concept. | [ ] |
Final readiness self-check
| If asked to… | You are ready when you can… |
|---|---|
| Design a Fabric data engineering solution | Select artifacts, data layers, ingestion pattern, transformation tool, security model, and monitoring approach. |
| Fix a failed pipeline | Isolate the failed activity, inspect credentials and parameters, read run outputs, and propose a safe retry strategy. |
| Improve a slow workload | Identify whether the cause is data layout, query design, Spark behavior, source bottleneck, or capacity pressure. |
| Secure shared data | Apply least privilege across workspace, item, and data layers without blocking legitimate analytics use. |
| Build reliable incremental ingestion | Use watermarks, validation, idempotency, and error handling to avoid gaps and duplicates. |
| Prepare curated analytics data | Model facts and dimensions, validate business rules, and expose stable tables or semantic models for reporting. |
Practical next step
After you mark weak areas, do targeted hands-on review before taking more practice questions. Build a small Fabric solution that includes a lakehouse, an ingestion pipeline, a transformation step, validation checks, and a curated serving table. Then use DP-700 practice questions to test whether you can choose the right Microsoft Fabric artifact and explain the operational tradeoffs under exam-style time pressure.