DP-700 — Microsoft Fabric Data Engineer Associate Exam Blueprint

Practical DP-700 exam blueprint for Microsoft Fabric Data Engineer Associate candidates: Fabric architecture, ingestion, transformation, governance, monitoring, and optimization readiness.

How to Use This Exam Blueprint

Use this independent checklist to prepare for the Microsoft Microsoft Fabric Data Engineer Associate (DP-700) exam. It is a practical study map for DP-700, not an official Microsoft skills outline and not a claim about exact exam weights.

Mark a topic as “ready” only when you can do three things:

  1. Choose the right Fabric service or artifact for a scenario.
  2. Explain the tradeoff: security, performance, governance, cost, maintainability, or operational impact.
  3. Troubleshoot a realistic failure without relying only on memorized UI steps.

For final review, focus less on definitions and more on scenario judgment: lakehouse vs warehouse, notebook vs Dataflow Gen2, copy vs shortcut, full load vs incremental load, and successful run vs trustworthy data.

DP-700 readiness map

Readiness areaWhat to reviewWhat “ready” looks like
Microsoft Fabric platform foundationWorkspaces, capacities, items, OneLake, tenant and workspace concepts, item relationshipsYou can explain how Fabric organizes data engineering work and how a data pipeline, lakehouse, warehouse, notebook, semantic model, and OneLake path relate to each other.
Lakehouse and OneLake architectureLakehouse tables, files, Delta Lake concepts, shortcuts, medallion layers, schemas, SQL analytics endpointYou can design a lakehouse layout for raw, cleansed, and curated data and explain when to avoid copying data by using a shortcut or other integration pattern.
Warehouse and SQL engineeringWarehouse use cases, relational modeling, T-SQL transformations, dimensional models, views, stored logic, serving layersYou can decide when a SQL-first warehouse is better than a Spark/lakehouse-first design and can model fact and dimension tables for analytics.
Data ingestionPipelines, Copy activity, Dataflow Gen2, notebooks, source connections, gateways, parameters, incremental patternsYou can design a repeatable ingestion flow with authentication, schema handling, error handling, and full or incremental load logic.
Data transformationSpark notebooks, Spark SQL, PySpark, Dataflow Gen2 transformations, warehouse SQL, Delta table writesYou can transform raw data into validated serving tables and justify the tool choice for code-first, low-code, or SQL-first work.
Orchestration and schedulingPipeline activities, dependencies, parameters, variables, retries, triggers, run historyYou can chain ingestion, validation, transformation, and notification steps into an operational workflow.
Data quality and reliabilityIdempotent loads, deduplication, validation rules, watermarks, late-arriving data, schema drift, error tablesYou can make a pipeline safe to rerun and can detect when a “successful” run produced incomplete or invalid data.
Security and governanceWorkspace roles, item permissions, data permissions, sensitivity labels, lineage, endorsements, gateway credentialsYou can apply least privilege and explain the difference between access to a Fabric workspace, access to an item, and access to the data inside that item.
Monitoring and troubleshootingMonitoring hub, pipeline run details, Spark logs, refresh/run failures, capacity signals, lineage, query diagnosticsYou can identify whether a failure is caused by credentials, schema mismatch, source limits, Spark performance, SQL logic, or capacity pressure.
Performance optimizationPartitioning, file size management, predicate pushdown, shuffle reduction, query design, table maintenance, workload schedulingYou can improve a slow load or query using evidence rather than guessing.
Deployment and lifecycleGit integration concepts, deployment pipelines, workspace promotion, parameterization, environment separationYou can explain how to move data engineering artifacts from development to test or production with minimal manual rework.
Analytics handoffSemantic models, Direct Lake-style serving concepts where appropriate, Power BI consumption patterns, curated gold layerYou can prepare data so downstream analysts and reports consume governed, documented, reliable tables.

Core service and artifact selection

Scenario cueStrong candidate choiceWhy it fitsCommon trap
Need a file-based analytical store with Spark transformationsLakehouseSupports files, tables, Delta patterns, notebooks, and flexible data engineeringTreating every lakehouse table like a fully modeled warehouse table from day one
Need SQL-first relational analytics and curated dimensional structuresWarehouseBetter fit for SQL developers, relational modeling, and serving structured analytical dataUsing Spark notebooks for transformations that are simpler and clearer in SQL
Need to orchestrate multiple steps with dependenciesData pipelineCoordinates copy, notebook, dataflow, validation, branching, and retry logicHiding orchestration inside a long notebook with no clear operational visibility
Need low-code data shaping from common sourcesDataflow Gen2Useful for visual transformation, mapping, and repeatable data preparationChoosing it for highly custom code-heavy logic better suited to notebooks
Need Python, PySpark, custom libraries, or complex data engineering logicNotebook or Spark job patternSupports code-first transformation and advanced processingUsing notebooks without parameterization, logging, or rerun safety
Need to expose curated data to reportingSemantic model or curated serving tablesSeparates engineering from consumption and supports governed analyticsPointing reports directly at raw or unstable tables
Need to reference data without physically copying itShortcut or equivalent integration patternReduces duplication and can simplify lake architectureForgetting that permissions, source availability, and governance still matter
Need near-real-time or event-oriented processingFabric real-time or event-oriented item, when in scope for the solutionFits telemetry, streams, and time-sensitive ingestion patternsForcing a batch pipeline onto a streaming requirement without latency analysis
Need environment promotionDeployment pipeline, Git integration, parametersSupports repeatable movement across workspaces or stagesHard-coding source paths, workspace names, or credentials

Can you do this? High-value DP-700 skills

Fabric architecture and platform judgment

  • Explain the role of OneLake in a Fabric data estate.
  • Distinguish a workspace, capacity-backed environment, item, lakehouse, warehouse, pipeline, notebook, and semantic model.
  • Choose between lakehouse, warehouse, Dataflow Gen2, pipeline, notebook, and shortcut based on a scenario.
  • Design a workspace layout for development, test, and production without hard-coding environment-specific values.
  • Explain how lineage helps troubleshoot dependencies and downstream impact.
  • Identify which artifact owns storage, which artifact transforms data, and which artifact serves data.
  • Recognize when a successful pipeline run still requires data validation before publishing results.

Lakehouse, OneLake, and Delta readiness

  • Organize data into raw, cleaned, and curated zones or medallion-style layers.
  • Explain the difference between files and managed tables in a lakehouse-style design.
  • Describe why Delta Lake concepts matter for reliability, schema management, and analytical reads.
  • Use partitioning deliberately rather than by habit.
  • Recognize small-file, skew, and over-partitioning symptoms.
  • Explain when to preserve raw source data unchanged.
  • Implement deduplication and upsert logic using business keys and timestamps.
  • Handle source schema changes without silently breaking downstream tables.
  • Explain how shortcuts can reduce duplication and where they add dependency risk.
  • Validate row counts, null rates, duplicate counts, and referential assumptions across layers.

Warehouse and SQL engineering readiness

  • Define the grain of a fact table before building it.
  • Choose star schema patterns for reporting-friendly curated data.
  • Distinguish business keys, surrogate keys, natural keys, and composite keys.
  • Handle slowly changing dimension requirements at a conceptual level.
  • Choose between a view and a materialized/physical table based on performance, freshness, and maintainability.
  • Write SQL transformations that are clear, testable, and rerunnable.
  • Avoid building reports directly on staging tables unless the scenario explicitly supports it.
  • Explain how SQL serving layers relate to lakehouse and warehouse choices.

Ingestion and orchestration readiness

  • Choose full load, append load, incremental load, or change-based load based on source capability and business requirements.
  • Design a watermark strategy for incremental ingestion.
  • Parameterize source path, destination path, date range, environment, and table name where appropriate.
  • Configure connection and credential patterns securely.
  • Recognize when an on-premises or private source may require a gateway or equivalent connectivity pattern.
  • Add retry, timeout, failure branch, and notification logic to operational pipelines.
  • Make loads idempotent so a retry does not duplicate data.
  • Capture rejected rows or invalid records for review.
  • Separate ingestion, transformation, validation, and publishing steps when operational clarity matters.
  • Use run history and activity outputs to determine where a pipeline failed.

Transformation readiness

  • Choose Spark/PySpark when transformations need distributed processing or custom code.
  • Choose Dataflow Gen2 when a visual, low-code transformation is more maintainable.
  • Choose SQL when the logic is relational, set-based, and close to the serving model.
  • Convert semi-structured data into structured tables.
  • Normalize date, time, currency, and identifier fields.
  • Enforce data types before data reaches curated tables.
  • Implement deduplication by key and precedence rule.
  • Detect late-arriving records and decide whether to restate downstream tables.
  • Validate transformation outputs against source totals and expected business rules.
  • Document assumptions that downstream report authors depend on.

Security, governance, and access control readiness

  • Explain least privilege for Fabric workspaces and data artifacts.
  • Distinguish workspace roles from item-level permissions and data-level permissions.
  • Identify when SQL object-level security or data access controls are needed in addition to workspace access.
  • Protect credentials used by pipelines, dataflows, notebooks, and gateways.
  • Apply sensitivity and endorsement concepts appropriately.
  • Use lineage to understand downstream effects before changing or deleting data assets.
  • Recognize governance risks from unmanaged shortcuts, copied data, and duplicated curated tables.
  • Explain why production data engineering work should avoid personal credentials where possible.
  • Review sharing decisions from both convenience and data exposure perspectives.

Monitoring, troubleshooting, and optimization readiness

  • Use pipeline run details to locate the failed activity and inspect error messages.
  • Use notebook and Spark logs to identify failed cells, package issues, executor errors, skew, or memory pressure.
  • Check source authentication and destination permissions before rewriting transformation logic.
  • Diagnose schema mismatch, missing columns, changed data types, and malformed files.
  • Explain why a query may be slow due to file layout, partitioning, joins, filters, or workload concurrency.
  • Identify small-file issues and when table maintenance or compaction-style actions may help.
  • Use filters and column pruning to reduce unnecessary reads.
  • Avoid expensive transformations in the serving path when they can be precomputed.
  • Schedule heavy jobs to reduce contention where the business allows.
  • Use monitoring signals to distinguish data failure, compute failure, and capacity pressure.

Medallion and serving-layer checklist

LayerPurposeCandidate checks
Bronze/rawPreserve source data with minimal transformationCan you reload from source? Did you capture load time, source file/table name, and ingestion batch metadata where useful?
Silver/cleansedStandardize, type, deduplicate, validateDid you enforce schemas, remove duplicates, handle invalid records, and apply consistent business keys?
Gold/curatedServe analytics-ready facts, dimensions, aggregates, or domain tablesIs the grain clear? Are measures and dimensions report-friendly? Are joins predictable?
Semantic/reporting layerProvide governed consumption for analysts and business usersAre table names, relationships, permissions, and refresh/serving choices appropriate?

Ingestion decision checks

QuestionIf yes, considerIf no, consider
Does the source support reliable change detection?Incremental or change-based load with watermark/checkpoint logicFull load, snapshot comparison, or source-side export pattern
Is the source large or slow to extract?Incremental copy, partitioned extraction, staged loadsSimpler full load may be acceptable for small reference data
Is transformation mostly visual and repeatable?Dataflow Gen2Notebook or SQL if logic is complex or code-heavy
Do you need multiple dependent steps?Pipeline orchestrationSingle dataflow or notebook schedule may be enough for simple jobs
Is data already available in a compatible cloud location?Shortcut or direct integration patternCopy into OneLake/lakehouse when isolation or performance requires it
Are credentials or network access complex?Connection, gateway, managed access pattern, or service identity approachStandard cloud connector may be sufficient
Must the process be safe to rerun?Idempotent design, staging, merge/upsert, batch IDsAppend-only may be acceptable only for immutable event data

Transformation patterns to recognize

Incremental watermark pattern

A strong DP-700 candidate can explain the purpose of a watermark even if syntax differs by tool.

-- Conceptual incremental filter pattern
WHERE SourceModifiedDate >  @LastSuccessfulWatermark
  AND SourceModifiedDate <= @CurrentWatermark;

Readiness checks:

  • You know where the previous successful watermark is stored.
  • You know when the new watermark is committed.
  • You avoid advancing the watermark before validation succeeds.
  • You have a plan for late-arriving records.
  • You can rerun a failed batch without duplicating rows.

PySpark table transformation pattern

You do not need to memorize every API call, but you should recognize the intent of common Spark/Delta operations.

orders = spark.read.format("delta").table("bronze_orders")

clean_orders = (
    orders
    .dropDuplicates(["OrderId"])
    .withColumnRenamed("OrderDateText", "OrderDate")
)

clean_orders.write.format("delta").mode("overwrite").saveAsTable("silver_orders")

Readiness checks:

  • Can you explain what table is read and what table is written?
  • Can you identify whether the write pattern is append, overwrite, or merge/upsert?
  • Can you explain why overwrite may be risky for large or production tables?
  • Can you add validation before publishing the result?
  • Can you parameterize table names for different environments?

Merge/upsert reasoning

You should be able to explain when an upsert is safer than a blind append.

SituationBetter patternWhy
New immutable event rowsAppendEvents are not expected to change after arrival
Source sends corrections for existing recordsMerge/upsertExisting rows may need updates
Source sends full daily snapshotReplace snapshot or compare-and-mergeAvoid duplicate active records
Dimension attributes change over timeType 1 or Type 2 dimension approachBusiness requirement determines whether to preserve history
Deletions must be reflectedChange detection with delete handlingAppend-only loads will leave stale rows

Security and governance decision checks

ScenarioWhat to think through
A developer needs to edit a notebook but not administer the workspaceWorkspace role selection, item permissions, and separation of duties
A reporting user needs to view curated data onlySemantic model permissions, warehouse/lakehouse data permissions, and avoiding raw data exposure
A pipeline connects to a production sourceCredential storage, identity choice, gateway requirements, and auditability
A shortcut references data owned by another teamSource permissions, lineage, ownership, availability, and change coordination
Sensitive columns appear in raw dataClassification, access controls, masking or exclusion in curated layers, and downstream exposure
A dataset is certified or endorsedOwnership, quality expectations, lineage, and controlled change process
Production and development share data assetsRisk of accidental modification, credential leakage, and environment contamination

Monitoring and troubleshooting checklist

SymptomLikely areas to inspectCandidate response
Pipeline activity fails immediatelyCredentials, connection, gateway, source path, permissionsCheck authentication and connectivity before changing transformation code.
Copy succeeds but destination table is emptyFilters, parameters, source query, date range, destination mappingInspect activity inputs/outputs and validate source row counts.
Notebook fails after running for a long timeSpark logs, data skew, memory, shuffle, package dependency, bad recordIdentify the failing stage or transformation and reduce data movement.
Schema mismatch error appearsSource column changes, data type changes, destination schema enforcementDecide whether to update schema, handle drift, or reject the batch.
Query against curated table is slowFile layout, partitions, joins, filters, table size, unnecessary columnsUse pruning, precomputation, and table maintenance where appropriate.
Refresh/report output is wrong but pipeline succeededData quality checks, business logic, duplicates, joins, late recordsTreat operational success and data correctness as separate checks.
Users lost access after workspace changeWorkspace role, item sharing, SQL/data permissions, semantic model permissionsTrace access from workspace to item to data.
Job performance varies by time of dayCapacity pressure, concurrency, scheduled workloads, source throttlingReview monitoring signals and schedule/resource tradeoffs.

Performance optimization checks

Spark and lakehouse performance

  • Filter early and select only needed columns.
  • Avoid unnecessary shuffles and wide transformations.
  • Understand when joins create skew or memory pressure.
  • Use partitioning only when it supports common filters and does not create excessive small files.
  • Recognize symptoms of too many small files.
  • Use table maintenance or optimization features where appropriate for the Fabric item and table format.
  • Cache only when reused data justifies it.
  • Avoid collecting large datasets to the driver.
  • Precompute curated tables rather than repeatedly transforming raw data for every report.
  • Validate that performance improvements preserve correct results.

SQL and warehouse performance

  • Write set-based transformations instead of row-by-row logic where possible.
  • Push filters close to the source or staging layer.
  • Avoid selecting unused columns in large transformations.
  • Use clear join keys and validate join cardinality.
  • Materialize expensive repeated logic when the scenario justifies it.
  • Separate staging, transformation, and serving objects for maintainability.
  • Check whether slow performance is caused by data design, query design, or resource contention.

Lifecycle, deployment, and maintainability

AreaChecklist
Environment separationDevelopment, test, and production should not depend on manually edited paths or personal connections.
ParameterizationPipelines and notebooks should accept environment-specific values instead of hard-coded constants.
Source controlKnow why Git integration or versioning helps with collaboration, rollback, and review.
DeploymentUnderstand the purpose of deployment pipelines or promotion patterns across workspaces.
Secrets and credentialsDo not place secrets directly in notebooks or scripts. Use secure connection and credential patterns.
DocumentationDocument table purpose, grain, ownership, refresh/load pattern, and known data quality rules.
Impact analysisUse lineage and dependency review before renaming, deleting, or changing tables.
Operational ownershipKnow who responds to failed runs, bad data, source changes, and access requests.

Common weak areas and traps

TrapWhy it hurts exam readinessBetter habit
Memorizing UI clicks onlyDP-700 scenarios test judgment, not just navigationLearn the purpose of each Fabric artifact and when to use it.
Confusing workspace access with data accessA user may see an item but still lack permission to query certain data, or the reverse may be governed separatelyTrace access at workspace, item, and data levels.
Treating lakehouse and warehouse as interchangeableThey support different engineering and consumption patternsChoose based on workload, skill set, modeling, and transformation needs.
Using append for every ingestion jobCorrections and updates create duplicates or stale dataUse keys, watermarks, merge/upsert logic, or snapshot handling.
Advancing a watermark before validationFailed or partial loads can cause permanent gapsCommit watermarks only after the batch is verified.
Ignoring schema driftSource changes can silently break curated outputsAdd schema checks and controlled evolution.
Over-partitioningToo many partitions can create small-file and management problemsPartition for common filters and data volume, not every column.
Copying data unnecessarilyDuplication increases storage, governance, and freshness problemsConsider shortcuts or direct integration patterns when appropriate.
Hiding orchestration inside notebooksOperations teams lose visibility into dependencies and failuresUse pipelines for multi-step control flow.
Equating pipeline success with data qualityA job can complete while producing wrong numbersAdd row counts, duplicate checks, null checks, and business validations.
Hard-coding environment valuesDeployment becomes fragileParameterize workspace, lakehouse, table, path, and connection values.
Skipping lineage reviewChanges can break downstream reports and semantic modelsReview dependencies before changing shared assets.

Scenario practice prompts

Use these prompts to test whether you can reason like a DP-700 candidate.

Scenario 1: Daily sales ingestion

A sales system exports daily files. Business users need updated reports each morning. Files may be resent with corrections.

Can you answer?

  • Would you treat the files as append-only, full snapshots, or correction-capable inputs?
  • Where would you land raw files?
  • How would you prevent duplicate sales records?
  • What validation checks would you run before publishing curated tables?
  • How would you alert the team if the export is missing or malformed?
  • Would the serving layer be a lakehouse table, warehouse table, semantic model, or combination?

Scenario 2: Slow notebook transformation

A notebook that joins large order and customer datasets has become slow and unreliable.

Can you answer?

  • Which logs or monitoring views would you inspect first?
  • Could the issue be skew, shuffle, unnecessary columns, poor partitioning, or small files?
  • Can filters be applied earlier?
  • Should a reusable intermediate table be materialized?
  • Would SQL be clearer for part of the transformation?
  • How would you prove the optimized output is still correct?

Scenario 3: Secure curated reporting

Analysts should see only curated sales metrics, not raw customer data.

Can you answer?

  • Which workspace and item permissions are needed?
  • Where should raw data live relative to curated data?
  • Should analysts query a warehouse, lakehouse SQL endpoint, or semantic model?
  • How are sensitive columns excluded, masked, or controlled?
  • How would lineage show the relationship between raw and curated assets?
  • What happens when a new analyst joins the team?

Scenario 4: Source schema change

A source system adds a nullable column and changes the type of an existing field.

Can you answer?

  • Which ingestion or transformation step detects the change?
  • Should the pipeline fail fast or tolerate the change?
  • How does the bronze layer preserve the source state?
  • What changes are needed in silver and gold tables?
  • Which downstream reports or semantic models are affected?
  • How would you prevent silent incorrect results?

Final-week DP-700 checklist

Final-review taskDone
Compare this checklist with the current Microsoft DP-700 skills outline and mark any missing official topics for review.[ ]
Build or rehearse one end-to-end Fabric data engineering flow: ingest, transform, validate, publish, and monitor.[ ]
Practice choosing between lakehouse, warehouse, pipeline, Dataflow Gen2, notebook, shortcut, and semantic model.[ ]
Review workspace roles, item permissions, data permissions, credentials, and lineage.[ ]
Rehearse incremental load, watermark, deduplication, and merge/upsert scenarios.[ ]
Review Spark troubleshooting: logs, skew, shuffle, small files, partitioning, and failed notebook runs.[ ]
Review SQL modeling: fact grain, dimensions, keys, views vs tables, and curated serving layers.[ ]
Practice reading error messages from pipeline, notebook, dataflow, and query scenarios.[ ]
Create a one-page artifact selection sheet in your own words.[ ]
Rework missed practice questions by explaining why each wrong option is wrong.[ ]
Do a mixed timed practice set rather than studying one topic at a time.[ ]
Stop memorizing exact UI paths unless they reinforce an architectural concept.[ ]

Final readiness self-check

If asked to…You are ready when you can…
Design a Fabric data engineering solutionSelect artifacts, data layers, ingestion pattern, transformation tool, security model, and monitoring approach.
Fix a failed pipelineIsolate the failed activity, inspect credentials and parameters, read run outputs, and propose a safe retry strategy.
Improve a slow workloadIdentify whether the cause is data layout, query design, Spark behavior, source bottleneck, or capacity pressure.
Secure shared dataApply least privilege across workspace, item, and data layers without blocking legitimate analytics use.
Build reliable incremental ingestionUse watermarks, validation, idempotency, and error handling to avoid gaps and duplicates.
Prepare curated analytics dataModel facts and dimensions, validate business rules, and expose stable tables or semantic models for reporting.

Practical next step

After you mark weak areas, do targeted hands-on review before taking more practice questions. Build a small Fabric solution that includes a lakehouse, an ingestion pipeline, a transformation step, validation checks, and a curated serving table. Then use DP-700 practice questions to test whether you can choose the right Microsoft Fabric artifact and explain the operational tradeoffs under exam-style time pressure.

Browse Certification Practice Tests by Exam Family