Microsoft DP-700: Ingest and Transform Data

May 1, 2026

Try 10 focused Microsoft DP-700 questions on Ingest and Transform Data, with explanations, then continue with IT Mastery.

On this page

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try Microsoft DP-700 on Web View full Microsoft DP-700 practice page

Topic snapshot

Field	Detail
Exam route	Microsoft DP-700
Topic area	Ingest and Transform Data
Blueprint weight	33%
Page purpose	Focused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Ingest and Transform Data for Microsoft DP-700. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

Pass	What to do	What to record
First attempt	Answer without checking the explanation first.	The fact, rule, calculation, or judgment point that controlled your answer.
Review	Read the explanation even when you were correct.	Why the best answer is stronger than the closest distractor.
Repair	Repeat only missed or uncertain items after a short break.	The pattern behind misses, not the answer letter.
Transfer	Return to mixed practice once the topic feels stable.	Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 33% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.

Question 1

Topic: Ingest and Transform Data

A manufacturing company sends sensor telemetry to Azure Event Hubs. You must validate proposed Microsoft Fabric streaming loading designs. The design must provide all of the following:

Events available for operational dashboards within 20 seconds of arrival.
Fabric-managed streaming ingestion with recovery from transient failures; no scheduled polling jobs.
KQL access for windowed aggregations in Real-Time Intelligence.

Which TWO designs satisfy the requirements?

Options:

A. Use Eventstream with an Event Hubs source and Eventhouse destination.
B. Use a Spark structured streaming notebook to write a Lakehouse table.
C. Use Dataflows Gen2 incremental refresh into a Lakehouse table.
D. Mirror the Event Hubs namespace to a Fabric Warehouse.
E. Schedule a pipeline to copy new Event Hubs data to a Warehouse.
F. Use Eventstream to filter events, then write valid events to Eventhouse.

Correct answers: A and F

Explanation: The required pattern is continuous streaming into a Real-Time Intelligence store that supports KQL. Eventstreams can ingest from Event Hubs and route events to Eventhouse with low latency, while optional event processing can filter or transform events before loading.

The core concept is matching the streaming load design to both freshness and consumption requirements. In Fabric, Eventstreams are the managed ingestion and event-processing capability for continuous event sources such as Azure Event Hubs. Eventhouse is the Real-Time Intelligence destination designed for high-volume, low-latency event analytics and KQL queries, including windowed aggregations. A design that routes directly to Eventhouse, or filters in Eventstream before writing to Eventhouse, preserves the continuous ingestion pattern and the required KQL consumption path.

Batch refresh and scheduled copy patterns can be useful for analytical loading, but they do not satisfy a sub-20-second streaming freshness requirement with KQL access in Eventhouse.

Notebook output can stream to Delta tables, but it uses custom notebook code and does not provide the required Eventhouse KQL consumption path.
Scheduled pipeline copy is a polling-style batch pattern, so it fails the continuous managed streaming requirement.
Dataflows Gen2 refresh is refresh-oriented transformation, not a continuous streaming load for sub-20-second telemetry.
Mirroring Event Hubs chooses the wrong Fabric capability because mirroring is not used to mirror an Event Hubs namespace into Warehouse for KQL analytics.

Question 2

Topic: Ingest and Transform Data

A Microsoft Fabric notebook uses Spark structured streaming to aggregate IoT events from an Eventstream into a Lakehouse Delta table. The business requirement is to include events that arrive up to 20 minutes after their eventTime in 10-minute aggregates.

The aggregate table is missing events from devices that reconnect 12 to 18 minutes late. The notebook contains:

df.withWatermark("eventTime", "5 minutes") \
  .groupBy(window("eventTime", "10 minutes"), "deviceId") \
  .count()

What is the best fix?

Options:

A. Increase the watermark to at least 20 minutes.
B. Window by ingestion time instead of event time.
C. Partition the Delta table by deviceId.
D. Change the trigger interval to 20 minutes.

Best answer: A

Explanation: The symptom matches a watermark that is shorter than the allowed late-arrival period. Spark structured streaming drops late records after the watermark advances past the window, so a 5-minute watermark cannot satisfy a 20-minute lateness requirement.

In Spark structured streaming, event-time windows use the watermark to decide how long to retain state for late-arriving records. The query is correctly grouping by eventTime, but withWatermark("eventTime", "5 minutes") allows only about 5 minutes of lateness before older window state can be closed and later records discarded. Because the requirement says events up to 20 minutes late must be included, the watermark duration should be increased to at least 20 minutes. This may increase state retention and resource use, but it matches the stated correctness requirement.

Trigger interval affects how often micro-batches run, not how long late event-time state is retained.
Ingestion-time windows would change the business meaning of the aggregate and would not report by the original event time.
Delta partitioning can help storage/query performance, but it does not recover records dropped by watermarking.

Question 3

Topic: Ingest and Transform Data

A Fabric workspace uses a Lakehouse bronze layer with OneLake shortcuts to mirrored sales tables. A nightly pipeline loads a Warehouse that feeds a semantic model. Analysts need only daily revenue and units by product category and sales region, and semantic model refresh should scan the fewest rows without losing that detail. Which design should you implement?

Options:

A. Load order-line rows with category and region columns.
B. Load a gold table grouped by day, category, and region.
C. Load a monthly table grouped by category and region.
D. Load a daily table grouped by customer and order.

Best answer: B

Explanation: The correct grain is the lowest level needed by downstream analytics: day, product category, and sales region. Aggregating to that grain preserves the requested detail while reducing scan volume for the Warehouse and semantic model.

The core concept is selecting the aggregation grain before loading curated data. Because analysts need daily revenue and units by category and region, the Fabric transformation should join the needed reference data, then group order-line facts into one row per day, category, and region. This creates a gold aggregate table that is still detailed enough for the required analysis but avoids carrying unnecessary order-line detail into the Warehouse. Grouping above that level, such as by month, loses required detail; grouping below it, such as by order or customer, keeps avoidable volume.

Order-line grain preserves unnecessary detail and increases refresh work when drill-through is not required.
Monthly aggregation loses the required daily level for downstream analysis.
Customer and order grouping uses an unrelated grain and does not match the requested category-region analytics.

Question 4

Topic: Ingest and Transform Data

A company uses Microsoft Fabric and must choose transformation tools for two workloads.

Customer dimension: staged relational tables are in a Fabric Warehouse; the load runs nightly, uses set-based joins and updates, and the team supports only T-SQL.
Clickstream: raw nested JSON files are in a Lakehouse; the load runs hourly, requires custom Python parsing and array flattening at scale, and the team supports PySpark.

Which TWO tool choices should you recommend?

Options:

A. Use Dataflows Gen2 for the clickstream transformation.
B. Use KQL in an Eventhouse for the customer dimension.
C. Use a Fabric notebook with PySpark for the customer dimension.
D. Use a Fabric notebook with PySpark for the clickstream transformation.
E. Use T-SQL in the Warehouse for the customer dimension.
F. Use T-SQL against the Lakehouse SQL analytics endpoint for clickstream.

Correct answers: D and E

Explanation: Match the transformation tool to the data store, complexity, latency, and team skills. T-SQL is the natural fit for relational transformations inside a Fabric Warehouse. A Fabric notebook with PySpark fits complex Lakehouse file transformations that require custom Python parsing at scale.

The core decision is to choose the transformation engine closest to each data store and team skill. Fabric Warehouse workloads are best handled with T-SQL when the transformations are relational, set-based, and maintained by a SQL team. Lakehouse file transformations that require custom parsing, nested JSON handling, and distributed processing align with Fabric notebooks because Spark can read files, flatten arrays, and write curated Delta tables. Dataflows Gen2 is strong for low-code Power Query-style transformations, and KQL is for Eventhouse and Real-Time Intelligence data, but neither matches these stated constraints. Avoid moving data to a different engine when the native transformation tool already fits the workload.

Wrong store: KQL is optimized for Eventhouse and real-time event analytics, not relational Warehouse dimension updates.
Low-code mismatch: Dataflows Gen2 does not fit the stated need for custom Python parsing and scalable nested JSON handling.
Read-oriented endpoint: The Lakehouse SQL analytics endpoint is not the right surface for custom Python transformations over raw files.
Skill mismatch: PySpark for the Warehouse dimension ignores the T-SQL-only team requirement and native Warehouse processing model.

Question 5

Topic: Ingest and Transform Data

A Fabric pipeline loads sales files into a Lakehouse every 30 minutes. Auditors must validate each run against schedule, row-count, source watermark freshness, and alerting requirements. They must not read raw files or sales table data. Which approach should you implement?

Options:

A. Write run metrics to an audit table and grant SELECT only there.
B. Share the Lakehouse and apply sensitivity labels to sales tables.
C. Provide Fabric audit log access only for pipeline activity events.
D. Add auditors as workspace Viewers so they can inspect pipeline runs.

Best answer: A

Explanation: The validation evidence should be separated from the protected sales data. Persisting schedule, volume, freshness, and alert status metrics to an audit table lets auditors verify ingestion without workspace-wide or Lakehouse data access.

For governed ingestion validation, the pipeline should record operational metadata such as scheduled time, actual run time, rows loaded, maximum source timestamp, freshness status, and alert status. Auditors can then receive object-level access, such as SELECT on only that audit table or schema. This follows least privilege because the audit evidence is available while raw files and sales tables remain protected. Fabric audit logs are useful for reviewing activities and access events, but they do not replace purpose-built ingestion validation metrics such as row counts and source watermarks.

Workspace Viewer access can expose more workspace items and operational details than auditors need.
Sensitivity labels classify and help protect data, but they do not provide ingestion schedule, volume, and freshness validation by themselves.
Audit logs only show activity evidence, but they do not reliably contain business validation metrics such as row counts and source watermarks.

Question 6

Topic: Ingest and Transform Data

You run a Fabric notebook that loads a Lakehouse fact_orders table. The raw orders DataFrame has 18,420 rows, but the output has 17,905 rows. Both join columns are strings, and the 515 missing orders all have customer IDs that are absent from the current customer dimension. The requirement is to keep every order and use CustomerKey = -1 for unknown customers.

fact = (orders
    .join(dimCustomer,
          orders.customer_id == dimCustomer.CustomerID,
          "left")
    .filter(dimCustomer.IsCurrent == True)
    .select("OrderID", "OrderDate", "CustomerKey", "Amount"))

What is the best fix?

Options:

A. Cast orders.customer_id to integer in the join.
B. Use a full outer join for all customer rows.
C. Change the join type to inner.
D. Filter dimCustomer first, left join, and default null keys.

Best answer: D

Explanation: The filter on dimCustomer.IsCurrent is applied after the left join. For unmatched orders, the right-side columns are null, so the filter removes those rows. Filtering the dimension first and filling null keys preserves every order as required.

In a left outer join, left-side rows are preserved only until a later filter removes them. Here, orders without a matching current customer have null values for dimCustomer.IsCurrent; the expression null == True is not true, so those orders are filtered out. The current-customer condition should be applied to dimCustomer before the join, or as part of the join condition, and CustomerKey should then be defaulted to -1 for unmatched rows. This keeps the fact table grain aligned with the raw orders while still using only current dimension rows.

Inner join would intentionally keep only matching customers and lose the unknown-customer orders.
Full outer join can add customer rows with no orders and does not fix the post-join filter.
Casting the key is not supported by the evidence because both join columns are strings and the missing count matches absent dimension IDs.

Question 7

Topic: Ingest and Transform Data

A retail company sends point-of-sale events to a Fabric workspace from Azure Event Hubs. You must continuously discard test events, keep the full stream in a Lakehouse table, and deliver the filtered stream to an Eventhouse for KQL queries. The solution must minimize custom code and support routing to both destinations from the same stream.

Which implementation should you use?

Options:

A. Run a Spark Structured Streaming notebook to write both destinations.
B. Create an Eventstream with a filter transform and two destinations.
C. Schedule a Dataflow Gen2 refresh to load both destinations.
D. Use a pipeline Copy activity with tumbling-window parameters.

Best answer: B

Explanation: Eventstreams are designed for real-time event ingestion, lightweight transformation, routing, and destination delivery in Fabric. The requirement is primarily to filter and route the same continuous stream to a Lakehouse and an Eventhouse while minimizing custom code.

Use Eventstreams when the main task is to process streaming events by routing, transforming, or delivering them to Fabric destinations. In this scenario, the stream must be filtered to remove test events, preserved in a Lakehouse, and delivered to an Eventhouse for KQL analysis. An Eventstream can connect to the event source, apply a filter transformation, and fan out to multiple destinations from the same event flow. Spark Structured Streaming is valid for code-based streaming transformations, but it is not the simplest fit when built-in routing and destination delivery are the primary requirements.

Notebook streaming adds custom PySpark code and operational overhead when the requirement favors managed event routing.
Dataflow refresh is a batch-style transformation pattern, not a continuous streaming routing solution.
Pipeline copy is suited to orchestrated movement, not continuous event filtering and fan-out from one live stream.

Question 8

Topic: Ingest and Transform Data

You are implementing a Fabric solution for finance reporting. Source files contain nested JSON, and the team must keep an existing PySpark notebook that parses and standardizes the files. The serving layer must let engineers load and update conformed fact and dimension tables by using T-SQL and support SQL-first warehouse queries. Which implementation should you use?

Options:

A. Publish the Lakehouse SQL analytics endpoint as the serving layer.
B. Use Dataflows Gen2 as the serving data store.
C. Use an Eventhouse and rewrite the transformations in KQL.
D. Orchestrate the PySpark notebook and load conformed tables into a Warehouse.

Best answer: D

Explanation: The data store should be chosen for the serving and access requirements, not just for the transformation language. A Fabric Warehouse is the appropriate serving store for T-SQL loading, updates, and SQL-first warehouse queries, while the existing PySpark notebook can still perform the parsing and standardization work.

Store selection and transformation-engine selection are separate design decisions in Fabric. In this scenario, PySpark is required for transformation because the team already has notebook logic for nested JSON. However, the serving layer requirements point to a Warehouse because engineers need T-SQL-based loading and updates over conformed fact and dimension tables, and analysts need SQL-first warehouse queries. The PySpark notebook can be orchestrated in a pipeline and its curated output loaded into the Warehouse. Choosing a Lakehouse only because PySpark is used would confuse the processing engine with the serving store.

Lakehouse endpoint is tempting because it works with Spark output, but it does not satisfy the T-SQL update-oriented warehouse serving requirement.
Eventhouse with KQL fits real-time or time-series analytics, not this finance star-schema serving layer.
Dataflows Gen2 can transform and load data, but it is not itself the serving data store.

Question 9

Topic: Ingest and Transform Data

A retail company receives point-of-sale events in Azure Event Hubs. The data engineering team must continuously deliver all events to a Lakehouse for archive and deliver only fraud-risk events to an Eventhouse table within seconds. The solution should avoid custom streaming code and should not wait for a schedule.

Which orchestration pattern should you implement?

Options:

A. Run a Spark structured streaming notebook from a pipeline.
B. Create an Eventstream with filtering and two destinations.
C. Create a scheduled Dataflow Gen2 refresh.
D. Create a scheduled pipeline with Copy activities.

Best answer: B

Explanation: An Eventstream is the best fit when the primary requirement is continuous event routing, filtering, and destination delivery. It can read from an event source, apply transformations such as filters, and send events to multiple Fabric destinations without relying on a schedule.

Fabric Eventstreams support real-time ingestion patterns where events must be routed, transformed, and delivered continuously. In this scenario, the team needs all events sent to a Lakehouse and only a filtered subset sent to an Eventhouse within seconds. That matches Eventstreams because the routing and filtering are part of the streaming flow, not a batch orchestration workflow. Dataflows Gen2 and scheduled pipelines are better for batch or refresh-based processing, while a Spark structured streaming notebook can work but adds custom code and operational complexity that the requirement explicitly avoids.

Scheduled refresh fails because Dataflows Gen2 are refresh-oriented and do not provide continuous event routing within seconds.
Pipeline schedule fails because scheduled Copy activities are batch orchestration, not always-on stream processing.
Streaming notebook is plausible but does not meet the requirement to avoid custom streaming code.

Question 10

Topic: Ingest and Transform Data

A retail company receives point-of-sale events through a Fabric Eventstream. The team must continuously load a curated Lakehouse Delta table. During loading, each event must be deduplicated by transactionId using an event-time watermark and enriched with a product dimension that already exists as a Lakehouse Delta table. Which loading pattern should you implement?

Options:

A. Use Spark Structured Streaming in a notebook.
B. Use Dataflows Gen2 with scheduled refresh.
C. Create a OneLake shortcut with Query acceleration.
D. Route the Eventstream to an Eventhouse native table.

Best answer: A

Explanation: The requirement is a continuous Lakehouse loading pattern with stateful streaming logic. Spark Structured Streaming is the best fit because it can apply event-time watermarking, deduplicate records, join to Delta-based reference data, and write the curated output to a Lakehouse Delta table.

Spark Structured Streaming is designed for continuous processing in Spark, including stateful operations such as watermark-based deduplication and streaming joins or enrichment with Delta tables. In this scenario, the target is a curated Lakehouse Delta table, and the transformation depends on transactionId deduplication plus enrichment from an existing Lakehouse dimension. A notebook-based Structured Streaming job can process micro-batches and write the cleaned, enriched data to Delta.

Eventhouse and KQL are excellent for native Real-Time Intelligence storage and interactive time-window queries, but they do not best match the stated Lakehouse Delta enrichment and loading requirement. Scheduled and shortcut-based approaches also fail the continuous transformation requirement.

Eventhouse routing is useful for KQL analytics over real-time tables, but it does not target the curated Lakehouse Delta loading pattern described.
Scheduled Dataflows Gen2 can transform batches, but scheduled refresh is not the right pattern for continuous event-time deduplication.
OneLake shortcut provides access to existing data, but it does not perform streaming transformation, deduplication, or loading.

Continue with full practice

Use the Microsoft DP-700 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try Microsoft DP-700 on Web View Microsoft DP-700 Practice Test

Free review resource

Read the Microsoft DP-700 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.

Revised on Thursday, May 14, 2026

Analytics Implementation

Analytics Optimization

Browse Certification Practice Tests by Exam Family

Microsoft DP-700: Ingest and Transform Data

Topic snapshot

How to use this topic drill

Sample questions

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Continue with full practice

Related focused pages

Free review resource