Try 10 focused Microsoft DP-700 questions on Ingest and Transform Data, with explanations, then continue with IT Mastery.
Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.
Try Microsoft DP-700 on Web View full Microsoft DP-700 practice page
| Field | Detail |
|---|---|
| Exam route | Microsoft DP-700 |
| Topic area | Ingest and Transform Data |
| Blueprint weight | 33% |
| Page purpose | Focused sample questions before returning to mixed practice |
Use this page to isolate Ingest and Transform Data for Microsoft DP-700. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.
| Pass | What to do | What to record |
|---|---|---|
| First attempt | Answer without checking the explanation first. | The fact, rule, calculation, or judgment point that controlled your answer. |
| Review | Read the explanation even when you were correct. | Why the best answer is stronger than the closest distractor. |
| Repair | Repeat only missed or uncertain items after a short break. | The pattern behind misses, not the answer letter. |
| Transfer | Return to mixed practice once the topic feels stable. | Whether the same skill holds up when the topic is no longer obvious. |
Blueprint context: 33% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.
These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.
Topic: Ingest and Transform Data
A manufacturing company sends sensor telemetry to Azure Event Hubs. You must validate proposed Microsoft Fabric streaming loading designs. The design must provide all of the following:
Which TWO designs satisfy the requirements?
Options:
A. Use Eventstream with an Event Hubs source and Eventhouse destination.
B. Use a Spark structured streaming notebook to write a Lakehouse table.
C. Use Dataflows Gen2 incremental refresh into a Lakehouse table.
D. Mirror the Event Hubs namespace to a Fabric Warehouse.
E. Schedule a pipeline to copy new Event Hubs data to a Warehouse.
F. Use Eventstream to filter events, then write valid events to Eventhouse.
Correct answers: A and F
Explanation: The required pattern is continuous streaming into a Real-Time Intelligence store that supports KQL. Eventstreams can ingest from Event Hubs and route events to Eventhouse with low latency, while optional event processing can filter or transform events before loading.
The core concept is matching the streaming load design to both freshness and consumption requirements. In Fabric, Eventstreams are the managed ingestion and event-processing capability for continuous event sources such as Azure Event Hubs. Eventhouse is the Real-Time Intelligence destination designed for high-volume, low-latency event analytics and KQL queries, including windowed aggregations. A design that routes directly to Eventhouse, or filters in Eventstream before writing to Eventhouse, preserves the continuous ingestion pattern and the required KQL consumption path.
Batch refresh and scheduled copy patterns can be useful for analytical loading, but they do not satisfy a sub-20-second streaming freshness requirement with KQL access in Eventhouse.
Topic: Ingest and Transform Data
A Microsoft Fabric notebook uses Spark structured streaming to aggregate IoT events from an Eventstream into a Lakehouse Delta table. The business requirement is to include events that arrive up to 20 minutes after their eventTime in 10-minute aggregates.
The aggregate table is missing events from devices that reconnect 12 to 18 minutes late. The notebook contains:
df.withWatermark("eventTime", "5 minutes") \
.groupBy(window("eventTime", "10 minutes"), "deviceId") \
.count()
What is the best fix?
Options:
A. Increase the watermark to at least 20 minutes.
B. Window by ingestion time instead of event time.
C. Partition the Delta table by deviceId.
D. Change the trigger interval to 20 minutes.
Best answer: A
Explanation: The symptom matches a watermark that is shorter than the allowed late-arrival period. Spark structured streaming drops late records after the watermark advances past the window, so a 5-minute watermark cannot satisfy a 20-minute lateness requirement.
In Spark structured streaming, event-time windows use the watermark to decide how long to retain state for late-arriving records. The query is correctly grouping by eventTime, but withWatermark("eventTime", "5 minutes") allows only about 5 minutes of lateness before older window state can be closed and later records discarded. Because the requirement says events up to 20 minutes late must be included, the watermark duration should be increased to at least 20 minutes. This may increase state retention and resource use, but it matches the stated correctness requirement.
Topic: Ingest and Transform Data
A Fabric workspace uses a Lakehouse bronze layer with OneLake shortcuts to mirrored sales tables. A nightly pipeline loads a Warehouse that feeds a semantic model. Analysts need only daily revenue and units by product category and sales region, and semantic model refresh should scan the fewest rows without losing that detail. Which design should you implement?
Options:
A. Load order-line rows with category and region columns.
B. Load a gold table grouped by day, category, and region.
C. Load a monthly table grouped by category and region.
D. Load a daily table grouped by customer and order.
Best answer: B
Explanation: The correct grain is the lowest level needed by downstream analytics: day, product category, and sales region. Aggregating to that grain preserves the requested detail while reducing scan volume for the Warehouse and semantic model.
The core concept is selecting the aggregation grain before loading curated data. Because analysts need daily revenue and units by category and region, the Fabric transformation should join the needed reference data, then group order-line facts into one row per day, category, and region. This creates a gold aggregate table that is still detailed enough for the required analysis but avoids carrying unnecessary order-line detail into the Warehouse. Grouping above that level, such as by month, loses required detail; grouping below it, such as by order or customer, keeps avoidable volume.
Topic: Ingest and Transform Data
A company uses Microsoft Fabric and must choose transformation tools for two workloads.
Which TWO tool choices should you recommend?
Options:
A. Use Dataflows Gen2 for the clickstream transformation.
B. Use KQL in an Eventhouse for the customer dimension.
C. Use a Fabric notebook with PySpark for the customer dimension.
D. Use a Fabric notebook with PySpark for the clickstream transformation.
E. Use T-SQL in the Warehouse for the customer dimension.
F. Use T-SQL against the Lakehouse SQL analytics endpoint for clickstream.
Correct answers: D and E
Explanation: Match the transformation tool to the data store, complexity, latency, and team skills. T-SQL is the natural fit for relational transformations inside a Fabric Warehouse. A Fabric notebook with PySpark fits complex Lakehouse file transformations that require custom Python parsing at scale.
The core decision is to choose the transformation engine closest to each data store and team skill. Fabric Warehouse workloads are best handled with T-SQL when the transformations are relational, set-based, and maintained by a SQL team. Lakehouse file transformations that require custom parsing, nested JSON handling, and distributed processing align with Fabric notebooks because Spark can read files, flatten arrays, and write curated Delta tables. Dataflows Gen2 is strong for low-code Power Query-style transformations, and KQL is for Eventhouse and Real-Time Intelligence data, but neither matches these stated constraints. Avoid moving data to a different engine when the native transformation tool already fits the workload.
Topic: Ingest and Transform Data
A Fabric pipeline loads sales files into a Lakehouse every 30 minutes. Auditors must validate each run against schedule, row-count, source watermark freshness, and alerting requirements. They must not read raw files or sales table data. Which approach should you implement?
Options:
A. Write run metrics to an audit table and grant SELECT only there.
B. Share the Lakehouse and apply sensitivity labels to sales tables.
C. Provide Fabric audit log access only for pipeline activity events.
D. Add auditors as workspace Viewers so they can inspect pipeline runs.
Best answer: A
Explanation: The validation evidence should be separated from the protected sales data. Persisting schedule, volume, freshness, and alert status metrics to an audit table lets auditors verify ingestion without workspace-wide or Lakehouse data access.
For governed ingestion validation, the pipeline should record operational metadata such as scheduled time, actual run time, rows loaded, maximum source timestamp, freshness status, and alert status. Auditors can then receive object-level access, such as SELECT on only that audit table or schema. This follows least privilege because the audit evidence is available while raw files and sales tables remain protected. Fabric audit logs are useful for reviewing activities and access events, but they do not replace purpose-built ingestion validation metrics such as row counts and source watermarks.
Topic: Ingest and Transform Data
You run a Fabric notebook that loads a Lakehouse fact_orders table. The raw orders DataFrame has 18,420 rows, but the output has 17,905 rows. Both join columns are strings, and the 515 missing orders all have customer IDs that are absent from the current customer dimension. The requirement is to keep every order and use CustomerKey = -1 for unknown customers.
fact = (orders
.join(dimCustomer,
orders.customer_id == dimCustomer.CustomerID,
"left")
.filter(dimCustomer.IsCurrent == True)
.select("OrderID", "OrderDate", "CustomerKey", "Amount"))
What is the best fix?
Options:
A. Cast orders.customer_id to integer in the join.
B. Use a full outer join for all customer rows.
C. Change the join type to inner.
D. Filter dimCustomer first, left join, and default null keys.
Best answer: D
Explanation: The filter on dimCustomer.IsCurrent is applied after the left join. For unmatched orders, the right-side columns are null, so the filter removes those rows. Filtering the dimension first and filling null keys preserves every order as required.
In a left outer join, left-side rows are preserved only until a later filter removes them. Here, orders without a matching current customer have null values for dimCustomer.IsCurrent; the expression null == True is not true, so those orders are filtered out. The current-customer condition should be applied to dimCustomer before the join, or as part of the join condition, and CustomerKey should then be defaulted to -1 for unmatched rows. This keeps the fact table grain aligned with the raw orders while still using only current dimension rows.
Topic: Ingest and Transform Data
A retail company sends point-of-sale events to a Fabric workspace from Azure Event Hubs. You must continuously discard test events, keep the full stream in a Lakehouse table, and deliver the filtered stream to an Eventhouse for KQL queries. The solution must minimize custom code and support routing to both destinations from the same stream.
Which implementation should you use?
Options:
A. Run a Spark Structured Streaming notebook to write both destinations.
B. Create an Eventstream with a filter transform and two destinations.
C. Schedule a Dataflow Gen2 refresh to load both destinations.
D. Use a pipeline Copy activity with tumbling-window parameters.
Best answer: B
Explanation: Eventstreams are designed for real-time event ingestion, lightweight transformation, routing, and destination delivery in Fabric. The requirement is primarily to filter and route the same continuous stream to a Lakehouse and an Eventhouse while minimizing custom code.
Use Eventstreams when the main task is to process streaming events by routing, transforming, or delivering them to Fabric destinations. In this scenario, the stream must be filtered to remove test events, preserved in a Lakehouse, and delivered to an Eventhouse for KQL analysis. An Eventstream can connect to the event source, apply a filter transformation, and fan out to multiple destinations from the same event flow. Spark Structured Streaming is valid for code-based streaming transformations, but it is not the simplest fit when built-in routing and destination delivery are the primary requirements.
Topic: Ingest and Transform Data
You are implementing a Fabric solution for finance reporting. Source files contain nested JSON, and the team must keep an existing PySpark notebook that parses and standardizes the files. The serving layer must let engineers load and update conformed fact and dimension tables by using T-SQL and support SQL-first warehouse queries. Which implementation should you use?
Options:
A. Publish the Lakehouse SQL analytics endpoint as the serving layer.
B. Use Dataflows Gen2 as the serving data store.
C. Use an Eventhouse and rewrite the transformations in KQL.
D. Orchestrate the PySpark notebook and load conformed tables into a Warehouse.
Best answer: D
Explanation: The data store should be chosen for the serving and access requirements, not just for the transformation language. A Fabric Warehouse is the appropriate serving store for T-SQL loading, updates, and SQL-first warehouse queries, while the existing PySpark notebook can still perform the parsing and standardization work.
Store selection and transformation-engine selection are separate design decisions in Fabric. In this scenario, PySpark is required for transformation because the team already has notebook logic for nested JSON. However, the serving layer requirements point to a Warehouse because engineers need T-SQL-based loading and updates over conformed fact and dimension tables, and analysts need SQL-first warehouse queries. The PySpark notebook can be orchestrated in a pipeline and its curated output loaded into the Warehouse. Choosing a Lakehouse only because PySpark is used would confuse the processing engine with the serving store.
Topic: Ingest and Transform Data
A retail company receives point-of-sale events in Azure Event Hubs. The data engineering team must continuously deliver all events to a Lakehouse for archive and deliver only fraud-risk events to an Eventhouse table within seconds. The solution should avoid custom streaming code and should not wait for a schedule.
Which orchestration pattern should you implement?
Options:
A. Run a Spark structured streaming notebook from a pipeline.
B. Create an Eventstream with filtering and two destinations.
C. Create a scheduled Dataflow Gen2 refresh.
D. Create a scheduled pipeline with Copy activities.
Best answer: B
Explanation: An Eventstream is the best fit when the primary requirement is continuous event routing, filtering, and destination delivery. It can read from an event source, apply transformations such as filters, and send events to multiple Fabric destinations without relying on a schedule.
Fabric Eventstreams support real-time ingestion patterns where events must be routed, transformed, and delivered continuously. In this scenario, the team needs all events sent to a Lakehouse and only a filtered subset sent to an Eventhouse within seconds. That matches Eventstreams because the routing and filtering are part of the streaming flow, not a batch orchestration workflow. Dataflows Gen2 and scheduled pipelines are better for batch or refresh-based processing, while a Spark structured streaming notebook can work but adds custom code and operational complexity that the requirement explicitly avoids.
Topic: Ingest and Transform Data
A retail company receives point-of-sale events through a Fabric Eventstream. The team must continuously load a curated Lakehouse Delta table. During loading, each event must be deduplicated by transactionId using an event-time watermark and enriched with a product dimension that already exists as a Lakehouse Delta table. Which loading pattern should you implement?
Options:
A. Use Spark Structured Streaming in a notebook.
B. Use Dataflows Gen2 with scheduled refresh.
C. Create a OneLake shortcut with Query acceleration.
D. Route the Eventstream to an Eventhouse native table.
Best answer: A
Explanation: The requirement is a continuous Lakehouse loading pattern with stateful streaming logic. Spark Structured Streaming is the best fit because it can apply event-time watermarking, deduplicate records, join to Delta-based reference data, and write the curated output to a Lakehouse Delta table.
Spark Structured Streaming is designed for continuous processing in Spark, including stateful operations such as watermark-based deduplication and streaming joins or enrichment with Delta tables. In this scenario, the target is a curated Lakehouse Delta table, and the transformation depends on transactionId deduplication plus enrichment from an existing Lakehouse dimension. A notebook-based Structured Streaming job can process micro-batches and write the cleaned, enriched data to Delta.
Eventhouse and KQL are excellent for native Real-Time Intelligence storage and interactive time-window queries, but they do not best match the stated Lakehouse Delta enrichment and loading requirement. Scheduled and shortcut-based approaches also fail the continuous transformation requirement.
Use the Microsoft DP-700 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.
Try Microsoft DP-700 on Web View Microsoft DP-700 Practice Test
Read the Microsoft DP-700 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.