Free Microsoft DP-750 Practice Questions: Prepare and Process Data
Practice 10 free Microsoft Certified: Azure Databricks Data Engineer Associate (Microsoft DP-750) questions on Prepare and Process Data, with answers, explanations, and the IT Mastery next step.
Try the IT Mastery web app for a richer interactive practice experience with mixed sets, timed mocks, topic drills, explanations, and progress tracking.
Topic snapshot
| Field | Detail |
|---|---|
| Exam route | Microsoft DP-750 |
| Topic area | Prepare and Process Data |
| Blueprint weight | 32% |
| Page purpose | Focused sample questions before returning to mixed practice |
How to use this topic drill
Use this page to isolate Prepare and Process Data for Microsoft DP-750. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.
| Pass | What to do | What to record |
|---|---|---|
| First attempt | Answer without checking the explanation first. | The fact, rule, calculation, or judgment point that controlled your answer. |
| Review | Read the explanation even when you were correct. | Why the best answer is stronger than the closest distractor. |
| Repair | Repeat only missed or uncertain items after a short break. | The pattern behind misses, not the answer letter. |
| Transfer | Return to mixed practice once the topic feels stable. | Whether the same skill holds up when the topic is no longer obvious. |
Blueprint context: 32% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.
Sample questions
These are original IT Mastery practice questions aligned to this topic area. They are not official Microsoft questions, copied live-exam content, or exam dumps. Use them for self-assessment, scope review, and deciding what to drill next.
Question 1
Topic: Prepare and Process Data
A Lakeflow Spark Declarative Pipeline loads raw order records into a silver Delta table. The target model requires order_id and order_timestamp; records missing either value must be excluded from the silver table, but the pipeline should continue and expose data-quality metrics for monitoring. What should you configure?
Options:
A. A column mask on the required fields
B. An expectation that drops rows when required fields are null
C. An Auto Loader rescued data column
D. A
NOT NULLtable constraint on the silver columns
Best answer: B
Explanation: Lakeflow Spark Declarative Pipelines expectations are designed for row-level data-quality rules such as required-value checks. When the requirement is to keep processing valid records while excluding rows that fail a rule, configure an expectation with a drop behavior for conditions like order_id IS NOT NULL AND order_timestamp IS NOT NULL. This prevents incomplete records from entering the target table and keeps quality metrics available for monitoring. A hard table constraint is better when failed writes should be rejected, not when the pipeline should continue processing valid rows.
- Hard constraint may reject or fail writes instead of continuing with only valid rows.
- Rescued data column captures unexpected schema fields, not missing required values.
- Column mask controls sensitive data visibility and does not validate row completeness.
Question 2
Topic: Prepare and Process Data
A Lakeflow Spark Declarative Pipeline that builds a gold.order_facts table failed after a new batch arrived. The target model requires order_id, customer_id, and order_ts to be present. The data-quality check failed with customer_id IS NOT NULL. A quick profile of the raw batch shows no new columns or type changes, but 1,842 rows have a null or blank customer_id. Business rules prohibit creating synthetic customer IDs. What is the best next step?
Options:
A. Update the schema to allow nullable
customer_idB. Grant the pipeline owner access to
gold.order_factsC. Replace missing
customer_idvalues withUNKNOWND. Quarantine incomplete rows and load only valid records
Best answer: D
Explanation: This failure is a missing-data problem, not schema drift or an access issue. The target data model requires customer_id, and the raw profile confirms that some incoming records lack that required value. Because the business rule prohibits synthetic IDs, filling the field with a placeholder would create invalid facts. The appropriate handling pattern is to separate incomplete records into a quarantine or error table for investigation and remediation, and allow only records that satisfy the required-value rule into the curated target. This preserves data quality without silently discarding evidence needed to fix the source issue.
- Relaxing nullability fails because it changes the target model instead of handling records that violate it.
- Using placeholders fails because
UNKNOWNwould invent a customer identifier that business rules explicitly prohibit. - Changing permissions fails because the visible error is a quality-check failure, not a Unity Catalog access denial.
Question 3
Topic: Prepare and Process Data
A data engineering team must ingest daily files from an SFTP source into a Unity Catalog table. Before writing to Delta Lake, the process must call a custom Python library to decrypt each file, parse a proprietary record format, and apply source-specific exception handling. Which ingestion configuration should the team use?
Options:
A. Lakeflow Connect with a managed connector
B. A notebook task on job compute
C. COPY INTO from a volume
D. Azure Data Factory copy activity only
Best answer: B
Explanation: Notebooks are the best fit when ingestion requires custom code, custom libraries, or source-specific processing that cannot be expressed as a standard connector or simple file load. In this scenario, the decisive requirements are Python-based decryption, proprietary parsing, and custom exception handling before writing to Delta Lake. A notebook task can run on job compute, install or reference the required library, and implement the ingestion logic in Python or PySpark. Managed ingestion options are preferable for standardized sources and simple movement, but they are not the primary choice when the ingest path depends on bespoke code.
- Managed connector is attractive for supported SaaS or database sources, but it does not address proprietary parsing and custom Python decryption.
- Copy activity only moves data between systems, but the stem requires code-driven processing before the Delta write.
- COPY INTO is useful for loading files into Delta tables, but it is not designed for custom pre-load decryption and parsing logic.
Question 4
Topic: Prepare and Process Data
A team changed a bronze ingestion job from reading raw Parquet exports in ADLS to reading the data/ folder under an upstream Apache Iceberg table. The Auto Loader job still succeeds, but the target Delta table now includes rows that the upstream system reports as deleted and shows occasional duplicates after upstream compaction. There are no schema changes or permission errors.
What is the best root cause or next diagnostic step?
Options:
A. The target Delta table needs a wider schema evolution mode.
B. The job is bypassing Iceberg metadata and reading data files directly.
C. Auto Loader does not support Parquet ingestion.
D. The Unity Catalog volume lacks read permission on the folder.
Best answer: B
Explanation: Iceberg is a table format, not just a collection of Parquet files. Its metadata tracks snapshots, manifests, deletes, and file replacements from operations such as compaction. If a pipeline reads the underlying data/ files directly, it can ignore which files are current and which rows have been deleted. The next diagnostic step is to confirm whether the source is an Iceberg table and use an ingestion or table-access approach that respects the Iceberg metadata instead of treating the data folder as raw Parquet input. The visible symptoms point to table-format semantics being bypassed, not to schema drift or access failure.
- Parquet support fails because Auto Loader can ingest Parquet files; the issue is reading Iceberg-managed files as raw files.
- Schema evolution fails because the stem states there are no schema changes, and widening the schema would not explain deleted rows reappearing.
- Volume permission fails because the job succeeds and reads data, so missing read access is not supported by the evidence.
Question 5
Topic: Prepare and Process Data
An Azure Databricks team is creating a managed Delta table in Unity Catalog for 18 months of clickstream data. Most queries filter by tenant_id and a small date range, but tenant_id has tens of thousands of values. The team wants faster data skipping and co-location for related rows without creating many small partition directories. Which table layout configuration should they choose?
Options:
A. Create one schema per tenant and a union view
B. Partition the table by
tenant_idand event dateC. Use liquid clustering on
tenant_idand event dateD. Partition only by event date and cache the table
Best answer: C
Explanation: For Delta tables in Azure Databricks, clustering is appropriate when queries benefit from physically co-locating related data but partitioning would create too many small directories or files. Here, tenant_id is a common filter but has very high cardinality, so partitioning by it would likely over-partition the table. Liquid clustering lets Databricks organize data around the selected columns and improve data skipping without requiring a rigid partition directory structure. It is a better managed table layout choice for evolving query patterns and high-cardinality filter columns than traditional partitioning.
- High-cardinality partitioning fails because tens of thousands of tenant partitions can create many small files and maintenance overhead.
- Date-only partitioning may help date pruning but does not co-locate tenant-specific rows for the main query pattern.
- Schema per tenant changes object organization, not physical layout, and adds unnecessary governance and query complexity.
Question 6
Topic: Prepare and Process Data
A retail team receives clickstream events from Azure Event Hubs into Azure Databricks. The pipeline must continuously parse each event, remove duplicates by event_id, compute finalized 5-minute event-time window counts with a 15-minute watermark, and write the results to a Unity Catalog Delta table. The workload must recover correctly after job restarts. Which implementation should the team use?
Options:
A. A Unity Catalog materialized view over Event Hubs
B.
COPY INTOscheduled every 5 minutes from Event HubsC. Auto Loader with a batch aggregation notebook
D. Structured Streaming with watermarking and checkpointed Delta writes
Best answer: D
Explanation: Structured Streaming is the appropriate Azure Databricks capability when the workload needs continuous processing logic over streaming data. In this scenario, the decisive requirements are stateful event-time processing, deduplication, watermarking for late events, and restart recovery. A Structured Streaming query can read the event stream, apply withWatermark, drop duplicates by event_id, aggregate by event-time windows, and write finalized results to a Unity Catalog Delta table with a checkpoint location. The checkpoint tracks progress and state so the job can resume safely after failures or restarts.
Batch-oriented ingestion tools can load data, but they do not provide the same continuous stateful streaming semantics needed here.
- Scheduled copy fails because
COPY INTOis for batch file ingestion, not continuous Event Hubs processing with stateful watermarks. - Batch notebook aggregation can transform data, but it does not preserve streaming state and offsets across continuous event processing.
- Materialized view is not the implementation for consuming Event Hubs streams with checkpointed Structured Streaming logic.
Question 7
Topic: Prepare and Process Data
A Lakeflow Job task in Azure Databricks reshapes sales data for a monthly reporting table. The downstream report expects one row per store_id, but the pivoted output contains many rows per store.
Exhibit: SQL used in the task
CREATE OR REPLACE TABLE reporting.store_monthly AS
SELECT *
FROM (
SELECT store_id, order_id, sales_month, amount
FROM silver.sales_orders
)
PIVOT (
SUM(amount) FOR sales_month IN ('Jan', 'Feb', 'Mar')
);
What is the best root cause indicated by the evidence?
Options:
A. The
amountcolumn must be cast before aggregationB. The pivot values need explicit column aliases
C. The target table needs a Unity Catalog primary key
D.
order_idis included in the pivot input
Best answer: D
Explanation: In Spark SQL, a PIVOT aggregates values and groups by the remaining columns from the input subquery that are not the pivot column or aggregate expression. Here, sales_month is the pivot column and amount is aggregated, but store_id and order_id remain as grouping columns. That produces one row per store and order rather than one row per store. The diagnostic fix is to project only the intended grouping columns before the pivot, such as store_id, sales_month, and amount, or aggregate to store-month grain before pivoting. Aliases may improve column names, but they do not change row granularity.
- Alias focus fails because naming pivoted columns does not reduce duplicate rows.
- Cast focus fails because the symptom is incorrect granularity, not an aggregation type error.
- Primary key focus fails because Unity Catalog constraints do not reshape pivot output.
Question 8
Topic: Prepare and Process Data
A data engineering team is defining a Lakeflow Spark Declarative Pipeline that creates a silver orders table from streaming bronze data. Records must be validated during pipeline execution, and if any record has a missing order_id or a negative amount, the pipeline update must fail so operators can investigate. Which configuration should the team use?
Options:
A. Add a Unity Catalog row filter on the silver table.
B. Add a Unity Catalog informational primary key constraint.
C. Add a pipeline expectation with
ON VIOLATION FAIL UPDATE.D. Add a pipeline expectation with
ON VIOLATION DROP ROW.
Best answer: C
Explanation: Lakeflow Spark Declarative Pipelines expectations validate data as the pipeline runs. When invalid records must stop processing, configure the expectation to fail the update, such as checking that order_id IS NOT NULL AND amount >= 0 with ON VIOLATION FAIL UPDATE. This keeps bad data from being silently omitted or passed through and gives operators an explicit pipeline failure to investigate. Use drop-row behavior only when discarding invalid records is acceptable; use Unity Catalog governance features for access control or metadata constraints, not as the primary pipeline-execution validation behavior.
- Row filter limits which rows users can query; it does not validate incoming records during pipeline execution.
- Drop row validates records but silently excludes invalid rows instead of failing the update.
- Primary key constraint can document relationships in Unity Catalog but is not the pipeline expectation behavior needed to stop the update.
Question 9
Topic: Prepare and Process Data
A data engineering team must ingest customer records from a SaaS application that is supported by Lakeflow Connect. The target must be governed Unity Catalog bronze tables in main.bronze, and the team wants minimal custom code with maintainable scheduled loads. Which implementation best preserves these constraints?
Options:
A. Mount the storage account to the workspace and write tables through the mount path.
B. Build a notebook that calls the SaaS API and writes Delta files to DBFS.
C. Configure Lakeflow Connect to sync into
main.bronzeusing a service principal with required Unity Catalog privileges.D. Use Azure Data Factory to copy files to an unmanaged storage path, then query them directly.
Best answer: C
Explanation: For a supported SaaS source with a goal of minimal custom code, maintainable scheduling, and governed Unity Catalog targets, Lakeflow Connect is the best fit. It provides managed connector-based ingestion and can land data directly into Unity Catalog tables when the running principal has the needed catalog and schema privileges. This keeps the data under Unity Catalog governance instead of depending on ad hoc files, mounts, or custom API code. Notebooks and Azure Data Factory can be valid ingestion tools in other scenarios, but they add more custom orchestration or may bypass the governed table model if implemented as raw file copies outside Unity Catalog.
- Custom notebook code adds API maintenance and writes to DBFS, which does not satisfy the governed Unity Catalog table target.
- Unmanaged file copy lands data outside the requested managed Unity Catalog bronze tables.
- Mount-based writes are not the preferred governed Unity Catalog pattern and can bypass catalog-level access controls.
Question 10
Topic: Prepare and Process Data
A team loads silver.orders in Azure Databricks from an operational SQL database. The initial load succeeded, but the nightly ingestion only inserts rows where last_modified is greater than the previous run. Downstream users report that canceled orders remain active and address corrections are missing.
Evidence:
| Signal | Value |
|---|---|
| SQL source CDC | Enabled for dbo.orders |
| Change table rows | Operations for updates and deletes exist |
| Event Hubs producer | None configured |
| Current job result | Appends new rows only |
What is the best root cause?
Options:
A. The job uses a SQL append pattern instead of CDC ingestion.
B. COPY INTO is failing to infer schema changes.
C. The Event Hubs checkpoint is replaying stale messages.
D. Unity Catalog denies writes to the target table.
Best answer: A
Explanation: This symptom points to a pattern mismatch. A SQL append load can be appropriate for simple inserts, but it does not automatically apply updates or deletes from the source. The source explicitly has CDC enabled and contains update/delete change records, so the ingestion should read and apply those CDC operations to keep the Delta table synchronized. Event Hubs is not supported by the visible evidence because no producer is configured. The successful initial and nightly writes also make a target write-permission issue unlikely.
- Event Hubs replay fails because the exhibit says no Event Hubs producer is configured.
- Schema inference fails because the reported issue is missing updates and deletes, not new columns or type drift.
- Write permission fails because the current job is successfully appending rows to the target table.
Continue in the web app
Use IT Mastery for interactive Microsoft DP-750 practice with mixed sets, timed mocks, topic drills, explanations, and progress tracking.