Free DP-750 Practice Exam: Microsoft Certified: Azure Databricks Data Engineer Associate
Try 50 free Microsoft Certified: Azure Databricks Data Engineer Associate (Microsoft DP-750) questions across the exam domains, with explanations, then continue with IT Mastery practice.
This free full-length Microsoft DP-750 practice exam includes 50 original IT Mastery questions across the exam domains.
These are original IT Mastery practice questions. They are not official Microsoft questions, copied live-exam content, or exam dumps. Use them for self-assessment, scope review, and deciding what to drill next.
Count note: this page uses the full-length practice count maintained in the Mastery exam catalog. Some certification vendors publish total questions, scored questions, duration, or unscored/pretest-item rules differently; always confirm exam-day rules with the sponsor.
Try the IT Mastery web app for a richer interactive practice experience with mixed sets, timed mocks, topic drills, explanations, and progress tracking.
Exam snapshot
- Exam route: Microsoft DP-750
- Practice-set question count: 50
- Time limit: 120 minutes
- Practice style: mixed-domain diagnostic run with answer explanations
Full-length exam mix
| Domain | Weight |
|---|---|
| Set Up and Configure an Azure Databricks Environment | 18% |
| Secure and Govern Unity Catalog Objects | 18% |
| Prepare and Process Data | 32% |
| Deploy and Maintain Data Pipelines and Workloads | 32% |
Use this as one diagnostic run. IT Mastery gives you timed mocks, topic drills, analytics, code-reading practice where relevant, and interactive practice.
Practice questions
Questions 1-25
Question 1
Topic: Prepare and Process Data
A retail team receives clickstream JSON files in cloud storage every few seconds. The data must be ingested continuously into the Unity Catalog table prod_bronze.web.clickstream_raw, and the process must recover after compute restarts without reprocessing files. Which implementation should you use?
Options:
A. A scheduled
COPY INTOcommand every 15 minutesB. A batch notebook that overwrites the table each hour
C. Auto Loader streaming read with a checkpoint in a Unity Catalog volume
D. A streaming query that writes only to a temporary view
Best answer: C
Explanation: For continuously arriving files, Auto Loader is the appropriate Azure Databricks ingestion pattern. It uses Structured Streaming to discover and process new files incrementally, and a checkpoint location stores progress so the stream can resume after a restart without starting over. Writing the stream to a fully qualified Unity Catalog table keeps the ingested data governed in the target catalog and schema. A Unity Catalog volume is a suitable governed location for checkpoint files when configured for the pipeline. Batch approaches can ingest files, but they do not meet the continuous or near-real-time requirement as directly.
- Scheduled batch misses the near-real-time requirement because
COPY INTOruns as discrete loads. - Hourly overwrite risks data loss and does not preserve continuous ingestion state.
- Temporary view output does not persist the events into the required Unity Catalog table.
Question 2
Topic: Deploy and Maintain Data Pipelines and Workloads
A Lakeflow Jobs workflow has a critical load_bronze_orders task. The operations team must be notified if this task fails or runs longer than 30 minutes, without changing the job schedule or retry behavior. Which configuration should you apply?
Options:
A. Create a Unity Catalog audit log alert
B. Add an automatic restart policy to the task
C. Change the job trigger to a continuous trigger
D. Add task alerts for failure and duration warning
Best answer: D
Explanation: Lakeflow Jobs supports operational notifications through job or task alerts. Because the requirement is specific to one task and includes both failure and a long-running condition, the configuration should be applied as task alerts with an appropriate duration warning threshold, such as 30 minutes. This sends notification without altering when the job runs or whether it retries after a failure.
Automatic restart and trigger settings affect execution behavior, not notification intent. Unity Catalog audit logs are useful for governance and access activity, not for alerting on task runtime states.
- Automatic restart changes recovery behavior and does not satisfy the notification requirement by itself.
- Continuous trigger changes how the job is invoked, not how operators are notified about task states.
- Audit log alert targets governance events rather than Lakeflow Jobs task failure or duration states.
Question 3
Topic: Set Up and Configure an Azure Databricks Environment
A team uses Azure Databricks notebooks for development against Unity Catalog tables. The group data-engineers already has edit permission on the notebooks and the required catalog, schema, and table privileges. The notebooks must run on the existing compute dev-etl-cluster, but users should not be able to change the compute configuration. What should you configure?
Options:
A. Grant
data-engineersUSE CATALOGon the workspace catalog.B. Grant
data-engineersCan Attach To ondev-etl-cluster.C. Grant
data-engineersCan Manage ondev-etl-cluster.D. Assign
data-engineersto the cluster policy owner role.
Best answer: B
Explanation: Notebook execution requires both data/object permissions and permission to use the target compute. In this scenario, Unity Catalog and notebook permissions are already in place, so the missing control is compute access. Granting the group Can Attach To on the existing compute allows members to attach notebooks and run development or processing workloads. It preserves the constraint because it does not give them administrative control over the compute configuration. Higher compute permissions are unnecessary when the goal is only to run notebooks on a managed cluster.
- Can Manage overgrants because it allows compute administration, which violates the requirement not to change configuration.
- Catalog privileges mismatch because Unity Catalog access is already granted and does not authorize attaching notebooks to compute.
- Cluster policy ownership controls policy administration, not routine notebook attachment to an existing compute resource.
Question 4
Topic: Secure and Govern Unity Catalog Objects
A data engineering team publishes a Unity Catalog table for finance analysts. The analysts can query main.curated.customer_lifetime_value when given the full three-part name, and no permission errors appear. However, they cannot find the dataset by searching for “CLV” or browsing business terms in Catalog Explorer. What is the best next diagnostic step?
Options:
A. Grant
SELECTon the table againB. Rotate the service principal secret
C. Remove any row filters from the table
D. Check and update the table description and tags
Best answer: D
Explanation: Unity Catalog privileges and descriptive metadata solve different problems. Privileges such as USAGE and SELECT determine whether a principal can access an object. Descriptions, comments, and tags help users understand and discover objects in tools such as Catalog Explorer. In this scenario, analysts can already query the table by its full name and there are no permission errors, so the evidence points to a discoverability metadata issue, not an access-control failure. The next diagnostic step is to inspect whether the table has useful descriptions, comments, and tags that include the expected business terminology such as “CLV.”
- Granting access again does not address search relevance when users already can query the table.
- Removing row filters is unrelated because no row-level access symptom or filtered query result is described.
- Rotating secrets targets authentication for automated identities, not human discoverability in Catalog Explorer.
Question 5
Topic: Set Up and Configure an Azure Databricks Environment
An analytics team uses Unity Catalog to govern curated sales data. Business users have SELECT only on approved gold views in prod.sales, but AI/BI Genie suggestions sometimes mention similarly named raw ingestion tables. You need to guide discovery toward the intended governed context without expanding user access. What should you configure?
Options:
A. Grant users
SELECTon the raw ingestion tablesB. Move the raw tables to a separate workspace
C. Add Lakeflow Jobs task parameters for the gold pipeline
D. Add Genie instructions that prioritize the approved
prod.salesviews
Best answer: D
Explanation: AI/BI Genie instructions are used to shape how Genie interprets business terms and which governed data context it should use for discovery and answers. In this scenario, the users already have the correct Unity Catalog privileges on approved gold views, so the implementation should not broaden access. Configure the Genie space/instructions to prefer the approved prod.sales views, define business terminology, and clarify which objects represent the certified context. Unity Catalog remains the enforcement layer for permissions; Genie instructions guide discovery and interpretation, not security boundaries. The key distinction is guidance versus access control.
- Granting access to raw ingestion tables violates the constraint to avoid expanding user access.
- Moving raw tables is an operational workaround and does not configure Genie’s discovery behavior.
- Lakeflow Jobs parameters affect pipeline execution, not AI/BI Genie data-discovery context.
Question 6
Topic: Secure and Govern Unity Catalog Objects
A Lakeflow Jobs task in Azure Databricks retrieves a database password with dbutils.secrets.get("kv-prod", "sql-password") from an Azure Key Vault-backed secret scope. The notebook succeeds when run interactively by a developer, but the scheduled job fails after its Run as identity is changed to the service principal spn-etl-prod.
Error excerpt:
PERMISSION_DENIED: Principal spn-etl-prod does not have READ permission on secret scope kv-prod
What is the most likely root cause?
Options:
A. The secret value must be embedded in the notebook
B. The service principal lacks READ on the secret scope
C. The Unity Catalog table grant is missing
D. The Key Vault secret was rotated
Best answer: B
Explanation: Azure Databricks secret access is evaluated for the identity running the workload. In this case, the developer can read the Azure Key Vault-backed secret scope, but the scheduled job now runs as spn-etl-prod. The error specifically identifies that service principal and the missing READ permission on the secret scope. Granting the workload identity access to the Databricks secret scope allows governed retrieval without placing credentials in notebook code. This is separate from Unity Catalog table access and separate from whether the secret value itself changed.
- Embedding credentials is the opposite of governed secret retrieval and is not required to fix scope access.
- Table grants control data objects in Unity Catalog, not reading a Databricks secret scope.
- Secret rotation would not explain an explicit missing
READpermission for the run-as principal.
Question 7
Topic: Deploy and Maintain Data Pipelines and Workloads
A Lakeflow Job run was interrupted during an outage after the first two tasks completed. The outage is fixed, and a corrected input file has been placed in the source location. The engineer must execute the workflow again from the first task so all downstream results are rebuilt from the corrected input. What should the engineer do next?
Options:
A. Restart the Lakeflow Job run
B. Repair the failed job run
C. Rerun only the failed task
D. Stop the current job run
Best answer: A
Explanation: Use a Lakeflow Jobs restart when execution must begin again after an interruption or correction. In this scenario, completed upstream tasks used input that has since been corrected, so the workflow must start from the beginning to rebuild all dependent outputs consistently. A repair operation is better when you want to continue from a failed point by rerunning failed or skipped tasks, not when previously successful tasks must be re-executed.
- Repair run fails because it is intended to resume from failed or skipped tasks, not rebuild from the first task.
- Failed task only fails because upstream completed tasks also need to process the corrected input.
- Stop run fails because the interruption is already resolved and stopping does not start processing again.
Question 8
Topic: Deploy and Maintain Data Pipelines and Workloads
A Lakeflow Job has tasks ingest_raw → build_silver → publish_gold. During an operational interruption, the run was stopped after ingest_raw completed. After correcting the source file, the team needs the workflow to begin again from ingest_raw using the current job definition. Which Lakeflow Jobs operation should be used?
Options:
A. Enable automatic restart for the job
B. Repair the stopped job run
C. Restart the stopped job run
D. Run only the downstream tasks
Best answer: C
Explanation: Use a Lakeflow Jobs restart when an interrupted or corrected workflow must begin again from the start of the task graph. In this scenario, the corrected source data affects the initial ingest task and everything downstream, so continuing from a later task or repairing only failed tasks would risk using stale intermediate output. Restarting keeps the current job definition but starts execution again at the beginning. Repair is better when only failed or skipped portions of a run should be retried.
- Repair run is tempting, but repair focuses on rerunning failed or skipped tasks rather than starting the whole workflow again.
- Downstream-only execution misses the corrected ingest step, so dependent results could still reflect the old source data.
- Automatic restart is a future resilience setting and does not explicitly begin this corrected workflow again now.
Question 9
Topic: Prepare and Process Data
A data engineering team needs to ingest customer records from a SaaS CRM into Unity Catalog bronze Delta tables. The source is supported by Lakeflow Connect. The team needs hourly incremental ingestion with no custom pre-ingestion transformations and wants to minimize notebook code and connector maintenance. Which implementation best satisfies these constraints?
Options:
A. Create a Lakeflow Spark Declarative Pipeline for custom CRM API ingestion.
B. Use Azure Data Factory to copy full extracts to storage, then run COPY INTO.
C. Build a notebook with custom API calls and schedule it in Lakeflow Jobs.
D. Configure Lakeflow Connect to load the CRM into bronze tables on an hourly schedule.
Best answer: D
Explanation: Lakeflow Connect is designed for managed ingestion from supported sources into Azure Databricks, including scheduling and incremental loading patterns without requiring custom connector code. In this scenario, the source is explicitly supported, the target is a Unity Catalog bronze layer, and there are no custom transformations before ingestion. Those facts favor a managed connector over notebooks or external copy orchestration. Notebooks and Lakeflow Jobs are better when ingestion requires custom code or transformations; Azure Data Factory is useful for broader orchestration or unsupported movement patterns but adds unnecessary maintenance here.
- Custom notebook code fails because the scenario prioritizes managed ingestion and minimal connector maintenance.
- Full extract pattern fails because copying full files and then using
COPY INTOdoes not match the incremental managed-ingestion requirement. - Declarative pipeline misuse fails because the source connector requirement is already satisfied by Lakeflow Connect, not custom API ingestion logic.
Question 10
Topic: Set Up and Configure an Azure Databricks Environment
A sales analytics team uses AI/BI Genie in Azure Databricks to help analysts discover data in Unity Catalog tables. The tables, schemas, and SELECT privileges are already correct. Analysts use business phrases such as “bookings” and “active customers,” which map to existing columns and filters. You must guide Genie’s interpretation without creating new Unity Catalog data objects, precomputing results, or changing table schemas. What should you implement?
Options:
A. Create a view with renamed business columns
B. Create a materialized view for the common metrics
C. Alter table definitions to add alias columns
D. Add AI/BI Genie instructions for the business terms
Best answer: D
Explanation: AI/BI Genie instructions are the right fit when the requirement is data discovery guidance: helping Genie interpret business vocabulary, synonyms, metric meanings, and preferred relationships over existing Unity Catalog objects. In this scenario, governance and table access are already correct, and the constraint explicitly avoids new objects, precomputed results, or schema changes. Views and materialized views are data objects that define or store query results. Table definition changes modify the physical or logical schema. Those choices may be useful for modeling or performance, but they are not needed just to guide natural-language discovery behavior.
- View creation fails because it introduces a new Unity Catalog object rather than only guiding Genie’s interpretation.
- Materialized results fail because a materialized view precomputes data, which the requirement excludes.
- Schema changes fail because adding alias columns changes table definitions instead of providing discovery guidance.
Question 11
Topic: Prepare and Process Data
A Delta table in Azure Databricks receives nightly MERGE operations that update corrected rows and delete invalid rows. After deletion vectors are enabled, the nightly job finishes faster, but several full-scan analytics queries are slower. The query profile shows similar file counts as before, no spill or skew, and increased time applying deletion vectors during scans. What is the best explanation?
Options:
A. Deletion vectors are adding read-time row filtering
B. VACUUM removed active data files too aggressively
C. Partition pruning stopped working for all queries
D. Unity Catalog row filters were applied automatically
Best answer: A
Explanation: Deletion vectors let Delta Lake record row-level deletes and some update effects without immediately rewriting the underlying Parquet files. That can make DELETE, UPDATE, and MERGE workloads faster because fewer files are rewritten. The tradeoff is that readers may need to apply deletion-vector metadata during scans to exclude logically removed rows. The evidence points to this behavior: DML is faster, file counts are similar, and the query profile specifically shows time spent applying deletion vectors. If physical cleanup is needed, use appropriate table maintenance to rewrite/materialize those changes before deleted rows can be purged according to retention rules.
- VACUUM issue fails because the evidence shows slower scans, not missing active files or read failures.
- Partition pruning fails because full scans and similar file counts do not indicate a pruning regression.
- Row filters fails because Unity Catalog row filters are explicit governance controls, not automatically created by enabling deletion vectors.
Question 12
Topic: Secure and Govern Unity Catalog Objects
A Unity Catalog metastore contains customer tables across several schemas. Compliance requires any column tagged classification=restricted to be masked from general analysts in current and future tables. The data_engineering_sp service principal must continue unmasked reads and writes for Lakeflow Jobs. Which configuration best meets the governance objective without disrupting the pipeline workflow?
Options:
A. Revoke
SELECTon all customer tables from the analysts group.B. Create an ABAC policy using the governed tag and exempt the service principal.
C. Apply the tag as discovery metadata without an enforcement policy.
D. Add table-specific column masks only to the existing customer tables.
Best answer: B
Explanation: Unity Catalog ABAC is designed for governance rules that follow data attributes, such as governed tags, rather than individual table-by-table grants. In this scenario, the control must apply to current and future restricted columns while allowing the Lakeflow Jobs service principal to keep processing unmasked data. A tag-based ABAC masking policy can target classification=restricted columns and include an exception for authorized principals, such as data_engineering_sp. This supports compliance without breaking the data engineering workflow. Object grants alone are too broad, and metadata tags alone do not enforce access behavior.
- Broad revocation blocks analyst access at the table level instead of masking only restricted columns.
- Table-specific masks miss future tagged columns unless every new table is manually configured.
- Metadata-only tagging helps discovery but does not enforce masking or access decisions.
Question 13
Topic: Prepare and Process Data
A team changed a bronze ingestion job from reading raw Parquet exports in ADLS to reading the data/ folder under an upstream Apache Iceberg table. The Auto Loader job still succeeds, but the target Delta table now includes rows that the upstream system reports as deleted and shows occasional duplicates after upstream compaction. There are no schema changes or permission errors.
What is the best root cause or next diagnostic step?
Options:
A. The job is bypassing Iceberg metadata and reading data files directly.
B. The Unity Catalog volume lacks read permission on the folder.
C. Auto Loader does not support Parquet ingestion.
D. The target Delta table needs a wider schema evolution mode.
Best answer: A
Explanation: Iceberg is a table format, not just a collection of Parquet files. Its metadata tracks snapshots, manifests, deletes, and file replacements from operations such as compaction. If a pipeline reads the underlying data/ files directly, it can ignore which files are current and which rows have been deleted. The next diagnostic step is to confirm whether the source is an Iceberg table and use an ingestion or table-access approach that respects the Iceberg metadata instead of treating the data folder as raw Parquet input. The visible symptoms point to table-format semantics being bypassed, not to schema drift or access failure.
- Parquet support fails because Auto Loader can ingest Parquet files; the issue is reading Iceberg-managed files as raw files.
- Schema evolution fails because the stem states there are no schema changes, and widening the schema would not explain deleted rows reappearing.
- Volume permission fails because the job succeeds and reads data, so missing read access is not supported by the evidence.
Question 14
Topic: Prepare and Process Data
A team maintains a Unity Catalog managed Delta table prod.sales.order_events loaded daily by a Lakeflow pipeline. The table is partitioned by event_date. Analysts report slow queries for one customer over a 7-day range. You must improve the layout without changing the table name or creating per-customer tables.
Evidence:
| Signal | Observation |
|---|---|
| Table size | 3.8 TB, 1,460 daily partitions |
| Common filter | event_date BETWEEN ... AND ... AND customer_id = ... |
customer_id | Millions of distinct values |
| Query profile | Date partitions pruned, but many files scanned inside each day |
Which implementation is most appropriate?
Options:
A. Keep date partitioning and apply Z-order clustering on
customer_id.B. Remove partitioning and cache the full table.
C. Create one managed table per
customer_id.D. Repartition the table by
customer_idonly.
Best answer: A
Explanation: The evidence shows that partition pruning on event_date is already working, so changing the partition column is not the likely fix. The remaining cost is scanning many files within the selected date partitions to find rows for a high-cardinality customer_id. Clustering, such as Z-ordering on customer_id for an existing partitioned Delta table, improves data skipping by colocating related values in files without creating millions of partitions or changing the table contract. Repartitioning by a high-cardinality customer key would create an impractical layout, and table-per-customer redesign would violate the stated constraint. The key signal is: partition pruning works, but file skipping within partitions needs improvement.
- Customer partitioning fails because millions of distinct customer values would create excessive partitions and poor manageability.
- Table redesign fails because per-customer tables change the table contract and operational model.
- Full-table caching does not fix persistent table layout and is unsuitable for a multi-terabyte managed table.
Question 15
Topic: Set Up and Configure an Azure Databricks Environment
A team is configuring compute for a new medallion ingestion workload in Azure Databricks. The workload is defined as a Lakeflow Spark Declarative Pipeline with streaming tables and pipeline expectations. It must run under pipeline-managed execution rather than as an interactive notebook or an ad hoc SQL workload.
Which compute configuration should the team use?
Options:
A. Job compute for a scheduled notebook task
B. All-purpose compute attached to a shared notebook
C. Pipeline compute for Lakeflow Spark Declarative Pipelines
D. SQL warehouse compute for Databricks SQL queries
Best answer: C
Explanation: Compute type should match the workload execution model. Interactive development typically uses all-purpose or shared compute, Databricks SQL workloads use warehouse compute, scheduled notebook or task runs use job compute, and Lakeflow Spark Declarative Pipelines use pipeline-managed compute. In this scenario, the defining requirement is not just streaming ingestion; it is that the workload is authored and operated as a Lakeflow Spark Declarative Pipeline with streaming tables and pipeline expectations. That makes pipeline compute the correct fit. Job compute can run scheduled tasks, but it does not make a notebook behave like a declarative pipeline with pipeline-managed execution semantics.
- All-purpose compute fits interactive development, not a pipeline-managed production execution requirement.
- SQL warehouse compute supports SQL queries and BI workloads, not Spark declarative pipeline execution.
- Job compute fits scheduled job tasks, but the stem requires Lakeflow Spark Declarative Pipeline execution.
Question 16
Topic: Secure and Govern Unity Catalog Objects
A data engineering team must investigate an unexpected change in a Unity Catalog table, prod.finance.transactions, and identify downstream tables that may have been affected. The compliance team also needs evidence of which principal ran the write operation. Which implementation best supports the investigation without changing table permissions?
Options:
A. Review only the Lakeflow Jobs run output for the failed task
B. Grant
SELECTon all downstream tables and compare row counts manuallyC. Use audit logs for the write event and Unity Catalog lineage for downstream impact
D. Add a row filter to
prod.finance.transactionsand rerun the pipeline
Best answer: C
Explanation: Audit logs and Unity Catalog lineage provide different governance evidence. Audit logs are used to determine who performed an operation, when it occurred, and what object or action was involved. Lineage is used to understand how data flows between tables, views, notebooks, jobs, and pipelines, including downstream objects that may depend on a changed table. In this scenario, the team needs both accountability evidence and impact analysis, but does not need to alter privileges or data-access controls. The key distinction is that audit logs prove activity, while lineage explains dependency and propagation risk.
- Manual row counts may help validate data values, but granting broad
SELECTchanges access and does not identify the writer. - Job output only can show task status, but it is not complete governance evidence for principal activity or downstream dependencies.
- Row filters enforce access rules, but they do not investigate who wrote data or which downstream objects were affected.
Question 17
Topic: Deploy and Maintain Data Pipelines and Workloads
A Lakeflow Job is configured to refresh a Gold table each day at 02:00 and pass run_mode=prod to its notebook task. The run history shows that last night’s scheduled run started on time but failed a data-quality check. The data operations group did not receive any notification.
Current setup:
| Setting | Value |
|---|---|
| Schedule | Daily at 02:00, enabled |
| Job parameter | run_mode=prod |
| Alerts | Notify job owner on success |
Which setup action best addresses the symptom?
Options:
A. Update the
run_modejob parameter value.B. Add a failure alert for the data operations group.
C. Change the daily schedule to a file arrival trigger.
D. Disable the schedule and run the job manually.
Best answer: B
Explanation: Lakeflow Jobs setup settings serve different purposes. A trigger or schedule controls when a job run starts, parameters pass values into tasks at run time, and alerts notify recipients about job events such as success, failure, or duration conditions. In this scenario, the run history proves that the schedule worked because the job started at 02:00. The parameter is also unrelated to the missing notification because the failure occurred after the task began. The visible gap is that alerts are configured only for success and only to the job owner, so the data operations group would not receive a message for a failed run.
The key diagnostic move is to match the symptom to the configuration category that can produce it.
- Changing the trigger does not address the missing notification because the scheduled run already started on time.
- Updating the parameter is unsupported by the evidence because
run_modeaffects task input, not alert delivery. - Manual runs avoid the configured schedule but do not create failure notifications for the operations group.
Question 18
Topic: Prepare and Process Data
A Lakeflow Jobs task loads immutable CDC files into a Silver Delta table. A run failed after partially writing the Silver table, and a repair run reprocessed the same source batch. The table now contains duplicate rows with the same order_id, change_seq, and source_file.
Exhibit:
Silver write mode: append
Retry behavior: reprocess failed batch
Source files: immutable CDC files
Duplicate keys: order_id + change_seq
Which remediation best preserves data correctness and repeatable processing behavior?
Options:
A. Increase the streaming watermark duration for the Silver task.
B. Run
OPTIMIZEbefore repairing the failed job.C. Enable schema evolution on the Silver table write.
D. Use an idempotent
MERGEkeyed byorder_idandchange_seq.
Best answer: D
Explanation: The visible issue is not late data, schema drift, or file layout; it is a non-idempotent append being retried after a partial write. For repeatable processing, the Silver load should be safe to run more than once for the same source changes. A deterministic MERGE keyed by the CDC business/change identifier, often with batch-level deduplication, makes retries update or ignore already-applied changes instead of appending duplicates. This preserves correctness when a repair run reprocesses the same immutable input batch.
The key takeaway is that retry-safe pipelines need idempotent write logic, not just faster or more permissive execution.
- Watermark tuning addresses late-arriving event-time data, not duplicate rows created by retrying an append.
- Schema evolution handles new or changed columns, but the duplicate keys show a retry/idempotency problem.
- OPTIMIZE first can improve file layout and query performance, but it does not change duplicate-producing write semantics.
Question 19
Topic: Deploy and Maintain Data Pipelines and Workloads
A data engineering team runs a nightly Lakeflow Job on Azure Databricks job compute. The team must continuously detect whether the cluster is underused, saturated, or inefficiently used, and notify operators without changing the transformation code. Which implementation best meets the requirement?
Options:
A. Stream cluster metrics/logs to Log Analytics and create Azure Monitor alerts
B. Grant Unity Catalog lineage access on the pipeline tables
C. Review the Spark UI manually after each failed run
D. Enable Photon and increase the maximum worker count
Best answer: A
Explanation: Cluster resource consumption monitoring should use runtime evidence such as CPU, memory, executor activity, disk or spill indicators, and logs from the compute running the workload. Streaming Azure Databricks logs and metrics to Log Analytics in Azure Monitor lets the team build queries and alerts that distinguish idle capacity, sustained saturation, and inefficient execution patterns without modifying the job logic. This fits an operational requirement because it is continuous and can notify operators automatically. Unity Catalog lineage explains data dependencies, while compute tuning changes such as adding workers or enabling Photon should be driven by monitoring evidence rather than used as the monitoring mechanism.
- Lineage access helps trace table dependencies but does not measure cluster CPU, memory, or executor behavior.
- Scaling first may mask the issue and does not identify whether the workload is underusing, overusing, or misusing compute.
- Manual Spark UI review can help troubleshoot a run but does not provide continuous alerting for operators.
Question 20
Topic: Prepare and Process Data
An Azure Databricks pipeline must ingest JSON order files that arrive every few minutes in a Unity Catalog volume. The target is a Delta table, and the pipeline must discover only newly arrived files after the initial load without manually listing the folder each run. Which implementation should you use?
Options:
A. Create an external table directly over the JSON files.
B. Run
COPY INTOon a fixed schedule against the folder.C. Use Auto Loader with
readStreamandcloudFiles, then write to the Delta table.D. Use a batch
spark.read.jsonjob that overwrites the table.
Best answer: C
Explanation: Auto Loader is designed for file-based ingestion when new files continue to arrive in cloud storage or Unity Catalog volumes. In Azure Databricks, Auto Loader uses the cloudFiles source with Structured Streaming to incrementally discover new files, track progress with checkpointing, and load records into a Delta table. This avoids repeatedly scanning or manually listing the full directory and supports reliable ongoing ingestion patterns. COPY INTO can be useful for simpler batch file loads, but Auto Loader is the better fit when files arrive continuously and should be discovered incrementally.
- Scheduled batch load can ingest files, but it does not best match continuous incremental file discovery.
- Batch overwrite risks replacing prior data and does not maintain streaming progress for new arrivals.
- External table exposes files in place but does not implement an ingestion process into a managed Delta target.
Question 21
Topic: Prepare and Process Data
An Azure Databricks ingestion job uses Auto Loader to process JSON order files into Unity Catalog. The job lands the parsed output under /Volumes/ops/raw/orders/. Downstream requirements are to query the ingested records with SQL, apply column masks, enforce data-quality constraints, and capture table lineage. The files arrive successfully, but the team cannot attach the required governance rules to the output object. What is the most likely root cause?
Options:
A. Auto Loader cannot ingest JSON files
B. The output is a volume, not a Delta table
C. Column masks require warehouse compute
D. The source path must be a foreign catalog
Best answer: B
Explanation: The core issue is the target object pattern for ingested data. Unity Catalog volumes are appropriate for governing access to files, such as raw landing files or unstructured data. They are not the right target when the processed records must be queried as relational data and governed with table features. For this requirement, the ingestion should write parsed records to a managed or external Delta table in a Unity Catalog catalog and schema, often as a bronze table. The original files can remain in a volume if file retention is required. The closest trap is thinking this is only a permission issue, but the visible evidence shows the object type cannot support the required table-level governance.
- JSON support is not the issue because Auto Loader can process JSON files.
- Compute choice does not explain why governance rules cannot attach to a file-based volume object.
- Foreign catalog is unrelated because the requirement is to store and govern ingested records in Unity Catalog.
Question 22
Topic: Deploy and Maintain Data Pipelines and Workloads
A Unity Catalog managed Delta table prod.sales.order_events is queried by Lakeflow Jobs. Most slow queries filter on customer_id and recent event_date, then join related customer rows. The table is not partitioned, customer_id has high cardinality, and ingestion must continue without rewriting the table into many physical partitions. Which implementation should you use?
Options:
A. Partition the table by
customer_idand compact daily files.B. Cache the table in each job before running the queries.
C. Enable liquid clustering by
customer_idandevent_date, then runOPTIMIZE.D. Run
VACUUMwith a shorter retention interval after ingestion.
Best answer: C
Explanation: Liquid clustering is the best fit when Delta queries repeatedly filter or join on columns whose related rows benefit from co-location, especially when those columns are too high-cardinality for traditional partitioning. In this scenario, clustering on customer_id and event_date helps Azure Databricks organize data for the dominant access pattern while preserving the table as a Unity Catalog managed Delta table. Running OPTIMIZE applies the clustering layout to improve file skipping and reduce unnecessary reads. The key distinction is that clustering improves data layout for query pruning, while partitioning by a high-cardinality column can create too many small partitions and operational overhead.
- High-cardinality partitioning fails because
customer_idwould create too many physical partitions and violate the stated constraint. - VACUUM cleanup removes old unreferenced files; it does not co-locate data for query access patterns.
- Job-level caching may help repeated reads in one run, but it does not optimize the persistent Delta table layout.
Question 23
Topic: Secure and Govern Unity Catalog Objects
A Unity Catalog table main.crm.customers contains customer_id, region, and tax_id. Analysts must be able to query all customer rows, but tax_id must be protected unless the user is in the pii_readers group. After a recent governance change, analysts report that row counts dropped from 50,000 to 8,000 even though they still have SELECT on the table.
What is the most likely root cause?
Options:
A. The analysts need ownership of the table
B. The
pii_readersgroup needsUSE CATALOGonlyC. The table must be converted to a materialized view
D. A row filter was used instead of a column mask
Best answer: D
Explanation: Unity Catalog column masks are used when users should retain access to rows and non-sensitive columns, but a specific column value must be hidden, redacted, or transformed based on identity or group membership. The symptom is a reduced row count, which points to row-level filtering rather than column-level masking. For this requirement, analysts should keep SELECT access to the table and see all customer records, while the tax_id column returns a protected value unless the user belongs to pii_readers.
The key diagnostic distinction is whether the policy changes rows or column values. Row filters affect which records are visible; column masks affect how sensitive field values are displayed.
- Table ownership is not required for analysts to query governed data when appropriate privileges and policies are already in place.
- Materialized view conversion does not address identity-based protection of one sensitive column in the base table.
- Catalog-only access is insufficient by itself and does not explain why visible row counts dropped.
Question 24
Topic: Prepare and Process Data
A data engineering team added a correction step to an Azure Databricks pipeline that updates existing customer records and must support rollback to prior table versions. The job fails on the first correction batch.
Exhibit: Failure details
Step: apply_customer_corrections
Statement: MERGE INTO raw.customers ...
Error: MERGE destination only supports Delta tables
DESCRIBE DETAIL raw.customers: format = parquet
What is the best root cause based on the visible evidence?
Options:
A. The target table uses Parquet instead of Delta
B. Photon acceleration is not enabled
C. The pipeline lacks Unity Catalog MODIFY privileges
D. Auto Loader schema evolution is disabled
Best answer: A
Explanation: Delta format is required when a table needs ACID table behavior, reliable updates such as MERGE, and time travel. The failure message states that MERGE supports Delta table targets, and DESCRIBE DETAIL shows the current target format is parquet. That directly explains why the correction step fails and why rollback to previous table versions would not be available for this table as designed. The diagnostic evidence points to a table-format mismatch, not a permissions, schema drift, or compute-performance issue. The table should be created or converted as a Delta table before using these behaviors.
- Permission issue is not supported because the visible error complains about table format, not denied privileges.
- Schema evolution is unrelated because the failing operation is an update/merge into a non-Delta target.
- Photon setting would affect performance, not whether
MERGEand time travel are supported.
Question 25
Topic: Set Up and Configure an Azure Databricks Environment
A team publishes an object for analysts in Unity Catalog. Analysts must receive access only to the published object, and the dashboard workload must use a precomputed aggregate that the pipeline can refresh.
Exhibit: DDL and job symptom
CREATE VIEW analytics.reporting.qtr_revenue AS
SELECT region, date_trunc('quarter', order_date) AS qtr, sum(amount) AS revenue
FROM prod.sales.orders
GROUP BY region, date_trunc('quarter', order_date);
GRANT SELECT ON VIEW analytics.reporting.qtr_revenue TO `analyst_group`;
Pipeline step: REFRESH MATERIALIZED VIEW analytics.reporting.qtr_revenue fails because the object cannot be refreshed as a materialized view.
What is the best root cause?
Options:
A. The object was created as a standard view, not a materialized view.
B. The aggregate query requires a foreign catalog to refresh.
C. The object is missing a
GRANT SELECTon the base table.D. The object should be created as a volume for governed access.
Best answer: A
Explanation: The DDL defines a standard view, which stores a query definition and evaluates against underlying data when queried. That can be useful for governed abstraction, because analysts can be granted access to the view instead of the base table. However, the processing requirement says the dashboard must use a precomputed aggregate that the pipeline can refresh. That requirement points to a materialized view, which stores computed results and supports refresh behavior. The failed REFRESH MATERIALIZED VIEW step matches the visible mismatch: the object name exists, but its object type is not materialized view. The fix is to create the object with materialized view DDL, not to change grants or storage location.
- Base-table grant is not indicated by the symptom; the failure is during a materialized-view refresh operation.
- Volume object fails because volumes govern file storage, not SQL aggregate query results.
- Foreign catalog is unrelated; the source table is already referenced and the issue is the published object’s type.
Questions 26-50
Question 26
Topic: Prepare and Process Data
A pipeline reads silver.sales_lines, where each row is an order line. Finance needs a Unity Catalog object named gold.daily_product_sales for a dashboard. The output must contain one row per sale_date, store_id, and product_id. The net_sales metric is extended line amount minus discounts and refunds; tax is excluded. The dashboard should query a precomputed object refreshed by the pipeline. Which output configuration should you use?
Options:
A. A materialized view grouped by
sale_date,store_id,product_id, andorder_idwith the samenet_salesformulaB. A gold table that appends order-line rows and leaves
net_salescalculation to the dashboardC. A materialized view grouped by
sale_date,store_id, andproduct_idwithnet_sales = SUM(quantity * unit_price - discount_amount - refund_amount)D. A view grouped by
sale_dateandstore_idwithnet_sales = SUM(quantity * unit_price + tax_amount - discount_amount)
Best answer: C
Explanation: The output object must match three requirements at the same time: table grain, metric definition, and consumption pattern. The stated grain is one row per date, store, and product, so those columns must be the grouping keys and no lower-level keys such as order_id should remain. The metric excludes tax and subtracts both discounts and refunds from the extended line amount. Because the dashboard needs a precomputed object refreshed by the pipeline, a materialized view is a better fit than a normal view or raw appended detail table. The closest traps either change the grouping level or push metric logic downstream.
- Wrong grain fails when
product_idis omitted ororder_idis added because the row level no longer matches the stated table grain. - Wrong metric fails when tax is included because the metric definition explicitly excludes tax.
- Wrong consumption pattern fails when raw line rows are exposed because the dashboard would recompute the governed aggregate itself.
Question 27
Topic: Deploy and Maintain Data Pipelines and Workloads
A Lakeflow Job has three tasks: ingest_orders → transform_orders → publish_orders. The ingest_orders task succeeded and loaded new files into a Unity Catalog Delta table. The transform_orders task failed because a required table privilege was missing. After the privilege is granted, you must complete the same failed job run without rerunning the successful ingestion task. Which action should you choose?
Options:
A. Stop the failed job run and wait for the next schedule.
B. Restart the job run from the beginning.
C. Repair the failed run starting at
transform_orders.D. Run the job now as a new run.
Best answer: C
Explanation: For a Lakeflow Job failure, use repair when the goal is to continue a failed multi-task run without repeating tasks that already succeeded. In this scenario, ingestion already loaded files into a Unity Catalog Delta table, so rerunning it could duplicate work or increase cost. Repair targets the failed task and any downstream tasks that need to run after it. Restarting or running the job again starts broader execution as a new or restarted run, which does not preserve the successful ingestion step in the same way. Stop is used to cancel active work, not to resume a failed run.
- Restarting broadly fails because it can rerun tasks that already succeeded, including the ingestion task.
- Stopping the run fails because the run is already failed and stopping does not complete remaining tasks.
- Running now fails because it starts a new job run instead of repairing the existing failed run.
Question 28
Topic: Prepare and Process Data
An Azure Databricks job loads main.gold.region_revenue from Unity Catalog tables main.silver.customers and main.silver.orders.
Requirement: output one row for every region in customers, sum only completed order amounts, and write 0 for regions with no completed orders. Which implementation preserves the required join and aggregation behavior?
Options:
A. Aggregate completed orders by region first, then overwrite the gold table with only the aggregated rows.
B. Left join from customers to orders, filter completed orders in a
WHEREclause, and group by region.C. Left join from customers to orders, filter completed orders in the join condition, group by region, and
COALESCE(SUM(amount), 0).D. Inner join customers to completed orders, group by region, and replace null sums with zero.
Best answer: C
Explanation: For this load, the preserved dataset is the list of regions from customers, not the list of matching orders. A LEFT JOIN from customers to orders keeps every customer-side row. The completed-order filter must be part of the join condition so that missing or noncompleted orders do not remove the customer region. After grouping by region, SUM(amount) is null only when no completed orders matched, so COALESCE converts that null to 0. Placing the order-status filter in WHERE would eliminate the null-extended rows and behave like an inner join for those regions.
- WHERE status filter fails because it removes rows where no order matched, losing regions with no completed orders.
- Inner join fails because it can only return regions that have at least one completed order.
- Aggregate orders first fails because regions without completed orders are absent unless joined back to the customer region set.
Question 29
Topic: Deploy and Maintain Data Pipelines and Workloads
An Azure Databricks team expects a Lakeflow Job to start when a vendor lands a file in a Unity Catalog volume. A file was added at 09:10, but no run began until the normal midnight run.
Job trigger summary
Trigger type: Scheduled
Schedule: Daily at 00:00 UTC
File arrival trigger: Not configured
Target path: /Volumes/raw/partner/incoming
What is the best root cause?
Options:
A. The job needs an email alert for missed runs
B. Only a time-based schedule is configured
C. The target path must be passed as a task parameter
D. The task needs a larger job compute cluster
Best answer: B
Explanation: Lakeflow Jobs start according to their configured triggers. In the visible configuration, the only trigger is a daily schedule at 00:00 UTC, and the file arrival trigger is explicitly not configured. Therefore, a file landing at 09:10 is not enough to start the workload immediately. To start from this event, configure an event-based file arrival trigger for the relevant Unity Catalog volume path. Alerts can notify about run states, and parameters can pass values to tasks, but neither changes when the job is launched.
- Compute size does not explain why no run was created when the file arrived.
- Email alerts notify operators but do not initiate workload execution.
- Task parameters can supply the path to code, but they do not create an event-based trigger.
Question 30
Topic: Prepare and Process Data
A Unity Catalog managed Delta table has deletion vectors enabled. A nightly Lakeflow Job runs a MERGE that updates or deletes about 1% of rows, and the team wants to keep that job fast. After several weeks, query profiles show extra read work on files with many deletion vectors. You need to implement maintenance that preserves fast row-level DML and eventually removes deleted row data according to the table retention policy. What should you do?
Options:
A. Disable deletion vectors before each nightly
MERGEso updates rewrite affected files immediately.B. Run only
OPTIMIZEon the table after each nightlyMERGE.C. Keep deletion vectors enabled; schedule
REORG TABLE ... APPLY (PURGE), then runVACUUMafter retention.D. Run
VACUUMimmediately after each nightlyMERGEto remove deleted rows.
Best answer: C
Explanation: Deletion vectors let Delta Lake record row-level deletes and updates without immediately rewriting data files, which helps small DELETE, UPDATE, and MERGE workloads finish faster. The tradeoff is that readers must apply those deletion vectors until affected files are rewritten. For maintenance, keep deletion vectors enabled for the nightly row-level workload, then periodically run REORG TABLE ... APPLY (PURGE) to rewrite files containing deletion vectors and physically remove the deleted rows from the rewritten data files. After the table’s retention period allows old files to be removed, run VACUUM to clean up obsolete files. This separates fast operational DML from scheduled physical cleanup.
- Disabling deletion vectors undermines the requirement to keep the nightly row-level
MERGEfast because affected files are rewritten during DML. - Using only OPTIMIZE may improve file layout, but it is not the targeted purge step for physically removing deletion-vector-marked rows.
- Immediate VACUUM does not rewrite files that contain deletion vectors and can conflict with the retention policy.
Question 31
Topic: Set Up and Configure an Azure Databricks Environment
A data engineering team has Unity Catalog Delta tables main.silver.orders and main.silver.customers. Analysts repeatedly need a joined, filtered projection for reporting. The team must avoid persisting another copy of the data, reflect changes in the underlying tables at query time, and let analysts query a named object without granting direct SELECT on the base tables. Which implementation should you use?
Options:
A. Create a materialized view and schedule refreshes for analysts.
B. Create a new Delta table with
CREATE TABLE AS SELECT.C. Create a notebook temporary view for the reporting query.
D. Create a Unity Catalog view and grant analysts access to the view.
Best answer: D
Explanation: A Unity Catalog view is the right object when users need a reusable logical query layer over underlying tables. The view stores the query definition, not a separate copy of the result data, so queries evaluate against the current state of the referenced tables. In Unity Catalog, analysts can be granted the required catalog/schema usage and SELECT on the view while the view owner maintains access to the underlying tables. This supports governed access to a curated projection without exposing the base tables directly.
Materialized views and CTAS tables persist derived data, while temporary views are session-scoped and not suitable as a shared reusable object.
- Materialized view fails because it stores precomputed results and depends on refresh behavior rather than purely querying current base data.
- CTAS table fails because it creates another persisted Delta table copy that can become stale.
- Temporary view fails because it is not a durable Unity Catalog object for shared analyst access.
Question 32
Topic: Secure and Govern Unity Catalog Objects
An external analytics vendor needs read-only access to two curated Unity Catalog objects: prod.sales.daily_orders_public and prod.ref.calendar. The vendor will use a non-Databricks client. The team must avoid granting Azure storage credentials, workspace access, or access to any other catalog objects. Which implementation should you use?
Options:
A. Provide a read-only Azure storage SAS token for the underlying table paths
B. Create a Delta Sharing share with only the two objects and assign it to a recipient
C. Grant the vendor
USE CATALOG,USE SCHEMA, andSELECTon the schemasD. Share the entire
prodcatalog and document which tables the vendor may query
Best answer: B
Explanation: Delta Sharing is the appropriate governance mechanism for sharing selected Unity Catalog data with an external recipient, including non-Databricks clients. The provider creates a recipient, creates a share, adds only the required tables or views, and grants that share to the recipient. This keeps access at the Unity Catalog object boundary and avoids exposing workspace permissions or cloud storage credentials. Granting catalog or schema privileges is for identities governed inside the metastore/workspace boundary, not for an external non-Databricks consumer. Sharing a broader catalog or raw storage paths breaks the least-privilege requirement.
- Workspace grants fail because the vendor should not receive workspace or metastore-level access.
- Whole-catalog sharing fails because it exposes more objects than the two required curated assets.
- Storage SAS access fails because it bypasses Unity Catalog governance and reveals underlying storage paths.
Question 33
Topic: Deploy and Maintain Data Pipelines and Workloads
An operations team triages production Azure Databricks workloads from Azure Monitor. Diagnostic logs are already streaming to Log Analytics. They need a proactive alert when a Lakeflow Jobs task reaches a failed terminal state. The alert must not depend on parsing notebook print output and must not be based on SQL query performance or Spark executor diagnostics. Which configuration best matches this requirement?
Options:
A. Create an alert by parsing job run output text.
B. Create an alert from query profile spill metrics.
C. Create an alert from pipeline event log expectations.
D. Create an Azure Monitor alert on failed task state events.
Best answer: D
Explanation: During workload triage, choose the evidence source that matches the operational question. A Lakeflow Jobs task reaching a failed terminal state is a task-state condition, so the alert should be based on job or task state events available through streamed diagnostics in Log Analytics and evaluated by Azure Monitor. Job run output is useful for debugging details after a failure, but parsing printed text is brittle and not the state source. Pipeline event logs help diagnose Lakeflow Spark Declarative Pipelines behavior, such as expectation failures. Query profiles and Spark UI evidence are for performance investigation, such as shuffles, spills, skew, and executor behavior. The key is to alert on the state signal, then use deeper evidence sources for root-cause analysis.
- Run output parsing fails because notebook output is diagnostic text, not the authoritative task-state signal.
- Pipeline event logs are useful for pipeline-specific events, not general Lakeflow Jobs task terminal states.
- Query profile metrics support SQL performance triage, not proactive failed-task alerting.
Question 34
Topic: Set Up and Configure an Azure Databricks Environment
A data engineering team must register Delta files that are written and lifecycle-managed by an external ingestion system in ADLS Gen2. An Azure Databricks job used this DDL:
CREATE TABLE bronze.raw_orders
USING DELTA
AS SELECT * FROM delta.`abfss://landing@acct.dfs.core.windows.net/orders/`;
The table is created, but DESCRIBE DETAIL bronze.raw_orders shows a Unity Catalog managed storage location instead of the abfss://landing.../orders/ path. What is the most likely root cause?
Options:
A. The table should have been created as a view.
B. The DDL omitted a
LOCATIONclause.C. The source path must be stored in a volume first.
D. The job needs a warehouse instead of cluster compute.
Best answer: B
Explanation: In Unity Catalog, a managed table stores its data in the catalog or schema managed storage location. If table metadata must reference data that is externally managed, create an unmanaged, or external, table by specifying the external data path in the table DDL with LOCATION. The shown CREATE TABLE AS SELECT statement reads from the abfss:// path but does not register that path as the table location, so Databricks writes the new table data to managed storage. The diagnostic evidence is the table location returned by DESCRIBE DETAIL, which confirms where the table is physically stored. The key takeaway is that reading from an external path is not the same as creating table metadata that points to that path.
- View workaround fails because a view stores query logic, not table metadata for Delta files at an external location.
- Volume requirement is a trap because volumes are for file access governance, not a prerequisite for registering an external Delta table.
- Compute type does not explain why Unity Catalog selected a managed storage location for the table.
Question 35
Topic: Deploy and Maintain Data Pipelines and Workloads
A nightly Lakeflow Job runs a PySpark notebook on Azure Databricks job compute to transform Unity Catalog Delta tables. The job must finish within 60 minutes and cannot use interactive all-purpose compute. You need to reduce compute cost without increasing failure risk.
Cluster consumption evidence:
| Metric | Observation |
|---|---|
| Current size | Fixed 16 workers |
| Run duration | 42-48 minutes |
| First 15 minutes | 14-16 active workers, high CPU |
| Remaining time | 2-4 active workers, CPU below 25% |
| Queue/start delays | None observed |
Options:
A. Reschedule the job to a later off-peak window.
B. Enable autoscaling from 4 to 16 workers on job compute.
C. Increase the fixed cluster size to 24 workers.
D. Move the notebook to an interactive shared cluster.
Best answer: B
Explanation: The evidence shows a variable workload, not a consistently undersized cluster. The first phase needs near the current 16-worker capacity, but most of the run uses only a few workers with low CPU. For job compute, autoscaling is the appropriate cost-control change because it can scale up for the initial parallel processing and scale down during lighter stages. The SLA is already met, and there are no queue or startup-delay symptoms, so changing job timing does not address the wasted capacity. Increasing the cluster size would raise cost without evidence of resource pressure. Moving to interactive shared compute violates the stated constraint and is not the right implementation for a scheduled production job.
- Larger fixed cluster fails because the job is not missing its SLA or showing sustained saturation.
- Interactive shared compute fails because the workload must remain on non-interactive job compute.
- Later schedule fails because no queueing or time-window contention is shown in the evidence.
Question 36
Topic: Set Up and Configure an Azure Databricks Environment
A Lakeflow Jobs task runs a PySpark enrichment workload on job compute. The job meets its SLA on normal days but misses it when the input volume doubles. The run does not fail.
Exhibit: Run evidence
| Signal | Value |
|---|---|
| Worker type | 4 vCPU, 16 GB |
| Workers | Fixed at 2 |
| Autoscaling | Disabled |
| CPU utilization | 90-98% for most of run |
| Memory pressure | Low; no OOM |
| Task backlog | High until near end |
Which compute configuration issue is the most likely root cause?
Options:
A. Worker nodes are too memory-constrained
B. Auto termination is stopping active runs
C. Too few fixed workers for CPU demand
D. The pool is causing slow task execution
Best answer: C
Explanation: The workload is CPU-bound, not memory-bound. Sustained CPU utilization near 100%, a high task backlog, and no out-of-memory or spill symptoms point to insufficient parallel CPU capacity. Because the cluster is fixed at two workers and autoscaling is disabled, the job cannot add workers when daily volume increases. The next configuration change would be to increase worker count, use a larger CPU-capable node type, or enable autoscaling with an appropriate maximum size for peak input days.
Pools mainly reduce cluster startup time, and auto termination affects idle clusters rather than actively running tasks.
- Memory constraint is not supported because the exhibit shows low memory pressure and no OOM condition.
- Pool issue is unlikely because pools affect startup latency, not sustained CPU saturation during execution.
- Auto termination does not explain a completed but slow active run because termination applies when compute is idle.
Question 37
Topic: Deploy and Maintain Data Pipelines and Workloads
A Lakeflow Job failed when the validate_orders task rejected a batch because of a data-quality expectation. The source file was corrected, and the job was repaired from the failed task. Before closing the incident, which evidence best demonstrates that the workload is healthy?
Options:
A. The job compute is running with no terminated workers.
B. The failed task now has a green status in the run graph.
C. The repaired run shows all dependent tasks succeeded and expectation metrics are within tolerance.
D. The corrected source file exists in the landing volume.
Best answer: C
Explanation: After a repair, the strongest health evidence is not just that the failed step restarted; it is that the repaired run completed through the dependent task path and the metric tied to the original failure is acceptable. Here, the incident was caused by a data-quality expectation, so the closing evidence should include successful downstream execution and expectation results within the required tolerance. Compute status or file presence can support triage, but they do not prove the workload produced valid outputs after the repair.
- Compute-only evidence is insufficient because healthy workers do not prove the repaired data path completed correctly.
- File presence confirms availability of an input, not successful validation or downstream processing.
- Single-task success can miss failures in dependent tasks or continued data-quality issues later in the run.
Question 38
Topic: Prepare and Process Data
A nightly Azure Databricks job updates gold.dim_customer. Analysts require reports to preserve the customer tier that was valid when each sale occurred. The job succeeds, but after a customer changes tier, historical sales now report the new tier.
Exhibit: Current merge logic
MERGE INTO gold.dim_customer AS t
USING silver.customer_changes AS s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN
UPDATE SET tier = s.tier, updated_at = s.change_ts
WHEN NOT MATCHED THEN
INSERT (customer_id, tier, effective_from, is_current)
VALUES (s.customer_id, s.tier, s.change_ts, true);
What is the best root cause?
Options:
A. The source change feed is missing schema evolution.
B. The sales fact table is partitioned incorrectly.
C. The dimension table needs Unity Catalog row filters.
D. The merge applies Type 1 overwrite logic.
Best answer: D
Explanation: The analytical requirement calls for a slowly changing dimension that preserves historical attribute values, typically SCD Type 2. The shown MERGE updates the existing tier value when a matching customer_id arrives, so previous states are lost. A Type 2 implementation would expire the current row, set an effective_to value or is_current = false, and insert a new row with a new effective_from value. Historical fact joins can then use the sale timestamp against the effective date range. The key issue is not job failure; it is that the model’s change-handling logic does not match the required history behavior.
- Schema evolution is not indicated because the columns used by the merge are present and the job succeeds.
- Row filters control access to rows; they do not preserve prior dimension attribute values.
- Partitioning might affect performance, but it would not cause historical tier values to be overwritten.
Question 39
Topic: Secure and Govern Unity Catalog Objects
A Lakeflow Jobs task runs a reconciliation query against a Unity Catalog table. The task does not fail with a permission error, but the row-count check fails.
Principal: sp_recon_job
Grants: USE CATALOG, USE SCHEMA, SELECT on finance.transactions
Job query result: 12,640 rows, only region = 'EMEA'
Table owner result with same query: 48,902 rows, multiple regions
Selected columns: txn_id, region, amount
What is the best root cause to investigate?
Options:
A. Missing USE CATALOG on finance
B. A column mask on amount
C. Missing SELECT on the table
D. A row filter on the table
Best answer: D
Explanation: Unity Catalog fine-grained access controls can change query results without producing a permission error. A row filter applies a predicate to each query against a table, often based on the current user or group membership, so different principals can see different row sets from the same SQL. In this scenario, the service principal can resolve and query the table, so basic table and namespace privileges are present. The symptom is fewer rows and only one region, while the table owner sees all regions. A column mask would change values in a selected column, not remove entire regions from the result. The key diagnostic is to inspect the table’s row filter definition and the principal’s group membership or attributes used by that filter.
- Missing table access would cause an authorization error instead of a successful query with reduced rows.
- Column masking affects returned column values, not which regional rows are included.
- Missing namespace access would prevent resolving the table rather than allowing a partial result set.
Question 40
Topic: Deploy and Maintain Data Pipelines and Workloads
A data engineering team needs centralized operational analysis of Azure Databricks workload failures in Azure Monitor. A Lakeflow Job failed, but this Log Analytics query returns no Databricks job records.
Azure Databricks diagnostic setting
Destination: Log Analytics workspace law-prod
Selected categories: AllMetrics
Log query result: 0 rows
Job run output: failure visible in Azure Databricks UI
What is the best next diagnostic step?
Options:
A. Create an Azure Monitor alert on the job metric
B. Grant SELECT on the affected Unity Catalog tables
C. Enable Databricks log categories in the diagnostic setting
D. Increase the job cluster driver size
Best answer: C
Explanation: Log streaming to Log Analytics is configured through the Azure Databricks workspace diagnostic setting. In the exhibit, the destination is present, but only AllMetrics is selected. Metrics can support alerting, but they do not provide the workload log records needed for centralized operational analysis. The next diagnostic step is to verify and enable the relevant Azure Databricks log categories, such as job and cluster-related logs, to the Log Analytics workspace, then rerun or wait for ingestion and query again.
An alert can notify on signals after telemetry exists, but it does not make missing log records appear.
- Unity Catalog grant addresses table access, not whether operational logs are streamed to Azure Monitor.
- Driver sizing may affect runtime stability, but the visible symptom is missing centralized telemetry.
- Metric alert depends on available telemetry and does not enable log ingestion into Log Analytics.
Question 41
Topic: Secure and Govern Unity Catalog Objects
A Lakeflow Jobs task runs as service principal sp_ingest. The task reads JSON files from the Unity Catalog volume path /Volumes/retail/bronze/landing_files/orders/ and appends rows to the existing Delta table retail.bronze.orders_raw. The principal already has USE CATALOG on retail and USE SCHEMA on retail.bronze, but the run fails with access denied on the source volume and target table.
Which grants should you add with the least privilege?
Options:
A. Grant
SELECTon the volume andSELECTon the table.B. Grant
CREATE VOLUMEon the schema andCREATE TABLEon the schema.C. Grant
READ VOLUMEon the volume andMODIFYon the table.D. Grant
WRITE VOLUMEon the volume andOWNERSHIPon the table.
Best answer: C
Explanation: Unity Catalog separates object discovery, file access, and table modification privileges. The stem already provides the namespace privileges needed to resolve objects: USE CATALOG and USE SCHEMA. To read files stored in a UC volume, the service principal needs READ VOLUME on that volume. To append rows to an existing Delta table, it needs MODIFY on the table. It does not need ownership, schema-level create privileges, or volume write access because the task is not creating objects or writing files back to the volume.
The key is to grant the privilege that matches the securable object and action being denied.
- Table read privilege fails because
SELECTallows reading table data, not appending rows or reading a volume. - Create privileges fail because the task uses existing objects and does not create a table or volume.
- Excessive control fails because
OWNERSHIPandWRITE VOLUMEexceed the stated append-from-volume requirement.
Question 42
Topic: Prepare and Process Data
A Lakeflow Spark Declarative Pipeline uses Auto Loader to stream JSON order files from a Unity Catalog volume into bronze.ops.orders. Previous files loaded successfully, the pipeline principal has the required source and target privileges, and the team must keep streaming ingestion while retaining unexpected fields for later review.
Exhibit: Failure excerpt
Source: /Volumes/raw/ops/orders
Target: bronze.ops.orders
Mode: streaming Auto Loader
Error: UnknownFieldException
Detail: Encountered unknown field 'promoCode' in a new JSON file.
Current option: cloudFiles.schemaEvolutionMode = failOnNewColumns
Options:
A. Grant
WRITE VOLUMEon the source volume.B. Grant
MODIFYonbronze.ops.orders.C. Replace Auto Loader with scheduled
COPY INTO.D. Set
cloudFiles.schemaEvolutionModetorescue.
Best answer: D
Explanation: This failure points to schema handling, not source access, target permissions, or the ingestion processing mode. The pipeline already read earlier files and wrote to the Unity Catalog target, so the privileges and locations are not the deciding issue. The new file contains an unexpected JSON field, and the current Auto Loader setting fails the stream when new columns appear. Setting cloudFiles.schemaEvolutionMode to rescue preserves streaming ingestion and stores unexpected fields in rescued data for later review instead of failing the pipeline or changing the target schema immediately. Switching to batch ingestion would violate the stated streaming requirement.
- Source privilege change fails because the stream can already read the volume and the error identifies an unknown field, not denied access.
- Target privilege change fails because previous writes succeeded and the error occurs during schema interpretation.
- Batch replacement fails because
COPY INTOchanges the requested streaming Auto Loader pattern.
Question 43
Topic: Deploy and Maintain Data Pipelines and Workloads
A team uses a Lakeflow Job to ingest, validate, and publish daily orders. The validation notebook raises an exception when null order_id values are found, but the publish task still writes the day’s data.
Run summary:
ingest_orders: Succeeded
validate_orders: Failed - null order_id check
publish_orders: Succeeded
publish_orders dependency: ingest_orders
publish_orders run condition: All done
What is the best root cause?
Options:
A. The null check should run after publishing
B. The validation notebook lacks catalog privileges
C. The job compute terminated before publishing
D. The publish task bypasses validation success
Best answer: D
Explanation: Lakeflow Jobs task logic should make each task represent a clear unit of pipeline work and use dependencies plus run conditions to enforce the intended sequence. In this run, publishing depends on ingest_orders, not validate_orders, and All done permits the downstream task to start after the dependency completes regardless of success or failure. To prevent partial or invalid data from being published, the publish task should depend on the validation task and run only when validation succeeds. The key diagnostic evidence is the task dependency and run condition, not the validation logic itself.
- Privilege issue does not fit because the validation task failed on a data check, and publishing succeeded.
- Compute termination does not fit because the publish task completed successfully.
- Late validation reverses the intended pipeline control point; validation should gate publishing, not follow it.
Question 44
Topic: Prepare and Process Data
A Lakeflow Spark Declarative Pipeline ingests JSON order events into a Delta target table. The source system can republish the same event in a later file, and duplicates have the same event_id. Valid late duplicates can arrive up to 48 hours after the first event. Which configuration should you apply before writing to the target table?
Options:
A. Watermark on
event_timeand deduplicate byevent_idB. Partition the target table by
event_dateC. Use the Auto Loader checkpoint only
D. Add a
NOT NULLconstraint onevent_id
Best answer: A
Explanation: Duplicate handling should use a business key that identifies repeated records, such as event_id, before rows are written to the target table. In a streaming pipeline, adding a watermark on the event-time column lets Azure Databricks keep enough state to catch late duplicates while eventually removing old state. The 48-hour lateness requirement makes a watermark appropriate because it defines how long late duplicate records must still be considered. Checkpoints help track stream progress, but they do not remove duplicate records that the source republishes in different files.
- Checkpoint only tracks processed input progress, but it does not deduplicate records that appear again in later files.
- NOT NULL constraint validates presence of
event_id, but it does not enforce uniqueness across ingested rows. - Partitioning by date can improve pruning for some queries, but it does not prevent repeated records in the target.
Question 45
Topic: Prepare and Process Data
A data engineering team must build a medallion pipeline in Azure Databricks. New JSON files arrive continuously in a cloud storage path, and the team wants Azure Databricks to manage ingestion, transformation dependencies, data-quality expectations, and Unity Catalog target tables as one declared pipeline. Which implementation best preserves these constraints?
Options:
A. Schedule
COPY INTOcommands for each bronze and silver tableB. Run ad hoc Auto Loader notebooks on shared interactive compute
C. Create a Lakeflow Job that runs separate ingestion and transformation notebooks
D. Define a Lakeflow Spark Declarative Pipeline with streaming ingestion and expectations
Best answer: D
Explanation: Lakeflow Spark Declarative Pipelines fit scenarios where the pipeline should own and maintain the data flow definition: source ingestion, transformation dependencies, quality expectations, and target Unity Catalog tables. In this case, the requirement is not just to run tasks on a schedule; it is to declare a managed pipeline that continuously ingests files and maintains bronze and silver outputs with built-in dependency handling. Lakeflow Jobs can orchestrate tasks, but they do not by themselves provide the same pipeline-managed declaration model for tables and expectations. The key distinction is pipeline declaration and management, not simply execution scheduling.
- Job orchestration fails because a Lakeflow Job coordinates tasks but does not inherently define managed tables, expectations, and dependencies as one declarative pipeline.
- Scheduled COPY INTO fails because it is batch-oriented loading and does not provide the requested managed transformation pipeline.
- Ad hoc notebooks fail because interactive notebook runs do not preserve the requirement for a declared, maintained pipeline.
Question 46
Topic: Deploy and Maintain Data Pipelines and Workloads
A data engineering team packages Lakeflow Jobs and notebooks as a Databricks Asset Bundle. An Azure DevOps pipeline must deploy the same bundle to dev and prod from a noninteractive shell and trigger validation runs. The team wants to avoid writing custom HTTP calls. Which configuration choice best supports this requirement?
Options:
A. Use workspace UI import and manually run each job
B. Use Git integration to sync notebooks into the workspace
C. Use Databricks CLI bundle commands with service principal authentication
D. Call the Databricks REST API directly from custom scripts
Best answer: C
Explanation: The key requirement is automated command-line deployment and operation without custom HTTP code. Databricks CLI is designed for shell-based automation, including Databricks Asset Bundle deployment, target selection, validation, and job operations when authenticated with a noninteractive identity such as a service principal. REST API calls can automate the same platform, but they require custom request handling. Git integration helps manage source code, not deploy and operate the bundle by itself.
- Manual UI steps fail because the pipeline must deploy and trigger runs noninteractively.
- Direct REST calls are possible but contradict the requirement to avoid custom HTTP calls.
- Git sync alone manages notebook source, but it does not provide bundle deployment and operational commands.
Question 47
Topic: Deploy and Maintain Data Pipelines and Workloads
A team stores an Azure Databricks Lakeflow Jobs workflow and notebooks in Git and packages them with Databricks Asset Bundles. A change to a transformation notebook must be validated before it can be deployed to the production catalog. Tests must run against a nonproduction Unity Catalog schema, and production deployment must occur only after the pull request is approved and checks pass. Which implementation should the team use?
Options:
A. Run bundle tests in a PR target, then deploy production from main
B. Run local sample-data tests, then deploy production from the feature branch
C. Deploy to production, then run a scheduled validation job afterward
D. Merge first, then run manual tests against the production catalog
Best answer: A
Explanation: For Azure Databricks code and pipeline changes, tests should be part of the pull request validation path before production deployment. With Databricks Asset Bundles, define separate targets such as test and production, point the test target to a nonproduction Unity Catalog schema, and run automated test tasks or test notebooks during CI. The pull request should require successful checks and approval before merge. Production deployment should then run only from the protected main branch using the production bundle target. This preserves environment isolation and prevents unvalidated code from reaching production.
- Testing after merge fails because production can receive unvalidated code before the manual test result is known.
- Local-only testing fails because it does not validate the deployed Databricks workflow against the required nonproduction Unity Catalog schema.
- Post-deployment validation fails because it detects problems only after the production deployment has already occurred.
Question 48
Topic: Set Up and Configure an Azure Databricks Environment
A data engineer attaches a notebook to an existing shared compute resource and runs a PySpark cell that imports a third-party parsing package before reading a Unity Catalog table. The same notebook can run SELECT current_catalog() and can query main.sales.orders, but this cell fails.
ModuleNotFoundError: No module named 'xmltodict'
Which configuration change should the engineer make?
Options:
A. Grant
SELECTonmain.sales.ordersB. Install
xmltodicton the attached computeC. Create a new schema for the parsed data
D. Give the user permission to attach to the compute
Best answer: B
Explanation: A missing library problem is indicated by an import failure such as ModuleNotFoundError before the data read is the deciding issue. In this scenario, the notebook is already attached to compute because other commands run, and the user already has data access because the Unity Catalog table can be queried. The configuration fix is to make the required Python package available to the notebook’s execution environment, such as installing it on the attached compute according to the workspace’s library management approach. Compute-access issues usually prevent attaching or running commands, while data-permission issues appear when accessing protected objects.
- Table grant is unnecessary because the stem states the notebook can already query
main.sales.orders. - Compute attach access does not fit because the notebook is already attached and can run commands.
- New schema setup does not address a Python package import failure.
Question 49
Topic: Deploy and Maintain Data Pipelines and Workloads
A team uses Git-backed Azure Databricks notebooks and Databricks Asset Bundles. A developer opens a pull request from feature/order-tests into main, but the pull request validation tests do not start.
Exhibit: Pull request status
Status: Cannot merge automatically
Conflict: resources/jobs/orders_pipeline.yml
main: changed task parameter env=prod
feature/order-tests: changed the same task parameter env=dev-test
CI: skipped because the pull request is not mergeable
What is the best next diagnostic step?
Options:
A. Approve the pull request without running tests
B. Grant Unity Catalog privileges to the CI principal
C. Repair the last production Lakeflow Job run
D. Resolve the conflicting YAML in the feature branch
Best answer: D
Explanation: The core issue is a Git merge conflict, not a failed Databricks job or data-access problem. A pull request compares a source branch with a target branch and typically runs validation tests only when the proposed change is mergeable. Here, both main and feature/order-tests changed the same field in the same Databricks Asset Bundle job file, and the CI system explicitly skipped tests because the pull request is not mergeable. The next step is to resolve that conflict in the feature branch, update the pull request, and then allow the PR validation tests to run. Approval, job repair, or Unity Catalog troubleshooting would come after evidence points to those areas.
- Reviewer approval does not make a pull request mergeable when Git reports a file conflict.
- Job repair addresses a failed Lakeflow Job run, but the symptom is PR validation being skipped.
- Privilege grants would fit an access error, but no Unity Catalog permission failure is shown.
Question 50
Topic: Secure and Govern Unity Catalog Objects
A data governance team has published several Unity Catalog tables in prod.sales. Analysts already have the required privileges to read the approved tables, but they are choosing the wrong tables because names are similar. The team wants the tables to be easier to discover and understand in Catalog Explorer and search results without changing who can query the data. Which configuration should the team apply?
Options:
A. Add table descriptions, column comments, and tags
B. Grant
SELECTon the tables to the analystsC. Create row filters on the sensitive tables
D. Apply column masks to customer identifier columns
Best answer: A
Explanation: Unity Catalog discovery metadata helps users understand and find governed data assets, but it does not replace access-control enforcement. For a discoverability problem, table descriptions, column comments, and tags are the appropriate configuration because they add business context and classification information visible through discovery experiences. The stem states that analysts already have the needed read privileges and that the goal is not to change query access. Grants, row filters, and column masks are enforcement mechanisms used to allow, restrict, or transform access to data. Use metadata when the problem is clarity and searchability; use privileges, filters, or masks when the problem is authorization or data protection.
- Extra grants fail because the analysts already have the required read access, and granting
SELECTchanges authorization. - Row filters fail because they enforce row-level access rules rather than improving table discoverability.
- Column masks fail because they transform sensitive column values at query time, which is an access-control behavior.
Continue in the web app
Use IT Mastery for interactive Microsoft DP-750 practice with mixed sets, timed mocks, topic drills, explanations, and progress tracking.