Free Databricks DE Associate Full-Length Practice Exam: 45 Questions

May 1, 2026

Try 45 free Databricks Data Engineer Associate questions across the exam domains, with explanations, then continue with full IT Mastery practice.

This free full-length Databricks Data Engineer Associate practice exam includes 45 original IT Mastery questions across the exam domains.

These questions are for self-assessment. They are not official exam questions and do not imply affiliation with the exam sponsor.

Count note: this page uses the full-length practice count maintained in the Mastery exam catalog. Some certification vendors publish total questions, scored questions, duration, or unscored/pretest-item rules differently; always confirm exam-day rules with the sponsor.

Need concept review first? Read the Databricks Data Engineer Associate Cheat Sheet on Tech Exam Lexicon, then return here for timed mocks and full IT Mastery practice.

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try Databricks Data Engineer Associate on Web View full Databricks Data Engineer Associate practice page

Exam snapshot

Exam route: Databricks Data Engineer Associate
Practice-set question count: 45
Time limit: 90 minutes
Practice style: mixed-domain diagnostic run with answer explanations

Full-length exam mix

Domain	Weight
Databricks Intelligence Platform	10%
Development and Ingestion	17%
Data Processing & Transformations	21%
Productionizing Data Pipelines	17%
Data Governance & Quality	35%

Use this as one diagnostic run. IT Mastery gives you timed mocks, topic drills, analytics, code-reading practice where relevant, and full practice.

Practice questions

Questions 1-25

Question 1

Topic: Data Governance & Quality

A data engineer created a gold table with the following SQL.

CREATE OR REPLACE TABLE main.gold.daily_store_sales AS
SELECT o.store_id,
       date(o.order_ts) AS order_date,
       SUM(oi.quantity * oi.unit_price) AS sales
FROM main.silver.orders o
JOIN main.silver.order_items oi
  ON o.order_id = oi.order_id
GROUP BY o.store_id, date(o.order_ts);

Which follow-up question is best answered with Unity Catalog lineage?

Options:

A. Which SQL warehouse size would reduce runtime?
B. If main.silver.order_items changes, what downstream tables or views are impacted?
C. Which workflow task failed and should be repaired?
D. Why did the join spend most time shuffling data?

Best answer: B

Explanation: Unity Catalog lineage is used for dependency tracing and impact analysis across governed data assets. The question about downstream tables or views affected by a source-table change is a lineage question, while shuffle analysis, warehouse sizing, and repairing failed tasks are performance or workflow topics.

Unity Catalog lineage is for tracing relationships between upstream and downstream data assets. In this SQL, main.gold.daily_store_sales is built from main.silver.orders and main.silver.order_items, so asking what other tables or views would be affected by a change to one of those sources is a dependency and impact-analysis question.

That differs from other Databricks question types:

Performance questions ask why a query was slow, such as shuffle-heavy joins or warehouse sizing.
Workflow-orchestration questions ask about task failures, repair runs, scheduling, or job execution state.

A good rule is: if the question is “what depends on what?”, use lineage; if it is “why was it slow?” or “what run failed?”, use performance or workflow tools.

Shuffle focus belongs to query-performance analysis, typically using Spark UI or query profiling rather than lineage.
Repairing a task is a Databricks Workflows question because it asks about job execution and rerun behavior.
Warehouse sizing is performance tuning, not dependency tracing between upstream and downstream assets.

Question 2

Topic: Data Processing & Transformations

A data engineer reviews this requirement note:

Target table: main.sales.daily_orders
Current state: already exists with historical data
Source: temp view new_orders
Requirement: add all rows from new_orders
Constraint: keep existing rows unchanged

Which SQL statement pattern best matches this requirement?

Options:

A. INSERT INTO main.sales.daily_orders SELECT * FROM new_orders
B. CREATE OR REPLACE TABLE main.sales.daily_orders AS SELECT * FROM new_orders
C. INSERT OVERWRITE main.sales.daily_orders SELECT * FROM new_orders
D. CREATE TABLE main.sales.daily_orders AS SELECT * FROM new_orders

Best answer: A

Explanation: INSERT INTO ... SELECT is the standard append pattern for an existing table. The exhibit says the table already exists and that historical rows must remain, so the correct choice must add new rows without replacing current data.

In Databricks SQL, INSERT INTO target SELECT ... is used to append query results to an existing table. That matches the exhibit exactly: main.sales.daily_orders already exists, and the requirement is to add rows from new_orders while leaving historical rows in place.

INSERT OVERWRITE does not append; it replaces the target table’s current contents with the query output. CREATE OR REPLACE TABLE ... AS SELECT also rebuilds the table from the select results, so it is a replace pattern rather than an append pattern. CREATE TABLE ... AS SELECT is meant for creating a new table, not writing additional rows into one that already exists.

For a simple add-rows requirement, the append form is the best match.

The overwrite pattern fails because it replaces existing rows instead of preserving history.
The create-or-replace pattern fails because it recreates the table from the new query results.
The CTAS pattern is for initial table creation, so it does not fit an already existing target table.

Question 3

Topic: Productionizing Data Pipelines

A data engineering team is setting up a scheduled ETL workflow. Based on the note below, which compute choice is the best fit?

Exhibit:

Workflow: orders_etl
Tasks: 3 notebook tasks
Schedule: every hour
Requirements:
- minimal operational overhead
- no manual cluster sizing or tuning
- Databricks should optimize compute automatically
- no special Spark configuration needed

Options:

A. Run it from a SQL warehouse
B. Run it on serverless compute
C. Use a new autoscaling jobs cluster
D. Use a shared all-purpose cluster

Best answer: B

Explanation: Serverless compute is the best match when a workflow should be hands-off and automatically optimized by Databricks. The exhibit describes standard scheduled notebook tasks with no special configuration needs, which is a strong serverless use case.

The core concept is choosing serverless compute when the team wants Databricks to manage provisioning, scaling, and optimization for a routine production workload. In the exhibit, the workflow is scheduled, uses standard notebook tasks, and explicitly says the team does not want to size or tune clusters.

That combination points to serverless compute because it reduces operational work while still supporting production ETL execution. A manually configured jobs cluster can run the workload, but it still requires decisions about cluster setup and lifecycle. A shared all-purpose cluster is aimed more at interactive development, and a SQL warehouse is not the right compute for a notebook-based ETL workflow.

The key takeaway is that serverless is the best fit when the goal is a managed, low-overhead workflow rather than custom cluster control.

Autoscaling still managed The autoscaling jobs cluster can work, but it still requires cluster configuration and is less hands-off.
Wrong workload style The shared all-purpose cluster is better suited to interactive development than scheduled production pipelines.
Wrong compute type The SQL warehouse option fits SQL querying and dashboard workloads, not a notebook ETL workflow like this one.

Question 4

Topic: Productionizing Data Pipelines

A data engineering team stores a Databricks job and a Lakeflow Spark Declarative Pipeline in one Databricks Asset Bundle. They need to deploy the same bundle to both a development workspace and a production workspace, use different workspace hosts, and keep production-specific overrides such as a schedule. Which bundle element best matches this deployment-structure requirement?

Options:

A. variables
B. resources
C. targets
D. workspace

Best answer: C

Explanation: targets define separate deployment environments in a Databricks Asset Bundle. They let the team reuse one set of resource definitions while changing environment-specific settings such as workspace hosts and production overrides. That matches the requirement to keep a single bundle for both dev and prod.

In Databricks Asset Bundles, targets are used to model deployment environments such as development and production. This is the right structural element when the same bundle should deploy the same job and Lakeflow Spark Declarative Pipeline to different workspaces while allowing environment-specific settings or overrides. A target can hold deployment-specific values like workspace configuration and resource overrides, so the shared resource definitions stay in one place. resources describe what gets deployed, such as jobs or pipelines. workspace sets workspace-related configuration, but by itself it does not organize multiple environments in one bundle. variables help parameterize values, but they do not replace the environment structure that targets provide. The key takeaway is to use targets for multi-environment deployment layout.

Workspace only The workspace element sets workspace configuration, but it is not the main structure for separate dev and prod deployments.
What to deploy The resources element defines jobs and pipelines, not the environment layout for deploying them.
Parameter values The variables element can substitute values, but it does not create target-specific deployment sections and overrides.

Question 5

Topic: Data Processing & Transformations

A Lakeflow Spark Declarative Pipeline update fails with AnalysisException: Table or view active_customers not found. One function creates active_customers as a temporary view, and a later dataset definition queries that view. The team says the temp-view code appears earlier in the notebook, so it should run first. What is the best fix?

Options:

A. Add Workflow dependencies between notebook tasks.
B. Declare the intermediate result as a Lakeflow dataset and reference it.
C. Move the temp-view code earlier in the notebook.
D. Use a larger cluster and rerun the update.

Best answer: B

Explanation: Lakeflow Spark Declarative Pipelines build execution order from the dependency graph between declared datasets. If a downstream transformation needs an upstream result, that dependency should be expressed in the pipeline definition instead of relying on notebook order or a temporary-view side effect.

Lakeflow Spark Declarative Pipelines are declarative, so Databricks plans execution from dataset references rather than from top-to-bottom notebook order. In this scenario, the downstream definition reads a temporary view that is created as a side effect, so the pipeline planner does not have a reliable declared dependency to follow.

Define the intermediate result as a declared pipeline dataset.
Read that dataset from the downstream transformation.
Let the pipeline engine infer the correct run order from those references.

Reordering code, adding workflow sequencing, or increasing compute does not fix the core issue: the transformation dependency is not explicitly represented in the pipeline.

Notebook order is not the control mechanism for declarative pipeline execution, so moving code earlier does not guarantee dependency handling.
Workflow sequencing can order separate tasks, but it does not define the internal dependency graph of transformations inside one pipeline update.
More compute may change timing, but it does not create a formal upstream dataset dependency for the planner.

Question 6

Topic: Data Processing & Transformations

In a Lakeflow Spark Declarative Pipeline, a customers_silver transformation must use the output of customers_bronze. How should the pipeline definition express this dependency so execution is reliable?

Options:

A. Attach separate source notebooks in bronze-then-silver order.
B. Place the customers_bronze definition before customers_silver.
C. Prefix dataset names so bronze sorts before silver.
D. Define customers_silver to read from customers_bronze.

Best answer: D

Explanation: Lakeflow Spark Declarative Pipelines are declarative: they run transformations based on dataset dependencies, not on the order code appears. A downstream definition should explicitly read from the upstream dataset it needs so Databricks can build the correct execution graph.

In Lakeflow Spark Declarative Pipelines, the important signal is lineage in the transformation definitions. If customers_silver depends on customers_bronze, the downstream dataset should reference the upstream dataset in its query or DataFrame logic. Databricks then builds the dependency graph and executes transformations in the required sequence.

Express dependencies in the transformation itself.
Let the pipeline engine infer run order from those references.
Do not rely on file order, notebook order, or naming patterns.

The key takeaway is that declarative pipelines use declared dependencies to determine execution, not implicit code ordering.

File order fails because placing one definition earlier does not create a true upstream dependency.
Notebook order fails because source attachment order is not the mechanism that controls transformation sequencing.
Name sorting fails because dataset names identify objects; they do not tell the pipeline how to execute them.

Question 7

Topic: Data Governance & Quality

A Databricks provider on AWS shares a 20 TB Delta table with an Azure recipient using Delta Sharing. The recipient runs read-only queries each day, metadata exchanged by the share is small, and assume the recipient’s compute cost would be the same either way. In this scenario, what is the main cost driver?

Options:

A. Asset Bundle deployment of the sharing configuration
B. Unity Catalog permission checks on each shared query
C. Cross-cloud data egress for the bytes the recipient reads
D. Delta log metadata exchanged for the share

Best answer: C

Explanation: In cross-cloud Delta Sharing, the biggest variable cost usually comes from the data volume that leaves the provider’s cloud storage. Because the stem says metadata is small and compute cost is unchanged, the main cost driver is the egress for the bytes the recipient reads.

Delta Sharing is designed for secure read-only sharing, but in a cross-cloud scenario the major ongoing cost consideration is usually the data that must move between clouds. Here, the provider is on AWS, the recipient is on Azure, the table is large, and the recipient reads it regularly. The stem also removes two common distractions by stating that metadata is small and compute cost is unchanged.

That means the primary cost driver is cloud egress for the shared data read by the recipient. Permission checks and Delta log exchange are control-plane activities, but they involve much less data than repeated multi-terabyte reads. The key takeaway is that for simple cross-cloud sharing, data volume transferred is typically the dominant cost factor.

The permission-check option is a control-plane activity and is not the main driver for large daily reads.
The Delta log option involves relatively small metadata compared with moving terabytes of table data.
The deployment option is a setup task, not an ongoing cost tied to reading shared data.

Question 8

Topic: Databricks Intelligence Platform

A company uses separate Databricks workspaces for data engineering and BI in the same Databricks account. An engineering workflow creates bronze.orders, but a downstream BI workflow task in the other workspace fails with:

TABLE_OR_VIEW_NOT_FOUND: bronze.orders
Current setup: table created in a workspace-local metastore
Need: one governed table, centralized permissions, lineage across both workspaces

Which next step best fixes the root problem?

Options:

A. Manage bronze.orders in Unity Catalog and grant access there.
B. Recreate bronze.orders in the BI workspace metastore.
C. Publish bronze.orders as a global temp view from the engineering job.
D. Repair the BI workflow run after the engineering job finishes.

Best answer: A

Explanation: This failure comes from workspace-scoped metadata, not from the workflow logic itself. Unity Catalog is the platform-level fix because it provides one governed table definition, centralized permissions, and lineage across workspaces in the same Databricks account.

The core issue is that the table exists only in a workspace-local metastore, so its metadata and governance are tied to one workspace. That makes this a platform-scope problem rather than a notebook, job, or SQL problem. Repairing the failed BI task might rerun compute, but it does not change where the table is registered or how access is governed.

Unity Catalog is Databricks’ centralized governance layer for data objects. Placing the table there lets multiple workspaces in the same account reference the same governed object, with consistent permissions and lineage. That directly addresses the stated goal of one table, centralized access control, and shared visibility. Recreating metadata separately in another workspace would still fragment governance instead of using the platform’s shared control plane.

Repairing the run only retries execution; it does not change workspace-scoped metadata.
Recreating the table elsewhere can expose the data, but it duplicates governance instead of centralizing it.
Using a global temp view is not a durable cross-workspace sharing pattern for governed tables.

Question 9

Topic: Data Governance & Quality

A cleanup workflow drops and recreates a Unity Catalog table each night. After DROP TABLE, a legacy application that reads the same cloud storage directory starts failing because the files are gone. The team still wants Unity Catalog governance, but Databricks must not own the file lifecycle for this dataset. What is the best next step?

Options:

A. Register the dataset as an external table in Unity Catalog
B. Grant the legacy application more Unity Catalog permissions
C. Share the dataset through Delta Sharing instead
D. Recreate the dataset as a managed table in Unity Catalog

Best answer: A

Explanation: An external table is the right fit when Unity Catalog should govern a dataset but the data files must stay in externally owned storage. Because another application still needs those files after DROP TABLE, Databricks should not manage their lifecycle.

Managed and external tables differ mainly by who owns the underlying data files. A managed table stores data in Databricks-managed storage under Unity Catalog, so Databricks controls the file lifecycle; when the table is dropped, the underlying data is also removed. An external table registers data that already lives in an external cloud location, allowing Unity Catalog governance without transferring storage ownership to Databricks.

In this scenario, the key requirement is that the files must remain available to another application even after the table is dropped and recreated. That makes an external table the best choice. The closest distractor is the managed-table option, because it also provides governance, but it conflicts with the requirement that Databricks must not own or delete the files.

Recreating the dataset as a managed table still gives Databricks control of the file lifecycle, so DROP TABLE can remove the underlying data.
Granting more Unity Catalog permissions changes access control, but it does not change storage ownership or drop behavior.
Using Delta Sharing helps with read-only sharing, not with keeping table files in externally owned storage.

Question 10

Topic: Data Governance & Quality

A data engineering team stores a gold Delta table in Unity Catalog on AWS. Another team uses Databricks on Azure and needs read-only access to the latest data every day. The provider wants to avoid managing duplicate copies, and finance wants the design to account for both compute cost and any cross-cloud transfer cost. Which approach is best?

Options:

A. Use Delta Sharing and plan for cross-cloud egress costs.
B. Use Auto Loader to copy the table into Azure daily.
C. Use Databricks Asset Bundles to deploy the sharing workflow.
D. Use Lakehouse Federation so Azure queries the table in place.

Best answer: A

Explanation: Delta Sharing is the Databricks feature designed for governed, read-only sharing across clouds without maintaining extra copies. Even when it is the right sharing mechanism, cross-cloud access can still introduce egress or other data-movement charges, so cost planning must include more than compute.

Delta Sharing is the best fit because the data is already governed in Unity Catalog, the consumer only needs read-only access, and the provider wants to avoid duplicating datasets. It enables live sharing to another Databricks environment on a different cloud while keeping provider-side control of the shared data. The key concept is that cross-cloud sharing decisions are not based only on cluster or serverless compute cost. When data is read across cloud boundaries, network transfer or egress charges may also apply, depending on where the data resides and how much is read. That makes data location and access pattern part of the architecture decision, not just the sharing feature choice.

The Lakehouse Federation idea targets external data sources and does not remove cross-cloud transfer considerations for shared Databricks data.
The Auto Loader idea creates a separate copy and extra pipeline, which conflicts with the requirement to avoid duplicate datasets.
The Databricks Asset Bundles idea helps deploy code and jobs, but it does not solve governed cross-cloud data sharing or egress planning.

Question 11

Topic: Development and Ingestion

A workflow task runs this notebook cell and fails immediately.

(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .load("/Volumes/main/bronze/landing/events"))

Run details show: AnalysisException: Auto Loader can infer schema, but you must set cloudFiles.schemaLocation to store the schema.

The team wants to keep using Auto Loader with inferred schema. What is the best next step?

Options:

A. Replace the stream with COPY INTO
B. Configure cloudFiles.schemaLocation for the Auto Loader source
C. Configure only checkpointLocation on the sink
D. Increase cluster size for the workflow task

Best answer: B

Explanation: The run details already identify the root cause: Auto Loader is inferring schema without a cloudFiles.schemaLocation. Adding that option is the correct next step because it gives Auto Loader a persistent place to store schema metadata and support later schema changes.

This is a direct Auto Loader configuration issue, not a compute or permissions issue. When Auto Loader reads files and infers schema, it needs cloudFiles.schemaLocation so Databricks can store the discovered schema and manage future schema evolution for that source. The notebook output provides enough evidence on its own because it explicitly names the missing option, so the best next step is to add that option and rerun the task.

Keep the source format as cloudFiles
Point cloudFiles.schemaLocation to a durable location
Continue using a normal write-side checkpoint separately if the stream writes to a sink

A sink checkpoint tracks processing state, but it does not replace Auto Loader’s schema storage.

Sink checkpoint only misses the issue because checkpoints track stream progress, not Auto Loader schema metadata.
More compute does not fix a missing required option that is explicitly reported in the run details.
Switching to COPY INTO changes the ingestion pattern instead of fixing the Auto Loader configuration the team wants to keep using.

Question 12

Topic: Development and Ingestion

A bronze ingestion job runs every 15 minutes using the code below. New JSON files land in the folder throughout the day. Last night the source system added a new optional column, and the next run failed when appending to the bronze table. The team wants to process only new files and avoid frequent manual schema updates.

df = spark.read.json("/Volumes/main/raw/orders/")
df.write.mode("append").saveAsTable("main.bronze.orders_raw")

What is the best next step?

Options:

A. Use Auto Loader with cloudFiles and a schemaLocation.
B. Set mergeSchema on the write and keep spark.read.
C. Overwrite the bronze table after inferring schema each run.
D. Replace the job with COPY INTO on a schedule.

Best answer: A

Explanation: Auto Loader is the best fit when new files arrive continuously and the source schema can change over time. It incrementally discovers only new files and stores schema information so compatible schema evolution can be managed more safely than repeated full-directory scans.

This scenario matches Auto Loader’s core use case: ongoing file ingestion plus evolving source schemas. A plain spark.read batch job scans the directory again each run and does not maintain incremental file-discovery state for newly arrived files. When the source adds a column, the pipeline also needs a better way to track and evolve schema over time.

With Auto Loader, you typically use cloudFiles and specify a schemaLocation so Databricks can persist inferred schema metadata and process new files incrementally. That reduces manual operational work and avoids redesigning the ingestion pattern each time the source adds a compatible field.

The key takeaway is that schema evolution plus continuous file arrival is a strong signal to use Auto Loader.

Using COPY INTO on a schedule can be incremental, but it is not the best match when the main need is managed continuous file discovery with schema tracking.
Setting mergeSchema on the write can help Delta accept new columns, but it does not solve incremental discovery of only new files.
Overwriting the bronze table after rescanning everything is operationally heavier and ignores the need for efficient ongoing ingestion.

Question 13

Topic: Data Processing & Transformations

In a Databricks Medallion Architecture, which type of data typically belongs in the silver layer?

Options:

A. Cleansed, validated, and conformed data for downstream processes
B. Raw source data landed with minimal transformation
C. Business-level aggregates prepared for reporting and dashboards
D. Read-only external data shared directly with other organizations

Best answer: A

Explanation: The silver layer is the stage where bronze data is refined into higher-quality datasets. It is typically cleansed, validated, and conformed so downstream transformations, analytics, and data products can use it reliably.

In Medallion Architecture, the silver layer sits between raw ingestion and business-ready presentation. Bronze usually stores source data with minimal changes, while silver improves that data by applying cleaning, validation, deduplication, standardization, and conformance rules. This makes silver the common place for trusted intermediate datasets that downstream processes can build on.

The gold layer is typically where data is further shaped into curated, business-level tables such as aggregates, KPIs, or reporting models. A sharing mechanism or external source access pattern is not itself a Medallion layer. The key idea is that silver is where data quality and consistency are established before broader consumption.

Raw landing describes the bronze layer, which keeps source data close to its original form.
Business aggregates describes the gold layer, which is more presentation-ready for analytics and reporting.
External sharing refers to a data access or sharing pattern, not the purpose of the silver layer.

Question 14

Topic: Development and Ingestion

A data engineer wants to write and test PySpark code in a local IDE but have the code execute on Databricks compute instead of the local machine. Which Databricks feature is designed for this workflow?

Options:

A. Delta Sharing
B. Databricks Connect
C. Databricks Asset Bundles
D. Auto Loader

Best answer: B

Explanation: Databricks Connect is the workflow for using local development tools such as an IDE while executing code on Databricks compute. It supports local development against remote Databricks resources rather than local Spark execution.

The core idea is separating where you write code from where it runs. Databricks Connect lets a developer use local tools, such as an IDE, for coding and testing while the actual execution happens on Databricks compute. That makes it the right choice when a team wants familiar local development workflows but still needs Databricks-backed execution.

This is different from features that package deployments, ingest files, or share datasets. The deciding clue is the combination of local development tools and remote Databricks compute. If the need is interactive development from a local environment against Databricks, Databricks Connect is the intended workflow.

The closest distractor is deployment tooling, which helps ship resources to Databricks but does not provide this local-to-remote development experience.

Deployment focus The option about packaging and deploying project resources supports promotion and deployment, not local IDE execution on Databricks compute.
Ingestion focus The file ingestion option is for incrementally loading data, not for writing code locally against remote compute.
Sharing focus The data sharing option is for securely sharing datasets with recipients, not for developer execution workflows.

Question 15

Topic: Data Governance & Quality

A data engineering team publishes a gold sales table in Unity Catalog. An external partner on a different analytics platform needs read-only access to the latest data, and the team wants centralized governance without recurring export or full-copy workflows. What is the best next action?

Options:

A. Schedule daily Parquet exports to partner-owned cloud storage
B. Create a Delta Sharing share for the gold table
C. Use Lakehouse Federation so the partner can query the table
D. Deploy Auto Loader to sync files to the partner

Best answer: B

Explanation: Delta Sharing is designed for governed data sharing when consumers need fresh access without manual copy pipelines. It lets the provider share a Unity Catalog table directly while keeping control over the shared data.

Delta Sharing is the Databricks capability for sharing live governed data with internal or external recipients. In this scenario, the partner needs read-only access to the latest version of a gold table, and the provider wants to avoid building recurring exports or maintaining duplicate full copies. A Delta Sharing share meets those requirements because the provider manages access centrally and the recipient reads current shared data through the sharing mechanism rather than through a hand-built file-delivery workflow. This is the simplest governed solution for cross-platform consumption.

The key takeaway is that Delta Sharing solves outbound governed data access, while ingestion and federation features solve different problems.

Scheduled Parquet exports create a separate copy-based delivery pipeline, which the team explicitly wants to avoid.
Lakehouse Federation is for querying external systems from Databricks, not for sharing Databricks tables out to recipients.
Auto Loader is used to ingest new files into Databricks, not to provide governed outbound sharing.

Question 16

Topic: Data Processing & Transformations

A team ingests raw order events into orders_bronze and defines this Lakeflow Spark Declarative Pipelines table:

CREATE OR REFRESH STREAMING TABLE orders_target AS
SELECT
  order_id,
  customer_id,
  CAST(order_ts AS TIMESTAMP) AS order_ts,
  UPPER(country_code) AS country_code,
  CAST(total_amount AS DECIMAL(10,2)) AS total_amount
FROM STREAM orders_bronze
WHERE order_id IS NOT NULL
  AND total_amount >= 0;

Which description best fits orders_target in the Medallion Architecture?

Options:

A. A gold table with aggregated business metrics for reporting.
B. A silver table with cleansed, validated data for downstream use.
C. A bronze table holding raw source records with minimal changes.
D. A source table queried in place through Lakehouse Federation.

Best answer: B

Explanation: The table is built from bronze data but applies validation and standardization before storing it. That matches the silver layer, which holds refined data that downstream transformations and analytics can reliably reuse.

In the Medallion Architecture, bronze stores raw ingested data, silver stores cleansed, validated, or conformed data, and gold stores business-ready aggregates or serving tables. This query reads from orders_bronze, filters out invalid rows, converts fields to useful data types, and standardizes country_code. Those are classic silver-layer actions because they improve data quality while keeping the data at a detailed record level for later joins, enrichment, and analysis. A gold table would usually summarize or model the data for a specific reporting need, while a bronze table would preserve the source data with far fewer changes.

Raw landing does not fit because the query already filters and standardizes the source records.
Business aggregate does not fit because no rollups, KPIs, or other summaries are created.
External query does not fit because the data is transformed from orders_bronze, not queried in place through federation.

Question 17

Topic: Data Processing & Transformations

A Databricks Workflow task calculates daily sales metrics from a PySpark DataFrame with one row per order line. A downstream validation task fails because daily_order_count is larger than the true number of orders on days when a single order has multiple lines.

daily = (transactions
  .groupBy("sales_date")
  .agg(
      F.count("order_id").alias("daily_order_count"),
      F.sum("line_total").alias("daily_revenue")
  )
)

What is the best fix?

Options:

A. Use countDistinct("line_total") for daily_revenue.
B. Use countDistinct("order_id") for daily_order_count.
C. Use count("line_total") for daily_revenue.
D. Use sum("order_id") for daily_order_count.

Best answer: B

Explanation: The metric is wrong because the data is at order-line grain, not order grain. Here, count("order_id") counts line rows, so the order-count metric should use countDistinct("order_id"), while sum("line_total") remains the correct way to total revenue.

In PySpark aggregations, count counts non-null values, countDistinct counts unique non-null values, and sum adds numeric values. Here, each row represents an order line, so multiple rows can share the same order_id. That means count("order_id") overstates the number of orders whenever an order has more than one line item.

count("order_id") measures non-null order ID entries
countDistinct("order_id") measures unique orders
sum("line_total") measures total revenue

The right fix is to change only the order-count metric to use distinct order IDs; revenue should remain a sum of the line amounts.

Summing IDs fails because order_id is an identifier, not a numeric measure to total.
Counting revenue rows fails because count("line_total") returns how many non-null amounts exist, not the revenue amount.
Distinct revenue values fails because countDistinct("line_total") counts unique prices or line totals, not money earned.

Question 18

Topic: Productionizing Data Pipelines

A Databricks workflow run failed after a Unity Catalog permission issue. The engineer grants the missing privilege and wants to avoid rerunning work that already succeeded.

Task status
ingest_bronze      SUCCESS
transform_silver   FAILED (PERMISSION_DENIED)
publish_gold       UPSTREAM_FAILED
send_summary       UPSTREAM_FAILED
refresh_lookup     SUCCESS

Dependencies
transform_silver <- ingest_bronze
publish_gold <- transform_silver
send_summary <- publish_gold
refresh_lookup <- none

What is the best next step?

Options:

A. Rerun ingest_bronze first, then manually rerun all later tasks.
B. Rerun only transform_silver and leave downstream tasks untouched.
C. Start a new run so every task executes from the beginning.
D. Repair the run so transform_silver and its downstream tasks rerun.

Best answer: D

Explanation: In Databricks Workflows, rerun scope should follow the dependency graph. After the permission problem is fixed, the right approach is to repair the run starting from the failed task so only that branch of dependent work is rerun.

The key concept is that Databricks workflow reruns should be driven by task dependencies, not by rerunning everything. Here, transform_silver failed, and publish_gold and send_summary did not complete because they depend on it. ingest_bronze already succeeded, and refresh_lookup is independent, so neither needs to run again.

A repair run is the best fit because it targets the failed path:

rerun transform_silver after the permission is fixed
rerun downstream dependent tasks that were blocked
keep unrelated successful tasks unchanged

That is safer and more efficient than restarting the entire workflow, and it avoids missing downstream work that still depends on the repaired task.

Full restart is unnecessary because independent successful tasks do not need to be recomputed.
Only the failed task is insufficient because downstream tasks still need to run after their dependency succeeds.
Rerun upstream first fails because a successful upstream task is not the reason this branch stopped.

Question 19

Topic: Productionizing Data Pipelines

A daily Databricks workflow has task dependencies ingest_orders - build_bronze - build_silver - publish_gold - notify_bi. The team confirms outputs from completed tasks are still valid and should not be recomputed.

Exhibit:

ingest_orders SUCCESS
build_bronze SUCCESS
build_silver SUCCESS
publish_gold FAILED
notify_bi UPSTREAM_FAILED
Cause: SQL warehouse temporarily unavailable

What is the best next action?

Options:

A. Redeploy the workflow with Databricks Asset Bundles
B. Delete successful outputs and rerun from build_bronze
C. Use Repair run for publish_gold and notify_bi
D. Start a new run from ingest_orders

Best answer: C

Explanation: This is a partial workflow failure with valid upstream outputs. In Databricks Workflows, the best recovery action is a Repair run so only the failed task and the task blocked by it are rerun.

A Repair run is designed for cases where part of a workflow succeeded and only later tasks need recovery. Here, ingest_orders, build_bronze, and build_silver already finished successfully, and the stem says their outputs are still valid. publish_gold failed because of a temporary availability issue, and notify_bi did not run only because its dependency failed.

Using Repair run lets Databricks rerun the failed task and the downstream task affected by that failure, without recomputing the successful upstream work. A brand-new run would repeat unnecessary processing. Redeploying a Databricks Asset Bundle is for changing or promoting workflow definitions, not for recovering one transient failed run. Deleting good outputs and rebuilding earlier stages adds cost and risk without improving correctness.

When upstream results remain valid, prefer repairing the run over rerunning the whole workflow.

Full rerun repeats successful upstream processing even though those results were already confirmed valid.
Redeploy first is for workflow definition changes or promotion, not for handling a transient failure in an existing run.
Delete and rebuild throws away correct outputs and expands the recovery scope unnecessarily.

Question 20

Topic: Data Governance & Quality

A data engineering team uses Auto Loader to ingest JSON files into bronze tables and a scheduled SQL workflow to update silver tables in Unity Catalog. During a governance review, the security team asks which users read or modified data-related assets during the last 30 days. The team needs evidence of user actions, not upstream/downstream relationships. Which Databricks capability should they use?

Options:

A. Audit logs
B. Delta table history
C. Lakeflow pipeline event logs
D. Unity Catalog lineage

Best answer: A

Explanation: Audit logs are the best fit because the request is about who accessed or changed assets over time. That is an activity-tracking need, which audit logs are built to capture across Databricks environments and governed assets.

The key concept is distinguishing activity auditing from data dependency tracking. When a team needs to know which users or services accessed or changed data-related assets, the right governance artifact is audit logs. Audit logs capture recorded actions and events, making them appropriate for security reviews, investigations, and compliance reporting.

Lineage answers a different question: how data moves between sources, tables, notebooks, and jobs. Delta table history is narrower because it focuses on changes to a specific Delta table, not broad access activity across assets. Pipeline event logs help troubleshoot pipeline execution, not user access patterns. The best match is the artifact that records user actions across the platform.

Lineage mismatch shows upstream and downstream relationships, not a full record of who accessed assets.
Too narrow Delta table history helps inspect table changes, but it does not broadly answer read-access questions across assets.
Operational focus pipeline event logs are for pipeline execution details and troubleshooting, not governance auditing of user activity.

Question 21

Topic: Data Governance & Quality

A scheduled workflow that reads a Unity Catalog table started failing. The team needs evidence of who attempted the access and when so they can confirm the failure came from a governed action inside Databricks.

Task state: Failed
Error: PERMISSION_DENIED
Message: User does not have SELECT on table main.finance.payroll

What should the data engineer check first?

Options:

A. Current grants on the main.finance.payroll table
B. Unity Catalog lineage for the main.finance.payroll table
C. Delta table history for the main.finance.payroll table
D. Databricks audit logs for the denied access event

Best answer: D

Explanation: Audit logs are designed to capture governed activity in Databricks, including access attempts and denials. For a Unity Catalog PERMISSION_DENIED failure, they provide the evidence trail for who tried to access the table and when the event occurred.

When the issue is a Unity Catalog permission error and the team needs evidence of the governed action, audit logs are the right source. They record activity in the Databricks environment, including authorization-related events, with details such as principal, object, timestamp, and outcome. That makes them appropriate for confirming that a specific access attempt was denied.

Current grants help compare intended permissions against the failure, but they do not prove that an access attempt happened. Lineage explains data dependencies, not access events. Delta table history tracks committed changes to the table, not read attempts or permission denials. The key distinction is evidence of activity versus metadata about the object.

Current grants show who is allowed now, but not who actually attempted the read or when it failed.
Lineage view helps trace upstream and downstream dependencies, not authorization events.
Table history shows Delta commits and table changes, not denied SELECT attempts.

Question 22

Topic: Data Governance & Quality

A team wants to separate data-governance decisions from production workflow decisions in Databricks. Which task is a Unity Catalog permission concern rather than a workflow or deployment concern?

Options:

A. Deploy job definitions across environments with Databricks Asset Bundles.
B. Grant SELECT on a Unity Catalog table to an analyst group.
C. Repair a failed workflow run without rerunning successful tasks.
D. Schedule a Databricks workflow to run hourly.

Best answer: B

Explanation: Unity Catalog privileges control access to governed data objects such as catalogs, schemas, and tables. Scheduling, deployment, and repair behavior belong to Databricks workflow or deployment features, not to Unity Catalog permissions.

Unity Catalog is the governance layer for Databricks data objects. Its privileges determine who can discover or use objects and what actions they can perform, such as reading a table with SELECT. That makes granting table access a governance decision. By contrast, Databricks Workflows handle job scheduling and run management, and Databricks Asset Bundles package and deploy resources across environments. Those tools control how data pipelines are executed or promoted, not who is authorized to access a catalog, schema, or table.

The key distinction is that Unity Catalog answers access-control questions, while workflow and deployment features answer execution and release-management questions.

Hourly scheduling is an orchestration task handled by Databricks Workflows, not by catalog privileges.
Cross-environment deployment is the purpose of Databricks Asset Bundles, which package and promote resources.
Repairing a failed run is run-management behavior for workflows, not a data-access permission.

Question 23

Topic: Data Governance & Quality

In Unity Catalog, which role is primarily used for governance administration at the metastore level, such as creating catalogs and overseeing access across workspaces attached to that metastore?

Options:

A. Metastore admin
B. Catalog owner
C. Workspace admin
D. Schema owner

Best answer: A

Explanation: Metastore admin is the top Unity Catalog governance role for a metastore. When the requirement is centralized administration across attached workspaces, including catalog creation and broad access oversight, that scope is higher than a workspace role or ownership of a single catalog or schema.

Unity Catalog uses scopes, and the metastore is the highest governance boundary for data objects managed together across attached workspaces. Because of that, the metastore admin role is the one associated with centralized governance administration at that level, including tasks such as creating catalogs and managing metastore-wide governance configuration.

A catalog owner governs one catalog.
A schema owner governs one schema inside a catalog.
A workspace admin manages workspace resources and settings, not Unity Catalog governance across the metastore.

The key clue is scope: if the scenario emphasizes top-level governance for many data assets or multiple attached workspaces, think metastore admin.

Workspace scope: the workspace admin role manages workspace operations, not centralized Unity Catalog governance for the metastore.
Catalog scope: a catalog owner is limited to one catalog, so it is too narrow for metastore-level administration.
Schema scope: a schema owner manages one schema and does not serve as the top governance role for the metastore.

Question 24

Topic: Databricks Intelligence Platform

An engineering lead shares this planning note:

Need one governed environment to:
- ingest files and streams
- build bronze, silver, and gold tables
- schedule production pipelines
- let analysts use SQL and engineers use Python
- share curated data with external partners

Which Databricks concept best matches this need?

Options:

A. Build one Lakeflow Spark Declarative Pipeline
B. Create one shared ETL notebook
C. Tune one Databricks SQL query
D. Adopt the Databricks Data Intelligence Platform end to end

Best answer: D

Explanation: The note covers multiple stages of the data lifecycle and multiple user groups. That points to the Databricks Data Intelligence Platform as an end-to-end governed solution, not to a single notebook, pipeline, job, or SQL statement.

The key clue is scope. A single notebook, pipeline, job, or SQL query solves one task, but the exhibit combines ingestion, medallion-style transformations, governance, production scheduling, SQL access, Python development, and external sharing. In Databricks, that is a platform-level scenario: the Databricks Data Intelligence Platform brings these capabilities together in one governed environment.

When a question mentions several teams and several workflow stages at once, the best interpretation is usually the overall platform value. Specific features such as Lakeflow Spark Declarative Pipelines, notebooks, or Databricks SQL may be used within that platform, but none of them alone explains the full request.

Single pipeline is too narrow because the note also includes governance, multi-persona access, and partner sharing.
Single notebook focuses on authoring code, not the broader need for orchestration and governed consumption.
Single SQL query addresses analyst access only, while the exhibit also requires ingestion and production data engineering.

Question 25

Topic: Data Processing & Transformations

A team uses Auto Loader to ingest partner JSON files directly into a single Delta table that is also queried by dashboards. After a source change adds new fields and a few malformed records, the ingestion job continues, but a finance dashboard fails and another team can no longer reliably reuse the data for detailed analysis or reprocessing.

What is the best next step to reduce these recurring issues?

Options:

A. Separate the pipeline into Bronze, Silver, and Gold tables
B. Move the dashboard workload to a larger SQL warehouse
C. Allow analysts to correct bad records directly in the ingestion table
D. Delete the saved Auto Loader schema and re-run ingestion

Best answer: A

Explanation: The best fix is to stop using one table for raw ingestion, cleanup, and reporting. Medallion layers improve clarity by separating purposes, improve reuse by preserving raw records, and improve quality by applying validation and standardization before business users query the data.

This scenario shows why combining raw ingestion and reporting in one table creates fragile pipelines. In a medallion design, Bronze stores the raw landed data with minimal changes, Silver applies cleansing, schema standardization, and deduplication, and Gold exposes business-ready tables for dashboards. That separation means source changes or malformed records can be handled in earlier layers without immediately breaking downstream consumers.

It also improves reuse because different teams can use the right layer for their needs:

Bronze for replay or forensic review
Silver for trusted detailed analytics
Gold for stable business reporting

The key idea is not bigger compute or manual fixes in place, but clearer layer boundaries so data quality and consumer expectations are managed explicitly.

Resetting schema state may help a specific ingestion problem, but it does not solve the design issue of mixing raw and curated data in one table.
More compute can improve performance, but it does not address malformed records, schema drift, or conflicting consumer needs.
Editing the ingestion table directly reduces traceability and makes reuse harder because the raw source of truth is no longer preserved.

Questions 26-45

Question 26

Topic: Data Governance & Quality

Which situation is the best reason to grant a Unity Catalog privilege to a group instead of granting it separately to individual users?

Options:

A. An external partner needs read-only access without being added internally.
B. One engineer needs short-term access to a single table.
C. Each analyst needs different privileges on different tables.
D. Many analysts need the same table access, and team membership changes often.

Best answer: D

Explanation: In Unity Catalog, groups are the preferred way to manage the same access for many users. You grant the privilege once to the group, then handle onboarding and offboarding by changing group membership instead of repeating grants for each person.

The core concept is group-based access control in Unity Catalog. When multiple users need the same privilege on the same data objects, granting access to a group is simpler and more consistent than granting the same privilege to each user individually. The privilege is assigned once to the group principal, and user access changes are handled by adding or removing users from that group. This reduces repetitive administration and lowers the risk that someone keeps access after changing teams.

This approach fits best when access needs are shared and membership changes over time. It is less useful when only one person needs access or when each person needs a different set of privileges. For external recipients, Delta Sharing is the better pattern than internal Unity Catalog group membership.

Single user does not justify creating shared access management for a broad set of people.
Different privileges makes a single group grant a poor fit because the users do not need the same permissions.
External partner points to Delta Sharing rather than normal internal Unity Catalog grants.

Question 27

Topic: Productionizing Data Pipelines

A team wants to deploy the same Databricks Asset Bundle to development and production. A teammate says the job should be duplicated under each target.

Exhibit:

bundle:
  name: orders-pipeline

resources:
  jobs:
    ingest_job:
      name: ingest-job
      tasks:
        - task_key: load_bronze
          notebook_task:
            notebook_path: ../src/load_bronze.py

targets:
  dev:
    default: true
    workspace:
      host: https://dev.acme.databricks.example
  prod:
    workspace:
      host: https://prod.acme.databricks.example
    run_as:
      user_name: prod-service@example.com

What is the best response?

Options:

A. Each target must repeat the full job definition to deploy it.
B. resources only names the bundle; jobs are created from targets.
C. The job belongs in resources; targets hold environment-specific deployment settings.
D. The notebook path should move to targets because code location is environment-specific.

Best answer: C

Explanation: In a Databricks Asset Bundle, resources contains the reusable asset definition, such as the job, tasks, and notebook path. targets provide deployment-specific context for each environment, such as workspace host, default target, or run_as identity. This lets the same job be promoted without copying the job definition.

In Databricks Asset Bundles, resources is where you define the deployable object itself. In this fragment, the job name, task key, and notebook path are all part of the job resource definition. targets describes environment-specific deployment context so the same bundle can be deployed to different environments without rewriting the resource.

Keep job structure and task logic in resources.
Put workspace-specific settings such as workspace.host, default, and run_as in targets.
Use target-specific overrides only when an environment truly needs different deployment behavior.

The closest distractor is the idea that every target needs its own full job block, which defeats the reuse pattern that bundles are designed to provide.

Duplicate job definitions is incorrect because targets do not normally require separate copies of the same job.
Move the notebook path confuses job content with deployment context; notebook_task is part of the resource definition.
Resources only names the bundle reverses the structure; the actual job is declared under resources.jobs.

Question 28

Topic: Databricks Intelligence Platform

A team lands raw files in cloud object storage and transforms them in Databricks. Their nightly task failed again because an external scheduler called the Databricks job with an expired token. Table permissions are also managed in a separate catalog tool, so engineers must check multiple systems whenever the pipeline breaks. What is the best next step to reduce this operational friction?

Options:

A. Run the pipeline in Databricks Workflows on job compute with Unity Catalog
B. Publish the tables with Delta Sharing
C. Increase job cluster size and autoscaling limits
D. Keep the external scheduler and rotate tokens more often

Best answer: A

Explanation: The best fix is to consolidate orchestration, compute, and governance inside Databricks. Using Databricks Workflows with job compute and Unity Catalog reduces cross-tool handoffs, avoids external scheduler token failures, and makes troubleshooting more centralized.

This scenario shows operational friction caused by splitting orchestration, compute access, and governance across different tools. A unified Databricks approach reduces that friction by letting the team run scheduled tasks with Databricks Workflows, execute them on Databricks job compute, and govern the resulting tables with Unity Catalog. That means scheduling, execution history, permissions, and data objects are managed together instead of being spread across separate systems.

When a run fails, engineers can investigate from one platform rather than tracing an external scheduler, separate credential flow, and separate catalog configuration. This does not guarantee every pipeline bug disappears, but it removes an avoidable integration point and simplifies day-to-day operations. The closest distractor only treats the token symptom without addressing the fragmented tooling.

More compute does not address the external token dependency or the split governance model.
Token rotation treats one symptom, but orchestration still stays outside Databricks and troubleshooting remains spread across tools.
Delta Sharing is for secure data sharing, not for simplifying internal pipeline orchestration and governance.

Question 29

Topic: Data Governance & Quality

A team manages a Unity Catalog table with daily sales data. An external partner that does not use the team’s Databricks account needs read-only access, and the team must keep access governed without copying the data or sharing cloud storage credentials. What is the best next action?

Options:

A. Convert the table to an external table and send storage credentials.
B. Create a Delta Sharing share, add the table, and create a recipient.
C. Configure Lakehouse Federation so the partner can query the table.
D. Grant SELECT on the table to the partner in the current workspace.

Best answer: B

Explanation: Delta Sharing is built for governed, read-only sharing of Databricks data with recipients, including external organizations. The provider should create a share, add the needed table, and create a recipient so access remains controlled through Databricks governance.

Delta Sharing is the correct Databricks mechanism when a team needs to expose data to another consumer in a governed way, especially outside its own Databricks account. It lets the data provider share selected tables or views as read-only data products without copying the underlying data or handing out cloud storage credentials. In Databricks, the normal flow is to create a share, add the approved data objects, and then create or assign a recipient that can access that share.

Create the share.
Add the Unity Catalog table or approved view.
Create or identify the recipient.
Grant the share to that recipient.

Lakehouse Federation is for querying external systems from Databricks, not for publishing Databricks data to another consumer.

Granting SELECT in the current workspace only fits principals governed inside that environment, not an external partner outside the account.
Lakehouse Federation is for Databricks to access outside data sources, not for sharing Databricks tables outward.
Sending storage credentials with an external table bypasses the governed sharing model and exposes underlying storage access.

Question 30

Topic: Development and Ingestion

A data engineer is exploring daily JSON files that land in cloud object storage. The team needs to read a sample with PySpark, run SQL checks after a simple cleanup, add Markdown notes for reviewers, and inspect outputs together before deciding how to productionize the pipeline. What is the best next action in Databricks?

Options:

A. Publish the landing files with Delta Sharing for external review.
B. Use a Databricks notebook for PySpark, SQL, Markdown, and shared result review.
C. Create a Lakeflow Spark Declarative Pipeline and schedule it immediately.
D. Use Databricks Connect so each engineer develops locally in separate files.

Best answer: B

Explanation: A Databricks notebook is the best fit for interactive, collaborative development. It lets the team combine PySpark ingestion, SQL validation, Markdown documentation, and immediate output inspection in one shared workflow before turning the logic into a production pipeline.

Databricks notebooks are designed for iterative data-engineering work when code, documentation, and results need to stay together. In this scenario, the team wants to ingest sample JSON data with PySpark, run SQL checks after cleanup, explain findings with Markdown, and review outputs collaboratively before committing to a production design. A notebook supports all of that in a single artifact, including cell results that can be inspected directly and shared with teammates. That makes notebooks the right first step for exploration and validation. By contrast, deployment tools are more appropriate once the logic is stable, and sharing features solve distribution needs rather than mixed-language development. The key takeaway is that notebooks combine code, SQL, Markdown, and result inspection in one collaborative workflow.

Local IDE focus Databricks Connect is useful for local development against Databricks compute, but it does not best satisfy the shared notebook review requirement.
Too early to productionize Lakeflow Spark Declarative Pipelines are for managed pipeline execution, which is premature before the team validates the logic interactively.
Wrong problem Delta Sharing securely shares data with recipients, but it does not provide a single workspace for mixed PySpark, SQL, Markdown, and output review.

Question 31

Topic: Data Governance & Quality

A Databricks workflow task reads a partner dataset through a foreign catalog. It now fails with:

Task: refresh_partner_orders
Status: FAILED
Message: Connection to source database timed out

The partner says they will not allow direct queries against their operational database. They only want to expose a few approved tables as read-only data and revoke access centrally when needed. What is the best next step?

Options:

A. Create external tables that point directly to the partner database.
B. Scale up the compute that runs the federated query.
C. Ask the partner to share the approved tables with Delta Sharing.
D. Request a read-only database account and keep using Lakehouse Federation.

Best answer: C

Explanation: This is a governed sharing requirement, not an in-place query requirement. When a provider wants to publish selected read-only tables and retain central control over revocation, Delta Sharing is the appropriate Databricks solution.

Lakehouse Federation is used for in-place querying of external systems when Databricks can connect directly to those systems. In this scenario, the partner explicitly refuses direct database access from customer environments and wants to expose only a small approved subset of data with centrally managed revocation. That maps to Delta Sharing, which is designed for secure, controlled, read-only sharing between organizations without giving the consumer direct access to the provider’s operational database.

The key distinction is the access model: federation queries the source in place, while sharing publishes approved data for governed consumption. Continuing to troubleshoot connectivity or resize compute would not address the provider’s stated access requirement.

Keep federation fails because Lakehouse Federation still depends on direct connectivity to the partner’s operational database.
Use external tables is incorrect because external tables reference storage locations, not live operational database connections.
Scale compute misses the issue because the failure is about the access model and policy, not cluster or warehouse capacity.

Question 32

Topic: Data Processing & Transformations

A nightly Databricks batch job uses built-in SQL to join and aggregate Delta tables, and the results are correct. After source data volume doubled, the job now misses its SLA on a small job compute configuration. Which change is most appropriate before rewriting the transformations?

Options:

A. Increase the job compute size, such as adding workers
B. Add extra medallion layers to the pipeline
C. Replace the built-in SQL logic with Python UDFs
D. Rewrite the SQL transformations in PySpark

Best answer: A

Explanation: This is a workload-fit issue, not a logic-correctness issue. When built-in Spark SQL transformations already work correctly and data volume increases, the first response is usually to give the job more appropriate compute resources.

Compute fit means matching the Databricks compute configuration to the workload’s size and pattern. In this case, the transformation logic is already correct and uses built-in SQL operations, so the new symptom points to insufficient capacity after data growth rather than a need to redesign the code.

Keep the existing transformation logic.
Resize the job compute for the larger batch workload.
Consider rewrites only if profiling shows an actual implementation bottleneck.

The closest distractor is rewriting the SQL in PySpark, but built-in SQL and DataFrame-style operations both use Spark’s optimizer, so changing APIs is not the first fix for a simple capacity problem.

PySpark rewrite is tempting, but moving built-in SQL logic to PySpark does not by itself solve an undersized compute configuration.
Python UDFs usually reduce optimization opportunities and are not the right first move for a basic scaling issue.
More medallion layers changes pipeline design, not the compute resources available to the current batch job.

Question 33

Topic: Development and Ingestion

A data engineering team uses a Databricks Workflow task to run a notebook that starts an Auto Loader stream. The notebook should continuously ingest CSV files from cloud object storage into a Unity Catalog bronze table. After deployment, the task fails immediately and no files are processed. The team wants the fastest built-in Databricks way to see whether the failure is caused by an option typo, a missing storage permission, or a schema issue. What should they use first?

Options:

A. Review the failed task error output in Workflows
B. Inspect Spark UI stage and shuffle metrics
C. Download driver and executor logs first
D. Check Unity Catalog lineage for the bronze table

Best answer: A

Explanation: For a job that fails immediately, the failed task output in Databricks Workflows is the quickest built-in place to inspect the actual exception. It typically surfaces Auto Loader option errors, access failures, and schema-related exceptions directly in the run details.

When an Auto Loader task fails before any data is processed, start with the failed notebook or workflow task output. Databricks captures the exception message and stack trace there, which is usually enough to identify common Associate-level issues such as an invalid cloudFiles option, missing permission to read the source path, or a schema mismatch when inferring or writing data.

This is the best first debugging aid because it points directly to the failing command without extra setup. Heavier tools are useful later only if the error output is not specific enough.

Use Spark UI mainly for understanding execution behavior and performance after Spark jobs have started.
Use driver or executor logs for deeper runtime investigation when basic error details are insufficient.
Use lineage to understand data relationships, not to diagnose an immediate ingestion failure.

For simple Auto Loader debugging, start where Databricks first reports the exception.

Spark UI mismatch focuses on stages, tasks, and performance, so it is not the best first tool for a failure that happens immediately.
Lineage mismatch helps trace table dependencies and data flow, but it does not explain a syntax, permission, or schema exception.
Logs too early can help later, but downloading low-level logs first adds unnecessary complexity for a simple failure.

Question 34

Topic: Development and Ingestion

A Databricks Workflow runs an Auto Loader notebook every hour to ingest new CSV files from cloud storage into a bronze Delta table. Last night’s run failed during the ingestion step, and the team needs to identify the specific failing stage before rerunning. Unity Catalog permissions and lineage are already in place. What is the best next action?

Options:

A. Redeploy the workflow with Databricks Asset Bundles
B. Review Unity Catalog lineage for the bronze table
C. Search audit logs for external location access events
D. Inspect the failed workflow run, driver logs, and Spark UI

Best answer: D

Explanation: To troubleshoot a failed Auto Loader step, the most direct Databricks tools are the failed run output, driver logs, and Spark UI. They reveal the actual error and where execution failed, which is what the team needs before rerunning.

When a specific Auto Loader ingestion step fails, start with the diagnostics from that failed run. In Databricks, the workflow run output, driver logs, and Spark UI are built-in debugging tools that help locate the failing stage, task, and exception. That is the fastest way to determine whether the issue is a bad option, schema problem, file-format issue, or storage-path problem.

Governance features serve different purposes. Unity Catalog lineage helps trace data relationships, and audit logs help review access or administrative events, but neither is designed to explain why one ingestion run failed at a particular stage. Databricks Asset Bundles help package and deploy workflows, not debug a completed failed run. For this scenario, use run-level diagnostics first, then correct the issue and rerun.

Lineage focus shows upstream and downstream data relationships, but not the stage-level error that stopped the Auto Loader query.
Audit focus helps investigate access and account events, not the root cause of a failed ingestion task.
Deployment focus improves packaging and promotion of workflows, but it does not diagnose the current failed run.

Question 35

Topic: Data Governance & Quality

A data engineering team must give a business partner read-only access to a Delta table. The partner is outside the team’s Unity Catalog environment and should not receive direct table privileges in the producer’s catalog. Which statement is accurate?

Options:

A. Lakehouse Federation is used to publish Databricks tables to outside recipients.
B. Delta Sharing is designed for read-only sharing to recipients outside the producer’s Unity Catalog context.
C. A GRANT SELECT on the table is the standard way to share with external partners.
D. Creating an external table automatically makes it available to outside recipients.

Best answer: B

Explanation: Delta Sharing is the Databricks capability for governed, read-only sharing when the recipient is outside the producer’s immediate Unity Catalog permission boundary. Local table grants are meant for principals inside that governed environment, not for external partner sharing.

Delta Sharing is used when a provider wants to share live, read-only data with recipients outside the producer’s local access-control scope, such as another company or another platform context. Instead of granting direct access on the underlying table, the provider creates a share and authorizes recipients to consume that share. By contrast, Unity Catalog privileges like GRANT SELECT are for users, groups, or service principals governed within the producer’s Databricks permission model. Lakehouse Federation addresses querying external systems from Databricks, not distributing your Databricks tables outward. External tables describe where data is stored, but they do not by themselves create an external sharing mechanism. The key distinction is internal access uses grants, while external governed sharing uses Delta Sharing.

Local grants apply to principals in the producer’s governed access model, so they are not the normal mechanism for an outside partner.
Federation confusion mixes up external querying with external data distribution; Lakehouse Federation is for querying remote sources.
External table confusion changes storage management, but it does not establish a governed recipient-sharing relationship.

Question 36

Topic: Productionizing Data Pipelines

A team is deploying a scheduled ETL workflow in Databricks. They want the most hands-off compute option: no node type selection, no worker sizing, and Databricks should automatically provision, scale, and optimize the compute for each run. Which compute choice is the best fit?

Options:

A. Single-node compute
B. Job compute with manually configured workers
C. All-purpose compute
D. Serverless compute

Best answer: D

Explanation: Serverless compute is the best fit when a team wants Databricks to handle infrastructure management and performance optimization automatically. It is intended for low-overhead execution, especially for production workloads where manual compute setup is unnecessary.

The key concept is choosing compute based on how much operational control the team wants to manage. For a scheduled ETL workload that should be as hands-off as possible, serverless compute is the best fit because Databricks manages provisioning, scaling, and many performance optimizations automatically.

This matches the stem’s requirements: the team does not want to choose node types, size workers, or tune compute settings for each run. Serverless compute reduces that overhead so engineers can focus on pipeline logic instead of cluster management.

A manually configured job compute resource can still run scheduled pipelines, but it requires more setup and sizing decisions than the stem allows.

Interactive focus The all-purpose compute option is better aligned to development and ad hoc work than a minimal-management production run.
Too much sizing The manually configured job compute option still requires explicit worker and configuration choices.
No auto-scaling benefit The single-node option simplifies topology, but it does not provide the auto-managed scaling and optimization requested.

Question 37

Topic: Productionizing Data Pipelines

A nightly Databricks job orchestrates three dependent tasks on a job cluster:

ingest_bronze   SUCCESS
build_silver    FAILED
publish_gold    UPSTREAM_FAILED

ingest_bronze uses Auto Loader. The cluster has been meeting SLA, and the failure was caused by a temporary credential issue that is now fixed. The team wants to continue the workflow quickly, avoid re-ingesting files, and leave compute settings unchanged. What is the best next action?

Options:

A. Increase job-cluster workers and rerun the entire workflow.
B. Run the failed notebook manually on all-purpose compute.
C. Redeploy the job with Databricks Asset Bundles first.
D. Run a repair for the failed job run.

Best answer: D

Explanation: This is a workflow recovery problem, not a compute-sizing problem. Because the upstream ingestion task already succeeded and the issue was fixed outside compute, a repair run is the fastest way to resume from the failure point without reprocessing bronze data.

Databricks Workflows Repair Run is intended for cases where part of a job run completed successfully and only failed or blocked tasks need to run again. In this scenario, ingest_bronze already finished, the root cause was a temporary credential issue, and the cluster was already meeting performance targets. That means changing workers or switching compute does not solve the real problem.

A repair run lets the workflow rerun the failed build_silver task and the downstream task that was skipped because of that failure. This keeps the recovery inside the job’s orchestration logic, preserves run history, and avoids unnecessary upstream reprocessing. The key distinction is that rerun/repair behavior is an orchestration concern, while worker counts and compute type are configuration concerns.

More workers is a compute change, but the stem says performance was already acceptable and the failure was caused by credentials.
Manual notebook rerun bypasses the workflow’s managed task dependencies and operational tracking.
Redeploying with bundles addresses deployment packaging and versioning, not recovery of a single failed run.

Question 38

Topic: Development and Ingestion

A data engineering team receives JSON files in cloud object storage throughout the day from an upstream application. New files arrive at unpredictable times, and the application occasionally adds new optional columns. The team wants a Databricks ingestion approach that automatically discovers new files and minimizes manual schema maintenance before loading a bronze table. Which approach is the best fit?

Options:

A. Schedule COPY INTO runs against the storage directory
B. Query the source with Lakehouse Federation
C. Share the source files with Delta Sharing
D. Configure Auto Loader with cloudFiles and schema evolution

Best answer: D

Explanation: Auto Loader is the Databricks feature built for ingesting files that keep arriving in cloud storage. It is the best match when the source schema may evolve and the team wants automatic file discovery with less operational overhead.

The core concept is choosing the ingestion pattern that matches both the arrival pattern and the schema behavior of the source data. Auto Loader is designed for cloud object storage sources where files land continuously or unpredictably, and it supports schema inference and schema evolution for changing file structures such as JSON.

In this scenario, Auto Loader fits because it:

automatically discovers new files as they arrive
avoids rescanning and custom file-tracking logic
supports evolving schemas with less manual intervention

COPY INTO is useful for incremental batch loads, but it is not the best fit when continuous file discovery and schema evolution are the main requirements. Federation and sharing address access to existing external data, not ingestion of newly arriving raw files.

The scheduled COPY INTO option is a batch-oriented loading pattern and is less appropriate when ongoing file discovery and schema changes are central requirements.
The Lakehouse Federation option is for querying external systems without ingestion, not for discovering new files in object storage.
The Delta Sharing option is for governed data sharing between providers and recipients, not for landing raw source files into a bronze table.

Question 39

Topic: Productionizing Data Pipelines

A team runs an hourly Databricks Workflow. Task 1 uses Auto Loader to ingest CSV files into a bronze Unity Catalog table, and Task 2 joins that data to a dimension table and writes a silver table. The schedule and Unity Catalog governance must stay unchanged. Recent runs succeed, but Task 2 has slowed from 5 minutes to 38 minutes.

Workflow status: Succeeded
Permission errors: none
Spark UI, Stage 14: 199 tasks < 10 sec, 1 task = 24 min
Shuffle read: highly uneven across tasks

What is the best next action?

Options:

A. Repair the workflow run so Task 2 reruns by itself
B. Move the workflow to serverless compute to eliminate orchestration delay
C. Investigate data skew in the transformation and adjust the join or partition strategy
D. Grant additional Unity Catalog privileges on the silver table

Best answer: C

Explanation: This is a Spark execution bottleneck, not a permission or scheduling problem. The workflow succeeds, no access errors appear, and the Spark UI shows one task doing far more shuffled work than the others, which points to skew in the transformation.

When a Databricks Workflow can read and write Unity Catalog tables and completes successfully, permission and orchestration problems are less likely. The deciding clue here is the Spark UI: almost all tasks finish quickly, but one task in the stage runs much longer and handles a disproportionate amount of shuffle data. That pattern points to a Spark performance issue, commonly data skew or an inefficient join or partitioning choice in the transformation.

Permission issues usually surface as explicit access errors before the read or write completes.
Orchestration issues usually appear as failed dependencies, scheduling delays, retries, or repair decisions in the workflow run.
Spark UI task imbalance indicates work distribution problems inside the Spark job.

The right next step is to analyze and tune the transformation based on the Spark UI evidence, not change permissions or workflow mechanics.

Extra grants do not fit because the task already reads and writes the governed tables without permission errors.
Repairing the run is for rerunning failed or interrupted work, not for diagnosing a successful but slow stage.
Switching to serverless changes compute management, but it does not directly address the uneven shuffle workload shown in the Spark UI.

Question 40

Topic: Databricks Intelligence Platform

A data engineer is building a new bronze-to-silver notebook. They need to ingest a sample of JSON files, inspect schema changes, and rerun PySpark cells repeatedly with a teammate during development. The data is governed in Unity Catalog, and the workload will be scheduled only after the logic is finalized. Which compute choice is most appropriate now?

Options:

A. All-purpose compute attached to the notebook
B. Job compute in Databricks Workflows
C. A SQL warehouse for the PySpark notebook
D. A Lakeflow Spark Declarative Pipelines pipeline

Best answer: A

Explanation: All-purpose compute is the best fit for interactive notebook development and data exploration. The engineer needs to inspect data, test PySpark logic, and rerun cells iteratively before moving to scheduled production execution.

The key decision is matching compute to the current stage of the workload. All-purpose compute is intended for interactive notebook use, exploratory analysis, and iterative development. In this scenario, the engineer is still validating ingestion behavior, checking schema changes, and refining PySpark transformations with repeated notebook runs. Unity Catalog governance still applies, but it does not make a production-oriented compute choice preferable.

Use all-purpose compute when people are actively developing and exploring in notebooks.
Use job compute when the code is ready to run as an automated job.
Use SQL warehouses for SQL analytics workloads.
Use Lakeflow Spark Declarative Pipelines when defining managed data pipelines, not simple ad hoc notebook iteration.

The strongest clue is that scheduling comes later, after the notebook logic is finalized.

The workflow job compute option fits automated or scheduled execution, not repeated interactive notebook edits.
The SQL warehouse option is aimed at SQL analytics and BI-style queries, not iterative PySpark notebook development.
The Lakeflow Spark Declarative Pipelines option adds managed pipeline structure before the team has finished exploring and refining the logic.

Question 41

Topic: Data Governance & Quality

Which statement best describes Delta Sharing in Databricks?

Options:

A. A governance model for organizing lakehouse data into bronze, silver, and gold layers
B. A virtual-query layer for external databases without moving them into Databricks
C. An ingestion service that incrementally discovers and loads new files from cloud storage
D. A secure open-sharing protocol for governed data, often without copying it into each recipient platform

Best answer: D

Explanation: Delta Sharing is Databricks capability for secure, governed data sharing across teams, organizations, and platforms. Its purpose is to expose shared data to recipients without requiring every consumer to first create a full separate copy of that data.

Delta Sharing is used when a data provider wants to share governed data with another internal team or an external recipient in a controlled way. It uses an open sharing protocol so the provider can publish specific data assets while maintaining governance over what is shared. The important idea in this question is that Delta Sharing is about distribution of access to data, not file ingestion, data modeling, or querying an external operational system in place.

It differs from nearby concepts:

Lakehouse Federation is for querying external data sources where they already live.
Medallion Architecture is a design pattern for organizing data quality layers.
Auto Loader is for incremental file ingestion from cloud storage.

The key takeaway is that Delta Sharing solves governed data sharing, often without forcing full duplication for each consumer.

The virtual-query option describes Lakehouse Federation, which is for querying external systems in place rather than publishing shares.
The bronze-silver-gold option describes Medallion Architecture, which organizes transformations but does not share data with recipients.
The incremental file-loading option describes Auto Loader, which ingests arriving files instead of exposing governed datasets to consumers.

Question 42

Topic: Data Processing & Transformations

A team has a batch ETL job whose SQL transformation is already correct. The job runs briefly a few times per day, does not require custom cluster settings, and the team wants to avoid managing clusters or keeping idle compute running. Which Databricks capability is the best fit to solve this by changing compute instead of rewriting the transformations?

Options:

A. Auto Loader
B. Databricks Connect
C. Serverless compute
D. Lakeflow Spark Declarative Pipelines

Best answer: C

Explanation: This is a compute-fit problem, not a transformation-logic problem. Serverless compute is designed for workloads that can run on managed Databricks compute without cluster administration, so the team can keep the existing SQL and change only how it runs.

The key concept is choosing the right compute for the workload. When a batch transformation is already correct and the main issue is that the job is short-lived, intermittent, and does not need custom cluster settings, serverless compute is usually the better fix than rewriting the pipeline logic. It lets Databricks manage the underlying compute so the team does not need to size, start, or maintain clusters for that workload.

Keep the existing transformation code.
Change the execution environment to managed compute.
Rewrite logic only when the problem is with pipeline behavior, not workload fit.

The important distinction is that serverless compute changes how the workload is executed, not what the transformation does.

Ingestion mismatch Auto Loader is for incrementally ingesting files from cloud storage, not for choosing managed compute for an existing batch transformation.
Development tool Databricks Connect helps developers work from a local IDE against Databricks, but it does not solve runtime cluster-management fit.
Broader pipeline feature Lakeflow Spark Declarative Pipelines is for defining and managing declarative pipelines, which is more than the simple need to switch compute for an already-correct job.

Question 43

Topic: Data Governance & Quality

A data engineer is creating a new Unity Catalog table for curated customer data. The team wants Databricks to manage the table’s storage lifecycle, including handling the underlying data files and cleaning them up when the table is dropped. Which object should the engineer create?

Options:

A. A volume
B. An external table
C. A managed table
D. A view

Best answer: C

Explanation: A managed table is the best fit when Databricks should control the table’s storage and lifecycle. It simplifies administration because Databricks manages the underlying data files instead of just registering data stored elsewhere.

In Unity Catalog, a managed table is used when you want Databricks to manage both the table metadata and the underlying data lifecycle. This is the simpler choice for new lakehouse data when there is no requirement to keep the files in a separately controlled external location. Managed tables align with the requirement that Databricks handle storage behavior and cleanup of the data it manages when the table is dropped.

External tables are different: they reference data in an external location that remains outside full Databricks lifecycle control. The key decision is whether Databricks should own the table’s data lifecycle or only point to externally managed files.

External storage control The external table option is for data kept in a separately managed storage location, so it does not match the simplified Databricks-managed lifecycle requirement.
No physical storage The view option stores a query definition, not managed table data files.
Wrong object type The volume option is for files and directories, not tabular data managed as a SQL table.

Question 44

Topic: Data Governance & Quality

A team publishes a Unity Catalog table through Delta Sharing. Finance reports higher costs.

Spark UI for the publish job:
- Runtime: unchanged
- Spill/skew warnings: none
- Cluster autoscaling range: unchanged

Recent change:
- A consumer on another cloud now refreshes a dashboard from the share every hour

What is the best next step?

Options:

A. Review cross-cloud data-transfer costs and data locality
B. Rebuild ingestion with Auto Loader
C. Convert the shared table to a managed table
D. Increase the publish job cluster size

Best answer: A

Explanation: The Spark UI shows the publishing compute is behaving the same as before, so this is unlikely to be a compute-cost problem. The new hourly cross-cloud reads are the key change, making data-transfer cost from sharing the first thing to investigate.

Differentiate the cost type by checking what changed. Spark UI symptoms such as unchanged runtime, no spill or skew, and the same autoscaling range suggest the publish job’s compute profile is stable. When the new event is a consumer in another cloud reading the shared data through Delta Sharing, the most likely new cost category is cross-cloud data transfer, not DBU consumption from the job cluster.

A good next step is to review:

how often the consumer reads the share
how much data those reads scan
whether the data should be placed closer to the consumer

Cluster tuning, ingestion changes, or table-type changes can affect performance or governance, but they do not directly solve a cost created by moving data across cloud boundaries.

Increasing the publish job cluster targets compute runtime, but the stem says compute behavior is unchanged.
Rebuilding ingestion with Auto Loader addresses file ingestion and schema handling, not costs from a consumer reading shared data.
Converting to a managed table changes storage management in Unity Catalog, but it does not remove cross-cloud transfer charges.

Question 45

Topic: Development and Ingestion

Which capability is provided by Databricks notebooks for day-to-day development, rather than by Databricks Connect, Workflows, or Databricks Asset Bundles?

Options:

A. Scheduled task orchestration with retries and dependencies
B. Interactive cell execution with inline results and charts
C. Local IDE execution against Databricks compute
D. Declarative resource packaging and deployment with YAML

Best answer: B

Explanation: Databricks notebooks are the interactive authoring surface for everyday development in the workspace. They let engineers run code cell by cell, inspect immediate output, and build quick visualizations inline, which is different from local IDE connectivity, orchestration, or deployment tooling.

Databricks notebooks are designed for interactive development inside the Databricks workspace. Their core value is fast iteration: write code, run a cell, review the result immediately, and optionally add lightweight visualizations in the same notebook. That makes notebooks the right choice when the task is about day-to-day exploration, debugging, or incremental development.

A useful way to separate these tools is:

notebooks: interactive authoring and inline results
Databricks Connect: develop in a local IDE while executing on Databricks compute
Workflows: schedule and orchestrate production tasks
Databricks Asset Bundles: define and deploy Databricks resources as code

If the need is interactive execution with immediate feedback in the workspace, think notebooks first.

Local IDE execution describes Databricks Connect, which is for external development tools rather than the notebook interface.
Scheduled orchestration with retries and dependencies is handled by Workflows, not by notebooks.
Declarative packaging and YAML-based deployment refers to Databricks Asset Bundles, which support deployment rather than interactive authoring.

Continue with full practice

Use the Databricks Data Engineer Associate Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try Databricks Data Engineer Associate on Web View Databricks Data Engineer Associate Practice Test

Focused topic pages

Free review resource

Read the Databricks Data Engineer Associate Cheat Sheet on Tech Exam Lexicon for concept review before another timed run.

Revised on Thursday, May 14, 2026

Governance and Quality

Browse Certification Practice Tests by Exam Family

Free Databricks DE Associate Full-Length Practice Exam: 45 Questions

Exam snapshot

Full-length exam mix

Practice questions

Questions 1-25

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Question 11

Question 12

Question 13

Question 14

Question 15

Question 16

Question 17

Question 18

Question 19

Question 20

Question 21

Question 22

Question 23

Question 24

Question 25

Questions 26-45

Question 26

Question 27

Question 28

Question 29

Question 30

Question 31

Question 32

Question 33

Question 34

Question 35

Question 36

Question 37

Question 38

Question 39

Question 40

Question 41

Question 42

Question 43

Question 44

Question 45

Continue with full practice

Focused topic pages

Free review resource