Try 45 free Databricks Data Engineer Associate questions across the exam domains, with explanations, then continue with full IT Mastery practice.
This free full-length Databricks Data Engineer Associate practice exam includes 45 original IT Mastery questions across the exam domains.
These questions are for self-assessment. They are not official exam questions and do not imply affiliation with the exam sponsor.
Count note: this page uses the full-length practice count maintained in the Mastery exam catalog. Some certification vendors publish total questions, scored questions, duration, or unscored/pretest-item rules differently; always confirm exam-day rules with the sponsor.
Need concept review first? Read the Databricks Data Engineer Associate Cheat Sheet on Tech Exam Lexicon, then return here for timed mocks and full IT Mastery practice.
Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.
Try Databricks Data Engineer Associate on Web View full Databricks Data Engineer Associate practice page
| Domain | Weight |
|---|---|
| Databricks Intelligence Platform | 10% |
| Development and Ingestion | 17% |
| Data Processing & Transformations | 21% |
| Productionizing Data Pipelines | 17% |
| Data Governance & Quality | 35% |
Use this as one diagnostic run. IT Mastery gives you timed mocks, topic drills, analytics, code-reading practice where relevant, and full practice.
Topic: Data Governance & Quality
A data engineer created a gold table with the following SQL.
CREATE OR REPLACE TABLE main.gold.daily_store_sales AS
SELECT o.store_id,
date(o.order_ts) AS order_date,
SUM(oi.quantity * oi.unit_price) AS sales
FROM main.silver.orders o
JOIN main.silver.order_items oi
ON o.order_id = oi.order_id
GROUP BY o.store_id, date(o.order_ts);
Which follow-up question is best answered with Unity Catalog lineage?
Options:
A. Which SQL warehouse size would reduce runtime?
B. If main.silver.order_items changes, what downstream tables or views are impacted?
C. Which workflow task failed and should be repaired?
D. Why did the join spend most time shuffling data?
Best answer: B
Explanation: Unity Catalog lineage is used for dependency tracing and impact analysis across governed data assets. The question about downstream tables or views affected by a source-table change is a lineage question, while shuffle analysis, warehouse sizing, and repairing failed tasks are performance or workflow topics.
Unity Catalog lineage is for tracing relationships between upstream and downstream data assets. In this SQL, main.gold.daily_store_sales is built from main.silver.orders and main.silver.order_items, so asking what other tables or views would be affected by a change to one of those sources is a dependency and impact-analysis question.
That differs from other Databricks question types:
A good rule is: if the question is “what depends on what?”, use lineage; if it is “why was it slow?” or “what run failed?”, use performance or workflow tools.
Topic: Data Processing & Transformations
A data engineer reviews this requirement note:
Target table: main.sales.daily_orders
Current state: already exists with historical data
Source: temp view new_orders
Requirement: add all rows from new_orders
Constraint: keep existing rows unchanged
Which SQL statement pattern best matches this requirement?
Options:
A. INSERT INTO main.sales.daily_orders SELECT * FROM new_orders
B. CREATE OR REPLACE TABLE main.sales.daily_orders AS SELECT * FROM new_orders
C. INSERT OVERWRITE main.sales.daily_orders SELECT * FROM new_orders
D. CREATE TABLE main.sales.daily_orders AS SELECT * FROM new_orders
Best answer: A
Explanation: INSERT INTO ... SELECT is the standard append pattern for an existing table. The exhibit says the table already exists and that historical rows must remain, so the correct choice must add new rows without replacing current data.
In Databricks SQL, INSERT INTO target SELECT ... is used to append query results to an existing table. That matches the exhibit exactly: main.sales.daily_orders already exists, and the requirement is to add rows from new_orders while leaving historical rows in place.
INSERT OVERWRITE does not append; it replaces the target table’s current contents with the query output. CREATE OR REPLACE TABLE ... AS SELECT also rebuilds the table from the select results, so it is a replace pattern rather than an append pattern. CREATE TABLE ... AS SELECT is meant for creating a new table, not writing additional rows into one that already exists.
For a simple add-rows requirement, the append form is the best match.
Topic: Productionizing Data Pipelines
A data engineering team is setting up a scheduled ETL workflow. Based on the note below, which compute choice is the best fit?
Exhibit:
Workflow: orders_etl
Tasks: 3 notebook tasks
Schedule: every hour
Requirements:
- minimal operational overhead
- no manual cluster sizing or tuning
- Databricks should optimize compute automatically
- no special Spark configuration needed
Options:
A. Run it from a SQL warehouse
B. Run it on serverless compute
C. Use a new autoscaling jobs cluster
D. Use a shared all-purpose cluster
Best answer: B
Explanation: Serverless compute is the best match when a workflow should be hands-off and automatically optimized by Databricks. The exhibit describes standard scheduled notebook tasks with no special configuration needs, which is a strong serverless use case.
The core concept is choosing serverless compute when the team wants Databricks to manage provisioning, scaling, and optimization for a routine production workload. In the exhibit, the workflow is scheduled, uses standard notebook tasks, and explicitly says the team does not want to size or tune clusters.
That combination points to serverless compute because it reduces operational work while still supporting production ETL execution. A manually configured jobs cluster can run the workload, but it still requires decisions about cluster setup and lifecycle. A shared all-purpose cluster is aimed more at interactive development, and a SQL warehouse is not the right compute for a notebook-based ETL workflow.
The key takeaway is that serverless is the best fit when the goal is a managed, low-overhead workflow rather than custom cluster control.
Topic: Productionizing Data Pipelines
A data engineering team stores a Databricks job and a Lakeflow Spark Declarative Pipeline in one Databricks Asset Bundle. They need to deploy the same bundle to both a development workspace and a production workspace, use different workspace hosts, and keep production-specific overrides such as a schedule. Which bundle element best matches this deployment-structure requirement?
Options:
A. variables
B. resources
C. targets
D. workspace
Best answer: C
Explanation: targets define separate deployment environments in a Databricks Asset Bundle. They let the team reuse one set of resource definitions while changing environment-specific settings such as workspace hosts and production overrides. That matches the requirement to keep a single bundle for both dev and prod.
In Databricks Asset Bundles, targets are used to model deployment environments such as development and production. This is the right structural element when the same bundle should deploy the same job and Lakeflow Spark Declarative Pipeline to different workspaces while allowing environment-specific settings or overrides. A target can hold deployment-specific values like workspace configuration and resource overrides, so the shared resource definitions stay in one place. resources describe what gets deployed, such as jobs or pipelines. workspace sets workspace-related configuration, but by itself it does not organize multiple environments in one bundle. variables help parameterize values, but they do not replace the environment structure that targets provide. The key takeaway is to use targets for multi-environment deployment layout.
workspace element sets workspace configuration, but it is not the main structure for separate dev and prod deployments.resources element defines jobs and pipelines, not the environment layout for deploying them.variables element can substitute values, but it does not create target-specific deployment sections and overrides.Topic: Data Processing & Transformations
A Lakeflow Spark Declarative Pipeline update fails with AnalysisException: Table or view active_customers not found. One function creates active_customers as a temporary view, and a later dataset definition queries that view. The team says the temp-view code appears earlier in the notebook, so it should run first. What is the best fix?
Options:
A. Add Workflow dependencies between notebook tasks.
B. Declare the intermediate result as a Lakeflow dataset and reference it.
C. Move the temp-view code earlier in the notebook.
D. Use a larger cluster and rerun the update.
Best answer: B
Explanation: Lakeflow Spark Declarative Pipelines build execution order from the dependency graph between declared datasets. If a downstream transformation needs an upstream result, that dependency should be expressed in the pipeline definition instead of relying on notebook order or a temporary-view side effect.
Lakeflow Spark Declarative Pipelines are declarative, so Databricks plans execution from dataset references rather than from top-to-bottom notebook order. In this scenario, the downstream definition reads a temporary view that is created as a side effect, so the pipeline planner does not have a reliable declared dependency to follow.
Reordering code, adding workflow sequencing, or increasing compute does not fix the core issue: the transformation dependency is not explicitly represented in the pipeline.
Topic: Data Processing & Transformations
In a Lakeflow Spark Declarative Pipeline, a customers_silver transformation must use the output of customers_bronze. How should the pipeline definition express this dependency so execution is reliable?
Options:
A. Attach separate source notebooks in bronze-then-silver order.
B. Place the customers_bronze definition before customers_silver.
C. Prefix dataset names so bronze sorts before silver.
D. Define customers_silver to read from customers_bronze.
Best answer: D
Explanation: Lakeflow Spark Declarative Pipelines are declarative: they run transformations based on dataset dependencies, not on the order code appears. A downstream definition should explicitly read from the upstream dataset it needs so Databricks can build the correct execution graph.
In Lakeflow Spark Declarative Pipelines, the important signal is lineage in the transformation definitions. If customers_silver depends on customers_bronze, the downstream dataset should reference the upstream dataset in its query or DataFrame logic. Databricks then builds the dependency graph and executes transformations in the required sequence.
The key takeaway is that declarative pipelines use declared dependencies to determine execution, not implicit code ordering.
Topic: Data Governance & Quality
A Databricks provider on AWS shares a 20 TB Delta table with an Azure recipient using Delta Sharing. The recipient runs read-only queries each day, metadata exchanged by the share is small, and assume the recipient’s compute cost would be the same either way. In this scenario, what is the main cost driver?
Options:
A. Asset Bundle deployment of the sharing configuration
B. Unity Catalog permission checks on each shared query
C. Cross-cloud data egress for the bytes the recipient reads
D. Delta log metadata exchanged for the share
Best answer: C
Explanation: In cross-cloud Delta Sharing, the biggest variable cost usually comes from the data volume that leaves the provider’s cloud storage. Because the stem says metadata is small and compute cost is unchanged, the main cost driver is the egress for the bytes the recipient reads.
Delta Sharing is designed for secure read-only sharing, but in a cross-cloud scenario the major ongoing cost consideration is usually the data that must move between clouds. Here, the provider is on AWS, the recipient is on Azure, the table is large, and the recipient reads it regularly. The stem also removes two common distractions by stating that metadata is small and compute cost is unchanged.
That means the primary cost driver is cloud egress for the shared data read by the recipient. Permission checks and Delta log exchange are control-plane activities, but they involve much less data than repeated multi-terabyte reads. The key takeaway is that for simple cross-cloud sharing, data volume transferred is typically the dominant cost factor.
Topic: Databricks Intelligence Platform
A company uses separate Databricks workspaces for data engineering and BI in the same Databricks account. An engineering workflow creates bronze.orders, but a downstream BI workflow task in the other workspace fails with:
TABLE_OR_VIEW_NOT_FOUND: bronze.orders
Current setup: table created in a workspace-local metastore
Need: one governed table, centralized permissions, lineage across both workspaces
Which next step best fixes the root problem?
Options:
A. Manage bronze.orders in Unity Catalog and grant access there.
B. Recreate bronze.orders in the BI workspace metastore.
C. Publish bronze.orders as a global temp view from the engineering job.
D. Repair the BI workflow run after the engineering job finishes.
Best answer: A
Explanation: This failure comes from workspace-scoped metadata, not from the workflow logic itself. Unity Catalog is the platform-level fix because it provides one governed table definition, centralized permissions, and lineage across workspaces in the same Databricks account.
The core issue is that the table exists only in a workspace-local metastore, so its metadata and governance are tied to one workspace. That makes this a platform-scope problem rather than a notebook, job, or SQL problem. Repairing the failed BI task might rerun compute, but it does not change where the table is registered or how access is governed.
Unity Catalog is Databricks’ centralized governance layer for data objects. Placing the table there lets multiple workspaces in the same account reference the same governed object, with consistent permissions and lineage. That directly addresses the stated goal of one table, centralized access control, and shared visibility. Recreating metadata separately in another workspace would still fragment governance instead of using the platform’s shared control plane.
Topic: Data Governance & Quality
A cleanup workflow drops and recreates a Unity Catalog table each night. After DROP TABLE, a legacy application that reads the same cloud storage directory starts failing because the files are gone. The team still wants Unity Catalog governance, but Databricks must not own the file lifecycle for this dataset. What is the best next step?
Options:
A. Register the dataset as an external table in Unity Catalog
B. Grant the legacy application more Unity Catalog permissions
C. Share the dataset through Delta Sharing instead
D. Recreate the dataset as a managed table in Unity Catalog
Best answer: A
Explanation: An external table is the right fit when Unity Catalog should govern a dataset but the data files must stay in externally owned storage. Because another application still needs those files after DROP TABLE, Databricks should not manage their lifecycle.
Managed and external tables differ mainly by who owns the underlying data files. A managed table stores data in Databricks-managed storage under Unity Catalog, so Databricks controls the file lifecycle; when the table is dropped, the underlying data is also removed. An external table registers data that already lives in an external cloud location, allowing Unity Catalog governance without transferring storage ownership to Databricks.
In this scenario, the key requirement is that the files must remain available to another application even after the table is dropped and recreated. That makes an external table the best choice. The closest distractor is the managed-table option, because it also provides governance, but it conflicts with the requirement that Databricks must not own or delete the files.
DROP TABLE can remove the underlying data.Topic: Data Governance & Quality
A data engineering team stores a gold Delta table in Unity Catalog on AWS. Another team uses Databricks on Azure and needs read-only access to the latest data every day. The provider wants to avoid managing duplicate copies, and finance wants the design to account for both compute cost and any cross-cloud transfer cost. Which approach is best?
Options:
A. Use Delta Sharing and plan for cross-cloud egress costs.
B. Use Auto Loader to copy the table into Azure daily.
C. Use Databricks Asset Bundles to deploy the sharing workflow.
D. Use Lakehouse Federation so Azure queries the table in place.
Best answer: A
Explanation: Delta Sharing is the Databricks feature designed for governed, read-only sharing across clouds without maintaining extra copies. Even when it is the right sharing mechanism, cross-cloud access can still introduce egress or other data-movement charges, so cost planning must include more than compute.
Delta Sharing is the best fit because the data is already governed in Unity Catalog, the consumer only needs read-only access, and the provider wants to avoid duplicating datasets. It enables live sharing to another Databricks environment on a different cloud while keeping provider-side control of the shared data. The key concept is that cross-cloud sharing decisions are not based only on cluster or serverless compute cost. When data is read across cloud boundaries, network transfer or egress charges may also apply, depending on where the data resides and how much is read. That makes data location and access pattern part of the architecture decision, not just the sharing feature choice.
Topic: Development and Ingestion
A workflow task runs this notebook cell and fails immediately.
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/Volumes/main/bronze/landing/events"))
Run details show: AnalysisException: Auto Loader can infer schema, but you must set cloudFiles.schemaLocation to store the schema.
The team wants to keep using Auto Loader with inferred schema. What is the best next step?
Options:
A. Replace the stream with COPY INTO
B. Configure cloudFiles.schemaLocation for the Auto Loader source
C. Configure only checkpointLocation on the sink
D. Increase cluster size for the workflow task
Best answer: B
Explanation: The run details already identify the root cause: Auto Loader is inferring schema without a cloudFiles.schemaLocation. Adding that option is the correct next step because it gives Auto Loader a persistent place to store schema metadata and support later schema changes.
This is a direct Auto Loader configuration issue, not a compute or permissions issue. When Auto Loader reads files and infers schema, it needs cloudFiles.schemaLocation so Databricks can store the discovered schema and manage future schema evolution for that source. The notebook output provides enough evidence on its own because it explicitly names the missing option, so the best next step is to add that option and rerun the task.
cloudFilescloudFiles.schemaLocation to a durable locationA sink checkpoint tracks processing state, but it does not replace Auto Loader’s schema storage.
COPY INTO changes the ingestion pattern instead of fixing the Auto Loader configuration the team wants to keep using.Topic: Development and Ingestion
A bronze ingestion job runs every 15 minutes using the code below. New JSON files land in the folder throughout the day. Last night the source system added a new optional column, and the next run failed when appending to the bronze table. The team wants to process only new files and avoid frequent manual schema updates.
df = spark.read.json("/Volumes/main/raw/orders/")
df.write.mode("append").saveAsTable("main.bronze.orders_raw")
What is the best next step?
Options:
A. Use Auto Loader with cloudFiles and a schemaLocation.
B. Set mergeSchema on the write and keep spark.read.
C. Overwrite the bronze table after inferring schema each run.
D. Replace the job with COPY INTO on a schedule.
Best answer: A
Explanation: Auto Loader is the best fit when new files arrive continuously and the source schema can change over time. It incrementally discovers only new files and stores schema information so compatible schema evolution can be managed more safely than repeated full-directory scans.
This scenario matches Auto Loader’s core use case: ongoing file ingestion plus evolving source schemas. A plain spark.read batch job scans the directory again each run and does not maintain incremental file-discovery state for newly arrived files. When the source adds a column, the pipeline also needs a better way to track and evolve schema over time.
With Auto Loader, you typically use cloudFiles and specify a schemaLocation so Databricks can persist inferred schema metadata and process new files incrementally. That reduces manual operational work and avoids redesigning the ingestion pattern each time the source adds a compatible field.
The key takeaway is that schema evolution plus continuous file arrival is a strong signal to use Auto Loader.
COPY INTO on a schedule can be incremental, but it is not the best match when the main need is managed continuous file discovery with schema tracking.mergeSchema on the write can help Delta accept new columns, but it does not solve incremental discovery of only new files.Topic: Data Processing & Transformations
In a Databricks Medallion Architecture, which type of data typically belongs in the silver layer?
Options:
A. Cleansed, validated, and conformed data for downstream processes
B. Raw source data landed with minimal transformation
C. Business-level aggregates prepared for reporting and dashboards
D. Read-only external data shared directly with other organizations
Best answer: A
Explanation: The silver layer is the stage where bronze data is refined into higher-quality datasets. It is typically cleansed, validated, and conformed so downstream transformations, analytics, and data products can use it reliably.
In Medallion Architecture, the silver layer sits between raw ingestion and business-ready presentation. Bronze usually stores source data with minimal changes, while silver improves that data by applying cleaning, validation, deduplication, standardization, and conformance rules. This makes silver the common place for trusted intermediate datasets that downstream processes can build on.
The gold layer is typically where data is further shaped into curated, business-level tables such as aggregates, KPIs, or reporting models. A sharing mechanism or external source access pattern is not itself a Medallion layer. The key idea is that silver is where data quality and consistency are established before broader consumption.
Topic: Development and Ingestion
A data engineer wants to write and test PySpark code in a local IDE but have the code execute on Databricks compute instead of the local machine. Which Databricks feature is designed for this workflow?
Options:
A. Delta Sharing
B. Databricks Connect
C. Databricks Asset Bundles
D. Auto Loader
Best answer: B
Explanation: Databricks Connect is the workflow for using local development tools such as an IDE while executing code on Databricks compute. It supports local development against remote Databricks resources rather than local Spark execution.
The core idea is separating where you write code from where it runs. Databricks Connect lets a developer use local tools, such as an IDE, for coding and testing while the actual execution happens on Databricks compute. That makes it the right choice when a team wants familiar local development workflows but still needs Databricks-backed execution.
This is different from features that package deployments, ingest files, or share datasets. The deciding clue is the combination of local development tools and remote Databricks compute. If the need is interactive development from a local environment against Databricks, Databricks Connect is the intended workflow.
The closest distractor is deployment tooling, which helps ship resources to Databricks but does not provide this local-to-remote development experience.
Topic: Data Governance & Quality
A data engineering team publishes a gold sales table in Unity Catalog. An external partner on a different analytics platform needs read-only access to the latest data, and the team wants centralized governance without recurring export or full-copy workflows. What is the best next action?
Options:
A. Schedule daily Parquet exports to partner-owned cloud storage
B. Create a Delta Sharing share for the gold table
C. Use Lakehouse Federation so the partner can query the table
D. Deploy Auto Loader to sync files to the partner
Best answer: B
Explanation: Delta Sharing is designed for governed data sharing when consumers need fresh access without manual copy pipelines. It lets the provider share a Unity Catalog table directly while keeping control over the shared data.
Delta Sharing is the Databricks capability for sharing live governed data with internal or external recipients. In this scenario, the partner needs read-only access to the latest version of a gold table, and the provider wants to avoid building recurring exports or maintaining duplicate full copies. A Delta Sharing share meets those requirements because the provider manages access centrally and the recipient reads current shared data through the sharing mechanism rather than through a hand-built file-delivery workflow. This is the simplest governed solution for cross-platform consumption.
The key takeaway is that Delta Sharing solves outbound governed data access, while ingestion and federation features solve different problems.
Topic: Data Processing & Transformations
A team ingests raw order events into orders_bronze and defines this Lakeflow Spark Declarative Pipelines table:
CREATE OR REFRESH STREAMING TABLE orders_target AS
SELECT
order_id,
customer_id,
CAST(order_ts AS TIMESTAMP) AS order_ts,
UPPER(country_code) AS country_code,
CAST(total_amount AS DECIMAL(10,2)) AS total_amount
FROM STREAM orders_bronze
WHERE order_id IS NOT NULL
AND total_amount >= 0;
Which description best fits orders_target in the Medallion Architecture?
Options:
A. A gold table with aggregated business metrics for reporting.
B. A silver table with cleansed, validated data for downstream use.
C. A bronze table holding raw source records with minimal changes.
D. A source table queried in place through Lakehouse Federation.
Best answer: B
Explanation: The table is built from bronze data but applies validation and standardization before storing it. That matches the silver layer, which holds refined data that downstream transformations and analytics can reliably reuse.
In the Medallion Architecture, bronze stores raw ingested data, silver stores cleansed, validated, or conformed data, and gold stores business-ready aggregates or serving tables. This query reads from orders_bronze, filters out invalid rows, converts fields to useful data types, and standardizes country_code. Those are classic silver-layer actions because they improve data quality while keeping the data at a detailed record level for later joins, enrichment, and analysis. A gold table would usually summarize or model the data for a specific reporting need, while a bronze table would preserve the source data with far fewer changes.
orders_bronze, not queried in place through federation.Topic: Data Processing & Transformations
A Databricks Workflow task calculates daily sales metrics from a PySpark DataFrame with one row per order line. A downstream validation task fails because daily_order_count is larger than the true number of orders on days when a single order has multiple lines.
daily = (transactions
.groupBy("sales_date")
.agg(
F.count("order_id").alias("daily_order_count"),
F.sum("line_total").alias("daily_revenue")
)
)
What is the best fix?
Options:
A. Use countDistinct("line_total") for daily_revenue.
B. Use countDistinct("order_id") for daily_order_count.
C. Use count("line_total") for daily_revenue.
D. Use sum("order_id") for daily_order_count.
Best answer: B
Explanation: The metric is wrong because the data is at order-line grain, not order grain. Here, count("order_id") counts line rows, so the order-count metric should use countDistinct("order_id"), while sum("line_total") remains the correct way to total revenue.
In PySpark aggregations, count counts non-null values, countDistinct counts unique non-null values, and sum adds numeric values. Here, each row represents an order line, so multiple rows can share the same order_id. That means count("order_id") overstates the number of orders whenever an order has more than one line item.
count("order_id") measures non-null order ID entriescountDistinct("order_id") measures unique orderssum("line_total") measures total revenueThe right fix is to change only the order-count metric to use distinct order IDs; revenue should remain a sum of the line amounts.
order_id is an identifier, not a numeric measure to total.count("line_total") returns how many non-null amounts exist, not the revenue amount.countDistinct("line_total") counts unique prices or line totals, not money earned.Topic: Productionizing Data Pipelines
A Databricks workflow run failed after a Unity Catalog permission issue. The engineer grants the missing privilege and wants to avoid rerunning work that already succeeded.
Task status
ingest_bronze SUCCESS
transform_silver FAILED (PERMISSION_DENIED)
publish_gold UPSTREAM_FAILED
send_summary UPSTREAM_FAILED
refresh_lookup SUCCESS
Dependencies
transform_silver <- ingest_bronze
publish_gold <- transform_silver
send_summary <- publish_gold
refresh_lookup <- none
What is the best next step?
Options:
A. Rerun ingest_bronze first, then manually rerun all later tasks.
B. Rerun only transform_silver and leave downstream tasks untouched.
C. Start a new run so every task executes from the beginning.
D. Repair the run so transform_silver and its downstream tasks rerun.
Best answer: D
Explanation: In Databricks Workflows, rerun scope should follow the dependency graph. After the permission problem is fixed, the right approach is to repair the run starting from the failed task so only that branch of dependent work is rerun.
The key concept is that Databricks workflow reruns should be driven by task dependencies, not by rerunning everything. Here, transform_silver failed, and publish_gold and send_summary did not complete because they depend on it. ingest_bronze already succeeded, and refresh_lookup is independent, so neither needs to run again.
A repair run is the best fit because it targets the failed path:
transform_silver after the permission is fixedThat is safer and more efficient than restarting the entire workflow, and it avoids missing downstream work that still depends on the repaired task.
Topic: Productionizing Data Pipelines
A daily Databricks workflow has task dependencies ingest_orders - build_bronze - build_silver - publish_gold - notify_bi. The team confirms outputs from completed tasks are still valid and should not be recomputed.
Exhibit:
ingest_orders SUCCESS
build_bronze SUCCESS
build_silver SUCCESS
publish_gold FAILED
notify_bi UPSTREAM_FAILED
Cause: SQL warehouse temporarily unavailable
What is the best next action?
Options:
A. Redeploy the workflow with Databricks Asset Bundles
B. Delete successful outputs and rerun from build_bronze
C. Use Repair run for publish_gold and notify_bi
D. Start a new run from ingest_orders
Best answer: C
Explanation: This is a partial workflow failure with valid upstream outputs. In Databricks Workflows, the best recovery action is a Repair run so only the failed task and the task blocked by it are rerun.
A Repair run is designed for cases where part of a workflow succeeded and only later tasks need recovery. Here, ingest_orders, build_bronze, and build_silver already finished successfully, and the stem says their outputs are still valid. publish_gold failed because of a temporary availability issue, and notify_bi did not run only because its dependency failed.
Using Repair run lets Databricks rerun the failed task and the downstream task affected by that failure, without recomputing the successful upstream work. A brand-new run would repeat unnecessary processing. Redeploying a Databricks Asset Bundle is for changing or promoting workflow definitions, not for recovering one transient failed run. Deleting good outputs and rebuilding earlier stages adds cost and risk without improving correctness.
When upstream results remain valid, prefer repairing the run over rerunning the whole workflow.
Topic: Data Governance & Quality
A data engineering team uses Auto Loader to ingest JSON files into bronze tables and a scheduled SQL workflow to update silver tables in Unity Catalog. During a governance review, the security team asks which users read or modified data-related assets during the last 30 days. The team needs evidence of user actions, not upstream/downstream relationships. Which Databricks capability should they use?
Options:
A. Audit logs
B. Delta table history
C. Lakeflow pipeline event logs
D. Unity Catalog lineage
Best answer: A
Explanation: Audit logs are the best fit because the request is about who accessed or changed assets over time. That is an activity-tracking need, which audit logs are built to capture across Databricks environments and governed assets.
The key concept is distinguishing activity auditing from data dependency tracking. When a team needs to know which users or services accessed or changed data-related assets, the right governance artifact is audit logs. Audit logs capture recorded actions and events, making them appropriate for security reviews, investigations, and compliance reporting.
Lineage answers a different question: how data moves between sources, tables, notebooks, and jobs. Delta table history is narrower because it focuses on changes to a specific Delta table, not broad access activity across assets. Pipeline event logs help troubleshoot pipeline execution, not user access patterns. The best match is the artifact that records user actions across the platform.
Topic: Data Governance & Quality
A scheduled workflow that reads a Unity Catalog table started failing. The team needs evidence of who attempted the access and when so they can confirm the failure came from a governed action inside Databricks.
Task state: Failed
Error: PERMISSION_DENIED
Message: User does not have SELECT on table main.finance.payroll
What should the data engineer check first?
Options:
A. Current grants on the main.finance.payroll table
B. Unity Catalog lineage for the main.finance.payroll table
C. Delta table history for the main.finance.payroll table
D. Databricks audit logs for the denied access event
Best answer: D
Explanation: Audit logs are designed to capture governed activity in Databricks, including access attempts and denials. For a Unity Catalog PERMISSION_DENIED failure, they provide the evidence trail for who tried to access the table and when the event occurred.
When the issue is a Unity Catalog permission error and the team needs evidence of the governed action, audit logs are the right source. They record activity in the Databricks environment, including authorization-related events, with details such as principal, object, timestamp, and outcome. That makes them appropriate for confirming that a specific access attempt was denied.
Current grants help compare intended permissions against the failure, but they do not prove that an access attempt happened. Lineage explains data dependencies, not access events. Delta table history tracks committed changes to the table, not read attempts or permission denials. The key distinction is evidence of activity versus metadata about the object.
SELECT attempts.Topic: Data Governance & Quality
A team wants to separate data-governance decisions from production workflow decisions in Databricks. Which task is a Unity Catalog permission concern rather than a workflow or deployment concern?
Options:
A. Deploy job definitions across environments with Databricks Asset Bundles.
B. Grant SELECT on a Unity Catalog table to an analyst group.
C. Repair a failed workflow run without rerunning successful tasks.
D. Schedule a Databricks workflow to run hourly.
Best answer: B
Explanation: Unity Catalog privileges control access to governed data objects such as catalogs, schemas, and tables. Scheduling, deployment, and repair behavior belong to Databricks workflow or deployment features, not to Unity Catalog permissions.
Unity Catalog is the governance layer for Databricks data objects. Its privileges determine who can discover or use objects and what actions they can perform, such as reading a table with SELECT. That makes granting table access a governance decision. By contrast, Databricks Workflows handle job scheduling and run management, and Databricks Asset Bundles package and deploy resources across environments. Those tools control how data pipelines are executed or promoted, not who is authorized to access a catalog, schema, or table.
The key distinction is that Unity Catalog answers access-control questions, while workflow and deployment features answer execution and release-management questions.
Topic: Data Governance & Quality
In Unity Catalog, which role is primarily used for governance administration at the metastore level, such as creating catalogs and overseeing access across workspaces attached to that metastore?
Options:
A. Metastore admin
B. Catalog owner
C. Workspace admin
D. Schema owner
Best answer: A
Explanation: Metastore admin is the top Unity Catalog governance role for a metastore. When the requirement is centralized administration across attached workspaces, including catalog creation and broad access oversight, that scope is higher than a workspace role or ownership of a single catalog or schema.
Unity Catalog uses scopes, and the metastore is the highest governance boundary for data objects managed together across attached workspaces. Because of that, the metastore admin role is the one associated with centralized governance administration at that level, including tasks such as creating catalogs and managing metastore-wide governance configuration.
The key clue is scope: if the scenario emphasizes top-level governance for many data assets or multiple attached workspaces, think metastore admin.
Topic: Databricks Intelligence Platform
An engineering lead shares this planning note:
Need one governed environment to:
- ingest files and streams
- build bronze, silver, and gold tables
- schedule production pipelines
- let analysts use SQL and engineers use Python
- share curated data with external partners
Which Databricks concept best matches this need?
Options:
A. Build one Lakeflow Spark Declarative Pipeline
B. Create one shared ETL notebook
C. Tune one Databricks SQL query
D. Adopt the Databricks Data Intelligence Platform end to end
Best answer: D
Explanation: The note covers multiple stages of the data lifecycle and multiple user groups. That points to the Databricks Data Intelligence Platform as an end-to-end governed solution, not to a single notebook, pipeline, job, or SQL statement.
The key clue is scope. A single notebook, pipeline, job, or SQL query solves one task, but the exhibit combines ingestion, medallion-style transformations, governance, production scheduling, SQL access, Python development, and external sharing. In Databricks, that is a platform-level scenario: the Databricks Data Intelligence Platform brings these capabilities together in one governed environment.
When a question mentions several teams and several workflow stages at once, the best interpretation is usually the overall platform value. Specific features such as Lakeflow Spark Declarative Pipelines, notebooks, or Databricks SQL may be used within that platform, but none of them alone explains the full request.
Topic: Data Processing & Transformations
A team uses Auto Loader to ingest partner JSON files directly into a single Delta table that is also queried by dashboards. After a source change adds new fields and a few malformed records, the ingestion job continues, but a finance dashboard fails and another team can no longer reliably reuse the data for detailed analysis or reprocessing.
What is the best next step to reduce these recurring issues?
Options:
A. Separate the pipeline into Bronze, Silver, and Gold tables
B. Move the dashboard workload to a larger SQL warehouse
C. Allow analysts to correct bad records directly in the ingestion table
D. Delete the saved Auto Loader schema and re-run ingestion
Best answer: A
Explanation: The best fix is to stop using one table for raw ingestion, cleanup, and reporting. Medallion layers improve clarity by separating purposes, improve reuse by preserving raw records, and improve quality by applying validation and standardization before business users query the data.
This scenario shows why combining raw ingestion and reporting in one table creates fragile pipelines. In a medallion design, Bronze stores the raw landed data with minimal changes, Silver applies cleansing, schema standardization, and deduplication, and Gold exposes business-ready tables for dashboards. That separation means source changes or malformed records can be handled in earlier layers without immediately breaking downstream consumers.
It also improves reuse because different teams can use the right layer for their needs:
The key idea is not bigger compute or manual fixes in place, but clearer layer boundaries so data quality and consumer expectations are managed explicitly.
Topic: Data Governance & Quality
Which situation is the best reason to grant a Unity Catalog privilege to a group instead of granting it separately to individual users?
Options:
A. An external partner needs read-only access without being added internally.
B. One engineer needs short-term access to a single table.
C. Each analyst needs different privileges on different tables.
D. Many analysts need the same table access, and team membership changes often.
Best answer: D
Explanation: In Unity Catalog, groups are the preferred way to manage the same access for many users. You grant the privilege once to the group, then handle onboarding and offboarding by changing group membership instead of repeating grants for each person.
The core concept is group-based access control in Unity Catalog. When multiple users need the same privilege on the same data objects, granting access to a group is simpler and more consistent than granting the same privilege to each user individually. The privilege is assigned once to the group principal, and user access changes are handled by adding or removing users from that group. This reduces repetitive administration and lowers the risk that someone keeps access after changing teams.
This approach fits best when access needs are shared and membership changes over time. It is less useful when only one person needs access or when each person needs a different set of privileges. For external recipients, Delta Sharing is the better pattern than internal Unity Catalog group membership.
Topic: Productionizing Data Pipelines
A team wants to deploy the same Databricks Asset Bundle to development and production. A teammate says the job should be duplicated under each target.
Exhibit:
bundle:
name: orders-pipeline
resources:
jobs:
ingest_job:
name: ingest-job
tasks:
- task_key: load_bronze
notebook_task:
notebook_path: ../src/load_bronze.py
targets:
dev:
default: true
workspace:
host: https://dev.acme.databricks.example
prod:
workspace:
host: https://prod.acme.databricks.example
run_as:
user_name: prod-service@example.com
What is the best response?
Options:
A. Each target must repeat the full job definition to deploy it.
B. resources only names the bundle; jobs are created from targets.
C. The job belongs in resources; targets hold environment-specific deployment settings.
D. The notebook path should move to targets because code location is environment-specific.
Best answer: C
Explanation: In a Databricks Asset Bundle, resources contains the reusable asset definition, such as the job, tasks, and notebook path. targets provide deployment-specific context for each environment, such as workspace host, default target, or run_as identity. This lets the same job be promoted without copying the job definition.
In Databricks Asset Bundles, resources is where you define the deployable object itself. In this fragment, the job name, task key, and notebook path are all part of the job resource definition. targets describes environment-specific deployment context so the same bundle can be deployed to different environments without rewriting the resource.
resources.workspace.host, default, and run_as in targets.The closest distractor is the idea that every target needs its own full job block, which defeats the reuse pattern that bundles are designed to provide.
notebook_task is part of the resource definition.resources.jobs.Topic: Databricks Intelligence Platform
A team lands raw files in cloud object storage and transforms them in Databricks. Their nightly task failed again because an external scheduler called the Databricks job with an expired token. Table permissions are also managed in a separate catalog tool, so engineers must check multiple systems whenever the pipeline breaks. What is the best next step to reduce this operational friction?
Options:
A. Run the pipeline in Databricks Workflows on job compute with Unity Catalog
B. Publish the tables with Delta Sharing
C. Increase job cluster size and autoscaling limits
D. Keep the external scheduler and rotate tokens more often
Best answer: A
Explanation: The best fix is to consolidate orchestration, compute, and governance inside Databricks. Using Databricks Workflows with job compute and Unity Catalog reduces cross-tool handoffs, avoids external scheduler token failures, and makes troubleshooting more centralized.
This scenario shows operational friction caused by splitting orchestration, compute access, and governance across different tools. A unified Databricks approach reduces that friction by letting the team run scheduled tasks with Databricks Workflows, execute them on Databricks job compute, and govern the resulting tables with Unity Catalog. That means scheduling, execution history, permissions, and data objects are managed together instead of being spread across separate systems.
When a run fails, engineers can investigate from one platform rather than tracing an external scheduler, separate credential flow, and separate catalog configuration. This does not guarantee every pipeline bug disappears, but it removes an avoidable integration point and simplifies day-to-day operations. The closest distractor only treats the token symptom without addressing the fragmented tooling.
Topic: Data Governance & Quality
A team manages a Unity Catalog table with daily sales data. An external partner that does not use the team’s Databricks account needs read-only access, and the team must keep access governed without copying the data or sharing cloud storage credentials. What is the best next action?
Options:
A. Convert the table to an external table and send storage credentials.
B. Create a Delta Sharing share, add the table, and create a recipient.
C. Configure Lakehouse Federation so the partner can query the table.
D. Grant SELECT on the table to the partner in the current workspace.
Best answer: B
Explanation: Delta Sharing is built for governed, read-only sharing of Databricks data with recipients, including external organizations. The provider should create a share, add the needed table, and create a recipient so access remains controlled through Databricks governance.
Delta Sharing is the correct Databricks mechanism when a team needs to expose data to another consumer in a governed way, especially outside its own Databricks account. It lets the data provider share selected tables or views as read-only data products without copying the underlying data or handing out cloud storage credentials. In Databricks, the normal flow is to create a share, add the approved data objects, and then create or assign a recipient that can access that share.
Lakehouse Federation is for querying external systems from Databricks, not for publishing Databricks data to another consumer.
SELECT in the current workspace only fits principals governed inside that environment, not an external partner outside the account.Topic: Development and Ingestion
A data engineer is exploring daily JSON files that land in cloud object storage. The team needs to read a sample with PySpark, run SQL checks after a simple cleanup, add Markdown notes for reviewers, and inspect outputs together before deciding how to productionize the pipeline. What is the best next action in Databricks?
Options:
A. Publish the landing files with Delta Sharing for external review.
B. Use a Databricks notebook for PySpark, SQL, Markdown, and shared result review.
C. Create a Lakeflow Spark Declarative Pipeline and schedule it immediately.
D. Use Databricks Connect so each engineer develops locally in separate files.
Best answer: B
Explanation: A Databricks notebook is the best fit for interactive, collaborative development. It lets the team combine PySpark ingestion, SQL validation, Markdown documentation, and immediate output inspection in one shared workflow before turning the logic into a production pipeline.
Databricks notebooks are designed for iterative data-engineering work when code, documentation, and results need to stay together. In this scenario, the team wants to ingest sample JSON data with PySpark, run SQL checks after cleanup, explain findings with Markdown, and review outputs collaboratively before committing to a production design. A notebook supports all of that in a single artifact, including cell results that can be inspected directly and shared with teammates. That makes notebooks the right first step for exploration and validation. By contrast, deployment tools are more appropriate once the logic is stable, and sharing features solve distribution needs rather than mixed-language development. The key takeaway is that notebooks combine code, SQL, Markdown, and result inspection in one collaborative workflow.
Topic: Data Governance & Quality
A Databricks workflow task reads a partner dataset through a foreign catalog. It now fails with:
Task: refresh_partner_orders
Status: FAILED
Message: Connection to source database timed out
The partner says they will not allow direct queries against their operational database. They only want to expose a few approved tables as read-only data and revoke access centrally when needed. What is the best next step?
Options:
A. Create external tables that point directly to the partner database.
B. Scale up the compute that runs the federated query.
C. Ask the partner to share the approved tables with Delta Sharing.
D. Request a read-only database account and keep using Lakehouse Federation.
Best answer: C
Explanation: This is a governed sharing requirement, not an in-place query requirement. When a provider wants to publish selected read-only tables and retain central control over revocation, Delta Sharing is the appropriate Databricks solution.
Lakehouse Federation is used for in-place querying of external systems when Databricks can connect directly to those systems. In this scenario, the partner explicitly refuses direct database access from customer environments and wants to expose only a small approved subset of data with centrally managed revocation. That maps to Delta Sharing, which is designed for secure, controlled, read-only sharing between organizations without giving the consumer direct access to the provider’s operational database.
The key distinction is the access model: federation queries the source in place, while sharing publishes approved data for governed consumption. Continuing to troubleshoot connectivity or resize compute would not address the provider’s stated access requirement.
Topic: Data Processing & Transformations
A nightly Databricks batch job uses built-in SQL to join and aggregate Delta tables, and the results are correct. After source data volume doubled, the job now misses its SLA on a small job compute configuration. Which change is most appropriate before rewriting the transformations?
Options:
A. Increase the job compute size, such as adding workers
B. Add extra medallion layers to the pipeline
C. Replace the built-in SQL logic with Python UDFs
D. Rewrite the SQL transformations in PySpark
Best answer: A
Explanation: This is a workload-fit issue, not a logic-correctness issue. When built-in Spark SQL transformations already work correctly and data volume increases, the first response is usually to give the job more appropriate compute resources.
Compute fit means matching the Databricks compute configuration to the workload’s size and pattern. In this case, the transformation logic is already correct and uses built-in SQL operations, so the new symptom points to insufficient capacity after data growth rather than a need to redesign the code.
The closest distractor is rewriting the SQL in PySpark, but built-in SQL and DataFrame-style operations both use Spark’s optimizer, so changing APIs is not the first fix for a simple capacity problem.
Topic: Development and Ingestion
A data engineering team uses a Databricks Workflow task to run a notebook that starts an Auto Loader stream. The notebook should continuously ingest CSV files from cloud object storage into a Unity Catalog bronze table. After deployment, the task fails immediately and no files are processed. The team wants the fastest built-in Databricks way to see whether the failure is caused by an option typo, a missing storage permission, or a schema issue. What should they use first?
Options:
A. Review the failed task error output in Workflows
B. Inspect Spark UI stage and shuffle metrics
C. Download driver and executor logs first
D. Check Unity Catalog lineage for the bronze table
Best answer: A
Explanation: For a job that fails immediately, the failed task output in Databricks Workflows is the quickest built-in place to inspect the actual exception. It typically surfaces Auto Loader option errors, access failures, and schema-related exceptions directly in the run details.
When an Auto Loader task fails before any data is processed, start with the failed notebook or workflow task output. Databricks captures the exception message and stack trace there, which is usually enough to identify common Associate-level issues such as an invalid cloudFiles option, missing permission to read the source path, or a schema mismatch when inferring or writing data.
This is the best first debugging aid because it points directly to the failing command without extra setup. Heavier tools are useful later only if the error output is not specific enough.
For simple Auto Loader debugging, start where Databricks first reports the exception.
Topic: Development and Ingestion
A Databricks Workflow runs an Auto Loader notebook every hour to ingest new CSV files from cloud storage into a bronze Delta table. Last night’s run failed during the ingestion step, and the team needs to identify the specific failing stage before rerunning. Unity Catalog permissions and lineage are already in place. What is the best next action?
Options:
A. Redeploy the workflow with Databricks Asset Bundles
B. Review Unity Catalog lineage for the bronze table
C. Search audit logs for external location access events
D. Inspect the failed workflow run, driver logs, and Spark UI
Best answer: D
Explanation: To troubleshoot a failed Auto Loader step, the most direct Databricks tools are the failed run output, driver logs, and Spark UI. They reveal the actual error and where execution failed, which is what the team needs before rerunning.
When a specific Auto Loader ingestion step fails, start with the diagnostics from that failed run. In Databricks, the workflow run output, driver logs, and Spark UI are built-in debugging tools that help locate the failing stage, task, and exception. That is the fastest way to determine whether the issue is a bad option, schema problem, file-format issue, or storage-path problem.
Governance features serve different purposes. Unity Catalog lineage helps trace data relationships, and audit logs help review access or administrative events, but neither is designed to explain why one ingestion run failed at a particular stage. Databricks Asset Bundles help package and deploy workflows, not debug a completed failed run. For this scenario, use run-level diagnostics first, then correct the issue and rerun.
Topic: Data Governance & Quality
A data engineering team must give a business partner read-only access to a Delta table. The partner is outside the team’s Unity Catalog environment and should not receive direct table privileges in the producer’s catalog. Which statement is accurate?
Options:
A. Lakehouse Federation is used to publish Databricks tables to outside recipients.
B. Delta Sharing is designed for read-only sharing to recipients outside the producer’s Unity Catalog context.
C. A GRANT SELECT on the table is the standard way to share with external partners.
D. Creating an external table automatically makes it available to outside recipients.
Best answer: B
Explanation: Delta Sharing is the Databricks capability for governed, read-only sharing when the recipient is outside the producer’s immediate Unity Catalog permission boundary. Local table grants are meant for principals inside that governed environment, not for external partner sharing.
Delta Sharing is used when a provider wants to share live, read-only data with recipients outside the producer’s local access-control scope, such as another company or another platform context. Instead of granting direct access on the underlying table, the provider creates a share and authorizes recipients to consume that share. By contrast, Unity Catalog privileges like GRANT SELECT are for users, groups, or service principals governed within the producer’s Databricks permission model. Lakehouse Federation addresses querying external systems from Databricks, not distributing your Databricks tables outward. External tables describe where data is stored, but they do not by themselves create an external sharing mechanism. The key distinction is internal access uses grants, while external governed sharing uses Delta Sharing.
Topic: Productionizing Data Pipelines
A team is deploying a scheduled ETL workflow in Databricks. They want the most hands-off compute option: no node type selection, no worker sizing, and Databricks should automatically provision, scale, and optimize the compute for each run. Which compute choice is the best fit?
Options:
A. Single-node compute
B. Job compute with manually configured workers
C. All-purpose compute
D. Serverless compute
Best answer: D
Explanation: Serverless compute is the best fit when a team wants Databricks to handle infrastructure management and performance optimization automatically. It is intended for low-overhead execution, especially for production workloads where manual compute setup is unnecessary.
The key concept is choosing compute based on how much operational control the team wants to manage. For a scheduled ETL workload that should be as hands-off as possible, serverless compute is the best fit because Databricks manages provisioning, scaling, and many performance optimizations automatically.
This matches the stem’s requirements: the team does not want to choose node types, size workers, or tune compute settings for each run. Serverless compute reduces that overhead so engineers can focus on pipeline logic instead of cluster management.
A manually configured job compute resource can still run scheduled pipelines, but it requires more setup and sizing decisions than the stem allows.
Topic: Productionizing Data Pipelines
A nightly Databricks job orchestrates three dependent tasks on a job cluster:
ingest_bronze SUCCESS
build_silver FAILED
publish_gold UPSTREAM_FAILED
ingest_bronze uses Auto Loader. The cluster has been meeting SLA, and the failure was caused by a temporary credential issue that is now fixed. The team wants to continue the workflow quickly, avoid re-ingesting files, and leave compute settings unchanged. What is the best next action?
Options:
A. Increase job-cluster workers and rerun the entire workflow.
B. Run the failed notebook manually on all-purpose compute.
C. Redeploy the job with Databricks Asset Bundles first.
D. Run a repair for the failed job run.
Best answer: D
Explanation: This is a workflow recovery problem, not a compute-sizing problem. Because the upstream ingestion task already succeeded and the issue was fixed outside compute, a repair run is the fastest way to resume from the failure point without reprocessing bronze data.
Databricks Workflows Repair Run is intended for cases where part of a job run completed successfully and only failed or blocked tasks need to run again. In this scenario, ingest_bronze already finished, the root cause was a temporary credential issue, and the cluster was already meeting performance targets. That means changing workers or switching compute does not solve the real problem.
A repair run lets the workflow rerun the failed build_silver task and the downstream task that was skipped because of that failure. This keeps the recovery inside the job’s orchestration logic, preserves run history, and avoids unnecessary upstream reprocessing. The key distinction is that rerun/repair behavior is an orchestration concern, while worker counts and compute type are configuration concerns.
Topic: Development and Ingestion
A data engineering team receives JSON files in cloud object storage throughout the day from an upstream application. New files arrive at unpredictable times, and the application occasionally adds new optional columns. The team wants a Databricks ingestion approach that automatically discovers new files and minimizes manual schema maintenance before loading a bronze table. Which approach is the best fit?
Options:
A. Schedule COPY INTO runs against the storage directory
B. Query the source with Lakehouse Federation
C. Share the source files with Delta Sharing
D. Configure Auto Loader with cloudFiles and schema evolution
Best answer: D
Explanation: Auto Loader is the Databricks feature built for ingesting files that keep arriving in cloud storage. It is the best match when the source schema may evolve and the team wants automatic file discovery with less operational overhead.
The core concept is choosing the ingestion pattern that matches both the arrival pattern and the schema behavior of the source data. Auto Loader is designed for cloud object storage sources where files land continuously or unpredictably, and it supports schema inference and schema evolution for changing file structures such as JSON.
In this scenario, Auto Loader fits because it:
COPY INTO is useful for incremental batch loads, but it is not the best fit when continuous file discovery and schema evolution are the main requirements. Federation and sharing address access to existing external data, not ingestion of newly arriving raw files.
COPY INTO option is a batch-oriented loading pattern and is less appropriate when ongoing file discovery and schema changes are central requirements.Topic: Productionizing Data Pipelines
A team runs an hourly Databricks Workflow. Task 1 uses Auto Loader to ingest CSV files into a bronze Unity Catalog table, and Task 2 joins that data to a dimension table and writes a silver table. The schedule and Unity Catalog governance must stay unchanged. Recent runs succeed, but Task 2 has slowed from 5 minutes to 38 minutes.
Workflow status: Succeeded
Permission errors: none
Spark UI, Stage 14: 199 tasks < 10 sec, 1 task = 24 min
Shuffle read: highly uneven across tasks
What is the best next action?
Options:
A. Repair the workflow run so Task 2 reruns by itself
B. Move the workflow to serverless compute to eliminate orchestration delay
C. Investigate data skew in the transformation and adjust the join or partition strategy
D. Grant additional Unity Catalog privileges on the silver table
Best answer: C
Explanation: This is a Spark execution bottleneck, not a permission or scheduling problem. The workflow succeeds, no access errors appear, and the Spark UI shows one task doing far more shuffled work than the others, which points to skew in the transformation.
When a Databricks Workflow can read and write Unity Catalog tables and completes successfully, permission and orchestration problems are less likely. The deciding clue here is the Spark UI: almost all tasks finish quickly, but one task in the stage runs much longer and handles a disproportionate amount of shuffle data. That pattern points to a Spark performance issue, commonly data skew or an inefficient join or partitioning choice in the transformation.
The right next step is to analyze and tune the transformation based on the Spark UI evidence, not change permissions or workflow mechanics.
Topic: Databricks Intelligence Platform
A data engineer is building a new bronze-to-silver notebook. They need to ingest a sample of JSON files, inspect schema changes, and rerun PySpark cells repeatedly with a teammate during development. The data is governed in Unity Catalog, and the workload will be scheduled only after the logic is finalized. Which compute choice is most appropriate now?
Options:
A. All-purpose compute attached to the notebook
B. Job compute in Databricks Workflows
C. A SQL warehouse for the PySpark notebook
D. A Lakeflow Spark Declarative Pipelines pipeline
Best answer: A
Explanation: All-purpose compute is the best fit for interactive notebook development and data exploration. The engineer needs to inspect data, test PySpark logic, and rerun cells iteratively before moving to scheduled production execution.
The key decision is matching compute to the current stage of the workload. All-purpose compute is intended for interactive notebook use, exploratory analysis, and iterative development. In this scenario, the engineer is still validating ingestion behavior, checking schema changes, and refining PySpark transformations with repeated notebook runs. Unity Catalog governance still applies, but it does not make a production-oriented compute choice preferable.
The strongest clue is that scheduling comes later, after the notebook logic is finalized.
Topic: Data Governance & Quality
Which statement best describes Delta Sharing in Databricks?
Options:
A. A governance model for organizing lakehouse data into bronze, silver, and gold layers
B. A virtual-query layer for external databases without moving them into Databricks
C. An ingestion service that incrementally discovers and loads new files from cloud storage
D. A secure open-sharing protocol for governed data, often without copying it into each recipient platform
Best answer: D
Explanation: Delta Sharing is Databricks capability for secure, governed data sharing across teams, organizations, and platforms. Its purpose is to expose shared data to recipients without requiring every consumer to first create a full separate copy of that data.
Delta Sharing is used when a data provider wants to share governed data with another internal team or an external recipient in a controlled way. It uses an open sharing protocol so the provider can publish specific data assets while maintaining governance over what is shared. The important idea in this question is that Delta Sharing is about distribution of access to data, not file ingestion, data modeling, or querying an external operational system in place.
It differs from nearby concepts:
The key takeaway is that Delta Sharing solves governed data sharing, often without forcing full duplication for each consumer.
Topic: Data Processing & Transformations
A team has a batch ETL job whose SQL transformation is already correct. The job runs briefly a few times per day, does not require custom cluster settings, and the team wants to avoid managing clusters or keeping idle compute running. Which Databricks capability is the best fit to solve this by changing compute instead of rewriting the transformations?
Options:
A. Auto Loader
B. Databricks Connect
C. Serverless compute
D. Lakeflow Spark Declarative Pipelines
Best answer: C
Explanation: This is a compute-fit problem, not a transformation-logic problem. Serverless compute is designed for workloads that can run on managed Databricks compute without cluster administration, so the team can keep the existing SQL and change only how it runs.
The key concept is choosing the right compute for the workload. When a batch transformation is already correct and the main issue is that the job is short-lived, intermittent, and does not need custom cluster settings, serverless compute is usually the better fix than rewriting the pipeline logic. It lets Databricks manage the underlying compute so the team does not need to size, start, or maintain clusters for that workload.
The important distinction is that serverless compute changes how the workload is executed, not what the transformation does.
Topic: Data Governance & Quality
A data engineer is creating a new Unity Catalog table for curated customer data. The team wants Databricks to manage the table’s storage lifecycle, including handling the underlying data files and cleaning them up when the table is dropped. Which object should the engineer create?
Options:
A. A volume
B. An external table
C. A managed table
D. A view
Best answer: C
Explanation: A managed table is the best fit when Databricks should control the table’s storage and lifecycle. It simplifies administration because Databricks manages the underlying data files instead of just registering data stored elsewhere.
In Unity Catalog, a managed table is used when you want Databricks to manage both the table metadata and the underlying data lifecycle. This is the simpler choice for new lakehouse data when there is no requirement to keep the files in a separately controlled external location. Managed tables align with the requirement that Databricks handle storage behavior and cleanup of the data it manages when the table is dropped.
External tables are different: they reference data in an external location that remains outside full Databricks lifecycle control. The key decision is whether Databricks should own the table’s data lifecycle or only point to externally managed files.
Topic: Data Governance & Quality
A team publishes a Unity Catalog table through Delta Sharing. Finance reports higher costs.
Spark UI for the publish job:
- Runtime: unchanged
- Spill/skew warnings: none
- Cluster autoscaling range: unchanged
Recent change:
- A consumer on another cloud now refreshes a dashboard from the share every hour
What is the best next step?
Options:
A. Review cross-cloud data-transfer costs and data locality
B. Rebuild ingestion with Auto Loader
C. Convert the shared table to a managed table
D. Increase the publish job cluster size
Best answer: A
Explanation: The Spark UI shows the publishing compute is behaving the same as before, so this is unlikely to be a compute-cost problem. The new hourly cross-cloud reads are the key change, making data-transfer cost from sharing the first thing to investigate.
Differentiate the cost type by checking what changed. Spark UI symptoms such as unchanged runtime, no spill or skew, and the same autoscaling range suggest the publish job’s compute profile is stable. When the new event is a consumer in another cloud reading the shared data through Delta Sharing, the most likely new cost category is cross-cloud data transfer, not DBU consumption from the job cluster.
A good next step is to review:
Cluster tuning, ingestion changes, or table-type changes can affect performance or governance, but they do not directly solve a cost created by moving data across cloud boundaries.
Topic: Development and Ingestion
Which capability is provided by Databricks notebooks for day-to-day development, rather than by Databricks Connect, Workflows, or Databricks Asset Bundles?
Options:
A. Scheduled task orchestration with retries and dependencies
B. Interactive cell execution with inline results and charts
C. Local IDE execution against Databricks compute
D. Declarative resource packaging and deployment with YAML
Best answer: B
Explanation: Databricks notebooks are the interactive authoring surface for everyday development in the workspace. They let engineers run code cell by cell, inspect immediate output, and build quick visualizations inline, which is different from local IDE connectivity, orchestration, or deployment tooling.
Databricks notebooks are designed for interactive development inside the Databricks workspace. Their core value is fast iteration: write code, run a cell, review the result immediately, and optionally add lightweight visualizations in the same notebook. That makes notebooks the right choice when the task is about day-to-day exploration, debugging, or incremental development.
A useful way to separate these tools is:
If the need is interactive execution with immediate feedback in the workspace, think notebooks first.
Use the Databricks Data Engineer Associate Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.
Try Databricks Data Engineer Associate on Web View Databricks Data Engineer Associate Practice Test
Read the Databricks Data Engineer Associate Cheat Sheet on Tech Exam Lexicon for concept review before another timed run.