Databricks Data Engineer Associate Practice Test

Prepare for Databricks Certified Data Engineer Associate with free sample questions, a full-length diagnostic, topic drills, timed practice, Lakehouse Platform usage, ingestion, transformations, production workflows, governance, Spark and Delta scenarios, and detailed explanations in IT Mastery.

Databricks Certified Data Engineer Associate targets practical lakehouse engineering judgment across platform usage, ingestion, transformations, workflows, and governance. If you are searching for Databricks Data Engineer Associate sample questions, a practice test, mock exam, or exam simulator, this is the main IT Mastery page to start on web and continue on iOS or Android with the same IT Mastery account.

Interactive Practice Center

Start a practice session for Databricks Certified Data Engineer Associate below, or open the full app in a new tab. For the best experience, open the full app in a new tab and navigate with swipes/gestures or the mouse wheel—just like on your phone or tablet.

Open Full App in a New Tab

A small set of questions is available for free preview. Subscribers can unlock full access by signing in with the same app-family account they use on web and mobile.

Prefer to practice on your phone or tablet? Download the IT Mastery – AWS, Azure, GCP & CompTIA exam prep app for iOS or IT Mastery app on Google Play (Android) and use the same IT Mastery account across web and mobile.

Free diagnostic: Try the 45-question Databricks Data Engineer Associate full-length practice exam before subscribing. Use it as one baseline run, then return to IT Mastery for timed mocks, topic drills, explanations, and the full Databricks question bank.

What this practice page gives you

  • a direct route into IT Mastery practice for Databricks Certified Data Engineer Associate
  • 24 sample questions with detailed explanations across platform, ingestion, transformations, production workflows, and governance
  • focused practice around Databricks-native engineering decisions instead of generic Spark memorization
  • a clear free-preview path before you subscribe
  • the same IT Mastery account across web and mobile

Data Engineer Associate exam snapshot

  • Vendor: Databricks
  • Official exam name: Databricks Certified Data Engineer Associate
  • Exam code: Databricks DEA
  • Items: 45 total
  • Exam time: 90 minutes
  • Practice format on this page: 24 original IT Mastery sample questions with detailed explanations

Introductory Databricks data engineering companion practice across platform fundamentals, ingestion, transformations, productionization, and Unity Catalog or sharing workflows, aligned to the 45-question, 90-minute exam.

Topic coverage for Data Engineer Associate practice

DomainWeight
Databricks Intelligence Platform10%
Development and Ingestion17%
Data Processing & Transformations21%
Productionizing Data Pipelines17%
Data Governance & Quality35%

How to use the Data Engineer Associate simulator efficiently

  1. Start with platform and ingestion questions so Unity Catalog, Auto Loader, notebooks, and job orchestration choices feel clear.
  2. Review every miss until you can explain why the best Databricks-native pattern is safer, cleaner, or more maintainable than the distractors.
  3. Move into mixed sets once you can switch between Spark transformations, Delta Lake behavior, and production pipeline decisions without hesitation.
  4. Finish with timed runs so the 90-minute pace feels controlled before test day.

Databricks Data Engineer decision filters

Use these filters when several lakehouse answers look plausible:

  • Workload shape: decide whether the task is SQL analytics, batch ETL, streaming, orchestration, governance, or pipeline troubleshooting.
  • Object boundary: distinguish catalogs, schemas, tables, views, jobs, notebooks, clusters, warehouses, and Delta Live Tables.
  • Delta behavior: look for ACID transactions, schema evolution, time travel, optimization, partitioning, and file-management clues.
  • Governance signal: apply Unity Catalog permissions, lineage, sharing, and data-access controls before optimizing performance.
  • Production readiness: prefer the answer that makes pipelines observable, repeatable, recoverable, and maintainable.

Final 7-day Databricks practice sequence

DayPractice focus
7Take the free full-length diagnostic and tag misses by platform, ingestion, processing, pipeline, or governance.
6Drill workspace, compute, SQL warehouse, catalog, schema, table, and lakehouse concepts.
5Drill Delta tables, ingestion, transformation, file layout, optimization, and data-quality scenarios.
4Drill production jobs, orchestration, dependencies, monitoring, retries, and deployment patterns.
3Drill Unity Catalog, permissions, lineage, sharing, and governance cases.
2Complete a timed mixed set and explain the lakehouse object boundary behind each miss.
1Review weak Databricks service and object distinctions; avoid cramming low-value UI trivia.

When Databricks practice is enough

If several unseen mixed attempts are above roughly 75% and you can explain the lakehouse, Delta, pipeline, or governance reason behind your answers, you are likely ready. More practice should improve data-engineering judgment, not turn Databricks scenarios into memorized stems.

Focused sample questions

Use these child pages when you want focused IT Mastery practice before returning to mixed sets and timed mocks.

Free study resources

Need concept review first? Read the Databricks Data Engineer Associate Cheat Sheet on Tech Exam Lexicon, then return here for timed mocks, topic drills, and full IT Mastery practice.

Free preview vs premium

  • Free preview: a smaller web set so you can validate the question style and explanation depth.
  • Premium: the full Data Engineer Associate practice bank, focused drills, mixed sets, timed mock exams, detailed explanations, and progress tracking across web and mobile.

Good next pages after Data Engineer Associate

  • Databricks DE-PRO if you are moving deeper into production pipelines and operations at scale
  • Databricks exam pages if you are still choosing between data engineering, analytics, machine learning, and GenAI tracks

24 Data Engineer Associate sample questions with detailed explanations

These are original IT Mastery practice questions aligned to Databricks Lakehouse concepts, Delta Lake, Spark SQL, data pipelines, jobs, and governance decisions. They are not Databricks exam questions and are not copied from any exam sponsor. Use them to check readiness here, then continue in IT Mastery with mixed sets, topic drills, and timed mocks.

Question 1

Topic: Data Governance & Quality

A producer team stores curated Delta tables in Unity Catalog on AWS us-east-1. Two recipient teams use separate Databricks accounts on AWS us-east-1, and one external partner uses Databricks on Azure. The producer needs governed read-only sharing and wants the lowest ongoing data-transfer cost without creating extra copies. What should the producer do?

Options:

  • A. Configure Lakehouse Federation so each recipient queries the producer tables in place.
  • B. Replicate all shared data to Azure first, then share that single copy with every recipient.
  • C. Use Databricks-to-Databricks Delta Sharing to the AWS recipients, and create a separate cross-cloud share only for the Azure partner.
  • D. Deliver nightly Parquet exports to each recipient’s cloud storage.

Best answer: C

Explanation: Delta Sharing is generally lowest cost when the provider and recipient are in the same cloud and region. Because two recipients are on AWS us-east-1, the producer should share directly to them there and use a cross-cloud share only for the Azure partner that actually requires it.

The key concept is that Delta Sharing data-transfer cost depends on where the provider and recipients are located. Same-cloud, same-region sharing is typically the lower-cost pattern because it avoids unnecessary cross-cloud or cross-region egress.

In this scenario, the producer and two recipients are all on AWS us-east-1, so those consumers should receive the share directly from that location. The Azure partner is the only consumer that requires cross-cloud access, so only that access path should incur cross-cloud transfer.

Using one Azure-centered pattern for everyone would force the AWS recipients onto a more expensive path, and exporting files would add copies and operational work instead of governed sharing.

  • Replicating everything to Azure first adds unnecessary transfer for the AWS recipients, which already match the producer’s cloud and region.
  • Lakehouse Federation is for querying external systems, not the standard governed sharing pattern for Databricks Delta tables.
  • Nightly Parquet exports create extra copies and pipeline overhead when the requirement is governed read-only sharing.

Question 2

Topic: Data Processing & Transformations

A data engineer reviews this requirement note:

Target table: main.sales.daily_orders
Current state: already exists with historical data
Source: temp view new_orders
Requirement: add all rows from new_orders
Constraint: keep existing rows unchanged

Which SQL statement pattern best matches this requirement?

Options:

  • A. INSERT INTO main.sales.daily_orders SELECT * FROM new_orders
  • B. INSERT OVERWRITE main.sales.daily_orders SELECT * FROM new_orders
  • C. CREATE TABLE main.sales.daily_orders AS SELECT * FROM new_orders
  • D. CREATE OR REPLACE TABLE main.sales.daily_orders AS SELECT * FROM new_orders

Best answer: A

Explanation: INSERT INTO ... SELECT is the standard append pattern for an existing table. The exhibit says the table already exists and that historical rows must remain, so the correct choice must add new rows without replacing current data.

In Databricks SQL, INSERT INTO target SELECT ... is used to append query results to an existing table. That matches the exhibit exactly: main.sales.daily_orders already exists, and the requirement is to add rows from new_orders while leaving historical rows in place.

INSERT OVERWRITE does not append; it replaces the target table’s current contents with the query output. CREATE OR REPLACE TABLE ... AS SELECT also rebuilds the table from the select results, so it is a replace pattern rather than an append pattern. CREATE TABLE ... AS SELECT is meant for creating a new table, not writing additional rows into one that already exists.

For a simple add-rows requirement, the append form is the best match.

  • The overwrite pattern fails because it replaces existing rows instead of preserving history.
  • The create-or-replace pattern fails because it recreates the table from the new query results.
  • The CTAS pattern is for initial table creation, so it does not fit an already existing target table.

Question 3

Topic: Data Processing & Transformations

A Delta table contains customer orders after schema enforcement, null handling, duplicate removal, and a join to a product reference table. The data is still stored at the individual order level so downstream teams can build different aggregates from it later. Which Medallion Architecture layer is the best fit for this table?

Options:

  • A. Gold layer
  • B. Silver layer
  • C. Bronze layer
  • D. Raw file landing area

Best answer: B

Explanation: The Silver layer is used for refined data that has already been cleaned, validated, deduplicated, and often joined to reference data. Because the table is still at transaction-level detail rather than a business summary, it fits Silver instead of Bronze or Gold.

Medallion Architecture separates datasets by how refined they are. Bronze usually keeps raw ingested data with minimal transformation so the original source can be preserved. Silver is for cleaned and standardized data: common examples include schema enforcement, null handling, deduplication, quality checks, and joins to reference data. Gold is for business-ready outputs such as KPIs, dashboards, aggregates, or department-focused data products.

In this case, the orders table has already been refined and enriched, but it is still at the individual order grain. That makes Silver the best match. Gold would be more appropriate after the data is aggregated or modeled for a specific reporting use case.

  • Bronze is too early because Bronze typically stores raw source data with minimal cleanup.
  • Gold is too far downstream because Gold usually contains business-ready aggregates or presentation-focused tables.
  • A raw landing area is not the target because the stem describes an already transformed Delta table within the Medallion flow.

Question 4

Topic: Productionizing Data Pipelines

A team promotes the same workflow from dev to test to prod by manually recreating jobs in each workspace. The latest production release failed:

Task: transform_sales
Status: Failed
Error: Notebook not found: /Shared/etl/transform_v1

In test, the task uses /Repos/team/etl/transform_v2. The team wants repeatable, versioned promotion across environments and fewer drift-related failures. What is the best next step?

Options:

  • A. Export notebooks from test and import them into prod
  • B. Repair the failed run after updating the notebook path
  • C. Use Databricks Repos, then update each job manually
  • D. Use Databricks Asset Bundles with environment targets

Best answer: D

Explanation: This failure came from deployment drift between environments, not from a one-time runtime issue. Databricks Asset Bundles are designed for repeatable, versioned promotion of workflows and related resources across dev, test, and prod.

Databricks Asset Bundles are the best fit when teams need consistent promotion of jobs, pipelines, and related resources across environments. In the scenario, production failed because the workflow was recreated manually and still referenced an outdated notebook path, which is a classic sign of environment drift. A bundle-based approach stores the deployment definition in version control, lets the team define targets such as dev, test, and prod, and deploys the same versioned resources to each workspace while allowing controlled environment-specific values.

  • Keep deployment definitions with the project code.
  • Use targets for environment-specific settings.
  • Promote the same tested version across workspaces.

That is a stronger long-term fix than correcting each workspace manually after releases fail.

  • Repairing the run addresses the immediate failure but does not prevent future drift between environments.
  • Exporting and importing notebooks is still a manual release process and does not reliably version workflow resources.
  • Using Repos with manual job edits versions code, but the job configuration can still diverge across workspaces.

Question 5

Topic: Productionizing Data Pipelines

A team runs an hourly Databricks Workflow with one PySpark notebook task that uses only built-in Spark and SQL features on Unity Catalog Delta tables. Serverless compute is available in the workspace. Spark UI from successful runs shows the stages finish in about 90 seconds, but total runs often take 7–8 minutes because a jobs cluster must start first. Some runs also fail before code starts due to cluster provisioning, and the team wants less operational overhead. What is the best next step?

Options:

  • A. Run the task on a larger all-purpose cluster
  • B. Configure the workflow task to use serverless compute
  • C. Increase the jobs cluster autoscaling maximum
  • D. Place the jobs cluster in a cluster pool

Best answer: B

Explanation: The notebook work is already short, so the main problem is not Spark execution. Because startup delays, provisioning failures, and cluster maintenance are the real bottlenecks, moving this supported workflow task to serverless compute is the best way to reduce operational overhead.

This is a compute-management problem, not a Spark-processing problem. The Spark UI shows the actual stages complete quickly, so adding workers or tuning execution is unlikely to help much. The bigger issue is that the workflow depends on a user-managed jobs cluster, and the stem says startup delays, provisioning failures, and cluster maintenance are the pain points. For a supported PySpark notebook workload, serverless compute reduces operational overhead because Databricks manages the compute lifecycle for the task.

  • Use serverless when the workload is supported and cluster administration is the main bottleneck.
  • Compare Spark UI stage time with total run time to separate slow execution from startup overhead.
  • Keep user-managed clusters mainly when you need custom cluster-level control.

A cluster pool can reduce startup latency, but it does not remove cluster management as directly as serverless compute.

  • More autoscaling targets execution capacity, but the stages already finish quickly, so it misses the startup issue.
  • All-purpose cluster can keep compute available, but it still requires ongoing cluster management for a scheduled production task.
  • Cluster pool may shorten provisioning time, but the team still manages cluster configuration instead of offloading that work.

Question 6

Topic: Data Processing & Transformations

When choosing compute in Databricks, which statement correctly distinguishes an all-purpose development workload from a job-oriented production workload?

Options:

  • A. All-purpose compute is for interactive development; jobs compute is for automated runs.
  • B. All-purpose compute is required for Unity Catalog; jobs compute is required for Delta tables.
  • C. All-purpose compute is only for SQL queries; jobs compute is only for notebooks.
  • D. All-purpose compute is for scheduled production runs; jobs compute is for ad hoc exploration.

Best answer: A

Explanation: All-purpose compute is designed for interactive, user-driven development such as notebook editing, testing, and exploration. Jobs compute is intended for automated production execution, such as scheduled or triggered workflow runs.

The key difference is workload pattern, not language or governance feature. All-purpose compute is intended for collaborative, iterative work where a user stays attached to compute, runs commands manually, and inspects results while developing or debugging notebooks. Jobs compute is intended for operationalized workloads that run through automation, such as Databricks Workflows, on a schedule or in response to a trigger.

When the main need is development, troubleshooting, or ad hoc analysis, all-purpose compute is the better fit. When the main need is repeatable, production-style execution, jobs compute is the better fit. A common mistake is to treat this as a SQL-versus-notebook choice or a Unity Catalog-versus-Delta choice, but those are separate concerns.

  • The reversed usage fails because scheduled or triggered production execution maps to jobs compute, not all-purpose compute.
  • The language-based distinction fails because both compute types can support notebook and Spark workloads.
  • The governance-based distinction fails because Unity Catalog and Delta tables are not exclusive to one compute type.

Question 7

Topic: Data Governance & Quality

A data engineering team stores a curated gold table in Unity Catalog. A logistics partner needs read-only access from its own non-Databricks BI tool, the provider must keep governance on the shared data, and the team wants to avoid file exports or duplicate copies. Which Databricks capability is the best fit?

Options:

  • A. Create an open-sharing recipient for the partner
  • B. Create a Databricks-to-Databricks recipient
  • C. Use Lakehouse Federation for partner access
  • D. Grant the partner SELECT in Unity Catalog

Best answer: A

Explanation: Open sharing is the Delta Sharing pattern for recipients that do not use Databricks. It lets the provider share governed, read-only data without exporting files or granting direct Unity Catalog access to the partner.

Delta Sharing has different patterns for different consumers. In this scenario, the recipient is outside Databricks and needs to read the data from its own BI tool, so open sharing is the best fit. The provider can share the governed gold table through a share and recipient configuration without creating duplicate copies or giving the partner direct access to the provider’s Databricks environment. Unity Catalog permissions are better for users and groups that already work directly inside the governed Databricks environment. Databricks-to-Databricks sharing is appropriate when the receiving team also uses Databricks. Lakehouse Federation solves a different problem: querying external systems from Databricks rather than publishing Databricks data outward.

  • Granting SELECT in Unity Catalog assumes the partner has direct access to the provider’s Databricks environment.
  • Creating a Databricks-to-Databricks recipient fits only when the receiving team also uses Databricks.
  • Using Lakehouse Federation is for querying external data into Databricks, not sharing Databricks-managed data out to a partner.

Question 8

Topic: Data Processing & Transformations

An engineer models a bronze-to-silver flow as two Databricks Workflow notebook tasks. The bronze notebook uses Auto Loader, and the silver notebook transforms the bronze output. After the bronze task fails because of a schema change, operators sometimes repair only the silver task later and get inconsistent results. They want Databricks to manage the upstream/downstream relationship instead of relying on notebook task order. What is the best fix?

Options:

  • A. Create bronze_orders and silver_orders in one Lakeflow pipeline, with silver_orders reading bronze_orders.
  • B. Add more Workflow task dependencies and retries between the notebooks.
  • C. Run both notebooks on the same cluster to share state.
  • D. Control execution by notebook order in the pipeline source list.

Best answer: A

Explanation: Lakeflow Spark Declarative Pipelines manage related datasets through dataset references, not notebook task sequencing. Putting bronze and silver in the same pipeline lets Databricks understand the dependency and update them in the correct order.

The issue is that Databricks Workflows task order only sequences notebook runs; it does not define a dataset dependency for Lakeflow to manage. For related bronze and silver transformations, the better pattern is to declare both datasets in the same Lakeflow Spark Declarative Pipeline and make the downstream dataset read the upstream dataset. Lakeflow then builds the dependency graph from those references and uses it for execution order, lineage, and downstream recomputation when upstream data changes.

Adding more task dependencies or changing compute can affect run timing, but those steps still treat the work as manually sequenced notebooks rather than a declarative, dependency-aware pipeline. The key takeaway is to declare data dependencies, not just notebook order.

  • More Workflow dependencies still manually sequence notebook tasks instead of declaring a dataset dependency inside Lakeflow.
  • Shared cluster changes where code runs, not how Databricks tracks upstream and downstream datasets.
  • Notebook order in pipeline sources is not the mechanism Lakeflow uses to resolve dependent datasets.

Question 9

Topic: Data Processing & Transformations

A data engineering team runs three notebooks in a Databricks Workflow to build bronze, silver, and gold tables. After a new transformation was added, the next scheduled run failed because the downstream task dependency was never updated in the job DAG. The team wants future pipeline changes to require minimal manual orchestration.

What is the best next implementation step?

Options:

  • A. Add retries and repair runs to the existing workflow.
  • B. Merge all stages into one scheduled notebook task.
  • C. Deploy the current notebooks with Databricks Asset Bundles.
  • D. Define the pipeline declaratively with Lakeflow Spark Declarative Pipelines.

Best answer: D

Explanation: The failure was caused by manually maintained task dependencies, not by a transient runtime issue. Lakeflow Spark Declarative Pipelines are built for creating and updating data pipelines with Databricks-managed orchestration, so changes require less manual DAG maintenance.

Lakeflow Spark Declarative Pipelines are the best fit when a team wants Databricks to manage pipeline sequencing instead of hand-editing workflow dependencies. In this scenario, the run failed because the pipeline changed, but the Workflow DAG did not. A declarative pipeline lets engineers define transformations and target datasets in code, and Databricks resolves dependencies and runs the steps in the correct order.

This is safer for ongoing pipeline updates because the orchestration logic stays tied to the pipeline definition rather than separate manual job edits. Databricks Asset Bundles help package and deploy resources, but they do not replace pipeline dependency management. Retries and repair runs help recover failures after they happen, and a single notebook task is more brittle, not less.

  • Asset Bundles only help package and deploy Databricks resources; they do not infer dataflow dependencies for pipeline execution.
  • Retries and repair runs can recover from failures, but they keep the same manually maintained orchestration problem.
  • One notebook task removes modularity and still does not provide declarative dependency management for evolving pipelines.

Question 10

Topic: Databricks Intelligence Platform

A small data engineering team has an hourly Databricks workflow with two existing notebook tasks: an Auto Loader ingestion step and a PySpark transformation step that writes to Unity Catalog managed tables. They want this production workflow to run with minimal compute administration and do not need long-running interactive compute. What is the best compute choice?

Options:

  • A. Use persistent all-purpose compute for both notebooks.
  • B. Use a serverless SQL warehouse for both notebook tasks.
  • C. Use serverless compute for the workflow jobs.
  • D. Use Lakehouse Federation instead of the workflow.

Best answer: C

Explanation: Serverless compute for jobs is the best fit for scheduled production notebook tasks when the team wants minimal operational overhead. It supports the existing Auto Loader and PySpark workflow without requiring engineers to manage long-running compute.

This scenario is about choosing the simplest production compute for an existing workflow. The team already has notebook tasks for file ingestion and PySpark transformation, needs scheduled execution, and explicitly wants as little compute administration as possible. Serverless compute for jobs is designed for that pattern: Databricks provisions and manages the compute for workflow tasks, so the team does not have to keep compute running, size a persistent cluster for a small hourly job, or handle routine lifecycle management.

Persistent all-purpose compute is better suited to interactive development, while a SQL warehouse is intended for SQL workloads rather than notebook tasks that use Auto Loader and PySpark. Lakehouse Federation is for querying external systems, not replacing a file-ingestion and transformation workflow.

  • The persistent all-purpose compute option adds ongoing lifecycle management that the team explicitly wants to avoid.
  • The serverless SQL warehouse option is for SQL execution, not notebook tasks built with Auto Loader and PySpark.
  • The Lakehouse Federation option solves external query access, not scheduled file ingestion and transformation.

Question 11

Topic: Data Governance & Quality

A scheduled Databricks job failed while reading a Unity Catalog table. The engineer must determine which principal removed access and when it happened.

SELECT count(*) FROM main.finance.payroll
Error: PERMISSION_DENIED:
User does not have SELECT on table main.finance.payroll

Which Databricks source is the best next step for this investigation?

Options:

  • A. Re-run the notebook and inspect cell output
  • B. Run SHOW GRANTS on the table
  • C. Open Spark UI for the failed run
  • D. Review audit logs for access and privilege events

Best answer: D

Explanation: This is a governance investigation, not a query-debugging or performance task. Audit logs are the right source to identify the actor, action, and timestamp for permission changes or access activity on Unity Catalog objects.

Use audit logs when the question is about governance activity: who accessed an object, who changed permissions, and when that happened. The notebook output only shows the failure that reached the query. Spark UI is for execution details such as jobs, stages, and performance, not access-control history. SHOW GRANTS can confirm the table’s current privilege state, but it does not tell you which user or service principal changed access earlier or the exact event time.

For Unity Catalog access investigations, audit logs are the source that ties together the principal, action, object, and timestamp. That makes them more appropriate than notebook output or Spark UI for this scenario.

  • Notebook output confirms the permission error, but it does not provide the historical actor and timestamp behind the access change.
  • Spark UI helps analyze execution behavior and performance, not governance or privilege-change events.
  • Current grants only from SHOW GRANTS show the present state of permissions, not who removed access earlier.

Question 12

Topic: Data Processing & Transformations

In a Databricks notebook, a DataFrame named orders contains customer_id, status, and order_total. What does avg_delivered_total represent in this PySpark aggregation?

from pyspark.sql import functions as F

metrics = (
    orders.groupBy("customer_id")
          .agg(
              F.avg(
                  F.when(F.col("status") == "delivered", F.col("order_total"))
              ).alias("avg_delivered_total")
          )
)

Options:

  • A. Delivered-order percentage per customer_id
  • B. Average delivered order_total per customer_id
  • C. Sum of delivered order_total per customer_id
  • D. Average order_total per customer_id, with other statuses as 0

Best answer: B

Explanation: This is a conditional average by customer. Because when has no otherwise, non-delivered rows become null, so Spark averages only delivered order_total values for each customer_id group rather than treating other rows as zero or computing a sum.

F.when(condition, value) returns value when the condition is true and null when no otherwise clause is provided. In this aggregation, only rows where status is delivered contribute order_total. Since Spark’s avg ignores nulls, avg_delivered_total is the mean delivered order amount for each customer_id.

  • groupBy("customer_id") creates one aggregate row per customer.
  • when(...) makes the aggregation conditional.
  • avg(...) calculates a mean, not a total or a percentage.

The closest confusion is assuming unmatched rows become 0, but they remain null unless otherwise(0) is added.

  • Zero fill fails because unmatched rows are null, not 0, without an otherwise clause.
  • Total amount fails because avg computes a mean; sum would be needed for a total.
  • Rate metric fails because the expression aggregates order_total values, not counts of delivered rows.

Question 13

Topic: Data Processing & Transformations

Which Medallion layer is the best fit for a Delta table that has already been cleaned, deduplicated, validated, and joined with reference data, but is not yet aggregated into business metrics?

Options:

  • A. Gold layer
  • B. External table
  • C. Bronze layer
  • D. Silver layer

Best answer: D

Explanation: Silver is the Medallion layer for curated operational data. The table in the stem has moved beyond raw ingestion and now contains cleaned, validated, deduplicated, and enriched records that can support multiple downstream uses before business-specific summarization.

In the Medallion Architecture, Silver is the layer for data that has been improved from its raw state and made trustworthy for broader reuse. Typical Silver processing includes schema standardization, data quality checks, deduplication, filtering bad records, and joining to reference data to create conformed datasets. That matches the stem exactly: the table is no longer raw, but it also is not yet a business-facing summary or KPI table. Bronze is usually the initial landing layer with minimal transformation, while Gold is usually the serving layer for business metrics, aggregates, and report-ready datasets. A storage choice such as managed versus external is separate from the Medallion layer decision. The key distinction here is reusable curated data versus business-level summarized data.

  • Bronze confusion fails because Bronze is for raw or minimally transformed ingested data, not cleaned and conformed records.
  • Gold confusion fails because Gold is typically for business-level aggregates, KPIs, or report-ready outputs, which the stem explicitly excludes.
  • Storage vs. layer fails because an external table describes where data is stored and governed, not its Medallion stage.

Question 14

Topic: Data Governance & Quality

An engineering team runs a nightly Databricks workflow that copies a Unity Catalog managed gold table to a separate storage location for an external business partner. The export_partner_copy task now fails because the target storage is out of space. The partner only needs read-only access to current data, and the team wants controlled sharing without creating a full duplicate unless required. What is the best next step?

Options:

  • A. Expose the managed table files through an external location
  • B. Keep the export job and add OPTIMIZE plus VACUUM
  • C. Use Lakehouse Federation for the partner connection
  • D. Create a Delta Share for the gold table

Best answer: D

Explanation: Delta Sharing is built for secure, controlled data sharing without defaulting to a separate exported copy. Because the partner only needs read-only access to current data, replacing the failing export pattern with a Delta Share is the best fit.

Delta Sharing is the Databricks capability for governed, read-only sharing of data with internal or external recipients. In this scenario, the real problem is not table performance or schema handling; it is that the team is maintaining a separate copy just to share data, and that copy workflow is now failing because of storage limits. Creating a Delta Share lets the provider share the curated table while keeping access controlled from Databricks and avoiding unnecessary duplication when it is not needed.

  • Share the required table from Unity Catalog.
  • Add the partner as a recipient.
  • Grant only the needed read access.
  • Stop depending on a full exported copy for this use case.

The closest wrong idea is direct storage exposure, which is not the intended governed sharing model for managed tables.

  • Direct file exposure is not the intended governed way to share a Unity Catalog managed table with an external partner.
  • Federation mismatch fails because Lakehouse Federation is for querying external data systems from Databricks, not publishing Databricks tables outward.
  • Job tuning only misses the root issue because OPTIMIZE and VACUUM do not remove the need for a separate duplicate export.

Question 15

Topic: Development and Ingestion

Which PySpark statement uses the basic Auto Loader syntax pattern to read JSON files from /mnt/raw/events through the cloudFiles mechanism?

Options:

  • A. spark.readStream.format('cloudFiles').option('format', 'json').load('/mnt/raw/events')
  • B. spark.readStream.format('json').option('cloudFiles.format', 'json').load('/mnt/raw/events')
  • C. spark.readStream.format('cloudFiles').option('cloudFiles.format', 'json').load('/mnt/raw/events')
  • D. spark.writeStream.format('cloudFiles').option('cloudFiles.format', 'json').start('/mnt/raw/events')

Best answer: C

Explanation: Auto Loader reads files by setting the source format to cloudFiles and then specifying the underlying file type with cloudFiles.format. In PySpark, the common pattern is a streaming read followed by load(path).

The core Auto Loader pattern uses Structured Streaming as the reader, cloudFiles as the source, and cloudFiles.format to identify the incoming file type. For JSON input, the basic form is spark.readStream.format('cloudFiles').option('cloudFiles.format', 'json').load(path). This tells Databricks to use Auto Loader to discover and ingest files incrementally from cloud storage.

A common mistake is putting json in format(); that performs a normal file read instead of using the Auto Loader source. Another common mistake is using format as an option name or using writeStream, which handles output rather than ingestion. The key pattern is readStream + cloudFiles + cloudFiles.format.

  • The option using format('json') selects the JSON source directly, so it does not use Auto Loader.
  • The option using option('format', 'json') places the file type in the wrong option name.
  • The option using writeStream uses the output API, not the input API for reading arriving files.

Question 16

Topic: Productionizing Data Pipelines

Which statement correctly distinguishes workflow repair or rerun from redeploying a Databricks Asset Bundle?

Options:

  • A. Repair or rerun republishes local bundle files before tasks run again.
  • B. Repair or rerun re-executes workflow tasks, while bundle deployment updates target artifacts.
  • C. Repair or rerun and bundle deployment both replace workspace artifacts.
  • D. Bundle deployment is the normal way to retry failed tasks from a prior run.

Best answer: B

Explanation: Workflow repair or rerun is for execution recovery, not for shipping new artifacts. Databricks Asset Bundle deployment updates resources in the target environment, while repair or rerun simply runs tasks again for an existing workflow.

The core difference is run recovery versus artifact deployment. In Databricks, repairing or rerunning a workflow run is an execution action: it re-executes failed tasks or reruns the workflow without changing local project files or republishing bundle contents. By contrast, deploying a Databricks Asset Bundle pushes the bundle-defined resources, such as workflows and related configuration, into the target workspace.

If a job failed because of a temporary data or infrastructure issue, repair or rerun is the relevant action. If code or configuration changed locally and those changes must exist in the target environment, the bundle must be redeployed first. The closest trap is assuming a rerun also republishes artifacts; it does not.

  • Republish confusion mixes execution recovery with deployment; repair or rerun does not push local bundle files.
  • Retry by deployment is incorrect because deployment updates resources, not the state of a previous run.
  • Same purpose fails because rerun focuses on execution, while deployment changes target definitions.

Question 17

Topic: Data Processing & Transformations

A Databricks job loads a daily customer change file into a staging Delta table. The silver customer table in Unity Catalog must update existing customers, insert brand-new customers, and leave all other rows unchanged. Which Databricks SQL operation is the best next step?

Options:

  • A. Use DELETE on matching silver rows
  • B. Use INSERT INTO from staging to silver
  • C. Use UPDATE on matching silver rows
  • D. Use MERGE INTO from staging to silver

Best answer: D

Explanation: This is an upsert scenario: existing customer rows must be updated, and new customer rows must be inserted. In Databricks SQL, MERGE INTO is the DML operation designed to do both against a Delta table while leaving unrelated rows unchanged.

MERGE INTO is the standard Delta Lake DML operation for applying change data from a staging table into a target table. Here, the target silver table needs three outcomes at once: update rows when the customer key already exists, insert rows when the customer key is new, and keep all other target rows as they are. A merge handles those outcomes in a single statement by checking a match condition and applying different actions for matched and unmatched rows.

  • Match the staging and target tables on a key such as customer_id.
  • Use WHEN MATCHED THEN UPDATE for existing customers.
  • Use WHEN NOT MATCHED THEN INSERT for new customers.

UPDATE alone only changes existing rows, and INSERT INTO alone only appends rows, so neither fully satisfies the requirement.

  • Appending with INSERT INTO can add new rows, but it does not update existing customer records and can create duplicates.
  • Using UPDATE on matching rows handles only existing customers; it does nothing for new customer IDs.
  • Removing matching rows with DELETE is incomplete and would still require additional steps to reload changed and new records.

Question 18

Topic: Development and Ingestion

A data engineer is building a new PySpark silver-table transformation. The source data already exists in Unity Catalog Delta tables, and the production job will still run in Databricks after development. The engineer wants to write code in a local IDE, run local tests, and execute the transformation against Databricks compute during development. Which Databricks capability is the best fit?

Options:

  • A. Create a Lakeflow pipeline with read_files
  • B. Use Databricks Connect from the local IDE
  • C. Configure Auto Loader on the upstream storage path
  • D. Develop only in a Databricks notebook

Best answer: B

Explanation: This is a development workflow question, not an ingestion question. Databricks Connect lets an engineer keep coding and testing in a local IDE while running the PySpark workload against Databricks compute and governed Unity Catalog data.

The key concept is that Databricks Connect supports local development against remote Databricks compute. In the scenario, the data is already available in Unity Catalog Delta tables, so the missing capability is not file discovery or source ingestion. The engineer specifically needs a local IDE workflow with local tests while still executing Spark code in Databricks.

Auto Loader and read_files are used when ingesting files from cloud storage or another file source. Lakeflow Spark Declarative Pipelines help define managed ingestion and transformation pipelines, but they do not replace the local IDE connectivity requirement. A Databricks notebook is a valid workspace authoring tool, yet it does not best satisfy the explicit requirement to develop from a local IDE. The deciding clue is that the problem is about where development happens, not how new data arrives.

  • Auto Loader mismatch fails because Auto Loader is for incremental file ingestion, but the source data already exists in Unity Catalog tables.
  • read_files mismatch adds an ingestion pattern for object storage instead of solving local IDE execution against Databricks compute.
  • Notebook-only workflow supports interactive development in Databricks, but it does not best meet the stated local IDE and local test requirements.

Question 19

Topic: Data Governance & Quality

A data engineer must create a Unity Catalog table over files that must remain in a partner-owned cloud storage path. They run:

CREATE TABLE main.ops.shipments
USING DELTA
LOCATION 's3://partner-data/shipments/'
AS SELECT * FROM main.staging.shipments_clean;

The SELECT query succeeds by itself, but the command fails with PERMISSION_DENIED: Missing CREATE EXTERNAL TABLE on External Location partner_data. The engineer already has USE CATALOG, USE SCHEMA, and CREATE TABLE on main.ops. What is the best next step?

Options:

  • A. Grant CREATE EXTERNAL TABLE on the matching external location.
  • B. Recreate the table as managed without LOCATION.
  • C. Grant MANAGE on schema main.ops and retry.
  • D. Grant SELECT on the target table in main.ops.

Best answer: A

Explanation: This failure is about storage governance, not table access. Because the statement uses an explicit LOCATION and the error names an external location, the missing privilege is the Unity Catalog permission required to create an external table on that path.

In Unity Catalog, a table created with an explicit LOCATION is an external table, so the operation depends on privileges for the external storage path as well as normal catalog and schema permissions. Here, the engineer already has CREATE TABLE on the schema and the SELECT portion runs successfully, which rules out a source-table read problem. The error specifically names the external location and the missing CREATE EXTERNAL TABLE privilege, so the next step is to grant that privilege on the external location that governs the path.

Managed tables are different: Databricks manages their storage, so you would not specify an external LOCATION for a partner-owned path. The key distinction is that table permissions control access to an existing table, while external location privileges control who can register external storage as a Unity Catalog table.

  • Granting SELECT on the target table does not help because the table was never created, and table permissions apply after registration.
  • Recreating the table as managed changes the storage-ownership model and conflicts with the requirement to keep data in the partner-owned path.
  • Granting MANAGE on the schema affects schema administration, but it does not replace the external location privilege required for external storage.

Question 20

Topic: Development and Ingestion

An Auto Loader workflow task starts failing after a new JSON field appears in source files.

Exhibit:

Task: ingest_customers
Status: Failed
Error: Unknown field `loyalty_tier` in incoming data
Need: inspect source schema, compare it with the target Delta table,
test a small fix, and share findings with a teammate

Which next step is best?

Options:

  • A. Use the SQL editor only and document findings in email.
  • B. Use one notebook with Markdown, PySpark, SQL, and inline results.
  • C. Rebuild the workflow into separate troubleshooting tasks first.
  • D. Use Databricks Connect and keep notes in a separate document.

Best answer: B

Explanation: Databricks notebooks support a collaborative workflow where engineers can combine Markdown, code, SQL, and output inspection in one place. That makes them a strong fit for troubleshooting an ingestion failure that requires both technical checks and clear shared notes.

The core notebook capability here is combining multiple investigation steps in a single collaborative artifact. For this failure, an engineer can use PySpark cells to inspect the incoming schema, SQL cells to query the target Delta table, Markdown cells to record observations and next actions, and inline results or visualizations to confirm what changed.

This is the fastest Associate-level troubleshooting flow because it keeps the technical checks and the team handoff together:

  • inspect source data and schema
  • compare with the target table
  • test a small change
  • save findings beside the outputs

Local IDE workflows, SQL-only tools, or reworking workflow tasks can be useful in other cases, but they do not match the need for one shared, mixed-language investigation space.

  • Databricks Connect split is useful for local IDE development, but it separates the investigation from the shared notes and inline outputs.
  • SQL-only approach is too narrow because source-file and schema troubleshooting often needs PySpark or DataFrame inspection in addition to SQL.
  • Workflow rebuild first adds overhead before the team even understands the failure, so it is not the best immediate troubleshooting step.

Question 21

Topic: Productionizing Data Pipelines

A Databricks notebook job joins a 20 GB Delta table of events with a 5 MB country lookup table. In the Spark UI, the join stages show very high shuffle read and shuffle write, and those stages dominate the runtime. What is the best next step to reduce this likely inefficiency?

Options:

  • A. Repartition the events table to one partition.
  • B. Run the job on a larger cluster without changing the code.
  • C. Broadcast the 5 MB lookup table in the join.
  • D. Cache the 20 GB events table before the join.

Best answer: C

Explanation: The Spark UI symptom is a shuffle-heavy join where one side is very small. In that case, broadcasting the small lookup table is the best conceptual fix because it lets executors join locally instead of exchanging both datasets across the cluster.

High shuffle read and shuffle write in join stages usually means Spark is moving a lot of data across the network. When one input is only 5 MB, that pattern is often inefficient because the small table can be broadcast to executors. Then each partition of the large events table can join locally, which greatly reduces shuffle overhead and often shortens the slowest stages shown in the Spark UI.

Caching is most useful when data is reused across multiple actions, not for fixing a single shuffle-heavy join. Repartitioning to one partition removes parallelism, and adding more compute may help somewhat but does not address the inefficient execution pattern. The key takeaway is that a small lookup table plus heavy join shuffle is a strong hint to use a broadcast join.

  • Cache mismatch fails because caching the large table does not remove the expensive data exchange during the join.
  • Lost parallelism fails because forcing one partition usually makes processing slower, not more efficient.
  • Scale only fails because larger compute can mask the symptom but does not fix the shuffle-heavy join strategy.

Question 22

Topic: Development and Ingestion

Which use case is the best fit for Auto Loader?

Options:

  • A. A team needs a one-time backfill from a fixed set of Parquet files
  • B. An engineer wants to read one CSV file a single time in a notebook for exploration
  • C. A cloud storage path receives new JSON files throughout the day, and the pipeline must ingest only new files with minimal custom logic
  • D. An analyst needs to query tables in an external database without copying the data

Best answer: C

Explanation: Auto Loader is best when files arrive continuously in cloud storage and the pipeline should discover and process only new files. It is more appropriate than a one-time manual read or a custom script for repeated incremental ingestion.

Auto Loader is Databricks’ file ingestion mechanism for data that lands over time in cloud object storage. It is especially useful when a directory keeps receiving new files and the team wants reliable incremental processing without writing custom code to repeatedly list files, track what was already loaded, or manage schema changes manually.

  • Use Auto Loader for ongoing file arrival patterns.
  • Use a simple manual read for a one-time, static file.
  • Use a one-time batch approach for a fixed backfill.
  • Use Lakehouse Federation when the goal is to query external databases without ingesting files.

The key distinction is continuous incremental file ingestion versus a single read or a non-file access pattern.

  • Single file exploration fits a basic one-time read, not a persistent ingestion pattern.
  • Fixed backfill is a bounded batch load, so Auto Loader is usually unnecessary overhead.
  • External database access describes federation, not file ingestion from cloud storage.

Question 23

Topic: Data Governance & Quality

A data steward needs to answer two governance questions about a new table: which source tables fed it, and which user executed the transformation.

Exhibit:

CREATE OR REPLACE TABLE main.analytics.customer_spend AS
SELECT c.customer_id, SUM(o.amount) AS total_spend
FROM main.bronze.customers c
JOIN main.silver.orders o
  ON c.customer_id = o.customer_id
GROUP BY c.customer_id;

Which interpretation is correct?

Options:

  • A. Use audit logs for source tables and lineage for who ran it
  • B. Use lineage for both the source tables and the user
  • C. Use lineage for source tables and audit logs for who ran it
  • D. Use audit logs for both the source tables and the user

Best answer: C

Explanation: Lineage and audit logs answer different governance questions. For this SQL, lineage identifies that main.bronze.customers and main.silver.orders feed main.analytics.customer_spend, while audit logs identify who performed the action and when.

The core distinction is data flow versus activity. Lineage is used to trace how data moves across tables, columns, and downstream assets, so it answers questions like “what upstream tables produced this table?” In the exhibit, lineage would show main.bronze.customers and main.silver.orders as upstream inputs to main.analytics.customer_spend.

Audit logs answer a different question: who did what, and when. If the team needs the identity of the user or service principal that executed the statement, that is an action record rather than a dependency relationship. Treating audit logs and lineage as interchangeable leads to incomplete governance answers because one describes data dependencies and the other describes events.

  • Swapped roles fails because audit logs record actions, not upstream table relationships.
  • Lineage for both fails because lineage does not serve as the authoritative record of who executed an action.
  • Audit logs for both fails because event records do not replace lineage for tracing upstream and downstream data flow.

Question 24

Topic: Data Governance & Quality

A platform team has replicated the same gold table into two Unity Catalog locations and will publish it as read-only with Delta Sharing to consumer workspaces in other accounts.

Exhibit:

Available provider copies
- `main.sales_gold` on AWS us-east-1
- `main.sales_gold` on Azure East US

Consumer workspaces
- BI on AWS us-east-1
- Finance on AWS us-east-1

Which sharing pattern should the team choose for the lowest expected data-transfer cost?

Options:

  • A. Publish the AWS us-east-1 copy through Delta Sharing.
  • B. Copy the table into each consumer workspace.
  • C. Query the Azure copy with Lakehouse Federation.
  • D. Publish the Azure East US copy through Delta Sharing.

Best answer: A

Explanation: For cost-aware sharing, the lowest-cost pattern is usually to publish from the provider location closest to the consumers. Because both consumer workspaces are on AWS us-east-1 and an AWS us-east-1 copy already exists, sharing that copy avoids unnecessary cross-cloud data movement.

The key concept is producer-consumer colocation for shared data access. Delta Sharing is the right mechanism when consumers need read-only access, but the expected transfer cost depends on where the provider copy sits relative to the recipients. In this exhibit, both consumers are on AWS us-east-1, and the table is also available in AWS us-east-1, so publishing that copy is the lowest-cost pattern.

Choosing the Azure East US copy would make both consumers read across clouds. Lakehouse Federation is meant for querying external systems, not as the best fit for selecting between two Databricks-hosted copies for sharing. Copying the table into each consumer workspace adds duplication and extra pipeline/storage cost when a live shared copy is already available.

When multiple provider locations exist, prefer the one that matches the consumers’ cloud and region.

  • Azure provider copy is tempting because the same table exists there, but both consumers would read from a different cloud.
  • Lakehouse Federation is the wrong fit because the scenario is about choosing the best Databricks sharing location for a shared table.
  • Copy into each workspace can work functionally, but it adds duplication and operational overhead that Delta Sharing avoids.

Databricks Data Engineer Associate map

Use this map after the sample questions to connect individual items to the Databricks lakehouse, Spark, Delta, pipeline, governance, and operations decisions these practice samples test.

    flowchart LR
	  S1["Data engineering requirement"] --> S2
	  S2["Ingest raw data"] --> S3
	  S3["Transform with Spark SQL or DataFrames"] --> S4
	  S4["Store reliable Delta tables"] --> S5
	  S5["Orchestrate jobs and pipelines"] --> S6
	  S6["Govern monitor and optimize lakehouse"]

Quick Cheat Sheet

CueWhat to remember
Delta LakeUnderstand ACID tables, time travel, schema enforcement, and optimization.
Spark executionKnow transformations vs actions, partitions, joins, caching, and shuffle symptoms.
PipelinesUse jobs, workflows, and Delta Live Tables where appropriate.
GovernanceControl catalogs, schemas, tables, permissions, lineage, and data quality.
OptimizationWatch file sizes, partitioning, Z-order, caching, cluster sizing, and job failures.

Mini Glossary

  • Delta table: Lakehouse table using Delta Lake transaction log and reliability features.
  • DataFrame: Distributed data abstraction used in Spark.
  • Shuffle: Redistribution of data across partitions during Spark processing.
  • Unity Catalog: Databricks governance layer for data and AI assets.
  • Workflow: Databricks orchestration mechanism for jobs and tasks.

In this section

Revised on Thursday, May 14, 2026