Databricks Data Engineer Associate: Processing and Transforms

May 1, 2026

Try 10 focused Databricks Data Engineer Associate questions on Processing and Transforms, with explanations, then continue with IT Mastery.

On this page

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try Databricks Data Engineer Associate on Web View full Databricks Data Engineer Associate practice page

Topic snapshot

Field	Detail
Exam route	Databricks Data Engineer Associate
Topic area	Data Processing & Transformations
Blueprint weight	21%
Page purpose	Focused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Data Processing & Transformations for Databricks Data Engineer Associate. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

Pass	What to do	What to record
First attempt	Answer without checking the explanation first.	The fact, rule, calculation, or judgment point that controlled your answer.
Review	Read the explanation even when you were correct.	Why the best answer is stronger than the closest distractor.
Repair	Repeat only missed or uncertain items after a short break.	The pattern behind misses, not the answer letter.
Transfer	Return to mixed practice once the topic feels stable.	Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 21% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.

Question 1

Topic: Data Processing & Transformations

A data engineer is reviewing this Lakeflow Spark Declarative Pipelines SQL.

Exhibit:

CREATE OR REFRESH STREAMING TABLE bronze_orders
AS SELECT * FROM STREAM read_files(
  "/data/orders",
  format => "json"
);

CREATE OR REFRESH MATERIALIZED VIEW silver_daily_totals
AS SELECT order_date, sum(amount) AS daily_total
FROM bronze_orders
GROUP BY order_date;

What is the main advantage of expressing the ETL logic this way?

Options:

A. Databricks requires full recomputation for every pipeline update.
B. Databricks automatically grants access to all downstream datasets.
C. Databricks converts the SQL into a Databricks Asset Bundle.
D. Databricks infers the dependency graph and manages refreshes for dependent datasets.

Best answer: D

Explanation: Lakeflow Spark Declarative Pipelines are declarative because the engineer defines target datasets and transformations instead of manually coding job order. From the SQL, Databricks can infer that the silver dataset depends on the bronze dataset and manage execution accordingly.

The core advantage of Lakeflow Spark Declarative Pipelines is that you declare what datasets should exist and how they are derived, while Databricks manages the execution flow between them. In the exhibit, the bronze streaming table is defined from files, and the silver materialized view is defined from the bronze table. Because those relationships are expressed in the dataset definitions, Databricks can build the pipeline DAG, run upstream logic before downstream logic, and manage refresh behavior without the engineer hand-coding notebook sequencing or separate orchestration steps. This makes ETL pipelines simpler to maintain and less error-prone. The closest distractor confuses pipeline orchestration with governance; permissions still come from Unity Catalog.

Automatic permissions confuses declarative ETL with governance; dataset access is still controlled through Unity Catalog.
Full recomputation is not the defining behavior here; the key benefit is managed dependency and refresh logic from declarations.
Asset Bundle conversion mixes deployment packaging with pipeline authoring; Databricks Asset Bundles help deploy projects, not define ETL semantics.

Question 2

Topic: Data Processing & Transformations

Which statement best describes the advantage of Lakeflow Spark Declarative Pipelines for ETL development in Databricks?

Options:

A. Declare datasets and transformations; Databricks manages execution dependencies.
B. Deploy Databricks resources across environments as code.
C. Run local IDE code directly on Databricks compute.
D. Detect new cloud files automatically during ingestion.

Best answer: A

Explanation: Lakeflow Spark Declarative Pipelines let engineers describe what data sets and transformations should exist instead of manually scripting each execution step. Databricks then manages dependency ordering and pipeline execution, which is the core advantage of declarative ETL.

Lakeflow Spark Declarative Pipelines are built for expressing ETL logic declaratively in Databricks. Instead of writing imperative code that manually controls step order, engineers define target datasets and transformations, typically in SQL or Python, and Databricks manages how the pipeline runs based on those relationships. This makes pipelines easier to read, maintain, and operate because the focus stays on the data flow rather than orchestration details.

The main idea is:

define the datasets you want
define the transformations between them
let Databricks manage dependency-aware execution

Features such as Auto Loader, Databricks Connect, and Databricks Asset Bundles are useful, but they address ingestion, local development connectivity, and deployment packaging rather than the declarative ETL model itself.

File discovery describes Auto Loader, which focuses on ingesting new files from cloud storage.
Deployment as code describes Databricks Asset Bundles, which package and deploy resources rather than define ETL logic declaratively.
Local IDE execution describes Databricks Connect, which helps development workflows but does not provide declarative pipeline semantics.

Question 3

Topic: Data Processing & Transformations

A data team builds a medallion ETL flow with three notebooks in a Databricks Workflow. After an engineer changes upstream logic and reruns only the middle task, the run fails.

Task: transform_silver
Status: FAILED
Error: [TABLE_OR_VIEW_NOT_FOUND] bronze_orders
Note: task order and reruns are managed manually in notebooks

The team wants to reduce this kind of orchestration failure and define ETL logic as dataset transformations instead of imperative task sequencing. What is the best next step?

Options:

A. Convert bronze_orders to an external table in Unity Catalog.
B. Increase the workflow cluster size so downstream tables become available sooner.
C. Keep the notebook workflow and manage the order with more task dependencies and repairs.
D. Rebuild the flow with Lakeflow Spark Declarative Pipelines and declare the bronze, silver, and gold datasets.

Best answer: D

Explanation: Lakeflow Spark Declarative Pipelines are built for ETL defined as datasets and transformations, not manual notebook sequencing. In this case, declaring bronze, silver, and gold dependencies lets Databricks determine execution order and reduces failures caused by partial reruns.

The key advantage here is declarative ETL. Instead of writing notebooks that both perform transformations and control when downstream steps should run, you declare the target datasets and the relationships between them. Lakeflow Spark Declarative Pipelines then builds the dependency graph and updates datasets in the correct order.

That fits this scenario because the failure happened after a partial rerun left a downstream transformation expecting an upstream dataset that was not available. Declarative pipelines reduce this kind of manual orchestration problem, which is common in medallion-style ETL. Workflows are still useful for task orchestration, but they do not replace the benefit of expressing table dependencies declaratively inside the pipeline itself.

The main takeaway is that Lakeflow Spark Declarative Pipelines simplify ETL maintenance by letting Databricks manage dependency-aware execution.

More workflow wiring still leaves ETL logic in imperative notebooks rather than declared dataset dependencies.
More compute might change runtime, but it does not fix missing upstream dependency handling.
External table change affects storage and governance, not how ETL order is expressed and maintained.

Question 4

Topic: Data Processing & Transformations

A retail team ingests hourly JSON orders from cloud storage into Databricks. Data scientists need cleaned order-level records for feature engineering, analysts need business-ready daily sales tables, and operations must keep original records unchanged for troubleshooting. The team also wants the pipeline to be easy to understand and reusable across downstream workloads. What is the best next design?

Options:

A. Create Bronze raw tables, Silver cleaned orders, and Gold sales tables.
B. Load the JSON directly into Gold sales tables.
C. Use Delta Sharing to distribute raw and refined data internally.
D. Use one Delta table and create views for each audience.

Best answer: A

Explanation: A Bronze-Silver-Gold layout best fits these requirements because it preserves raw records, creates a reusable refined layer, and publishes purpose-built business tables. That separation makes each stage easier to understand and allows quality checks before downstream consumption.

This scenario is a classic fit for Medallion Architecture. Bronze stores the incoming source data in its original form so the team can trace issues, replay ingestion, or inspect raw records later. Silver holds cleaned, standardized, and validated order-level data that multiple downstream users can reuse, including data scientists who need refined but still detailed records. Gold contains business-ready tables such as daily sales metrics for analysts and dashboards.

Separating these layers improves pipeline clarity because each layer has one clear purpose. It improves reuse because several consumers can build from the Silver layer instead of repeating cleaning logic. It improves quality because validation, deduplication, and standardization happen before business-facing outputs are published. A single-table or direct-to-Gold design mixes concerns and weakens traceability.

Loading directly into business-ready tables skips a durable raw layer and makes troubleshooting or reprocessing harder.
Using one Delta table with views hides stage boundaries, so quality enforcement and reuse are less clear.
Using Delta Sharing addresses data distribution, not internal pipeline layering for transformation and quality control.

Question 5

Topic: Data Processing & Transformations

A data engineer is reviewing a design request. Which requirement most clearly shows the problem is about Lakeflow Spark Declarative Pipelines rather than one standalone SQL statement?

Options:

A. The solution must store output in a Unity Catalog table.
B. The solution must join two source tables in one query.
C. The solution must calculate daily totals with GROUP BY.
D. The solution must refresh multiple dependent datasets in dependency order as upstream data changes.

Best answer: D

Explanation: Lakeflow Spark Declarative Pipelines are the right fit when the requirement is about managing related datasets over time. A join, aggregation, or table write can all exist in one SQL statement, but dependency-aware refresh across datasets is pipeline behavior.

Lakeflow Spark Declarative Pipelines are for declaring datasets as parts of a pipeline and letting Databricks manage how those datasets are updated based on their dependencies. When a requirement says several downstream datasets must stay current as upstream data changes, the problem is no longer just about SQL syntax. It is about pipeline behavior: coordinating dependent updates, maintaining execution order, and handling repeated refreshes over time.

By contrast, joins, GROUP BY aggregations, and writing to a Unity Catalog table are all things a single standalone SQL statement can already do. Those actions may appear inside a pipeline, but by themselves they do not make the problem a Lakeflow design question. The key signal is the need to manage multiple related datasets together.

Single-query logic A join is ordinary SQL transformation logic and does not by itself imply pipeline orchestration.
Storage target Writing to a Unity Catalog table describes where data is stored, not how multiple dependent datasets are managed.
Aggregation only A GROUP BY requirement is just a calculation pattern that one SQL statement can handle directly.

Question 6

Topic: Data Processing & Transformations

A data engineering team already uses Auto Loader to ingest new order files. They now need a production process that builds bronze, silver, and gold datasets, deploys as one pipeline, and determines execution order from dataset relationships instead of manually sequencing notebooks. What is the best next step in Databricks?

Options:

A. Deploy three notebooks with Databricks Asset Bundles and run them sequentially.
B. Create a Databricks Workflow with notebook tasks and set task dependencies manually.
C. Schedule separate notebooks and use Unity Catalog lineage to verify run order.
D. Build one Lakeflow Spark Declarative Pipeline with each dataset defined from its upstream dataset.

Best answer: D

Explanation: Lakeflow Spark Declarative Pipelines are built for related data transformations whose dependencies can be declared through dataset definitions. When bronze feeds silver and silver feeds gold, Databricks can infer the pipeline graph and run the steps in the needed order without manual notebook sequencing.

The core concept is declarative dependency management. In a Lakeflow Spark Declarative Pipeline, you define each target dataset and reference its upstream dataset. Databricks uses those references to build the dependency graph and execute the pipeline in the correct order.

This is the best fit for a medallion flow such as Auto Loader ingestion into bronze, cleansing into silver, and aggregation into gold because these are related transformations in one data pipeline. A Databricks Workflow is useful for orchestrating separate tasks, but it still relies on task-by-task sequencing. Databricks Asset Bundles help package and deploy resources, and Unity Catalog lineage helps you observe upstream and downstream relationships, but neither replaces declarative pipeline dependency definition.

The key takeaway is that related dataset dependencies belong in the pipeline definition, not in manually ordered notebook runs.

Manual orchestration with a Workflow can run notebooks in order, but it does not express dataset dependencies as part of one declarative pipeline.
Deployment only with Databricks Asset Bundles packages resources for promotion, but it does not make execution order come from data relationships.
Observation, not control with Unity Catalog lineage helps show dependencies after the fact, but it does not determine pipeline run order.

Question 7

Topic: Data Processing & Transformations

A data engineer reviews this requirement note:

Target table: main.sales.daily_orders
Current state: already exists with historical data
Source: temp view new_orders
Requirement: add all rows from new_orders
Constraint: keep existing rows unchanged

Which SQL statement pattern best matches this requirement?

Options:

A. INSERT INTO main.sales.daily_orders SELECT * FROM new_orders
B. CREATE OR REPLACE TABLE main.sales.daily_orders AS SELECT * FROM new_orders
C. CREATE TABLE main.sales.daily_orders AS SELECT * FROM new_orders
D. INSERT OVERWRITE main.sales.daily_orders SELECT * FROM new_orders

Best answer: A

Explanation: INSERT INTO ... SELECT is the standard append pattern for an existing table. The exhibit says the table already exists and that historical rows must remain, so the correct choice must add new rows without replacing current data.

In Databricks SQL, INSERT INTO target SELECT ... is used to append query results to an existing table. That matches the exhibit exactly: main.sales.daily_orders already exists, and the requirement is to add rows from new_orders while leaving historical rows in place.

INSERT OVERWRITE does not append; it replaces the target table’s current contents with the query output. CREATE OR REPLACE TABLE ... AS SELECT also rebuilds the table from the select results, so it is a replace pattern rather than an append pattern. CREATE TABLE ... AS SELECT is meant for creating a new table, not writing additional rows into one that already exists.

For a simple add-rows requirement, the append form is the best match.

The overwrite pattern fails because it replaces existing rows instead of preserving history.
The create-or-replace pattern fails because it recreates the table from the new query results.
The CTAS pattern is for initial table creation, so it does not fit an already existing target table.

Question 8

Topic: Data Processing & Transformations

A data engineer loads daily customer changes into a Delta table. The target table main.crm.customers already stores one current row per customer_id. The daily feed contains both brand-new customers and changed values for existing customers.

Exhibit:

INSERT INTO main.crm.customers
SELECT customer_id, email, status, updated_at
FROM staging.customer_updates

What is the best next step to meet the requirement without creating multiple current rows for the same customer_id?

Options:

A. Keep INSERT INTO and add DISTINCT to the query.
B. Keep INSERT INTO and run OPTIMIZE after each load.
C. Use INSERT OVERWRITE so the feed replaces the table.
D. Use MERGE INTO on customer_id with update and insert clauses.

Best answer: D

Explanation: This is an upsert requirement, not a pure append. Because the source includes both new and existing customer_id values, MERGE INTO is the right pattern to update existing rows and insert new ones without creating duplicate current records.

The core concept is choosing row-level upsert logic instead of append-only logic. Here, the target already has one current row per customer_id, and the incoming feed contains a mix of new customers and changes for existing ones. A plain INSERT INTO only appends rows, so existing customers would be duplicated rather than updated.

In Databricks SQL, MERGE INTO is the appropriate pattern for this case because it can:

match source and target rows by customer_id
update matched rows with the latest values
insert rows that do not yet exist

This preserves unchanged target rows while correctly applying daily changes. The closest distractor is INSERT OVERWRITE, which replaces table contents instead of performing an upsert.

Overwrite instead fails because replacing the target with only the daily feed could remove unchanged customers.
Optimize afterward fails because OPTIMIZE improves file layout and performance, not upsert logic.
Add DISTINCT fails because deduplicating the incoming query does not update rows that already exist in the target.

Question 9

Topic: Data Processing & Transformations

A data engineering team asks for a SQL statement to process new JSON order files. The workflow must ingest new files every few minutes, clean and deduplicate records across bronze and silver layers, and refresh a gold reporting layer in Unity Catalog with managed dependencies. What is the best next action?

Options:

A. Query the landing files directly from a SQL warehouse
B. Create a Lakeflow Spark Declarative Pipeline with a bronze streaming table and downstream datasets
C. Schedule one SQL statement to rebuild the gold reporting table
D. Use separate workflow tasks and notebooks with manual dependencies

Best answer: B

Explanation: This scenario is about managed pipeline behavior, not a single SQL command. Incremental file ingestion, layered transformations, and automatic dependency handling are core Lakeflow Spark Declarative Pipelines use cases.

The key clue is that the requirements describe an ongoing pipeline: new files arrive continuously, data moves through multiple medallion layers, and downstream refreshes depend on upstream results. Lakeflow Spark Declarative Pipelines is designed for exactly this pattern. You declaratively define datasets such as a bronze streaming table and downstream silver and gold datasets, and Databricks manages dependency resolution and refresh behavior for the pipeline.

A single SQL statement can define one transformation, but it does not by itself address the broader pipeline lifecycle described here. Manual workflow orchestration can work, but it adds unnecessary operational overhead when the requirement is really for a managed declarative pipeline.

Single statement scope fails because rebuilding one gold table does not model the full incremental bronze-to-gold pipeline.
Manual orchestration can run the steps, but it pushes dependency management into jobs and notebooks instead of using the pipeline abstraction.
Direct querying skips the required layered transformations and does not create a managed production data pipeline.

Question 10

Topic: Data Processing & Transformations

A Databricks workflow runs a notebook task named daily_customer_metrics. The downstream task validate_metrics fails because daily_active_customers is higher than the total known customer count for several days.

The notebook uses:

daily = (orders
    .groupBy("order_date", "order_id")
    .agg(F.sum("amount").alias("sales"))
)

metrics = (daily
    .groupBy("order_date")
    .count()
    .withColumnRenamed("count", "daily_active_customers")
)

The business metric should be the number of unique customers who placed at least one order each day. What is the best fix?

Options:

A. Repair only the failed validation task and keep the notebook unchanged.
B. Group by order_date and customer_id instead of order_id.
C. Repartition the input by order_date before the aggregation.
D. Cache the daily DataFrame before the second aggregation.

Best answer: B

Explanation: The grouping key defines the grain of the metric. Here, order_id makes the intermediate result order-level, so counting rows per day returns orders per day, not unique customers per day.

In PySpark, the groupBy columns determine what each output row represents. Because the notebook groups by order_date and order_id, each row in daily represents one order for one day. The later .count() therefore counts orders, which can easily exceed the number of customers when customers place multiple orders.

To match the business metric, the aggregation must use the customer key at the daily grain.

Use customer_id as the business entity being counted.
Group by order_date and customer_id, or aggregate by order_date with countDistinct("customer_id").
Then derive daily_active_customers from that customer-level result.

Repartitioning, caching, or rerunning tasks may affect execution, but they do not correct a metric built at the wrong grain.

Repartitioning can help shuffle performance, but it does not change an order-level metric into a customer-level metric.
Caching avoids recomputation, but it preserves the same incorrect grouping key and same wrong result.
Repairing the validation task only reruns the failure point; it does not fix the notebook logic producing the inflated metric.

Continue with full practice

Use the Databricks Data Engineer Associate Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try Databricks Data Engineer Associate on Web View Databricks Data Engineer Associate Practice Test

Free review resource

Read the Databricks Data Engineer Associate Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.

Revised on Thursday, May 14, 2026

Development and Ingestion

Data Pipeline Production

Browse Certification Practice Tests by Exam Family

Databricks Data Engineer Associate: Processing and Transforms

Topic snapshot

How to use this topic drill

Sample questions

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Continue with full practice

Related focused pages

Free review resource