Databricks Data Engineer Associate: Development and Ingestion

Try 10 focused Databricks Data Engineer Associate questions on Development and Ingestion, with explanations, then continue with IT Mastery.

On this page

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try Databricks Data Engineer Associate on Web View full Databricks Data Engineer Associate practice page

Topic snapshot

FieldDetail
Exam routeDatabricks Data Engineer Associate
Topic areaDevelopment and Ingestion
Blueprint weight17%
Page purposeFocused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Development and Ingestion for Databricks Data Engineer Associate. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

PassWhat to doWhat to record
First attemptAnswer without checking the explanation first.The fact, rule, calculation, or judgment point that controlled your answer.
ReviewRead the explanation even when you were correct.Why the best answer is stronger than the closest distractor.
RepairRepeat only missed or uncertain items after a short break.The pattern behind misses, not the answer letter.
TransferReturn to mixed practice once the topic feels stable.Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 17% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.

Question 1

Topic: Development and Ingestion

A data engineer is reviewing a notebook that currently runs once per day. New JSON files now land in the same cloud object storage path every 10 minutes, and the team wants to ingest both existing and newly arriving files with minimal custom logic.

df = (spark.read
        .format("json")
        .load("s3://company-landing/orders/"))

Which change is the best fit for this requirement?

Options:

  • A. Use repeated spark.read batch jobs on the path.

  • B. Use a one-time COPY INTO from the path.

  • C. Use Auto Loader with readStream and cloudFiles.

  • D. Use Lakehouse Federation to query the path.

Best answer: C

Explanation: The exhibit shows a batch file read from cloud object storage. When files keep arriving and must be ingested continuously or repeatedly with low operational overhead, Auto Loader is the Databricks feature built for that pattern.

The code in the exhibit uses spark.read, which performs a batch read of the files present when the job runs. That works for one-time loads, but it is not the best choice when new files keep landing in cloud object storage and the pipeline should keep picking them up automatically.

Auto Loader is the Databricks ingestion capability for this use case. In practice, you switch to a streaming read with the cloudFiles source and specify the underlying file format with cloudFiles.format. Auto Loader incrementally discovers files in the source location and processes both existing and newly arriving files with less custom orchestration than repeated batch scans. The closest distractor is COPY INTO, which can support incremental batch loading, but the stem asks for ongoing new-file ingestion from object storage.

  • Repeated batch scans still rely on rerunning one-time reads and usually need extra logic or orchestration for newly arrived files.
  • One-time COPY INTO does not satisfy an ongoing ingestion requirement by itself, even though it can load files idempotently when executed.
  • Lakehouse Federation is for querying external systems, not for continuously discovering new files in cloud object storage.

Question 2

Topic: Development and Ingestion

A team reviews the following note:

Current state
- Nightly pipeline is already scheduled in Databricks Workflows.
- Unity Catalog permissions on the target tables are already configured.
- A developer wants to run and debug PySpark code from VS Code while using Databricks compute.

Which Databricks capability best fits the remaining need?

Options:

  • A. Databricks Asset Bundles for deployment configuration

  • B. Unity Catalog grants for data access governance

  • C. Databricks Connect for local IDE development

  • D. Databricks Workflows for production scheduling

Best answer: C

Explanation: The missing requirement is local IDE development while still using Databricks compute. Databricks Connect is designed for that workflow. The exhibit already shows that production scheduling and data governance are covered by Databricks Workflows and Unity Catalog.

Databricks Connect is a development workflow feature. It allows a developer to write, run, and debug code from a local IDE, such as VS Code, while execution uses Databricks compute. In the exhibit, the team already has production orchestration through Databricks Workflows and governance through Unity Catalog permissions, so those are not the missing pieces.

Databricks Connect fits when the need is:

  • local editing and debugging
  • running code from an external IDE
  • keeping execution on Databricks compute

The closest distractor is deployment tooling: Databricks Asset Bundles help define and deploy resources, but they do not provide the local interactive execution path that Databricks Connect does.

  • Workflows mismatch applies to scheduling and orchestrating runs, which the exhibit says is already in place.
  • Governance mismatch applies to permissions and access control, not to bridging a local IDE session to Databricks compute.
  • Deployment mismatch applies to packaging and deploying Databricks resources, not to live local debugging.

Question 3

Topic: Development and Ingestion

Which statement best describes Databricks Connect?

Options:

  • A. A sharing capability for delivering read-only datasets to external recipients

  • B. A governance layer for permissions, lineage, and data object access

  • C. A service for scheduling, repairing, and monitoring production data jobs

  • D. A local IDE workflow for running and debugging code against Databricks compute

Best answer: D

Explanation: Databricks Connect is for developer workflow. It lets engineers use a local IDE and debugging tools while executing code on Databricks compute, rather than handling job orchestration, governance, or external sharing.

Databricks Connect bridges local development tools and Databricks compute. A developer can write, run, and debug code from an IDE while the execution uses the Databricks environment, making it a development and testing workflow feature. It is not the service that schedules or repairs production pipelines, and it does not provide governance controls or external data sharing.

Production orchestration is handled by Databricks Workflows and can be deployed with Databricks Asset Bundles. Governance features such as permissions and lineage belong to Unity Catalog. Sharing data with outside recipients is handled by Delta Sharing. The key distinction is simple: Databricks Connect helps you build code locally; it does not run production operations or enforce governance.

  • Job operations describes production orchestration, which is handled by Databricks Workflows rather than a local development bridge.
  • Governance layer refers to Unity Catalog, which manages permissions, lineage, and governed access to data assets.
  • External sharing refers to Delta Sharing, which is designed to share data with recipients outside the workspace or organization.

Question 4

Topic: Development and Ingestion

A team ingests CSV files from a cloud storage landing path with a scheduled notebook that lists files and writes processed filenames to a control table. After a workflow retry, the control table became inconsistent, so some newly arrived files were skipped and others were loaded twice. Some new files also include an extra column. The team wants continuous ingestion without manually tracking each file. What is the best next step?

Options:

  • A. Replace the custom logic with Auto Loader using cloudFiles, checkpointing, and schemaLocation.

  • B. Increase cluster size and keep the filename control table.

  • C. Create an external table on the landing folder and query it directly.

  • D. Schedule COPY INTO on the folder for each batch window.

Best answer: A

Explanation: Auto Loader is the Databricks feature designed for files that keep arriving over time. It removes fragile custom file-tracking logic and, with checkpoint/state plus schemaLocation, also addresses the added-column problem in the stem.

Auto Loader is the best fit when a pipeline must reliably process newly arriving files from cloud storage without maintaining its own list of processed filenames. In this scenario, the manual control table became inconsistent after a retry, which caused skipped and duplicated loads, and new columns introduced schema drift. Auto Loader solves both issues at an Associate level by using the cloudFiles source for incremental file discovery, checkpoint/state for resilient progress tracking, and schemaLocation for inferred schema management.

  • Read the landing path with cloudFiles.
  • Persist checkpoint/state so retries do not depend on a custom filename table.
  • Use schemaLocation to track schema changes for new columns.

A repeated batch pattern can still work for some ingestion tasks, but it is less aligned with the continuous-arrival and schema-evolution requirements given here.

  • The scheduled COPY INTO idea can load new files incrementally, but it is less aligned with the stated continuous ingestion and schema-management needs.
  • Adding more compute may speed up listing, but it does not fix fragile manual tracking or schema drift.
  • Querying the landing folder directly does not create a managed incremental ingestion process for newly arriving files.

Question 5

Topic: Development and Ingestion

An engineer runs this Auto Loader code in a notebook and sees the output below. Which troubleshooting conclusion is best supported by the message?

(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .load("/data/incoming/orders"))
AnalysisException: Auto Loader can infer schema, but a schema location
must be provided. Set cloudFiles.schemaLocation or provide an explicit schema.

Options:

  • A. The source files must be converted to Delta before Auto Loader can read them.

  • B. The target table is missing Unity Catalog permissions.

  • C. The code should use spark.read instead of spark.readStream.

  • D. The stream needs cloudFiles.schemaLocation or an explicit schema.

Best answer: D

Explanation: The notebook output already identifies the issue. Auto Loader is reading JSON files with schema inference, but no cloudFiles.schemaLocation was provided, so the engineer can troubleshoot this directly from the error message.

This error is a good example of concise Auto Loader feedback being enough to diagnose an ingestion problem. The message says Auto Loader is trying to infer schema and needs a place to persist schema information, which is provided with cloudFiles.schemaLocation, unless the schema is defined explicitly.

  • cloudFiles.format tells Auto Loader the source file type.
  • For schema inference on files like JSON, Auto Loader needs schema metadata storage.
  • The message points to a missing ingestion option, not a permissions or table-format problem.

The key takeaway is that some Auto Loader failures can be resolved directly from notebook output without deeper investigation in run logs or Spark UI.

  • Delta confusion is incorrect because Auto Loader can ingest JSON files directly; conversion to Delta is not required first.
  • Batch vs. streaming is incorrect because changing to spark.read does not address the missing schema metadata required by the current ingestion setup.
  • Permissions confusion is incorrect because the error mentions schema inference configuration, not access denial or Unity Catalog authorization.

Question 6

Topic: Development and Ingestion

A Databricks workflow task named bronze_ingest fails after a team tries to use Lakehouse Federation for this source:

Source: /landing/orders/
Format: JSON files
Arrival pattern: new files every 10 minutes
Goal: keep a Delta bronze table in Databricks updated

The team chose to “query the source in place” instead of using a file ingestion pattern. What is the best next step?

Options:

  • A. Keep Lakehouse Federation and query the file path in place.

  • B. Create an external table on the path and skip ingestion.

  • C. Use Auto Loader to ingest the directory into a Delta table.

  • D. Use Delta Sharing to expose the landing files to Databricks.

Best answer: C

Explanation: The source is a stream of files, and the requirement is to load those files into a Delta bronze table in Databricks. Auto Loader is the Databricks feature designed for incremental file ingestion, while in-place querying features are for different source patterns.

Auto Loader is the correct choice when new files keep arriving in storage and Databricks must ingest them into a table. In this scenario, the source is a directory of JSON files with ongoing arrivals, so the pipeline needs file discovery and incremental loading into a Delta bronze table.

Lakehouse Federation is for querying external systems in place, typically database-style sources, rather than implementing a file-ingestion workflow. An external table over files can make existing files queryable, but it does not replace the intended bronze ingestion pattern for continuously arriving files. Delta Sharing is for secure data sharing, not for landing-zone ingestion.

When the requirement is “files arrive repeatedly and must be loaded into Databricks,” Auto Loader is the best fit.

  • Federation mismatch uses in-place access for external systems, not a recurring file-drop ingestion pattern.
  • Sharing mismatch focuses on distributing shared data, not discovering and loading new source files.
  • External table only can expose files for queries, but it does not provide the intended incremental ingestion workflow into a bronze Delta table.

Question 7

Topic: Development and Ingestion

A bronze ingestion job runs every 15 minutes using the code below. New JSON files land in the folder throughout the day. Last night the source system added a new optional column, and the next run failed when appending to the bronze table. The team wants to process only new files and avoid frequent manual schema updates.

df = spark.read.json("/Volumes/main/raw/orders/")
df.write.mode("append").saveAsTable("main.bronze.orders_raw")

What is the best next step?

Options:

  • A. Overwrite the bronze table after inferring schema each run.

  • B. Use Auto Loader with cloudFiles and a schemaLocation.

  • C. Set mergeSchema on the write and keep spark.read.

  • D. Replace the job with COPY INTO on a schedule.

Best answer: B

Explanation: Auto Loader is the best fit when new files arrive continuously and the source schema can change over time. It incrementally discovers only new files and stores schema information so compatible schema evolution can be managed more safely than repeated full-directory scans.

This scenario matches Auto Loader’s core use case: ongoing file ingestion plus evolving source schemas. A plain spark.read batch job scans the directory again each run and does not maintain incremental file-discovery state for newly arrived files. When the source adds a column, the pipeline also needs a better way to track and evolve schema over time.

With Auto Loader, you typically use cloudFiles and specify a schemaLocation so Databricks can persist inferred schema metadata and process new files incrementally. That reduces manual operational work and avoids redesigning the ingestion pattern each time the source adds a compatible field.

The key takeaway is that schema evolution plus continuous file arrival is a strong signal to use Auto Loader.

  • Using COPY INTO on a schedule can be incremental, but it is not the best match when the main need is managed continuous file discovery with schema tracking.
  • Setting mergeSchema on the write can help Delta accept new columns, but it does not solve incremental discovery of only new files.
  • Overwriting the bronze table after rescanning everything is operationally heavier and ignores the need for efficient ongoing ingestion.

Question 8

Topic: Development and Ingestion

A data engineer is working directly in the Databricks workspace to explore a new CSV source. They want to run code in small steps, inspect intermediate DataFrames, and validate logic before turning it into a production process. Which Databricks option is the best fit?

Options:

  • A. A notebook attached to compute

  • B. Databricks Connect in a local IDE

  • C. A Databricks Asset Bundle

  • D. A Lakeflow Spark Declarative Pipeline

Best answer: A

Explanation: Databricks notebooks are designed for interactive development. When the goal is to test code incrementally, inspect results, and refine logic before packaging or scheduling anything, a notebook is the best starting point.

Databricks notebooks are intended for interactive authoring, exploration, and validation. An engineer can attach compute, run cells incrementally, inspect intermediate DataFrames, visualize output, and quickly adjust code while learning about a new data source. That makes notebooks the right choice when the immediate goal is to understand data and confirm transformation logic.

Tools for packaged deployment or managed pipeline execution are more appropriate after the logic is understood and ready to be operationalized. A local IDE workflow can also be useful, but it is not the best match when the scenario specifically emphasizes workspace-based, step-by-step exploration.

  • Packaged deployment The Databricks Asset Bundles option is for defining and deploying Databricks resources, not for initial interactive exploration.
  • Local IDE workflow Databricks Connect supports development from an external IDE, but the stem emphasizes workspace-based iterative inspection and validation.
  • Managed pipeline Lakeflow Spark Declarative Pipelines are for declarative pipeline development, not ad hoc step-by-step exploration of a new source.

Question 9

Topic: Development and Ingestion

A data engineer is iterating on transformation logic directly in the Databricks workspace. Based on the exhibit, which Databricks capability is directly demonstrated?

Cell 1 (Python)
orders = spark.read.table("main.sales.orders")
orders.createOrReplaceTempView("orders_v")

Cell 2 (%sql)
SELECT order_status, COUNT(*) AS cnt
FROM orders_v
GROUP BY order_status

Options:

  • A. Databricks Workflows orchestration of a scheduled job

  • B. Interactive Databricks notebook development with mixed-language cells

  • C. Databricks Connect development from a local IDE

  • D. Databricks Asset Bundles deployment of resources as code

Best answer: B

Explanation: The exhibit shows cell-by-cell execution in the Databricks workspace, with PySpark creating a temporary view and %sql querying it in a later cell. That is a notebook capability used for interactive day-to-day development.

Databricks notebooks are designed for interactive development in the workspace. A user can run code one cell at a time, switch languages across cells by using language magics such as %sql, and reuse the same Spark session state, including temporary views created in earlier cells. That is exactly what the exhibit shows: Python creates orders_v, then SQL immediately queries it.

Databricks Connect is used when developing from a local IDE while executing against Databricks compute. Databricks Workflows is for orchestration, scheduling, and run management. Databricks Asset Bundles is for defining and deploying Databricks resources as code. The deciding clue is the interactive, mixed-language workflow inside the Databricks workspace.

  • Local IDE confusion The Databricks Connect option does not fit because the exhibit shows workspace notebook cells, not code authored from a local IDE.
  • Orchestration confusion The Workflows option does not fit because the exhibit shows no tasks, schedule, dependency graph, or run status.
  • Deployment confusion The Asset Bundles option does not fit because the exhibit is executable analysis code, not bundle YAML or deployment configuration.

Question 10

Topic: Development and Ingestion

A data engineer writes PySpark in VS Code and runs it from a laptop. The same code works in a Databricks notebook, but locally this line fails:

df = spark.read.table("main.sales.orders")

with:

AnalysisException: [REQUIRES_SINGLE_PART_NAMESPACE]
spark_catalog requires a single-part namespace,
but got `main`.`sales`

The team wants to keep developing in the local IDE while executing against Databricks compute and Unity Catalog tables. What is the best next step?

Options:

  • A. Move the code into a Databricks notebook-only workflow.

  • B. Install local Spark and recreate the catalog structure on the laptop.

  • C. Deploy the project with Databricks Asset Bundles after each edit.

  • D. Configure Databricks Connect to use remote Databricks compute.

Best answer: D

Explanation: The failure shows the code is running against a local Spark session, not Databricks compute. Databricks Connect is the workflow that lets developers keep using local tools such as VS Code while Spark operations run remotely on Databricks.

Databricks Connect is used when a developer wants local editing, debugging, and testing in an IDE, but needs Spark code to execute on Databricks compute. In this scenario, the multipart name main.sales.orders works in Databricks because Unity Catalog is available there, but the laptop is using a local spark_catalog, which causes the namespace error. The best fix is to connect the local project to remote Databricks compute instead of trying to reproduce the Databricks environment locally.

  • Keep coding in the local IDE.
  • Configure Databricks Connect for the workspace and target compute.
  • Run the PySpark code so Spark execution happens remotely.

Changing to notebooks or redeploying after every edit may run the code, but they do not solve the local-development-to-remote-execution requirement.

  • Local Spark mismatch fails because recreating catalogs on a laptop does not provide the same Databricks compute and Unity Catalog behavior.
  • Notebook-only development can run the code, but it does not meet the requirement to keep using local development tools.
  • Bundle deployment is for packaging and deployment workflows, not interactive local IDE execution during development.

Continue with full practice

Use the Databricks Data Engineer Associate Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try Databricks Data Engineer Associate on Web View Databricks Data Engineer Associate Practice Test

Free review resource

Read the Databricks Data Engineer Associate Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.

Revised on Thursday, May 14, 2026