Use this Quick Reference as independent review support for Microsoft Certified: Azure Databricks Data Engineer Associate (DP-750). The exam rewards practical decisions: how to ingest data, transform it with Delta Lake, govern it with Unity Catalog, run reliable pipelines, and troubleshoot Azure Databricks workloads.
DP-750 Mental Model
Azure Databricks data engineering questions usually combine four layers:
| Layer | What to decide | High-yield exam cues |
|---|
| Ingestion | How data enters the lakehouse | Auto Loader vs COPY INTO vs streaming source vs batch read |
| Storage and format | How data is stored and modeled | Delta tables, managed vs external, medallion architecture, schema evolution |
| Processing | How data is transformed and scheduled | Spark SQL, PySpark, jobs, Lakeflow Declarative Pipelines / Delta Live Tables |
| Governance and operations | How data is secured, monitored, and optimized | Unity Catalog, permissions, external locations, lineage, jobs UI, Spark UI, OPTIMIZE |
flowchart LR
A[Sources: files, databases, events, APIs] --> B[Landing storage in ADLS Gen2]
B --> C[Ingestion: Auto Loader, COPY INTO, Structured Streaming]
C --> D[Bronze Delta tables]
D --> E[Silver cleansing and conformance]
E --> F[Gold marts and aggregates]
F --> G[BI, ML, apps, sharing]
H[Unity Catalog] -. governs .-> D
H -. governs .-> E
H -. governs .-> F
I[Jobs / Workflows / Pipelines] -. orchestrate .-> C
I -. orchestrate .-> E
I -. orchestrate .-> F
Core Azure Databricks Object Map
| Object | What it is | Exam point |
|---|
| Workspace | Azure Databricks environment for users, compute, notebooks, jobs | Workspace is not the primary data governance boundary when Unity Catalog is used |
| Metastore | Unity Catalog governance container | Assigned to workspaces; contains catalogs |
| Catalog | Top-level namespace in Unity Catalog | Use for environment, business domain, or governance boundary |
| Schema | Namespace inside a catalog | Also called database in older Spark/Hive terminology |
| Table | Structured data object, usually Delta | Prefer Delta for reliability, transactions, and optimization |
| View | Saved query | Useful for abstraction, row filtering, column masking, and simplified access |
| Volume | Unity Catalog object for non-tabular files | Prefer over legacy mounts for governed file access |
| External location | Governed reference to cloud storage path | Grants file access without exposing storage credentials |
| Storage credential | Unity Catalog credential for cloud storage | In Azure, commonly backed by managed identity / access connector patterns |
| Cluster | Spark compute for notebooks and jobs | Choose access mode and policies carefully for governance |
| SQL warehouse | Compute for Databricks SQL | Best for SQL analytics, dashboards, BI, SQL queries |
| Job | Orchestrated workflow | Use task dependencies, parameters, retries, schedules, and job clusters |
| Pipeline | Declarative data pipeline | Use Lakeflow Declarative Pipelines / Delta Live Tables for managed dependencies and quality rules |
| Notebook | Interactive or scheduled code unit | Good for development; production needs parameters, source control, and job orchestration |
| Secret scope | Secure reference to secrets | Prefer managed identities and Unity Catalog storage credentials where possible |
Medallion Architecture Reference
| Layer | Typical contents | Common operations | Quality expectation |
|---|
| Bronze | Raw or lightly parsed data | Ingest, append, preserve source metadata | Minimal transformation; keep recoverability |
| Silver | Cleaned, deduplicated, conformed data | Type casting, validation, joins, CDC handling, deduplication | Business-ready entity tables |
| Gold | Aggregated or serving data | Star schemas, marts, KPIs, feature tables, BI extracts | Optimized for consumption |
High-yield distinctions:
| Decision | Choose this when | Avoid this trap |
|---|
| Bronze stores raw records | You need replay, audit, or schema recovery | Do not overwrite raw history without a retention strategy |
| Silver applies business rules | You need reusable clean entities | Do not bury cleansing logic only in gold reports |
| Gold serves consumers | You need fast BI or domain-specific outputs | Do not make every downstream team read raw bronze data |
| Delta for all layers | You need ACID, schema enforcement, time travel, MERGE, optimization | Do not use plain Parquet when transactional updates are required |
Feature Selection Matrix
| Requirement | Best fit | Why |
|---|
| Incrementally ingest new files from cloud storage | Auto Loader | Tracks discovered files, supports schema inference/evolution, works with streaming |
| Load known files idempotently into Delta | COPY INTO | Simple SQL pattern for batch file ingestion |
| Process continuously arriving events | Structured Streaming | Handles streaming sources, checkpoints, state, watermarks |
| Build declarative, managed ETL with quality checks | Lakeflow Declarative Pipelines / Delta Live Tables | Manages dependencies, expectations, event logs, pipeline execution |
| Upsert changed records into a Delta table | MERGE INTO | Supports inserts, updates, deletes in one atomic operation |
| Read only changed rows from a Delta table | Change Data Feed | Efficient downstream incremental processing |
| Govern data and file access centrally | Unity Catalog | Catalog/schema/table permissions, external locations, lineage |
| Secure access to ADLS Gen2 | Unity Catalog external locations and storage credentials | Avoid hard-coded keys and legacy mount-first designs |
| Run production Spark tasks | Jobs with job clusters | Repeatable, isolated, schedulable execution |
| Run BI SQL workloads | SQL warehouse | Optimized for SQL queries, dashboards, and BI tools |
| Explore interactively | All-purpose compute | Flexible for development, less ideal for production cost control |
| Standardize compute settings | Cluster policies | Enforce allowed node types, access modes, tags, and security settings |
| Reduce repeated cluster startup latency | Pools or serverless options where available | Use only when supported and appropriate for workload |
Ingestion Quick Reference
Batch and File Ingestion
| Pattern | Use when | Key details |
|---|
COPY INTO | Periodic batch loads from files into Delta | SQL-friendly; tracks previously loaded files for the target |
| Auto Loader | New files arrive continuously or unpredictably | Uses cloudFiles; requires checkpoint and schema location |
| Direct Spark read | One-time exploration or controlled batch | Not ideal for incremental file tracking |
| External table over files | Query data in place | Use when data lifecycle is managed outside Databricks |
| Managed Delta table | Databricks manages table storage | Dropping table removes managed data |
| External Delta table | Data remains in external location | Dropping table removes metadata, not underlying files |
COPY INTO Pattern
COPY INTO prod.bronze.orders_raw
FROM 'abfss://landing@storageacct.dfs.core.windows.net/orders/'
FILEFORMAT = JSON
FORMAT_OPTIONS ('inferSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
Exam cues:
| Cue | Interpretation |
|---|
| “Simple SQL ingestion from files” | Consider COPY INTO |
| “Need automatic incremental file discovery” | Consider Auto Loader |
| “Need streaming semantics and checkpoints” | Use Structured Streaming / Auto Loader |
| “Files may arrive late or in nested directories” | Auto Loader is usually stronger than manual file lists |
Auto Loader Pattern
raw_path = "abfss://landing@storageacct.dfs.core.windows.net/orders/"
schema_path = "abfss://checkpoint@storageacct.dfs.core.windows.net/schemas/orders/"
checkpoint_path = "abfss://checkpoint@storageacct.dfs.core.windows.net/checkpoints/orders/"
(
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", schema_path)
.load(raw_path)
.writeStream
.option("checkpointLocation", checkpoint_path)
.trigger(availableNow=True)
.toTable("prod.bronze.orders_raw")
)
Auto Loader exam traps:
| Trap | Correct approach |
|---|
| Reusing the same checkpoint for multiple streams | Use a separate checkpoint per streaming query |
| Deleting checkpoint state casually | Expect duplicates or reprocessing unless designed for it |
| Omitting schema location | Schema inference/evolution becomes harder to manage |
| Treating Auto Loader like a one-time file read | It is designed for incremental discovery and streaming-style processing |
| Writing to non-idempotent sinks | Use Delta tables and deterministic logic when possible |
Streaming Ingestion and Processing
| Concept | Meaning | Exam use |
|---|
| Checkpoint | Stores stream progress and state | Required for fault tolerance |
| Trigger | Controls when micro-batches run | Scheduled-like incremental processing or continuous-like workloads |
| Output mode | Append, update, or complete | Depends on aggregation/stateful logic |
| Watermark | Bound on how long to keep state for late data | Required for many deduplication and time-window scenarios |
| State store | Maintains streaming aggregation/join/dedup state | Watch for state growth and late data |
foreachBatch | Applies batch logic to each micro-batch | Useful for MERGE/upsert patterns |
from pyspark.sql.functions import col
updates = (
spark.readStream.table("prod.bronze.orders_raw")
.filter(col("order_id").isNotNull())
)
def upsert_orders(batch_df, batch_id):
batch_df.createOrReplaceTempView("orders_updates")
spark.sql("""
MERGE INTO prod.silver.orders AS t
USING orders_updates AS s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
(
updates.writeStream
.foreachBatch(upsert_orders)
.option("checkpointLocation", "abfss://checkpoint@storageacct.dfs.core.windows.net/checkpoints/orders_merge/")
.start()
)
Delta Lake Operations Reference
| Operation | Use when | Syntax cue |
|---|
| Create Delta table | Persist reliable lakehouse data | CREATE TABLE ... USING DELTA or DataFrame write |
| Append | Add new records | INSERT INTO, .mode("append") |
| Overwrite | Replace a dataset or partition carefully | .mode("overwrite"), INSERT OVERWRITE |
| MERGE | Upsert or delete based on keys | MERGE INTO target USING source |
| Time travel | Query previous table version | VERSION AS OF or TIMESTAMP AS OF |
| Restore | Roll table back to a prior version | RESTORE TABLE ... TO VERSION AS OF |
| History | Inspect operations and versions | DESCRIBE HISTORY |
| Change Data Feed | Read row-level changes | table_changes() or read options |
| OPTIMIZE | Compact small files | OPTIMIZE table |
| ZORDER | Improve data skipping for selected columns | OPTIMIZE table ZORDER BY (...) |
| VACUUM | Remove unreferenced old files | Be careful with time travel and lagging streams |
MERGE Pattern
MERGE INTO prod.silver.customers AS target
USING prod.bronze.customers_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.operation = 'DELETE' THEN DELETE
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
MERGE traps:
| Trap | Why it matters |
|---|
| Duplicate source rows match one target row | Can cause ambiguous updates; deduplicate source first |
| No stable business key | Upserts become unreliable |
| Using MERGE for pure append | Adds unnecessary complexity |
| Not handling deletes from CDC source | Target keeps records that should be removed or expired |
| Schema drift not planned | MERGE may fail or create inconsistent expectations |
Time Travel and History
DESCRIBE HISTORY prod.silver.orders;
SELECT *
FROM prod.silver.orders VERSION AS OF 12;
RESTORE TABLE prod.silver.orders TO VERSION AS OF 12;
| Feature | Use for | Watch out |
|---|
| Time travel | Auditing, debugging, reproducibility | Limited by retained Delta log/data files |
| Restore | Operational rollback | It creates a new table version |
| VACUUM | Storage cleanup | Can remove files needed for older time travel or delayed streaming readers |
Change Data Feed
ALTER TABLE prod.silver.orders
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
SELECT *
FROM table_changes('prod.silver.orders', 10);
| Use CDF when | Do not use CDF when |
|---|
| Downstream jobs need incremental changes | Full refresh is simpler and small enough |
| You need inserts, updates, and deletes from Delta | Source is not Delta or CDF is not enabled |
| You want efficient propagation to gold tables | Consumers cannot keep up with retention expectations |
Schema, Tables, and Storage Decisions
Managed vs External Tables
| Table type | Storage controlled by | Drop behavior | Best use |
|---|
| Managed table | Databricks / Unity Catalog managed storage | Dropping table removes managed data | Standard curated lakehouse tables |
| External table | External cloud storage location | Dropping table removes metadata only | Data shared with other systems or lifecycle managed externally |
Schema Enforcement and Evolution
| Requirement | Option | Exam note |
|---|
| Reject unexpected schema | Schema enforcement | Good for trusted silver/gold layers |
| Allow new columns during ingestion | Schema evolution | Common for bronze Auto Loader |
| Overwrite schema intentionally | overwriteSchema patterns | Use carefully; can break consumers |
| Merge with new columns | mergeSchema / controlled evolution | Validate before using in curated layers |
| Store rescued data | Rescue column patterns | Useful when raw records may contain unexpected fields |
Partitioning, Clustering, and File Layout
| Technique | Use when | Avoid |
|---|
| Partitioning | Large tables frequently filtered by low/moderate-cardinality columns | High-cardinality partitions that create many tiny files |
| OPTIMIZE | Table has many small files | Running constantly without need |
| ZORDER | Queries filter on selective columns | Too many columns or columns rarely used in filters |
| Liquid clustering | Supported environment and evolving query patterns | Combining blindly with older partition/ZORDER assumptions |
| Auto compaction / optimized writes | Need better file sizing with less manual work | Assuming they fix bad table design |
Lakeflow Declarative Pipelines / Delta Live Tables
Microsoft and Databricks materials may reference Delta Live Tables (DLT) and newer Lakeflow Declarative Pipelines terminology. For exam purposes, focus on the concepts: declarative tables, dependencies, streaming tables, data quality expectations, pipeline monitoring, and managed execution.
| Concept | Meaning | Exam use |
|---|
| Pipeline | Managed set of table definitions and flows | Use for reliable ETL with dependency management |
| Live table / materialized table | Declarative table created from query logic | Good for batch transformations |
| Streaming live table | Table fed by streaming input | Good for incremental file/event ingestion |
| Expectation | Data quality rule | Warn, drop invalid rows, or fail pipeline |
| Event log | Pipeline operational log | Troubleshooting and audit of pipeline runs |
| Development mode | Faster iteration | Not the same as production reliability settings |
| Triggered pipeline | Runs then stops | Batch-like scheduled processing |
| Continuous pipeline | Keeps processing | Streaming-style processing |
Example DLT-style SQL:
CREATE OR REFRESH STREAMING LIVE TABLE bronze_orders
AS
SELECT *
FROM cloud_files(
"abfss://landing@storageacct.dfs.core.windows.net/orders/",
"json"
);
CREATE OR REFRESH LIVE TABLE silver_orders
(
CONSTRAINT valid_order_id EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW,
CONSTRAINT valid_amount EXPECT (amount >= 0) ON VIOLATION DROP ROW
)
AS
SELECT
order_id,
customer_id,
CAST(amount AS DECIMAL(18,2)) AS amount,
CAST(order_timestamp AS TIMESTAMP) AS order_timestamp
FROM STREAM(LIVE.bronze_orders);
Expectation actions:
| Action | Result | Use when |
|---|
| Warn | Records are kept; metric is recorded | You need visibility without blocking |
| Drop | Invalid records are removed | Bad rows should not enter curated data |
| Fail | Pipeline stops | Data quality failure should block publication |
DLT / Lakeflow traps:
| Trap | Correct thinking |
|---|
| Treating pipeline tables as manually updated tables | They are managed by pipeline definitions |
| Hiding quality rules in notebook code only | Use expectations when pipeline-level metrics matter |
| Ignoring event logs | Event logs are key for troubleshooting |
| Using continuous mode for everything | Triggered mode is often enough for scheduled batch-style ingestion |
| Mixing dev and prod assumptions | Production pipelines need stable configuration, permissions, and monitoring |
Unity Catalog and Governance
Namespace and Securable Hierarchy
| Level | Example | Notes |
|---|
| Metastore | main metastore | Assigned to one or more workspaces |
| Catalog | prod | Top-level data namespace |
| Schema | prod.silver | Groups tables, views, functions, volumes |
| Object | prod.silver.orders | Table, view, volume, model, function, etc. |
Use three-level names for clarity:
SELECT *
FROM prod.silver.orders;
Common Unity Catalog Privileges
| Privilege | Grants ability to |
|---|
USE CATALOG | Access objects inside a catalog, subject to lower-level grants |
USE SCHEMA | Access objects inside a schema, subject to object grants |
SELECT | Read table or view data |
MODIFY | Insert, update, delete, merge, or otherwise change table data |
CREATE TABLE | Create tables in a schema |
CREATE VOLUME | Create volumes in a schema |
READ VOLUME | Read files in a volume |
WRITE VOLUME | Write files in a volume |
EXECUTE | Run functions or models where applicable |
MANAGE | Manage privileges or object settings, depending on object type |
Basic grant pattern:
GRANT USE CATALOG ON CATALOG prod TO `data-engineers`;
GRANT USE SCHEMA ON SCHEMA prod.silver TO `data-engineers`;
GRANT SELECT ON TABLE prod.silver.orders TO `analysts`;
GRANT SELECT, MODIFY ON TABLE prod.silver.orders TO `etl-service-principal`;
High-yield permission rule: object-level access is not enough if the principal lacks USE CATALOG and USE SCHEMA.
External Locations and Volumes
| Object | Purpose | Exam distinction |
|---|
| Storage credential | Authenticates to cloud storage | Do not expose raw keys in notebooks |
| External location | Governs access to a cloud path | Grants READ FILES / WRITE FILES |
| External table | Table metadata over data in external storage | For structured tabular data |
| Volume | Governed access to files | For non-tabular files, libraries, ML files, landing files |
| Legacy mount | Workspace-level storage mount | Avoid as a UC-first governance design |
Example external location pattern:
CREATE EXTERNAL LOCATION raw_orders_location
URL 'abfss://landing@storageacct.dfs.core.windows.net/orders/'
WITH (STORAGE CREDENTIAL adls_storage_credential);
GRANT READ FILES ON EXTERNAL LOCATION raw_orders_location TO `ingestion-engineers`;
Security Decision Table
| Requirement | Prefer | Avoid |
|---|
| Govern table access across workspaces | Unity Catalog grants | Workspace-local ACL-only design |
| Secure ADLS Gen2 access | Managed identity/access connector with UC storage credential | Hard-coded account keys in notebooks |
| Restrict rows by user/group | Row filters or dynamic views | Duplicating many physical tables |
| Restrict sensitive columns | Column masks or secure views | Giving broad table access then relying on consumers |
| Store passwords/API keys | Secret scopes or managed identities | Plain text in notebooks, jobs, or repos |
| Production service identity | Service principal or managed identity pattern | Personal user identity for scheduled jobs |
| Govern non-tabular files | Volumes and external locations | Ungoverned DBFS or ad hoc mounts |
Unity Catalog Traps
| Symptom | Likely cause |
|---|
| User can see table name but query fails | Missing SELECT, USE SCHEMA, or USE CATALOG |
| Job works for developer but fails in production | Job identity lacks UC or storage permissions |
| External table cannot read files | External location or storage credential issue |
| Notebook path works in one workspace only | Workspace-local mount or DBFS dependency |
| Data appears outside lineage/governance | Legacy metastore, unmanaged path, or direct cloud access bypassing UC |
| Drop table removed data unexpectedly | It was a managed table |
Compute Selection
| Compute type | Best for | Exam notes |
|---|
| All-purpose compute | Interactive notebooks, exploration, development | Flexible but not ideal as default production runtime |
| Job compute / job cluster | Scheduled production tasks | Ephemeral, repeatable, easier cost and dependency control |
| SQL warehouse | SQL queries, dashboards, BI, Databricks SQL | Not a general PySpark notebook cluster |
| Serverless compute where available | Reduced infrastructure management | Confirm workload and governance support in scenario |
| Cluster pool | Faster cluster startup | Useful when many similar clusters start frequently |
| Photon-enabled compute | SQL and Delta-heavy workloads | Often improves query performance for supported operations |
Access Modes
| Access mode terminology | Use case | Exam point |
|---|
| Standard / shared | Multiple users with governance controls | Common for UC-enabled collaborative workloads |
| Dedicated / single user | One user or assigned identity | Useful for isolation and certain workloads |
| No isolation shared / legacy | Older less-isolated mode | Avoid for modern governed UC workloads |
Cluster Configuration Cues
| Requirement | Setting or feature |
|---|
| Enforce approved settings | Cluster policy |
| Control library versions | Job cluster config, init scripts only when needed, pinned dependencies |
| Minimize idle cost | Auto-termination for interactive clusters |
| Handle variable load | Autoscaling |
| Separate dev/test/prod | Separate workspaces, catalogs, schemas, or policies as appropriate |
| Improve repeatability | Jobs, parameters, source-controlled code |
Jobs, Workflows, and Deployment
| Feature | Use for |
|---|
| Task dependencies | Build DAG-style workflows |
| Job parameters | Avoid hard-coded environment paths and dates |
| Retries | Handle transient failures |
| Run-if conditions | Control downstream behavior after success/failure |
| Job clusters | Isolated production compute per job or task |
| Shared job cluster | Reuse compute among tasks in same job when appropriate |
| Schedule trigger | Time-based orchestration |
| File arrival trigger | Start when new data arrives, where supported |
| Continuous trigger | Always-on processing pattern |
| Alerts/notifications | Operational awareness |
| Git integration / source control | Version notebooks, code, SQL, pipeline definitions |
| Databricks Asset Bundles or deployment tooling | Promote repeatable assets across environments |
Production readiness checklist:
- Use service principals or managed identities for scheduled workloads.
- Parameterize catalog, schema, storage path, and processing date.
- Separate development and production data namespaces.
- Store secrets outside code.
- Use job clusters or governed compute policies.
- Define retry behavior and failure notifications.
- Log row counts, rejected records, and important pipeline metrics.
- Avoid relying on an interactive user’s cluster, credentials, or notebook state.
Table and Query Tuning
| Symptom | Likely cause | Corrective action |
|---|
| Query scans too much data | Poor filters, no data skipping, bad layout | Partition appropriately, ZORDER/liquid clustering, collect stats where relevant |
| Many small files | Frequent small writes or streaming micro-batches | OPTIMIZE, optimized writes, compaction strategy |
| Slow joins | Shuffle-heavy join, skew, missing broadcast opportunity | Broadcast small dimension, repartition, handle skew, use AQE |
| OOM during transformation | Large shuffle or wide operation | Reduce data earlier, repartition carefully, avoid collecting to driver |
| Slow Python UDFs | Row-by-row Python execution | Prefer built-in Spark SQL functions or vectorized patterns |
| Streaming state grows | No watermark or broad aggregation key | Add watermark, reduce state, tune late-data assumptions |
| Dashboard slow | Gold table not serving-shaped | Pre-aggregate, use SQL warehouse, optimize table layout |
| Repeated full recomputation | No incremental design | Use CDF, Auto Loader, MERGE, or pipeline incremental patterns |
Spark Execution Concepts
| Concept | Why it matters |
|---|
| Lazy evaluation | Transformations run only when an action executes |
| Narrow transformation | No shuffle; generally cheaper |
| Wide transformation | Requires shuffle; often expensive |
| Shuffle | Data redistributed across partitions |
| Partition | Unit of parallel data processing |
| Driver | Coordinates Spark application |
| Executor | Runs tasks and stores/cache partitions |
| Cache/persist | Useful for reused intermediate data; can become stale or consume memory |
| Adaptive Query Execution | Runtime query optimization for supported workloads |
| Broadcast join | Sends small table to workers to avoid large shuffle |
Optimization Traps
| Trap | Better answer |
|---|
| “Partition by every filter column” | Partition only when cardinality and access patterns justify it |
| “Cache everything” | Cache reused data only; unpersist when done |
| “Use bigger clusters first” | Fix data layout, shuffles, and query logic before scaling blindly |
| “VACUUM aggressively” | Respect time travel, rollback, and streaming readers |
| “One huge notebook does everything” | Break into tasks/pipelines with observable boundaries |
| “Use UDFs for simple expressions” | Use native Spark functions for optimizer support |
Data Quality and Reliability
| Requirement | Feature or pattern |
|---|
| Reject invalid rows in pipeline | Expectations with drop/fail action |
| Track invalid rows for review | Quarantine table pattern |
| Enforce non-null or valid values | Constraints or expectations |
| Deduplicate events | Window functions, watermark-based streaming deduplication |
| Handle late data | Watermarks and business-defined lateness rules |
| Idempotent reruns | MERGE by key, deterministic outputs, checkpoint discipline |
| Audit load metadata | Add source file, ingestion timestamp, batch ID |
| Backfill historical data | Parameterized jobs, controlled overwrite/merge, separate checkpoint strategy |
Useful metadata columns:
| Column | Purpose |
|---|
ingestion_timestamp | When the platform ingested the row |
source_file | File lineage and troubleshooting |
batch_id | Rerun and reconciliation |
record_hash | Change detection or deduplication |
is_quarantined / error fields | Data quality review |
CDC and Slowly Changing Dimensions
| Pattern | Use when | Core logic |
|---|
| Type 1 SCD | Keep only latest value | MERGE update overwrites existing row |
| Type 2 SCD | Preserve history | Expire current row, insert new current row |
| Delete propagation | Source emits deletes | MERGE with WHEN MATCHED ... DELETE or expire record |
| CDF downstream | Delta source changes should feed another table | Read changes since last version |
| Pipeline CDC helpers | Declarative CDC in Lakeflow/DLT scenarios | Use when exam scenario emphasizes managed CDC pipeline |
SCD Type 2 cues:
| Field | Purpose |
|---|
| Business key | Identifies entity, such as customer_id |
| Surrogate key | Unique dimension row key |
| Effective start/end | Validity range |
| Current flag | Identifies active record |
| Hash diff | Detects attribute changes |
Monitoring and Troubleshooting
Where to Look
| Problem | First places to inspect |
|---|
| Job failed | Jobs run output, task logs, cluster logs |
| Spark query slow | Spark UI, SQL query profile, stages, shuffle metrics |
| SQL dashboard slow | Query history, warehouse size/state, table layout |
| Pipeline failed | Pipeline event log, expectation metrics, failed flow/table |
| Stream stalled | Streaming query progress, checkpoint path, source backlog |
| Permission denied | Unity Catalog grants, external location grants, cloud IAM |
| Data missing | Ingestion file discovery, filters, expectations, checkpoints |
| Duplicates | Checkpoint reset, non-idempotent writes, source replay, missing dedup key |
| Schema error | Auto Loader schema location, rescued data, evolution settings |
Common Error Patterns
| Error pattern | Likely cause | Fix direction |
|---|
Cannot access abfss://... | Storage credential, external location, or cloud permission missing | Validate UC external location and identity |
| Table not found | Wrong catalog/schema or current context | Use three-level names |
| Permission denied on table | Missing grants | Grant USE CATALOG, USE SCHEMA, and object privilege |
| Stream already active or checkpoint conflict | Reused checkpoint/query | Use distinct checkpoint and stop old query |
| Schema mismatch | Source changed | Apply controlled schema evolution or rescue/quarantine |
| Slow stage with high shuffle | Skew or wide transformation | Repartition, broadcast, filter earlier, handle skew |
| Driver out of memory | Collected too much data to driver | Avoid .collect() for large data; write distributed outputs |
| Unexpected old results | Cached data or stale table/view | Refresh/unpersist or rerun after invalidating cache |
Triage Order
- Identify whether the failure is permissions, code, data, compute, or orchestration.
- Check the job or pipeline run output.
- Inspect table names, catalog/schema context, and identity used by the run.
- Validate source paths, external locations, and grants.
- Review schema changes and data quality expectation failures.
- Use Spark UI or SQL profile for performance bottlenecks.
- Fix idempotency before rerunning failed writes.
Azure-Specific Integration Points
| Azure component | Azure Databricks role | Exam note |
|---|
| ADLS Gen2 | Primary lakehouse storage | Use abfss:// paths and governed access |
| Microsoft Entra ID | Users, groups, service principals | Prefer group-based access management |
| Managed identities / access connectors | Secure Azure resource access | Avoid embedded credentials |
| Azure Key Vault | Secret backing for secret scopes | Useful for external secrets, but not a substitute for UC governance |
| Azure Data Factory / Synapse pipelines | External orchestration | Can trigger Databricks jobs/notebooks |
| Event Hubs | Streaming source | Often used with Structured Streaming |
| Microsoft Purview | Broader data catalog/governance integration | Unity Catalog handles Databricks-native governance and lineage |
SQL and PySpark Syntax Mini-Reference
| Task | SQL / PySpark cue |
|---|
| Set catalog | USE CATALOG prod; |
| Set schema | USE SCHEMA silver; |
| Create table as select | CREATE TABLE prod.gold.sales AS SELECT ... |
| Insert data | INSERT INTO prod.silver.orders SELECT ... |
| Overwrite table | CREATE OR REPLACE TABLE ... AS SELECT ... |
| Merge updates | MERGE INTO ... USING ... ON ... |
| Inspect history | DESCRIBE HISTORY catalog.schema.table |
| Optimize table | OPTIMIZE catalog.schema.table |
| Read table in PySpark | spark.table("prod.silver.orders") |
| Write table in PySpark | df.write.mode("append").saveAsTable("prod.silver.orders") |
| Streaming read table | spark.readStream.table("prod.bronze.orders") |
| Streaming write table | df.writeStream.toTable("prod.silver.orders") |
High-Yield Exam Traps Checklist
- Do not choose legacy mounts when the scenario emphasizes Unity Catalog governance.
- Do not grant
SELECT only and forget USE CATALOG and USE SCHEMA. - Do not use an all-purpose interactive cluster as the default production answer.
- Do not reset or share streaming checkpoints without understanding replay and duplicates.
- Do not use direct file reads when the requirement is incremental file discovery.
- Do not use COPY INTO for continuous event streams.
- Do not use MERGE without a stable key and deduplicated source.
- Do not overpartition high-cardinality columns.
- Do not run VACUUM casually when rollback, time travel, or lagging streams matter.
- Do not store production data in DBFS root as a governance strategy.
- Do not assume a notebook user’s permissions are the same as the job’s service identity.
- Do not hide all data quality logic downstream in BI; validate in silver/pipeline layers.
- Do not choose Python UDFs for simple transformations that Spark SQL functions can handle.
- Do not ignore pipeline event logs, job task logs, Spark UI, and SQL query profiles during troubleshooting.
Last-Mile Practice Plan
Practice DP-750 scenarios by forcing yourself to choose: ingestion pattern, Delta table design, Unity Catalog permissions, compute type, orchestration method, and troubleshooting path. Then implement small end-to-end exercises: Auto Loader to bronze, MERGE to silver, aggregate to gold, secure with Unity Catalog grants, schedule as a job, and diagnose one intentional failure.