Databricks Certified Data Engineer Associate Quick Review
High-yield Databricks Certified Data Engineer Associate quick review for Delta Lake, pipelines, jobs, SQL, streaming, and governance before practice.
Quick Review for Databricks DEA Candidates
This Quick Review is for candidates preparing for the Databricks Certified Data Engineer Associate exam, official exam code Databricks DEA, from Databricks. Use it as a focused refresh before moving into IT Mastery practice, topic drills, mock exams, and detailed explanations.
The exam rewards practical understanding of how data engineering work is done on the Databricks Lakehouse Platform: ingesting data, transforming it reliably, using Delta Lake correctly, building production pipelines, scheduling jobs, and applying basic governance and performance practices.
High-Yield Exam Map
| Area | What to review | Common exam angle |
|---|---|---|
| Lakehouse concepts | Lakehouse architecture, data lake vs warehouse, medallion layers | Identify the best architecture or data layer for a scenario |
| Delta Lake | ACID transactions, transaction log, schema enforcement, time travel, MERGE | Choose the right Delta feature or SQL command |
| Data ingestion | COPY INTO, Auto Loader, file formats, incremental loading | Select ingestion method based on scale and arrival pattern |
| Transformations | Spark SQL, DataFrames, views, temp views, CTAS, filtering, joins | Predict results or choose efficient transformation logic |
| Streaming | Structured Streaming, checkpoints, triggers, watermarks, append/update modes | Distinguish streaming from batch and avoid duplicate processing |
| Pipelines | Delta Live Tables concepts, expectations, declarative pipelines | Understand managed pipeline behavior and data quality checks |
| Jobs and orchestration | Tasks, dependencies, retries, parameters, clusters | Build or troubleshoot scheduled workflows |
| Databricks SQL | Warehouses, queries, dashboards, alerts, SQL endpoints/warehouses | Know when SQL warehouse vs all-purpose/job compute fits |
| Governance | Unity Catalog concepts, catalogs/schemas/tables, permissions, lineage | Apply access control and object hierarchy correctly |
| Performance and reliability | Partitioning, OPTIMIZE, ZORDER, caching, file sizing | Pick practical tuning options without overengineering |
Lakehouse and Medallion Architecture
A Databricks data engineer should understand the lakehouse as a unified architecture that combines low-cost object storage with database-style reliability, governance, and analytics performance.
Core Lakehouse Ideas
| Concept | Quick meaning | Exam trap |
|---|---|---|
| Data lake | Stores raw data in open formats on object storage | Does not automatically provide ACID reliability by itself |
| Data warehouse | Optimized for structured analytics and BI | Often less flexible for raw/semi-structured data |
| Lakehouse | Combines open storage, Delta Lake reliability, and analytics/ML access | Not just “a data lake with dashboards” |
| Delta Lake | Storage layer providing reliability and performance features | It is not a separate database engine; it works on files plus a transaction log |
| Medallion architecture | Bronze, Silver, Gold data refinement pattern | The layers are logical design patterns, not mandatory product objects |
Medallion Layer Decision Rules
| Layer | Typical contents | Common operations | Candidate mistake |
|---|---|---|---|
| Bronze | Raw or lightly processed ingested data | Append, capture source metadata, preserve original records | Cleaning too aggressively and losing auditability |
| Silver | Cleaned, validated, conformed data | Deduplication, joins, type casting, data quality rules | Leaving source-specific inconsistencies unresolved |
| Gold | Business-ready aggregates or serving tables | Aggregations, dimensional models, BI-ready tables | Putting raw data directly into dashboards |
A good exam habit: when a scenario mentions raw source preservation, think Bronze. When it mentions cleaned reusable entity tables, think Silver. When it mentions business metrics, dashboards, or serving use cases, think Gold.
Delta Lake Essentials
Delta Lake is one of the most important topics for the Databricks Certified Data Engineer Associate exam. Focus on what Delta adds beyond ordinary Parquet files.
Delta Lake Features to Know
| Feature | Why it matters | Typical command or concept |
|---|---|---|
| ACID transactions | Reliable concurrent reads/writes | Delta transaction log |
| Schema enforcement | Prevents incompatible writes | Write fails unless schema is compatible |
| Schema evolution | Allows controlled schema changes | mergeSchema or ALTER TABLE patterns |
| Time travel | Query older table versions or timestamps | VERSION AS OF / TIMESTAMP AS OF |
| Upserts | Insert/update records from source into target | MERGE INTO |
| Deletes and updates | Modify existing table rows | DELETE, UPDATE |
| Compaction | Improve file sizes and query performance | OPTIMIZE |
| Data skipping | Avoid scanning irrelevant files | Statistics, ZORDER where appropriate |
| Audit history | Review table operations | DESCRIBE HISTORY |
Delta Table Types and Storage
| Object | What it means | Review point |
|---|---|---|
| Managed table | Databricks manages table metadata and data location | Dropping may remove underlying data depending on configuration |
| External table | Metadata points to data in an external location | Data lifecycle is managed outside the table definition |
| View | Saved query definition | Does not store data like a table |
| Temporary view | Session-scoped view | Not available outside the session |
| Global temporary view | Shared across sessions in a special global temp database | Still temporary, not a permanent table |
Delta Commands Worth Recognizing
| Task | SQL pattern |
|---|---|
| Create a Delta table from query | CREATE TABLE target AS SELECT … |
| Insert rows | INSERT INTO table SELECT … |
| Overwrite table data | INSERT OVERWRITE or write mode overwrite |
| Update matched records | MERGE INTO target USING source ON … WHEN MATCHED THEN UPDATE |
| Insert new records during merge | WHEN NOT MATCHED THEN INSERT |
| Delete rows | DELETE FROM table WHERE … |
| Query history | DESCRIBE HISTORY table |
| Query prior version | SELECT … FROM table VERSION AS OF n |
| Optimize files | OPTIMIZE table |
| Z-order selected columns | OPTIMIZE table ZORDER BY (col1, col2) |
Common Delta Lake Traps
- Parquet alone is not Delta Lake. Delta typically stores data as Parquet files plus a Delta transaction log.
- Schema enforcement and schema evolution are different. Enforcement blocks incompatible data; evolution allows approved changes.
- MERGE is for upserts. Do not choose a full overwrite when the scenario needs record-level updates and inserts.
- Time travel depends on retained history. Avoid assuming unlimited access to all previous versions.
- OPTIMIZE is not a fix for bad logic. It can improve file layout, but it does not correct incorrect joins, filters, or partition design.
- Partitioning is not always better. High-cardinality partition columns can create many small partitions and hurt performance.
Ingestion: Batch, Incremental, and Streaming
The exam often tests whether you can select the correct ingestion approach.
Ingestion Method Decision Table
| Scenario | Strong option | Why |
|---|---|---|
| Periodic batch load from a stable file location | COPY INTO | Simple, idempotent-style incremental file ingestion for batch use cases |
| Many files arriving continuously in cloud storage | Auto Loader | Scales file discovery and supports incremental processing |
| Low-latency continuously processed data | Structured Streaming | Processes new data as it arrives with checkpoints |
| One-time historical backfill | Batch read/write | Simpler than streaming when data is static |
| CDC-style source with inserts/updates/deletes | MERGE into Delta or CDC-aware pipeline | Handles changing records instead of append-only assumptions |
COPY INTO vs Auto Loader
| Feature | COPY INTO | Auto Loader |
|---|---|---|
| Best for | Simple incremental batch file loads | Scalable cloud file ingestion |
| Processing style | Batch command | Structured Streaming source |
| File discovery | Tracks loaded files for a target table | Designed for efficient incremental file discovery |
| Typical use | “Load new files from this directory into this Delta table” | “Continuously ingest new cloud files into Bronze” |
| Exam trap | Choosing streaming when scheduled batch is enough | Choosing manual directory listing at scale |
File Format Review
| Format | Key characteristics | Exam clue |
|---|---|---|
| CSV | Text, simple, needs schema/header handling | Watch delimiter, header, inferSchema issues |
| JSON | Semi-structured, nested data possible | May require parsing/exploding nested fields |
| Parquet | Columnar, efficient analytics format | Common underlying format for Delta data files |
| Delta | Transactional table format built on data files plus log | Needed for ACID, MERGE, time travel |
Spark SQL and Data Transformations
A data engineer on Databricks should be comfortable reading and reasoning about SQL transformations.
SQL Transformation Patterns
| Need | Common approach | Trap |
|---|---|---|
| Create a reusable transformed dataset | CREATE TABLE AS SELECT | Confusing a table with a view |
| Create a logical query layer | CREATE VIEW | Expecting a view to store transformed data |
| Remove duplicates | ROW_NUMBER with window function, dropDuplicates, distinct | Using distinct and accidentally losing meaningful columns |
| Keep latest record per key | Window function ordered by timestamp | Forgetting deterministic tie-breakers |
| Aggregate metrics | GROUP BY with aggregate functions | Selecting non-grouped, non-aggregated columns |
| Join reference data | INNER/LEFT joins | Choosing INNER join when unmatched records must be preserved |
| Parse nested data | explode, from_json, struct/array functions | Treating nested data like flat columns |
Join Decision Review
| Join type | Keeps rows from | Use when |
|---|---|---|
| INNER JOIN | Matching rows only | You only want records with matches on both sides |
| LEFT JOIN | All left rows plus matches from right | You must preserve the primary dataset |
| RIGHT JOIN | All right rows plus matches from left | Less common; usually can rewrite as LEFT JOIN |
| FULL OUTER JOIN | All rows from both sides | Reconciliation or comparison scenarios |
| CROSS JOIN | Every combination | Rare; often a mistake unless explicitly required |
| SEMI JOIN | Left rows that have a match | Filtering to existing keys |
| ANTI JOIN | Left rows that do not have a match | Finding missing or unmatched records |
Window Function Pattern
A common exam pattern is “select the latest record per business key.”
SELECT *
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY updated_at DESC
) AS rn
FROM source_table
)
WHERE rn = 1;
Review the difference between:
ROW_NUMBER()— assigns a unique sequence, useful for one winner per group.RANK()— ties share rank and can skip numbers.DENSE_RANK()— ties share rank without gaps.
Structured Streaming Review
Structured Streaming treats streaming data as an unbounded table. The same DataFrame-style transformations often apply, but reliability depends on checkpointing, output mode, and trigger configuration.
Streaming Concepts
| Concept | Meaning | Exam trap |
|---|---|---|
| Checkpoint | Stores streaming progress and state | Without it, recovery and exactly-once-style behavior are at risk |
| Trigger | Controls when micro-batches are processed | Continuous arrival does not always mean continuous execution |
| Output mode | Defines what gets written | Append/update/complete depend on query type |
| Watermark | Bounds how long late data is considered | Not the same as filtering by event time |
| State | Maintained data for aggregations/deduplication | Can grow if not bounded |
| Sink | Destination for streaming output | Delta is common for reliable lakehouse pipelines |
Output Mode Quick Review
| Output mode | What it writes | Common use |
|---|---|---|
| Append | Only newly completed rows | Append-only streams, finalized aggregations with watermark |
| Update | Rows changed since last trigger | Updating aggregation results |
| Complete | Entire result table each trigger | Full aggregate outputs, usually smaller result sets |
Watermark Decision Rule
Use a watermark when:
- The stream uses event-time logic.
- Late-arriving data is expected.
- The engine needs a boundary for state cleanup.
- Some late data can be excluded after the threshold.
Do not treat a watermark as a guarantee that all late data is preserved. It is a practical tradeoff between correctness window and state size.
Delta Live Tables and Declarative Pipelines
If a scenario involves declarative pipelines, data quality checks, managed dependencies, or simplified batch/streaming pipeline operations, review Delta Live Tables concepts.
| Concept | Review point |
|---|---|
| Pipeline | Managed execution of one or more dataset definitions |
| Live table / streaming table | Dataset maintained by the pipeline |
| View | Intermediate logic that may not materialize as a final table |
| Expectations | Data quality rules applied to records |
| Pipeline dependencies | Derived from table/view definitions |
| Batch vs streaming pipeline logic | Depends on source and table type |
Expectations: What to Remember
| Expectation behavior | Meaning |
|---|---|
| Track invalid records | Records are monitored for quality but may still flow |
| Drop invalid records | Bad records are excluded |
| Fail on invalid records | Pipeline stops when invalid data violates the rule |
Candidate trap: expectations are not just documentation. Depending on configuration, they can track, drop, or fail records.
Jobs, Workflows, and Compute
Production data engineering requires orchestration. Review how jobs, tasks, dependencies, parameters, and compute choices work together.
Jobs and Workflow Concepts
| Concept | Meaning | Exam angle |
|---|---|---|
| Job | Scheduled or triggered unit of work | Used for production automation |
| Task | Individual step in a job | Can run notebooks, scripts, SQL, pipelines, etc. |
| Dependency | Controls task order | Downstream tasks wait for upstream success when configured |
| Retry | Reruns failed tasks according to policy | Helps with transient failures |
| Parameter | Runtime value passed into a task | Supports reusable jobs |
| Job cluster | Cluster created for a job run | Good for isolated automated workloads |
| All-purpose cluster | Interactive shared compute | Common for development, less ideal for scheduled production jobs |
| SQL warehouse | Compute for Databricks SQL | Used for SQL queries, BI, dashboards, and alerts |
Compute Choice Decision Table
| Scenario | Likely compute choice |
|---|---|
| Interactive notebook development | All-purpose cluster |
| Scheduled production ETL job | Job cluster or configured job compute |
| SQL dashboard for analysts | SQL warehouse |
| Databricks SQL query or alert | SQL warehouse |
| Managed declarative pipeline | Pipeline-managed compute |
| Cost-sensitive repeated production workload | Job-specific compute with right-sized resources |
Workflow Design Traps
- Do not run every production job manually from a notebook.
- Do not use a single large shared cluster as the default answer for all workloads.
- Use task dependencies when order matters.
- Use retries for transient failures, but fix deterministic data or code errors.
- Pass parameters instead of copying nearly identical notebooks for each environment or date.
Unity Catalog and Governance
Governance topics often test conceptual clarity: object hierarchy, permissions, data discovery, and lineage.
Unity Catalog Object Hierarchy
| Level | Example role in organization |
|---|---|
| Metastore | Top-level governance container for a workspace/account setup |
| Catalog | Broad domain or environment grouping |
| Schema | Database-like namespace within a catalog |
| Table/View/Function | Data and logic objects accessed by users |
A common three-level name pattern is:
catalog.schema.table
Governance Review Table
| Topic | What to know |
|---|---|
| Catalogs and schemas | Organize data assets and permissions |
| Grants | Control who can access or modify objects |
| External locations | Govern access to cloud storage paths |
| Lineage | Helps understand upstream/downstream data relationships |
| Data discovery | Users find governed assets through cataloging |
| Least privilege | Grant only the access needed for the job |
Common Governance Traps
- Workspace access is not the same as table access.
- Cloud storage access and table permissions are related but not identical concepts.
- A user may be able to run compute but still lack permission to query a table.
- Object names may need catalog and schema qualification in governed environments.
- Governance is not only security; it also supports discovery, lineage, and operational trust.
Databricks SQL Review
Databricks SQL supports analytics, dashboards, visualizations, and alerts using SQL warehouses.
| Feature | Quick review |
|---|---|
| SQL warehouse | Compute resource for SQL queries |
| Query | Saved SQL statement |
| Dashboard | Visual collection of query results |
| Alert | Condition-based notification from query results |
| Query history | Useful for reviewing executed SQL and performance |
| Permissions | Control who can run, edit, or view assets |
Databricks SQL Candidate Traps
- A SQL warehouse is not the same as a general-purpose interactive cluster.
- Dashboards show query results; they do not replace upstream data modeling.
- Alerts depend on query results and refresh behavior.
- Slow dashboard performance may require improving the underlying query, table layout, or warehouse sizing.
Performance and Optimization
The exam may test whether you can pick reasonable optimizations. Avoid extreme answers. Start with correct data layout, efficient transformations, and Delta features.
Optimization Decision Table
| Symptom | Possible action | Trap |
|---|---|---|
| Many small files | OPTIMIZE / compaction | Repartitioning randomly without understanding output |
| Queries filter often by specific columns | ZORDER on selected filter columns | ZORDERing every column |
| Slow query scanning too much data | Partition pruning, data skipping, filters | Partitioning on high-cardinality columns |
| Repeated expensive intermediate use | Cache selectively or materialize | Caching everything |
| Skewed join performance | Review join keys, salting/broadcast strategies where appropriate | Assuming more workers always fixes skew |
| Overly expensive scheduled job | Right-size compute, optimize logic, incremental processing | Full refresh when incremental processing is possible |
Partitioning Review
Partitioning can help when queries frequently filter by a low-to-moderate cardinality column such as date. It can hurt when the partition column has too many unique values, creating excessive small directories and files.
Good partition clues:
- Date-based filtering is common.
- Partition cardinality is controlled.
- Data volume per partition is meaningful.
- Queries can prune partitions.
Bad partition clues:
- User ID, transaction ID, UUID, or other high-cardinality values.
- Tiny files per partition.
- Queries rarely filter on the partition column.
Reliability and Data Quality
Data engineering exam scenarios often ask how to make pipelines reliable, repeatable, and testable.
| Need | Good practice |
|---|---|
| Recover failed streaming job | Use checkpoints |
| Avoid duplicate file ingestion | Use incremental ingestion features such as COPY INTO or Auto Loader |
| Preserve raw source data | Store in Bronze before destructive transformations |
| Enforce valid records | Use constraints, expectations, or validation logic |
| Handle late data | Use event-time processing and watermarks where appropriate |
| Apply updates to target | Use MERGE instead of append-only writes |
| Audit changes | Use Delta history and pipeline/job run history |
| Reduce manual error | Schedule jobs and parameterize tasks |
Scenario Decision Flow
flowchart TD
A[New data engineering scenario] --> B{Is the source continuously arriving?}
B -- No, periodic files --> C{Need simple incremental batch load?}
C -- Yes --> D[COPY INTO]
C -- No --> E[Batch read and write]
B -- Yes --> F{Cloud files at scale?}
F -- Yes --> G[Auto Loader with checkpointing]
F -- No --> H[Structured Streaming source]
D --> I[Write Bronze Delta]
E --> I
G --> I
H --> I
I --> J{Need cleaned reusable data?}
J -- Yes --> K[Transform to Silver]
J -- No --> L[Keep raw/audit layer]
K --> M{Need BI or business metrics?}
M -- Yes --> N[Gold tables or views]
M -- No --> O[Reusable Silver tables]
Use this flow as a quick mental model. The real exam may phrase the scenario differently, but the decision points are usually about arrival pattern, reliability, transformation need, and serving layer.
Frequently Tested Command Intent
| If the question says… | Think… |
|---|---|
| “Insert new rows and update existing rows” | MERGE INTO |
| “Load only new files from a directory” | COPY INTO or Auto Loader depending on scale/streaming |
| “Recover stream after failure” | Checkpoint location |
| “Handle late-arriving event-time records” | Watermark |
| “View previous table version” | Delta time travel |
| “Improve many-small-files performance” | OPTIMIZE |
| “Co-locate data for filter columns” | ZORDER |
| “Create business-level aggregates for dashboards” | Gold layer |
| “Keep raw source history” | Bronze layer |
| “Apply data quality rule in managed pipeline” | Expectations |
| “Run production notebook on a schedule” | Databricks job/workflow |
| “Analysts need dashboards and alerts” | Databricks SQL warehouse |
Common Candidate Mistakes
Conceptual Mistakes
- Treating Delta Lake as just another file format.
- Assuming all ingestion should be streaming.
- Confusing Bronze/Silver/Gold with security levels.
- Choosing overwrite when the scenario requires upsert.
- Ignoring checkpointing in streaming recovery scenarios.
- Assuming a view stores physical data.
- Using inner joins when unmatched source rows must be retained.
- Treating watermarks as late-data guarantees rather than state boundaries.
Practical Design Mistakes
- Full-refreshing a large table when incremental processing is appropriate.
- Partitioning by a high-cardinality column.
- Running BI dashboards directly on messy Bronze tables.
- Using development clusters for all production automation.
- Skipping data quality checks until the Gold layer.
- Not preserving raw data before cleaning.
- Granting broad access instead of using least privilege.
Fast Final Review Checklist
Before starting original practice questions, make sure you can answer these without notes:
- What does Delta Lake add beyond Parquet?
- When should you use MERGE instead of INSERT?
- What is the purpose of the Delta transaction log?
- How do Bronze, Silver, and Gold layers differ?
- When is COPY INTO a better fit than Auto Loader?
- Why do streaming jobs need checkpoints?
- What problem does a watermark solve?
- What is the difference between a managed table, external table, view, and temporary view?
- When should you use a SQL warehouse instead of an all-purpose cluster?
- What does OPTIMIZE do, and when might ZORDER help?
- Why can high-cardinality partitioning be harmful?
- How do job tasks and dependencies support production workflows?
- What is the Unity Catalog hierarchy?
- How do expectations support data quality in pipelines?
How to Use This Quick Review with Practice
Use this page as a rapid reset, then move into IT Mastery practice:
- Start with topic drills on Delta Lake, ingestion, streaming, jobs, and governance.
- For each missed question, identify whether the issue was concept confusion, command recognition, or scenario judgment.
- Read the detailed explanations, then rewrite the decision rule in your own words.
- Take mixed question bank sets after you can consistently handle single-topic drills.
- Use mock exams only after your weak areas are specific enough to review efficiently.
Practical next step: begin with a short set of original practice questions on Delta Lake and ingestion, then review every explanation before moving to mixed Databricks DEA practice.
Continue in IT Mastery
Use this Quick Review as a final concept map, then move into IT Mastery for focused topic drills, mixed practice sets, timed mock exams, and detailed explanations. The practice questions are original IT Mastery practice items; they are not official Databricks questions, copied live-exam content, or exam dumps.