DE-ASSOC Syllabus — Learning Objectives by Topic

Blueprint-aligned learning objectives for Databricks Data Engineer Associate (DE-ASSOC), organized by topic with quick links to targeted practice.

Use this syllabus as your source of truth for DE‑ASSOC. Work topic-by-topic, and drill questions after each section.

What’s covered

Topic 1: Spark Fundamentals (SQL + DataFrames)
Topic 2: Data Ingestion & Batch ETL Patterns
Topic 3: Delta Lake Fundamentals
Topic 4: Lakehouse Modeling & Performance Basics
Topic 5: Databricks Platform Basics

Topic 1: Spark Fundamentals (SQL + DataFrames)

Practice this topic →

1.1 Transformations, actions, and basic execution intuition

Differentiate between Spark transformations and actions and identify which operations trigger execution.
Explain Spark’s lazy evaluation at a conceptual level and why plans are built before execution.
Recognize the small-file problem and describe why it can hurt performance in distributed reads.
Use caching/persist conceptually and explain when caching can improve iterative workloads.
Differentiate narrow vs wide transformations and identify which operations commonly introduce shuffles.
Given a scenario, choose the simplest safe approach to reduce data early before expensive shuffles.

1.2 Spark SQL essentials (joins, aggregations, windows)

Select the appropriate join type (inner/left/full/semi/anti) for a described requirement.
Write or interpret aggregations and GROUP BY logic including distinct counts and conditional aggregation.
Explain the purpose of window functions and identify correct partition/order usage for common analytics tasks.
Diagnose common join pitfalls (duplicate amplification, wrong granularity, filtering that changes join semantics).
Apply null handling correctly in filters and joins (especially with LEFT joins).
Given a scenario, choose SQL logic that produces correct results with minimal complexity.

1.3 DataFrames and column expressions (PySpark awareness)

Translate between SQL logic and DataFrame transformations for common operations (select, filter, join, groupBy).
Use column expressions safely (avoid stringly-typed mistakes) and understand why column naming matters.
Recognize when UDFs are a last resort and prefer built-in functions where possible (concept-level).
Explain how schema and data types impact correctness (string vs numeric, timestamps, nullability).
Choose an appropriate approach for deduplication (dropDuplicates vs window-based selection) based on requirements.
Given a scenario, identify the simplest transformation chain that meets correctness requirements.

Topic 2: Data Ingestion & Batch ETL Patterns

Practice this topic →

2.1 Reading data (formats, options, and schemas)

Differentiate common lakehouse formats conceptually (CSV/JSON/Parquet/Delta) and pick the right one for a scenario.
Explain why schema-on-read needs explicit schemas for reliability and why inference can be risky in production.
Identify common ingestion options (header, delimiter, multiline JSON) and their impact on correctness.
Describe the difference between managed tables and external locations conceptually and when each is used.
Recognize how partitioned data is typically laid out in storage and how it affects reads.
Given a scenario, choose a safe ingestion strategy that balances correctness and operational simplicity.

2.2 Writing data (append, overwrite, incremental thinking)

Choose append vs overwrite strategies based on batch ETL intent and data freshness requirements.
Explain idempotency at a conceptual level and why repeatable runs matter for batch pipelines.
Recognize the difference between overwriting a table vs overwriting partitions and the risk of clobbering data.
Describe how to safely handle incremental loads using watermark columns or change tracking (concept-level).
Identify when a MERGE/upsert is appropriate vs a full refresh.
Given a scenario, choose a write pattern that is both correct and recoverable.

2.3 Basic data quality and pipeline hygiene

Identify common data quality checks (null checks, range checks, uniqueness, referential integrity) for pipelines.
Explain why “fail fast” is often safer than silently passing bad data downstream in curated layers.
Recognize where to place validation in a multi-hop pipeline (bronze vs silver) based on data cleanliness needs.
Describe deduplication strategies and how to choose a deterministic record when duplicates exist.
Explain basic error handling patterns: quarantine bad records, log metrics, and produce audit outputs.
Given a scenario, choose a quality approach that balances strictness with operational practicality.

Topic 3: Delta Lake Fundamentals

Practice this topic →

3.1 Delta tables, ACID, and table operations

Explain what Delta Lake adds to a data lake at a conceptual level (ACID, schema enforcement, time travel).
Differentiate between Delta tables and raw Parquet files in terms of reliability and operations.
Identify basic Delta table operations: create table, insert, append, overwrite, and table history awareness.
Explain why transactions improve correctness in concurrent read/write workloads (concept-level).
Recognize how deletes/updates are supported in Delta compared to plain files (concept-level).
Given a scenario, choose Delta as the target format when ACID and evolution are needed.

3.2 Schema enforcement and evolution

Differentiate schema enforcement from schema evolution and identify expected outcomes (fail vs evolve).
Recognize common schema mismatch scenarios (new columns, type changes) and how they affect writes.
Explain why explicit schemas and controlled evolution reduce pipeline breakage.
Identify when it is safer to add columns than to rename/remove columns (contract stability).
Given a scenario, choose whether schema evolution is appropriate or whether a pipeline should fail and alert.
Explain how downstream consumers are impacted by schema changes and why compatibility matters.

3.3 Time travel and change operations (MERGE awareness)

Explain Delta time travel at a conceptual level and identify when it is used (audit, debugging, rollback).
Interpret table history and identify which operations changed a table.
Describe MERGE at a high level (upsert/CDC) and when it is preferred over full refresh.
Recognize the importance of unique merge keys to prevent unintended row multiplication.
Explain why time travel supports investigation but does not replace good pipeline versioning practices.
Given a scenario, choose a Delta operation (append, overwrite, merge) that matches the change pattern.

Topic 4: Lakehouse Modeling & Performance Basics

Practice this topic →

4.1 Medallion architecture and table/view choices

Explain the Bronze/Silver/Gold (medallion) concept and why each layer exists.
Differentiate tables vs views and choose when a view is appropriate (abstraction, security, reuse).
Describe common modeling choices: star schema basics vs wide denormalized tables (concept-level).
Identify how curated layers should stabilize business definitions for downstream consumers.
Given a scenario, choose the right layer to apply cleaning, deduplication, and business rules.
Explain why governance and ownership matter more as shared consumption grows.

4.2 Partitioning and file layout intuition

Explain partition pruning at a conceptual level and why partition columns should match common filters.
Identify when over-partitioning leads to too many small files and degraded performance.
Choose partition columns with appropriate cardinality and stability (avoid high-cardinality partitions).
Explain the difference between partitioning strategy and clustering/file compaction (concept-level).
Recognize how skewed data can create hot partitions and uneven performance.
Given a scenario, choose a partitioning approach that balances pruning benefits and file count.

4.3 Performance troubleshooting basics

Recognize operations that commonly cause shuffles (joins, groupBy, distinct) and why they are expensive.
Identify common causes of slow queries: scanning too much data, lack of filters, and small files.
Explain why broadcasting small dimensions can improve join performance (concept-level awareness).
Describe caching when appropriate for repeated reads in interactive analysis.
Given a scenario, choose the simplest fix: filter early, reduce columns, or correct join logic before scaling compute.
Explain why observability (timings, row counts) helps catch regressions early.

Topic 5: Databricks Platform Basics

Practice this topic →

5.1 Notebooks, jobs, and parameterization (awareness)

Differentiate notebooks (interactive development) from jobs/workflows (scheduled execution).
Explain why parameterized jobs improve reusability across environments (dev/test/prod).
Recognize common job failure categories (data issues, permissions, cluster unavailable) and basic triage steps.
Describe why idempotent jobs are easier to retry safely.
Given a scenario, choose when to run code interactively vs in an automated job.
Explain why logging row counts and key metrics supports operational visibility.

5.2 Catalogs, schemas, and basic permissions (concept-level)

Explain the purpose of organizing data into catalogs/schemas and why naming conventions matter.
Differentiate user permissions for reading vs writing tables and why least privilege reduces risk.
Recognize that shared tables require governance to prevent accidental destructive changes.
Describe how table ownership and documentation reduce confusion for analytics consumers.
Given a scenario, choose a safer approach to sharing: views and controlled access rather than ad-hoc copies.
Explain why separating environments reduces accidental production impact.

5.3 Basic troubleshooting and safety practices

Apply a simple troubleshooting sequence: validate inputs → validate schema → validate write target → validate permissions.
Recognize common causes of schema mismatch and how to respond safely (fail vs evolve).
Identify how to recover from a bad write using table history/time travel when available (concept-level).
Explain why destructive operations should be gated with approvals and backups in production environments.
Given a scenario, choose the least risky remediation option that preserves data integrity.
Describe why documenting pipeline assumptions prevents repeated operational mistakes.

Tip: After finishing a topic, take a 15–25 question drill focused on that area, then revisit weak objectives before moving on.

Study Plan

Cheat Sheet

Browse Exams — Mock Exams & Practice Tests

DE-ASSOC Syllabus — Learning Objectives by Topic

What’s covered

Topic 1: Spark Fundamentals (SQL + DataFrames)

1.1 Transformations, actions, and basic execution intuition

1.2 Spark SQL essentials (joins, aggregations, windows)

1.3 DataFrames and column expressions (PySpark awareness)

Topic 2: Data Ingestion & Batch ETL Patterns

2.1 Reading data (formats, options, and schemas)

2.2 Writing data (append, overwrite, incremental thinking)

2.3 Basic data quality and pipeline hygiene

Topic 3: Delta Lake Fundamentals

3.1 Delta tables, ACID, and table operations

3.2 Schema enforcement and evolution

3.3 Time travel and change operations (MERGE awareness)

Topic 4: Lakehouse Modeling & Performance Basics

4.1 Medallion architecture and table/view choices

4.2 Partitioning and file layout intuition

4.3 Performance troubleshooting basics

Topic 5: Databricks Platform Basics

5.1 Notebooks, jobs, and parameterization (awareness)

5.2 Catalogs, schemas, and basic permissions (concept-level)

5.3 Basic troubleshooting and safety practices