Use this syllabus as your source of truth for DE‑ASSOC. Work topic-by-topic, and drill questions after each section.
What’s covered
Topic 1: Spark Fundamentals (SQL + DataFrames)
Practice this topic →
- Differentiate between Spark transformations and actions and identify which operations trigger execution.
- Explain Spark’s lazy evaluation at a conceptual level and why plans are built before execution.
- Recognize the small-file problem and describe why it can hurt performance in distributed reads.
- Use caching/persist conceptually and explain when caching can improve iterative workloads.
- Differentiate narrow vs wide transformations and identify which operations commonly introduce shuffles.
- Given a scenario, choose the simplest safe approach to reduce data early before expensive shuffles.
1.2 Spark SQL essentials (joins, aggregations, windows)
- Select the appropriate join type (inner/left/full/semi/anti) for a described requirement.
- Write or interpret aggregations and GROUP BY logic including distinct counts and conditional aggregation.
- Explain the purpose of window functions and identify correct partition/order usage for common analytics tasks.
- Diagnose common join pitfalls (duplicate amplification, wrong granularity, filtering that changes join semantics).
- Apply null handling correctly in filters and joins (especially with LEFT joins).
- Given a scenario, choose SQL logic that produces correct results with minimal complexity.
1.3 DataFrames and column expressions (PySpark awareness)
- Translate between SQL logic and DataFrame transformations for common operations (select, filter, join, groupBy).
- Use column expressions safely (avoid stringly-typed mistakes) and understand why column naming matters.
- Recognize when UDFs are a last resort and prefer built-in functions where possible (concept-level).
- Explain how schema and data types impact correctness (string vs numeric, timestamps, nullability).
- Choose an appropriate approach for deduplication (dropDuplicates vs window-based selection) based on requirements.
- Given a scenario, identify the simplest transformation chain that meets correctness requirements.
Topic 2: Data Ingestion & Batch ETL Patterns
Practice this topic →
- Differentiate common lakehouse formats conceptually (CSV/JSON/Parquet/Delta) and pick the right one for a scenario.
- Explain why schema-on-read needs explicit schemas for reliability and why inference can be risky in production.
- Identify common ingestion options (header, delimiter, multiline JSON) and their impact on correctness.
- Describe the difference between managed tables and external locations conceptually and when each is used.
- Recognize how partitioned data is typically laid out in storage and how it affects reads.
- Given a scenario, choose a safe ingestion strategy that balances correctness and operational simplicity.
2.2 Writing data (append, overwrite, incremental thinking)
- Choose append vs overwrite strategies based on batch ETL intent and data freshness requirements.
- Explain idempotency at a conceptual level and why repeatable runs matter for batch pipelines.
- Recognize the difference between overwriting a table vs overwriting partitions and the risk of clobbering data.
- Describe how to safely handle incremental loads using watermark columns or change tracking (concept-level).
- Identify when a MERGE/upsert is appropriate vs a full refresh.
- Given a scenario, choose a write pattern that is both correct and recoverable.
2.3 Basic data quality and pipeline hygiene
- Identify common data quality checks (null checks, range checks, uniqueness, referential integrity) for pipelines.
- Explain why “fail fast” is often safer than silently passing bad data downstream in curated layers.
- Recognize where to place validation in a multi-hop pipeline (bronze vs silver) based on data cleanliness needs.
- Describe deduplication strategies and how to choose a deterministic record when duplicates exist.
- Explain basic error handling patterns: quarantine bad records, log metrics, and produce audit outputs.
- Given a scenario, choose a quality approach that balances strictness with operational practicality.
Topic 3: Delta Lake Fundamentals
Practice this topic →
3.1 Delta tables, ACID, and table operations
- Explain what Delta Lake adds to a data lake at a conceptual level (ACID, schema enforcement, time travel).
- Differentiate between Delta tables and raw Parquet files in terms of reliability and operations.
- Identify basic Delta table operations: create table, insert, append, overwrite, and table history awareness.
- Explain why transactions improve correctness in concurrent read/write workloads (concept-level).
- Recognize how deletes/updates are supported in Delta compared to plain files (concept-level).
- Given a scenario, choose Delta as the target format when ACID and evolution are needed.
3.2 Schema enforcement and evolution
- Differentiate schema enforcement from schema evolution and identify expected outcomes (fail vs evolve).
- Recognize common schema mismatch scenarios (new columns, type changes) and how they affect writes.
- Explain why explicit schemas and controlled evolution reduce pipeline breakage.
- Identify when it is safer to add columns than to rename/remove columns (contract stability).
- Given a scenario, choose whether schema evolution is appropriate or whether a pipeline should fail and alert.
- Explain how downstream consumers are impacted by schema changes and why compatibility matters.
3.3 Time travel and change operations (MERGE awareness)
- Explain Delta time travel at a conceptual level and identify when it is used (audit, debugging, rollback).
- Interpret table history and identify which operations changed a table.
- Describe MERGE at a high level (upsert/CDC) and when it is preferred over full refresh.
- Recognize the importance of unique merge keys to prevent unintended row multiplication.
- Explain why time travel supports investigation but does not replace good pipeline versioning practices.
- Given a scenario, choose a Delta operation (append, overwrite, merge) that matches the change pattern.
Practice this topic →
4.1 Medallion architecture and table/view choices
- Explain the Bronze/Silver/Gold (medallion) concept and why each layer exists.
- Differentiate tables vs views and choose when a view is appropriate (abstraction, security, reuse).
- Describe common modeling choices: star schema basics vs wide denormalized tables (concept-level).
- Identify how curated layers should stabilize business definitions for downstream consumers.
- Given a scenario, choose the right layer to apply cleaning, deduplication, and business rules.
- Explain why governance and ownership matter more as shared consumption grows.
4.2 Partitioning and file layout intuition
- Explain partition pruning at a conceptual level and why partition columns should match common filters.
- Identify when over-partitioning leads to too many small files and degraded performance.
- Choose partition columns with appropriate cardinality and stability (avoid high-cardinality partitions).
- Explain the difference between partitioning strategy and clustering/file compaction (concept-level).
- Recognize how skewed data can create hot partitions and uneven performance.
- Given a scenario, choose a partitioning approach that balances pruning benefits and file count.
- Recognize operations that commonly cause shuffles (joins, groupBy, distinct) and why they are expensive.
- Identify common causes of slow queries: scanning too much data, lack of filters, and small files.
- Explain why broadcasting small dimensions can improve join performance (concept-level awareness).
- Describe caching when appropriate for repeated reads in interactive analysis.
- Given a scenario, choose the simplest fix: filter early, reduce columns, or correct join logic before scaling compute.
- Explain why observability (timings, row counts) helps catch regressions early.
Practice this topic →
5.1 Notebooks, jobs, and parameterization (awareness)
- Differentiate notebooks (interactive development) from jobs/workflows (scheduled execution).
- Explain why parameterized jobs improve reusability across environments (dev/test/prod).
- Recognize common job failure categories (data issues, permissions, cluster unavailable) and basic triage steps.
- Describe why idempotent jobs are easier to retry safely.
- Given a scenario, choose when to run code interactively vs in an automated job.
- Explain why logging row counts and key metrics supports operational visibility.
5.2 Catalogs, schemas, and basic permissions (concept-level)
- Explain the purpose of organizing data into catalogs/schemas and why naming conventions matter.
- Differentiate user permissions for reading vs writing tables and why least privilege reduces risk.
- Recognize that shared tables require governance to prevent accidental destructive changes.
- Describe how table ownership and documentation reduce confusion for analytics consumers.
- Given a scenario, choose a safer approach to sharing: views and controlled access rather than ad-hoc copies.
- Explain why separating environments reduces accidental production impact.
5.3 Basic troubleshooting and safety practices
- Apply a simple troubleshooting sequence: validate inputs → validate schema → validate write target → validate permissions.
- Recognize common causes of schema mismatch and how to respond safely (fail vs evolve).
- Identify how to recover from a bad write using table history/time travel when available (concept-level).
- Explain why destructive operations should be gated with approvals and backups in production environments.
- Given a scenario, choose the least risky remediation option that preserves data integrity.
- Describe why documenting pipeline assumptions prevents repeated operational mistakes.
Tip: After finishing a topic, take a 15–25 question drill focused on that area, then revisit weak objectives before moving on.