DE-ASSOC Cheatsheet — Spark + Delta Lake ETL (Tables, SQL, and Quick Rules)

Last-mile DE-ASSOC review: Spark SQL/DataFrames essentials, Delta Lake features (schema, time travel, merge), ETL patterns, and exam-style pickers. Includes code snippets, tables, and diagrams.

Use this for last‑mile review. Pair it with the Syllabus for coverage and Practice to validate speed/accuracy.


1) Spark fundamentals (what the exam is testing)

Transformations vs actions (execution trigger)

CategoryExamplesNotes
Transformationsselect, filter, withColumn, groupBy, joinLazy: build a plan
Actionscount, collect, show, write, foreachTrigger execution

Rule: If it returns a DataFrame, it’s usually a transformation. If it returns a value or writes, it’s usually an action.

Narrow vs wide transformations (shuffle intuition)

TypeWhat it impliesExamples
NarrowNo shuffleselect, filter, map
WideShufflegroupBy, join, distinct, orderBy

2) Spark SQL high-yield patterns

Joins: choose the right type

JoinKeeps rows fromCommon use
INNERboth sides matchdimensional enrichment
LEFTleft side always“keep all facts”
FULLboth sides alwaysreconciliation
SEMI/ANTIleft rows with/without matchde-dup and “missing” detection

Window functions (classic exam topic)

1SELECT
2  user_id,
3  ts,
4  amount,
5  ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts DESC) AS rn
6FROM transactions;

3) Delta Lake essentials (must-know features)

Why Delta exists (one sentence)

Delta Lake adds ACID + schema enforcement + time travel to data lake storage.

Read/write basics

1CREATE TABLE IF NOT EXISTS silver_orders
2USING DELTA
3AS SELECT * FROM bronze_orders;

MERGE (upsert / CDC pattern)

1MERGE INTO silver_orders t
2USING staging_orders s
3ON t.order_id = s.order_id
4WHEN MATCHED THEN UPDATE SET *
5WHEN NOT MATCHED THEN INSERT *;

Schema enforcement vs schema evolution

ConceptWhat it meansWhat you see
Schema enforcementrejects incompatible writeswrite fails
Schema evolutionupdates table schema intentionallynew columns appear

4) Time travel + history (audit and rollback)

1DESCRIBE HISTORY silver_orders;
2
3SELECT * FROM silver_orders VERSION AS OF 42;

Exam cue: time travel is for debugging/audit/rollback; it doesn’t replace good pipeline discipline.


5) Partitioning and file layout (concept-level pickers)

You want…Do thisWhy
Faster pruning on common filtersPartition on low-to-medium cardinality columnsPartition pruning
Avoid too many tiny filesDon’t over-partition; use compaction/OPTIMIZE where appropriateSmall-file problem
Faster reads on frequent predicatesUse file compaction + data skipping (conceptually)Fewer files + better locality

6) ETL pipeline mental model (Bronze → Silver → Gold)

    flowchart LR
	  B["Bronze (raw)"] --> S["Silver (cleaned)"]
	  S --> G["Gold (business-ready)"]

Rules of thumb

  • Bronze is append-heavy and close to source.
  • Silver enforces schema/quality and supports joins.
  • Gold is curated for BI and stable consumption.

7) Quick troubleshooting pickers

  • Unexpected duplicates after upsert: MERGE join condition wrong or not unique.
  • Slow groupBy/join: shuffle-heavy; consider partition strategy and input skew (concept-level).
  • Write fails on schema mismatch: schema enforcement; decide whether evolution is intended.
  • Lots of small files: adjust write patterns and compact (concept-level).