DE-ASSOC Cheatsheet — Spark + Delta Lake ETL (Tables, SQL, and Quick Rules)

Last-mile DE-ASSOC review: Spark SQL/DataFrames essentials, Delta Lake features (schema, time travel, merge), ETL patterns, and exam-style pickers. Includes code snippets, tables, and diagrams.

Use this for last‑mile review. Pair it with the Syllabus for coverage and Practice to validate speed/accuracy.


1) Spark fundamentals (what the exam is testing)

Transformations vs actions (execution trigger)

Category Examples Notes
Transformations select, filter, withColumn, groupBy, join Lazy: build a plan
Actions count, collect, show, write, foreach Trigger execution

Rule: If it returns a DataFrame, it’s usually a transformation. If it returns a value or writes, it’s usually an action.

Narrow vs wide transformations (shuffle intuition)

Type What it implies Examples
Narrow No shuffle select, filter, map
Wide Shuffle groupBy, join, distinct, orderBy

2) Spark SQL high-yield patterns

Joins: choose the right type

Join Keeps rows from Common use
INNER both sides match dimensional enrichment
LEFT left side always “keep all facts”
FULL both sides always reconciliation
SEMI/ANTI left rows with/without match de-dup and “missing” detection

Window functions (classic exam topic)

1SELECT
2  user_id,
3  ts,
4  amount,
5  ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts DESC) AS rn
6FROM transactions;

3) Delta Lake essentials (must-know features)

Why Delta exists (one sentence)

Delta Lake adds ACID + schema enforcement + time travel to data lake storage.

Read/write basics

1CREATE TABLE IF NOT EXISTS silver_orders
2USING DELTA
3AS SELECT * FROM bronze_orders;

MERGE (upsert / CDC pattern)

1MERGE INTO silver_orders t
2USING staging_orders s
3ON t.order_id = s.order_id
4WHEN MATCHED THEN UPDATE SET *
5WHEN NOT MATCHED THEN INSERT *;

Schema enforcement vs schema evolution

Concept What it means What you see
Schema enforcement rejects incompatible writes write fails
Schema evolution updates table schema intentionally new columns appear

4) Time travel + history (audit and rollback)

1DESCRIBE HISTORY silver_orders;
2
3SELECT * FROM silver_orders VERSION AS OF 42;

Exam cue: time travel is for debugging/audit/rollback; it doesn’t replace good pipeline discipline.


5) Partitioning and file layout (concept-level pickers)

You want… Do this Why
Faster pruning on common filters Partition on low-to-medium cardinality columns Partition pruning
Avoid too many tiny files Don’t over-partition; use compaction/OPTIMIZE where appropriate Small-file problem
Faster reads on frequent predicates Use file compaction + data skipping (conceptually) Fewer files + better locality

6) ETL pipeline mental model (Bronze → Silver → Gold)

    flowchart LR
	  B["Bronze (raw)"] --> S["Silver (cleaned)"]
	  S --> G["Gold (business-ready)"]

Rules of thumb

  • Bronze is append-heavy and close to source.
  • Silver enforces schema/quality and supports joins.
  • Gold is curated for BI and stable consumption.

7) Quick troubleshooting pickers

  • Unexpected duplicates after upsert: MERGE join condition wrong or not unique.
  • Slow groupBy/join: shuffle-heavy; consider partition strategy and input skew (concept-level).
  • Write fails on schema mismatch: schema enforcement; decide whether evolution is intended.
  • Lots of small files: adjust write patterns and compact (concept-level).