Use this syllabus as your source of truth for ML‑ASSOC. Work topic-by-topic, and drill questions after each section.
What’s covered
Topic 1: Data Preparation & Feature Engineering on Databricks
Practice this topic →
1.1 Data access, cleaning, and feature creation
- Describe common data sources for ML workloads on Databricks (tables, files) and how they are accessed conceptually.
- Apply basic cleaning steps (null handling, outliers, type casting) and explain their impact on model training.
- Create features using Spark SQL/DataFrames and choose appropriate transformations for numeric vs categorical data (concept-level).
- Explain why feature definitions should be consistent across training and inference pipelines.
- Recognize the risk of data leakage and identify feature patterns that can leak future information.
- Given a scenario, choose feature engineering steps that improve signal while avoiding leakage.
1.2 Splits, leakage, and evaluation hygiene
- Differentiate train/validation/test splits and describe why each exists.
- Recognize when time-based splits are required to prevent leakage in temporal datasets.
- Explain why preprocessing steps should be fit on training data only (to avoid contamination).
- Identify common causes of unrealistically high metrics (target leakage, duplicate rows across splits).
- Given a scenario, choose a split strategy that matches the prediction use case.
- Describe basic reproducibility practices: fixed seeds, logged data versions, and deterministic preprocessing.
1.3 Feature pipelines (awareness) and data versions
- Explain why feature pipelines should be repeatable and version-aware in collaborative environments.
- Describe how schema changes can break training jobs and why schema governance matters (concept-level).
- Recognize the role of metadata (feature definitions, timestamps, owner) in preventing confusion and drift.
- Identify when to materialize features vs compute them on the fly (trade-off awareness).
- Given a scenario, choose an approach that balances simplicity, correctness, and repeatability.
- Explain why documenting feature meaning prevents incorrect model usage downstream.
Topic 2: Model Training & Evaluation Basics
Practice this topic →
2.1 Problem framing and baseline models
- Differentiate classification vs regression problems based on the target variable and business question.
- Select baseline approaches and explain why baselines are important for measuring improvement.
- Recognize class imbalance and identify why accuracy can be misleading in imbalanced datasets.
- Explain the purpose of regularization and how it relates to overfitting at a high level.
- Given a scenario, choose an appropriate evaluation approach for the problem type.
- Describe how to interpret common failure patterns: underfitting vs overfitting.
2.2 Metrics selection and interpretation
- Choose appropriate metrics for classification (precision/recall/F1/AUC) and interpret what each emphasizes.
- Choose appropriate metrics for regression (RMSE/MAE/R²) and interpret sensitivity to outliers.
- Explain confusion matrices conceptually and relate them to false positives/false negatives.
- Recognize when threshold choice affects business outcomes and why calibration may matter (concept-level).
- Given a scenario, select the metric that aligns with the business cost of errors.
- Describe how to compare models fairly (same split, same preprocessing, controlled randomness).
2.3 Hyperparameter tuning and validation (awareness)
- Explain the purpose of hyperparameters and how they differ from learned parameters.
- Describe cross-validation at a high level and why it improves robustness in small datasets.
- Recognize the trade-off between exhaustive search and efficient search methods (concept-level).
- Identify over-tuning risk and why a final holdout test set is needed for honest evaluation.
- Given a scenario, choose a tuning approach that balances compute cost and expected benefit.
- Explain why logging tuning results supports reproducibility and collaboration.
Topic 3: MLflow Tracking & Experiment Management
Practice this topic →
3.1 Runs, parameters, metrics, and artifacts
- Explain what an MLflow run represents and why runs are the unit of comparison.
- Log parameters, metrics, and artifacts and explain why each is required for reproducibility.
- Differentiate parameters (config) from metrics (results) and identify common logging mistakes.
- Describe the purpose of storing model artifacts and evaluation plots as run artifacts.
- Given a scenario, choose what to log to enable audit and repeatability.
- Explain how to compare runs and select the best candidate based on objective criteria.
3.2 Experiment organization and collaboration
- Organize experiments logically (by project/model/problem) and explain why naming conventions matter.
- Describe how tags and metadata improve discoverability and collaboration.
- Explain why tracking data and code versions reduces “it worked on my machine” issues (concept-level).
- Recognize the purpose of reproducible environments and dependency capture (concept-level).
- Given a scenario, choose an experiment structure that supports multiple teammates and iterations.
- Describe why capturing baseline runs and ablations helps decision-making.
3.3 Common tracking pitfalls and fixes
- Identify why missing randomness control can cause irreproducible results and how to mitigate it.
- Recognize when metrics drift across runs indicates data drift or code changes rather than “model randomness.”
- Explain why large artifacts should be stored intentionally and referenced rather than duplicated excessively.
- Describe safe handling of sensitive data in artifacts (avoid logging PII).
- Given a scenario, choose a tracking remediation plan that restores reproducibility quickly.
- Explain why consistent feature preprocessing must be captured as an artifact or code dependency.
Topic 4: Model Registry & Lifecycle
Practice this topic →
4.1 Registering models and versioning
- Explain why registering a model creates a stable, versioned artifact for downstream use.
- Differentiate experiment runs from registry versions (experiments vs releases).
- Describe the purpose of model version metadata (who trained it, which data, which code) conceptually.
- Given a scenario, choose when to register a model vs keep it as an experimental run artifact.
- Identify why versioning supports rollback and audit requirements.
- Explain why model contracts (input/output schema) should be documented and validated.
- Describe stage-based promotion at a high level (development/staging/production) and why it reduces risk.
- Explain why approvals and review gates matter for regulated or high-impact models (concept-level).
- Recognize common promotion pitfalls: insufficient evaluation, missing monitoring, or inconsistent preprocessing.
- Given a scenario, choose a safe promotion strategy that includes validation and rollback thinking.
- Describe why access control matters for the registry and who should be allowed to promote models.
- Explain why documenting model intent prevents misuse in downstream applications.
4.3 Deployment awareness (batch vs online)
- Differentiate batch inference from online inference and map each to latency/throughput requirements.
- Explain why feature preprocessing must match training when serving models.
- Recognize the operational need for monitoring and rollback in production deployments.
- Given a scenario, choose the simplest deployment mode that meets requirements.
- Describe how to validate deployments with canary tests and representative inputs (concept-level).
- Explain why model performance should be monitored over time and not assumed stable.
Tip: After finishing a topic, take a 15–25 question drill focused on that area, then revisit weak objectives before moving on.