ML-ASSOC Cheatsheet — MLflow, Features, Training & Evaluation on Databricks

Last-mile ML-ASSOC review: feature engineering patterns, train/test discipline, MLflow tracking and registry concepts, and evaluation pickers. Includes code snippets, tables, and diagrams.

Use this for last‑mile review. Pair it with the Syllabus for coverage and Practice to validate instincts.


1) The MLflow mental model (what goes where)

MLflow concept What it stores Why it matters
Run one training/eval attempt compare experiments reproducibly
Params hyperparameters/config explain how a run was produced
Metrics evaluation numbers rank candidates
Artifacts model files, plots, reports reproduce and deploy
Registry model versions + lifecycle stages controlled promotion to production
    flowchart LR
	  D["Data"] --> FE["Feature engineering"]
	  FE --> TR["Train"]
	  TR --> R["MLflow run (params/metrics/artifacts)"]
	  R --> REG["Model Registry"]
	  REG --> DEP["Deploy (batch/real-time)"]

2) Feature engineering quick rules (avoid leakage)

Risk What it looks like Safer approach
Leakage features use future info compute features using only info available at prediction time
Label leakage feature derived from target drop/shift feature; verify pipeline
Train/test contamination stats computed on full dataset fit transforms on train only

3) Metrics pickers (high-yield)

Task Common metrics Notes
Classification accuracy, precision/recall, F1, AUC beware class imbalance
Regression RMSE, MAE, R² choose based on error sensitivity

Rule: If the prompt mentions imbalance or false positives/negatives, accuracy is rarely the right answer.


4) MLflow tracking: minimal code pattern

1import mlflow
2
3with mlflow.start_run():
4  mlflow.log_param("max_depth", 8)
5  mlflow.log_metric("auc", 0.91)
6  mlflow.log_artifact("confusion_matrix.png")
7  mlflow.sklearn.log_model(model, "model")

Exam cue: if you need reproducibility, log params + metrics + model artifact in the run.


5) Registry basics (versioning + promotion)

Step What happens Why it matters
Register model creates a named model with versions stable reference
New version produced from a run/model artifact traceability
Promote stage e.g., staging → production controlled rollout

6) Fast troubleshooting pickers

  • Can’t reproduce a result: missing params/artifacts, data version drift, or randomness not controlled.
  • Metrics look too good: leakage, wrong split, target in features.
  • Model “works” offline but not in production: skewed features, missing preprocessing, inconsistent schema.