How to Use This Quick Reference
This independent Quick Reference is built for candidates preparing for the CompTIA DataAI (DY0-001) exam. Use it as a compact review of high-yield data, analytics, AI, governance, and operational decision points.
Focus on recognizing which concept fits the scenario:
- What kind of data is being used?
- What is the business question?
- Is the task descriptive analytics, prediction, classification, clustering, or generation?
- What risks apply: privacy, bias, leakage, drift, security, quality, or explainability?
- What should be done first, next, or instead?
Core DataAI Mental Model
| Area | Candidate should recognize | Common exam trap |
|---|
| Data lifecycle | Collection, storage, preparation, analysis, deployment, monitoring, retirement | Jumping to modeling before defining the problem or validating data quality |
| Data quality | Accuracy, completeness, consistency, timeliness, validity, uniqueness | Treating more data as automatically better data |
| Analytics | Descriptive, diagnostic, predictive, prescriptive | Confusing “why did it happen?” with “what will happen?” |
| AI/ML | Learning patterns from data to make predictions, classifications, recommendations, or generated outputs | Assuming AI is always appropriate when a rule-based or reporting solution is enough |
| Generative AI | Produces text, code, images, summaries, responses, or synthetic content | Treating generated output as verified truth |
| Governance | Policies for ownership, access, quality, privacy, retention, ethics, and compliance | Treating governance as only a security function |
| Security | Protect confidentiality, integrity, availability, and authorized access | Ignoring data exposure through model outputs, prompts, or logs |
| MLOps / AI operations | Deployment, versioning, monitoring, retraining, rollback | Thinking the project ends when the model is trained |
Data Lifecycle Reference
| Phase | Purpose | Key activities | Exam cues |
|---|
| Define problem | Convert business need into measurable objective | Stakeholder alignment, success criteria, constraints, risk review | “Before collecting data, what should be done?” |
| Collect / ingest | Bring data from sources into controlled environment | Batch loads, streaming, APIs, logs, surveys, sensors | “Data comes from multiple systems” |
| Store | Persist data for use and governance | Databases, warehouses, lakes, lakehouses, object storage | “Structured reporting” versus “raw diverse data” |
| Prepare | Make data usable | Cleaning, deduplication, normalization, feature engineering, labeling | “Missing values, inconsistent formats” |
| Analyze / model | Generate insights or predictions | Querying, statistics, visualization, training, validation | “Predict churn,” “segment customers,” “detect anomalies” |
| Deploy | Put outputs into production workflow | APIs, dashboards, batch scoring, embedded models | “Real-time decisioning” or “business dashboard” |
| Monitor | Detect degradation and risk | Drift checks, performance metrics, bias monitoring, incident response | “Model worked before but now performs poorly” |
| Retain / retire | Manage end-of-life data and models | Retention, archiving, deletion, decommissioning | “Data no longer needed” or “policy requires removal” |
Data Roles and Responsibilities
| Role | Primary responsibility | What to remember for DY0-001 scenarios |
|---|
| Data owner | Accountability for data use, access, and business meaning | Usually approves access and classification decisions |
| Data steward | Data quality, definitions, metadata, and governance execution | Maintains business glossary and data standards |
| Data custodian | Technical operation of data systems | Implements backups, access controls, storage, and availability |
| Data analyst | Reporting, querying, dashboards, descriptive and diagnostic analysis | Explains trends and business patterns |
| Data scientist | Statistical modeling, machine learning, experimentation | Builds and evaluates predictive or advanced models |
| Data engineer | Pipelines, integration, transformation, scalable data platforms | Ensures reliable ingestion and processing |
| ML engineer / AI engineer | Production deployment and operation of models | Focuses on serving, monitoring, scaling, and automation |
| Security / privacy team | Protects data and manages risk | Encryption, access control, privacy impact, incident response |
| Business stakeholder | Defines requirements and validates usefulness | Success criteria should map to business outcomes |
Data Types and Structures
| Category | Examples | Best suited for | Exam distinction |
|---|
| Structured | Relational tables, rows, columns, transactions | SQL queries, reporting, dashboards, warehouses | Schema is predefined |
| Semi-structured | JSON, XML, logs, events, email metadata | APIs, event analytics, flexible ingestion | Has tags or keys but not strict relational format |
| Unstructured | Documents, images, audio, video, free text | NLP, computer vision, generative AI, search | Needs extraction, embedding, labeling, or preprocessing |
| Time series | Sensor readings, stock prices, telemetry, usage over time | Forecasting, anomaly detection, trend monitoring | Order and intervals matter |
| Categorical | Region, product type, status, class label | Grouping, classification, one-hot encoding | Values are labels, not numeric magnitude |
| Numerical | Age, revenue, temperature, count | Statistics, regression, scaling | Can be continuous or discrete |
| Ordinal | Satisfaction rating, severity, priority | Ranking, ordered comparisons | Order matters; equal distance may not |
| Geospatial | Coordinates, addresses, regions | Mapping, route optimization, location analytics | Requires spatial context |
Storage and Processing Selection
| Need | Better fit | Why | Avoid when |
|---|
| Operational transactions | OLTP database | Fast inserts/updates, normalized records, current state | Large analytical scans are primary need |
| Business reporting | Data warehouse | Structured, curated, historical, optimized for analytics | Raw diverse data must be stored before modeling |
| Raw multi-format storage | Data lake | Stores structured, semi-structured, and unstructured data | Governance and metadata are absent |
| Warehouse plus lake flexibility | Lakehouse concept | Combines open storage with governance/query features | Organization needs only simple transactional storage |
| Near-real-time event handling | Streaming pipeline | Processes data as events arrive | Daily or monthly batch is sufficient |
| Scheduled large loads | Batch processing | Efficient for periodic transformation and reporting | Low-latency decisions are required |
| Search across documents | Search index / vector index | Retrieval by keyword, semantic similarity, or embeddings | Exact relational transactions are primary use case |
| Temporary analysis | Sandbox / workspace | Exploration without changing production | Sensitive data lacks masking or approval |
Analytics Type Decision Table
| Type | Question answered | Typical output | Example cue |
|---|
| Descriptive | What happened? | Reports, KPIs, counts, totals, dashboards | “Show last quarter revenue by region” |
| Diagnostic | Why did it happen? | Drill-downs, root cause, correlations | “Find why churn increased” |
| Predictive | What is likely to happen? | Forecasts, risk scores, classifications | “Predict which customers may leave” |
| Prescriptive | What should we do? | Recommendations, optimization, next-best action | “Recommend optimal inventory levels” |
| Cognitive / generative | What can the system create or infer from context? | Summaries, generated text, answers, code, images | “Summarize support tickets” |
Data Quality Dimensions
| Dimension | Meaning | Detection examples | Remediation examples |
|---|
| Accuracy | Data reflects reality | Compare to trusted source, validation rules | Correct source system, reconcile records |
| Completeness | Required values are present | Null checks, missing field reports | Collect missing data, impute cautiously |
| Consistency | Values agree across systems | Conflicting customer status, different date formats | Standardize definitions and formats |
| Validity | Values conform to allowed format/range | Invalid email, negative age, impossible dates | Enforce constraints and validation |
| Uniqueness | Records are not duplicated | Duplicate keys, fuzzy matching | Deduplicate, master data management |
| Timeliness | Data is current enough | Stale timestamp, delayed feed | Improve ingestion frequency, alert on latency |
| Integrity | Relationships remain correct | Orphan records, broken foreign keys | Referential constraints, reconciliation |
| Lineage | Origin and transformations are known | Missing metadata or undocumented changes | Data catalog, pipeline documentation |
Data Preparation Reference
| Task | Use when | Important caution |
|---|
| Deduplication | Same entity appears more than once | Define duplicate logic; exact matching may miss fuzzy duplicates |
| Standardization | Formats differ across systems | Normalize date, currency, units, casing, categories |
| Normalization / scaling | Numeric features have different ranges | Fit scaling on training data only to avoid leakage |
| Encoding categorical variables | ML model needs numeric input | Watch high-cardinality fields and unseen categories |
| Imputation | Missing values must be handled | Do not hide systemic missingness; missingness may be predictive |
| Outlier handling | Extreme values distort analysis | Determine whether outlier is error, rare valid event, or fraud signal |
| Tokenization | Text must be processed for NLP/LLMs | Token limits affect cost, context, and truncation |
| Labeling | Supervised model needs target values | Poor labels create poor models even with good algorithms |
| Feature engineering | Raw fields need predictive transformation | Avoid using future information unavailable at prediction time |
| Data splitting | Model must be evaluated fairly | Split before transformations that learn from the data |
High-Yield Statistical Concepts
| Concept | Meaning | Exam use |
|---|
| Mean | Arithmetic average | Sensitive to outliers |
| Median | Middle value | Better for skewed distributions |
| Mode | Most frequent value | Useful for categorical values |
| Range | Max minus min | Simple spread, sensitive to outliers |
| Variance | Average squared deviation from mean | Measures dispersion |
| Standard deviation | Typical distance from mean | Same unit as data |
| Percentile | Value below which a percentage falls | Used for thresholds and distribution comparison |
| Correlation | Strength/direction of relationship | Does not prove causation |
| Covariance | Directional joint variability | Scale-dependent, less interpretable than correlation |
| Confidence interval | Range of plausible values | Wider interval means more uncertainty |
| p-value | Evidence against null hypothesis | Does not measure business importance |
| Sampling bias | Sample does not represent population | Leads to misleading conclusions |
| Class imbalance | One class dominates target | Accuracy can be misleading |
\[
\text{Mean} = \frac{\text{sum of values}}{\text{number of values}}
\]\[
\text{Range} = \text{maximum value} - \text{minimum value}
\]\[
\text{Error} = \text{actual value} - \text{predicted value}
\]
SQL and Querying Patterns
Use SQL patterns to recognize joins, aggregation, filtering order, and quality checks.
Aggregation and Filtering
SELECT department, COUNT(*) AS employee_count, AVG(salary) AS avg_salary
FROM employees
WHERE active = true
GROUP BY department
HAVING COUNT(*) > 10
ORDER BY avg_salary DESC;
| Clause | Purpose | Trap |
|---|
| WHERE | Filters rows before grouping | Cannot filter aggregate results here |
| GROUP BY | Creates groups for aggregation | Non-aggregated selected columns must be grouped |
| HAVING | Filters groups after aggregation | Often confused with WHERE |
| ORDER BY | Sorts final result | Does not change calculation logic |
Join Selection
| Join type | Keeps | Use when | Common trap |
|---|
| INNER JOIN | Matching rows only | Need records present in both tables | Accidentally drops unmatched records |
| LEFT JOIN | All left rows plus matches | Need all primary records even without match | WHERE condition on right table can turn it into inner behavior |
| RIGHT JOIN | All right rows plus matches | Less common; similar to reversing LEFT JOIN | Harder to read in complex queries |
| FULL OUTER JOIN | All rows from both sides | Reconciliation and mismatch detection | Not all systems support it |
| CROSS JOIN | Every combination | Generate combinations | Can create huge result sets |
Data Quality Check Example
SELECT customer_id, COUNT(*) AS duplicate_count
FROM customers
GROUP BY customer_id
HAVING COUNT(*) > 1;
Window Function Example
SELECT
customer_id,
order_date,
order_total,
SUM(order_total) OVER (
PARTITION BY customer_id
ORDER BY order_date
) AS running_total
FROM orders;
Use window functions when you need row-level detail plus grouped context.
AI and Machine Learning Task Selection
| Task | Goal | Common algorithms / approaches | Evaluation focus |
|---|
| Regression | Predict numeric value | Linear regression, decision trees, random forest, gradient boosting, neural networks | MAE, MSE, RMSE, R-squared |
| Binary classification | Predict one of two classes | Logistic regression, decision trees, SVM, random forest, neural networks | Precision, recall, F1, ROC-AUC |
| Multiclass classification | Predict one of several classes | Softmax models, trees, boosting, neural networks | Accuracy, macro/micro F1, confusion matrix |
| Clustering | Group similar records without labels | K-means, hierarchical clustering, DBSCAN | Silhouette score, cluster interpretability |
| Anomaly detection | Find unusual behavior | Isolation forest, statistical thresholds, autoencoders | False positives, recall for rare events |
| Forecasting | Predict future time-based values | ARIMA-style methods, exponential smoothing, regression, recurrent/deep models | Backtesting, MAE/RMSE, seasonality handling |
| Recommendation | Suggest items or actions | Collaborative filtering, content-based filtering, hybrid systems | Ranking metrics, click-through, conversion |
| NLP classification | Categorize text | Bag-of-words, embeddings, transformers | F1, confusion matrix, label quality |
| Summarization / generation | Produce text or content | LLM prompting, RAG, fine-tuning | Factuality, relevance, safety, human review |
Learning Types
| Learning type | Uses labels? | Goal | Example |
|---|
| Supervised learning | Yes | Learn mapping from input to known target | Predict loan default from historical labeled loans |
| Unsupervised learning | No | Discover structure or patterns | Segment customers by behavior |
| Semi-supervised learning | Some labels | Use limited labeled data with larger unlabeled set | Classify documents with few labeled examples |
| Reinforcement learning | Feedback/rewards | Learn actions through trial and reward | Optimize game strategy or robotics behavior |
| Self-supervised learning | Labels derived from data | Pretrain models on inherent structure | Predict masked words in text |
| Transfer learning | Uses learned representation | Adapt existing model to new task | Fine-tune image or language model |
Model Selection Reference
| Scenario cue | Likely choice | Why |
|---|
| “Predict sales amount” | Regression | Target is numeric |
| “Will customer churn: yes/no?” | Binary classification | Target has two classes |
| “Classify ticket as billing, technical, account, or other” | Multiclass classification | Target has multiple categories |
| “Group customers without predefined labels” | Clustering | No target labels |
| “Find suspicious transactions” | Anomaly detection or classification | Fraud is rare and unusual |
| “Predict demand next month” | Time-series forecasting | Temporal order matters |
| “Recommend products to users” | Recommendation system | Personalized ranking |
| “Summarize long policy documents” | Generative AI / NLP summarization | Produces text output |
| “Answer questions using internal documents” | Retrieval-augmented generation | Needs grounded responses from enterprise knowledge |
| “Explain which features influenced prediction” | Interpretable model or explainability method | Transparency is required |
Training Workflow and Leakage Controls
flowchart LR
A[Define objective and success metric] --> B[Collect and profile data]
B --> C[Split data into train, validation, test]
C --> D[Fit preprocessing on training data only]
D --> E[Train model]
E --> F[Tune with validation data]
F --> G[Final evaluation on test data]
G --> H[Deploy with monitoring]
H --> I[Monitor drift, quality, bias, and performance]
I --> J[Retrain or rollback when needed]
| Step | Correct practice | Leakage trap |
|---|
| Split data | Separate train, validation, and test data | Cleaning, scaling, or feature selection before split using all data |
| Time-based data | Split chronologically when forecasting | Random split leaks future patterns into training |
| Feature engineering | Use only data available at prediction time | Including future outcomes, post-event fields, or manual labels |
| Hyperparameter tuning | Use validation set or cross-validation | Repeatedly tuning on the test set |
| Final evaluation | Test once for unbiased estimate | Reporting best validation result as final test result |
| Deployment | Reproduce same preprocessing pipeline | Training and serving logic differ |
Bias, Variance, and Fit
| Condition | Symptoms | Likely cause | Response |
|---|
| Underfitting | Poor training and test performance | Model too simple, weak features, insufficient training | Add features, increase complexity, train longer |
| Overfitting | Strong training performance, weak test performance | Model memorizes training noise | Regularization, more data, simpler model, cross-validation |
| High bias | Systematic error | Assumptions too restrictive | More expressive model or better features |
| High variance | Performance unstable across samples | Model too sensitive to data | More data, regularization, ensembling |
| Data drift | Input distribution changes | Real-world data changes after deployment | Monitor features and retrain |
| Concept drift | Relationship between input and target changes | Behavior, fraud, market, or policy changes | Update labels, retrain, revise objective |
Evaluation Metrics
Classification Metrics
\[
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
\]\[
\text{Precision} = \frac{TP}{TP + FP}
\]\[
\text{Recall} = \frac{TP}{TP + FN}
\]\[
\text{F1 score} = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}}
\]
| Metric | Best when | Watch out |
|---|
| Accuracy | Classes are balanced and errors have similar cost | Misleading with class imbalance |
| Precision | False positives are costly | May miss true positives |
| Recall / sensitivity | False negatives are costly | May increase false positives |
| Specificity | True negative rate matters | Often paired with sensitivity |
| F1 score | Need balance between precision and recall | Hides separate precision/recall tradeoff |
| ROC-AUC | Compare ranking ability across thresholds | Can be less informative for highly imbalanced data |
| PR-AUC | Positive class is rare | More useful for fraud, defects, rare disease scenarios |
| Confusion matrix | Need to inspect error types | Requires class-specific interpretation |
Regression Metrics
\[
\text{MAE} = \frac{\text{sum of absolute errors}}{\text{number of predictions}}
\]\[
\text{MSE} = \frac{\text{sum of squared errors}}{\text{number of predictions}}
\]\[
\text{RMSE} = \sqrt{\text{MSE}}
\]
| Metric | Use when | Watch out |
|---|
| MAE | Need easy-to-understand average error | Treats all errors linearly |
| MSE | Large errors should be penalized more | Units are squared |
| RMSE | Penalize large errors but keep original unit | Sensitive to outliers |
| R-squared | Explain proportion of variance captured | Can look good without proving usefulness |
| MAPE | Percentage error is useful | Fails or misleads near zero actual values |
Unsupervised and Generative Evaluation
| Area | Metric / method | What it checks |
|---|
| Clustering | Silhouette score | Separation and cohesion of clusters |
| Clustering | Business interpretability | Whether clusters are actionable |
| Anomaly detection | Precision and recall on labeled anomalies | Balance between alert noise and missed events |
| LLM output | Human evaluation | Relevance, correctness, tone, safety |
| LLM output | Groundedness / citation check | Whether answer is supported by retrieved sources |
| LLM output | Toxicity / safety checks | Harmful, biased, or policy-violating output |
| Retrieval | Recall at k / precision at k | Whether relevant documents appear in top results |
Generative AI and LLM Reference
| Concept | Meaning | Exam relevance |
|---|
| Prompt | Input instructions and context sent to a model | Quality strongly affects output |
| System prompt | High-priority instruction defining behavior | Used to set role, constraints, and safety boundaries |
| Temperature | Controls randomness of output | Lower for deterministic factual tasks; higher for creative variation |
| Token | Unit of text processed by model | Context length, cost, and truncation depend on tokens |
| Embedding | Numeric representation of semantic meaning | Used for similarity search and retrieval |
| Vector database / index | Stores embeddings for similarity search | Common in RAG architectures |
| RAG | Retrieval-augmented generation; retrieves external context before generation | Helps ground answers in current or private data |
| Fine-tuning | Adjusting model behavior with additional training examples | Useful for style, task adaptation, or domain patterns |
| Hallucination | Plausible but false generated output | Requires grounding, validation, and human review |
| Guardrail | Control to reduce unsafe or invalid outputs | Includes filtering, policy checks, prompt constraints |
| Agent | Model-driven system that can plan and call tools | Needs permissions, logging, and action limits |
Prompting and GenAI Decision Table
| Need | Prefer | Why | Avoid if |
|---|
| Improve one-off answer quality | Better prompt design | Fast, low-cost, no model changes | Problem requires private knowledge not in prompt |
| Answer from internal documents | RAG | Grounds output in retrieved content | Source documents are low quality or access is not controlled |
| Enforce organization-specific style | Fine-tuning or prompt templates | Produces consistent format and tone | Need factual updates from changing documents |
| Reduce hallucinations | RAG, citations, validation, constrained output | Ties answer to sources and checks format | User expects creative brainstorming |
| Extract structured fields from text | Prompt with schema or NLP extraction model | Converts unstructured to structured | Output is not validated |
| Execute business actions | Agent with tool controls | Can call APIs or workflows | Permissions, audit, and rollback are absent |
| Protect sensitive data | Redaction, access control, approved model path | Reduces data exposure | Users can paste secrets into prompts freely |
RAG Architecture Components
| Component | Purpose | Common failure mode |
|---|
| Source documents | Authoritative knowledge | Outdated, duplicated, or conflicting content |
| Chunking | Splits documents into retrievable pieces | Chunks too large, too small, or missing context |
| Embedding model | Converts chunks and queries to vectors | Poor semantic match for domain language |
| Vector index | Retrieves similar chunks | Irrelevant results if metadata and filters are weak |
| Retriever | Selects candidate context | Low recall misses needed evidence |
| Generator | Produces final answer | Hallucinates if context is weak or ignored |
| Citation / grounding check | Verifies support | References irrelevant or unavailable text |
| Access control | Ensures users retrieve only allowed data | Data leakage through shared index or cached context |
Security, Privacy, and Governance
| Control / concept | Purpose | Exam decision point |
|---|
| Data classification | Labels data by sensitivity and handling needs | First step before applying protection controls |
| Least privilege | Grants only needed access | Preferred access model for data and AI systems |
| Role-based access control | Access by job role | Easier administration for common roles |
| Attribute-based access control | Access by attributes, context, or conditions | Better for fine-grained and dynamic policies |
| Encryption at rest | Protects stored data | Does not control who can query decrypted data |
| Encryption in transit | Protects data moving over networks | Required for APIs, pipelines, and client connections |
| Tokenization | Replaces sensitive value with token | Useful when original value must be recoverable via secure mapping |
| Masking | Hides part or all of sensitive data | Useful for display or nonproduction access |
| Anonymization | Removes identifying linkage | Hard to reverse if done properly; utility may decrease |
| Pseudonymization | Replaces identifiers but can be re-linked with key | Still sensitive if re-identification is possible |
| Data loss prevention | Detects or blocks sensitive data movement | Useful for email, uploads, endpoints, and prompts |
| Audit logging | Records access and actions | Required for investigation and accountability |
| Retention policy | Defines how long data is kept | Reduces risk from unnecessary data |
| Data lineage | Tracks origin and transformations | Supports trust, troubleshooting, and compliance |
| Model card | Documents model purpose, data, metrics, limits, risks | Supports transparency and responsible use |
| Data catalog | Inventory of data assets and metadata | Helps discovery and governance |
Responsible AI and Risk Controls
| Risk | Description | Mitigation |
|---|
| Bias | Model treats groups unfairly due to data or design | Representative data, fairness metrics, review by subgroup |
| Disparate impact | Outcomes disproportionately affect protected or sensitive groups | Fairness testing and policy review |
| Lack of explainability | Users cannot understand decisions | Use interpretable models or explainability tools |
| Hallucination | Generated content is false but convincing | RAG, validation, citations, human review |
| Privacy leakage | Sensitive information appears in outputs or logs | Redaction, access controls, prompt filtering, logging controls |
| Data poisoning | Training or retrieval data is maliciously altered | Source validation, integrity checks, monitoring |
| Prompt injection | User or document attempts to override model instructions | Input filtering, instruction hierarchy, tool restrictions |
| Model inversion | Attacker infers training data | Limit output detail, privacy-preserving training, access control |
| Model theft | Attacker extracts model behavior or parameters | Rate limits, monitoring, access control |
| Automation bias | Humans over-trust model output | Human-in-the-loop review and confidence indicators |
Data Visualization Selection
| Goal | Chart / visualization | Avoid |
|---|
| Compare categories | Bar chart | 3D effects and crowded labels |
| Show trend over time | Line chart | Pie charts for time series |
| Show part-to-whole | Stacked bar or pie for few categories | Too many slices |
| Show distribution | Histogram, box plot | Average-only summaries for skewed data |
| Show relationship | Scatter plot | Inferring causation from visual correlation |
| Show geographic pattern | Map | Using area size when color scale is clearer |
| Show ranking | Sorted bar chart | Unsorted tables for quick comparison |
| Show process flow | Flowchart or Sankey | Overly dense dashboard tiles |
| Show uncertainty | Error bars, confidence intervals | Hiding uncertainty in exact-looking numbers |
Dashboard and Reporting Checks
| Check | Why it matters |
|---|
| Audience is defined | Executives, analysts, operations, and engineers need different detail |
| KPI definitions are documented | Prevents conflicting interpretations |
| Filters are obvious | Users need to know what data is included |
| Time period is clear | Avoids misleading comparisons |
| Units are shown | Currency, count, percent, and rate are different |
| Refresh cadence is visible | Users need to know data freshness |
| Drill-down path exists | Supports diagnostic analysis |
| Accessibility is considered | Color-only signals may exclude some users |
| Action is clear | A dashboard should support decisions, not only display data |
MLOps and AI Operations
| Capability | Purpose | Exam cue |
|---|
| Version control | Tracks code, data schema, features, and model versions | “Need reproducibility” |
| Experiment tracking | Records parameters, metrics, artifacts | “Compare multiple model runs” |
| Model registry | Stores approved model versions and metadata | “Promote model to production” |
| CI/CD for ML | Automates testing and deployment | “Frequent controlled releases” |
| Feature store | Reuses governed features for training and serving | “Training-serving consistency” |
| Batch inference | Scores data on schedule | “Nightly risk scores” |
| Real-time inference | Scores request immediately | “Approve transaction at checkout” |
| Canary deployment | Releases to small subset first | “Reduce deployment risk” |
| Blue-green deployment | Switches traffic between environments | “Fast rollback” |
| A/B testing | Compares alternatives with users | “Which model performs better in production?” |
| Monitoring | Watches performance, drift, latency, errors | “Model degraded after launch” |
| Retraining pipeline | Updates model with new data | “Performance decline due to new patterns” |
Troubleshooting Decision Table
| Symptom | Likely cause | First checks | Likely response |
|---|
| Model is accurate in training but poor in production | Overfitting, leakage, drift, training-serving skew | Compare train/test/production distributions and features | Fix pipeline, retrain, simplify model |
| Dashboard totals differ from source system | Transformation issue, filter mismatch, refresh delay | Reconcile definitions, timestamps, joins | Correct ETL and KPI definitions |
| Sudden missing data | Pipeline failure or source schema change | Ingestion logs, schema validation, source availability | Repair pipeline and add alerts |
| Many false fraud alerts | Threshold too low or data drift | Confusion matrix, precision, recent data distribution | Adjust threshold, retrain, segment rules |
| Rare events are missed | Class imbalance or recall too low | Recall, PR-AUC, minority class representation | Resampling, class weights, threshold tuning |
| LLM gives unsupported answer | Retrieval failure or hallucination | Retrieved context, prompt, citations | Improve RAG, require source grounding |
| Sensitive data appears in output | Weak filtering or access control | Prompt logs, retrieval permissions, DLP findings | Redact, restrict, audit, update guardrails |
| Model latency too high | Large model, inefficient features, slow retrieval | Inference timing by component | Optimize, cache, batch, use smaller model |
| Users do not trust model | Lack of explainability or poor communication | Documentation, model card, decision rationale | Add explanations and human review |
| Metrics improved but business outcome did not | Wrong success metric | Link model metric to business KPI | Redefine objective and evaluation |
High-Yield Distinctions
| Do not confuse | Correct distinction |
|---|
| Correlation vs causation | Correlation is association; causation requires stronger evidence or experimental design |
| Validation set vs test set | Validation supports tuning; test estimates final generalization |
| Precision vs recall | Precision limits false positives; recall limits false negatives |
| Data drift vs concept drift | Data drift changes inputs; concept drift changes relationship between inputs and target |
| Masking vs encryption | Masking changes display; encryption protects encoded data with keys |
| Anonymization vs pseudonymization | Anonymization removes identity linkage; pseudonymization can be re-linked |
| Data lake vs data warehouse | Lake stores raw diverse data; warehouse stores curated structured analytics data |
| OLTP vs OLAP | OLTP supports transactions; OLAP supports analysis |
| Batch vs streaming | Batch processes groups on schedule; streaming processes events continuously |
| Supervised vs unsupervised | Supervised uses labels; unsupervised discovers patterns without labels |
| Regression vs classification | Regression predicts numbers; classification predicts categories |
| RAG vs fine-tuning | RAG adds retrieved knowledge at query time; fine-tuning changes model behavior through training |
| Explainability vs accuracy | More accurate models are not always more interpretable |
| Data quality vs model quality | Poor data can make any model unreliable |
| Governance vs security | Governance defines accountability and policy; security enforces protection controls |
Scenario-Based Exam Cues
| If the question says… | Think… |
|---|
| “Need to know what happened last month” | Descriptive analytics |
| “Need to identify why sales dropped” | Diagnostic analytics |
| “Need to estimate future demand” | Predictive analytics or forecasting |
| “Need to recommend best action” | Prescriptive analytics |
| “No labeled outcomes are available” | Unsupervised learning |
| “Target variable is yes/no” | Binary classification |
| “False negatives are dangerous” | Optimize recall |
| “False positives are expensive” | Optimize precision |
| “Classes are highly imbalanced” | Avoid accuracy as sole metric |
| “Data changes over time” | Monitor drift and use time-aware validation |
| “Model uses information unavailable at prediction time” | Data leakage |
| “Need current internal knowledge in LLM answers” | RAG |
| “Need consistent response format” | Prompt template, schema, or fine-tuning |
| “Need auditability and ownership” | Governance, lineage, catalog, logging |
| “Users need only approved data” | Least privilege, RBAC/ABAC, data classification |
| “Data must be protected in a nonproduction environment” | Masking, tokenization, synthetic data, access control |
| “Model is deployed but performance declines” | Monitoring, drift detection, retraining |
| “Need safe release with rollback” | Canary or blue-green deployment |
| “Need to compare two live models” | A/B testing |
| “Need explainable decisions” | Interpretable model, explainability tools, model documentation |
Compact Exam-Day Checklist
- Identify the business objective before selecting a tool, model, or metric.
- Determine whether the data is structured, semi-structured, unstructured, time-series, categorical, or numerical.
- Match analytics type: descriptive, diagnostic, predictive, prescriptive, or generative.
- Validate data quality before trusting analysis or training results.
- Watch for data leakage, especially future information and preprocessing before splitting.
- Select metrics based on error cost: precision, recall, F1, MAE, RMSE, or ranking metrics.
- For LLM scenarios, consider prompting, RAG, fine-tuning, guardrails, and human review.
- For sensitive data, apply classification, least privilege, encryption, masking, retention, and audit logging.
- For production models, think versioning, monitoring, drift, rollback, and retraining.
- Prefer the answer that reduces risk while still meeting the business requirement.
Practical Next Step
After reviewing this Quick Reference, practice with scenario-based DY0-001 questions that force you to choose the best data, analytics, AI, governance, or operational response rather than simply recall definitions.