Free CompTIA DataAI DY0-001 Practice Questions: Machine Learning

Last revised: July 14, 2026

Practice 10 free CompTIA DataAI (CompTIA DataAI DY0-001) questions on Machine Learning, with answers, explanations, and the IT Mastery next step.

Try the IT Mastery web app for a richer interactive practice experience with mixed sets, timed mocks, topic drills, explanations, and progress tracking.

Try CompTIA DataAI DY0-001 on Web

Topic snapshot

Field	Detail
Practice target	CompTIA DataAI DY0-001
Topic area	Machine Learning
Blueprint weight	24%
Page purpose	Focused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Machine Learning for CompTIA DataAI DY0-001. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

Pass	What to do	What to record
First attempt	Answer without checking the explanation first.	The fact, rule, calculation, or judgment point that controlled your answer.
Review	Read the explanation even when you were correct.	Why the best answer is stronger than the closest distractor.
Repair	Repeat only missed or uncertain items after a short break.	The pattern behind misses, not the answer letter.
Transfer	Return to mixed practice once the topic feels stable.	Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 24% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These are original IT Mastery practice questions aligned to this topic area. They are not official CompTIA questions, copied live-exam content, or exam dumps. Use them to preview question style and explanation depth before continuing with topic drills, mixed sets, and timed mocks in IT Mastery.

Question 1

Topic: Machine Learning

A streaming service analytics team wants to identify viewing behaviors that tend to occur together during the same user session so it can design bundle recommendations. Which method is best supported by the exhibit?

Exhibit: Session event data profile

Field	Example	Notes
`session_id`	S-10492	Groups events in one visit
`event_type`	trailer_view, add_watchlist	Multiple events per session
`content_tag`	sci-fi, documentary	Multiple tags per session
`conversion_label`	not collected	No target outcome available

Options:

A. Association rule mining
B. Logistic regression
C. Linear regression
D. K-nearest neighbors classification

Best answer: A

Explanation: Association rule mining is used to discover items, events, or behaviors that frequently occur together within a shared context, such as a shopping basket or user session. The exhibit shows multiple event types and content tags grouped by session_id, and it explicitly lacks a target label. That supports finding rules such as “sessions with trailer views and sci-fi tags often also include watchlist adds,” typically evaluated with measures such as support, confidence, and lift.

Supervised methods require a labeled outcome to predict, while this use case is about uncovering co-occurrence patterns.

Logistic regression would need a binary target outcome, such as churn or conversion, which the exhibit says is not collected.
KNN classification also requires labeled examples to classify new sessions into known classes.
Linear regression predicts a continuous numeric outcome, not co-occurring session behaviors.

Question 2

Topic: Machine Learning

A fraud analytics team needs to choose a first production classifier for manual investigator triage. A small performance gap is acceptable if each decision can be explained as a readable sequence of feature thresholds.

Exhibit: Validation summary

Candidate	AUC	Diagnostic finding
Regularized logistic regression	0.77	Misses threshold interactions
Shallow decision tree	0.83	Uses readable split paths
Random forest	0.85	Opaque aggregate voting
Gradient-boosted trees	0.86	Opaque additive ensemble

Which candidate best fits the requirement?

Options:

A. Use gradient-boosted trees
B. Use regularized logistic regression
C. Use a shallow decision tree
D. Use a random forest

Best answer: C

Explanation: A decision tree is the best fit when the scenario values both interpretability and non-linear split behavior. The exhibit shows that logistic regression is more transparent than the ensembles, but it misses the threshold interactions that matter for the fraud patterns. The random forest and gradient-boosted models have slightly higher AUC, but their ensemble mechanisms make individual decisions harder to explain as a simple readable path. A shallow tree gives investigators a sequence of feature thresholds that can be reviewed and challenged while still modeling non-linear decision boundaries. The key trade-off is accepting a small metric gap to satisfy the explanation requirement.

Linear transparency fails because logistic regression is interpretable but does not capture the threshold interactions shown in the diagnostics.
Forest accuracy is tempting, but aggregate voting does not provide the required simple decision path.
Boosted performance has the highest AUC, but the additive ensemble is less directly interpretable for manual triage explanations.

Question 3

Topic: Machine Learning

A regional lender needs a default-risk model for a regulated approval workflow. The training set has 2,800 historical loans, 32 mostly clean tabular predictors, and a 12% default rate. Business requirements include audit-friendly explanations, stable quarterly retraining on CPU-only infrastructure, and clear factor directionality for adverse-action review. A pilot multilayer neural network improved cross-validated AUC from 0.781 to 0.786 but showed higher fold-to-fold variance. Which method best maps to these requirements?

Options:

A. A convolutional network over reshaped feature vectors
B. Regularized logistic regression with calibrated probabilities
C. A deeper neural network with dropout and early stopping
D. An autoencoder to learn latent tabular features

Best answer: B

Explanation: For a small, structured tabular problem with strict interpretability and stable retraining requirements, a simpler statistical model is usually preferable unless a complex model delivers a meaningful, validated gain. Regularized logistic regression supports coefficient-based directionality, odds-ratio style explanations, calibration, and reproducible CPU-based retraining. The neural network’s tiny AUC lift is not compelling because it comes with higher validation variance and weaker auditability. Deep learning is most defensible when the data modality or scale justifies representation learning, such as images, text, speech, or very large heterogeneous datasets. Here, the business constraints make unnecessary complexity a liability rather than an advantage.

Dropout and early stopping reduce overfitting risk but do not solve the auditability problem or justify the marginal, unstable AUC gain.
Autoencoder features add representation-learning complexity and make factor directionality harder to explain for adverse-action review.
Reshaped convolution imposes an artificial spatial structure on ordinary tabular predictors and adds avoidable operational risk.

Question 4

Topic: Machine Learning

A hospital analytics team needs to prioritize follow-up calls for patients at discharge. The target is whether a patient is readmitted within 30 days, the positive class is 8%, predictors are mixed tabular EHR variables available at discharge, and stakeholders need calibrated risk scores that can be thresholded as staffing changes. Nonlinear interactions are expected, and nightly batch scoring is acceptable. Which supervised method is the best professional choice?

Options:

A. K-means clustering with risk labels assigned after training
B. Ordinary least squares regression
C. Calibrated gradient-boosted tree classifier
D. Cox proportional hazards survival model

Best answer: C

Explanation: This is a supervised binary classification problem: the label is readmission within 30 days. Because the business needs ranked, thresholdable risk scores rather than only class labels, probability calibration matters. A gradient-boosted tree classifier is a strong fit for mixed tabular healthcare data when nonlinear interactions are expected, and nightly batch scoring reduces latency concerns. Class imbalance should be handled during training and evaluation, but the method still needs to optimize for calibrated binary risk rather than a continuous outcome or unsupervised grouping. The key distinction is matching the binary target and calibrated decision workflow, not simply choosing the most interpretable or most statistically familiar model.

Linear outcome mismatch fails because ordinary least squares predicts a continuous response and can produce poorly calibrated values outside probability bounds.
Survival framing mismatch fails because the stated business target is fixed-window readmission, not censored time-to-event estimation.
Unsupervised grouping fails because clustering does not directly learn the known readmission label or provide calibrated supervised risk.

Question 5

Topic: Machine Learning

A data science team is selecting a model for a regulated loan-default early-warning system. The business goal is stable recall at a fixed false-positive budget, and the dataset has repeated observations per borrower over time. The current comparison tunes 200 gradient-boosting configurations on the final holdout set, tunes 10 logistic-regression configurations with random row-level cross-validation, and reports the best observed holdout score for each model.

Which action is the BEST professional decision before recommending a model?

Options:

A. Average all tuned models into an ensemble before evaluation
B. Increase logistic-regression trials until its holdout score improves
C. Select gradient boosting because it had the highest holdout recall
D. Use borrower-grouped nested cross-validation with a comparable tuning budget

Best answer: D

Explanation: A fair model selection process must separate hyperparameter tuning from final performance estimation and apply a comparable validation protocol across candidates. Here, the final holdout has been reused for tuning, so the reported best score is optimistically biased. Random row-level cross-validation can also leak borrower-specific information because repeated observations from the same borrower may appear in both train and validation folds. A grouped nested cross-validation design addresses both issues: inner folds tune hyperparameters, outer folds estimate generalization, and borrower grouping prevents identity leakage. Comparable search budgets do not require identical grids, but they should give each candidate a defensible opportunity without overfitting the selection process.

The key takeaway is that the model recommendation should be based on an unbiased, like-for-like evaluation, not the best score from uneven holdout probing.

Highest holdout score is unreliable because the holdout was used during tuning, creating selection bias.
More holdout trials worsens the same problem by further overfitting model selection to the holdout.
Premature ensembling adds complexity before establishing a fair baseline comparison across candidate models.

Question 6

Topic: Machine Learning

A claims analytics team must choose a production model for fraud triage. The business requires ROC-AUC of at least 0.84, per-claim reason codes for analyst review, weekly retraining by a small MLOps team, and batch scoring overnight.

Candidate	Validation ROC-AUC	Notes
Pruned decision tree	0.78	Easy to explain
Random forest	0.84	Many trees; slower explanations
Gradient-boosted trees	0.87	Supports monotonic constraints and SHAP values
Stacked ensemble	0.89	Combines trees and neural network

Which option best maps to these requirements?

Options:

A. Pruned decision tree with analyst-readable rules
B. Random forest without local explanation artifacts
C. Gradient-boosted trees with constraints and SHAP reason codes
D. Stacked ensemble optimized only for ROC-AUC

Best answer: C

Explanation: The key trade-off is ensemble performance versus interpretability and operational complexity. The pruned tree is easiest to explain, but it fails the required ROC-AUC threshold. The stacked ensemble has the best validation score, but its mixed architecture increases deployment, monitoring, and explanation burden for a small team. Gradient-boosted trees provide stronger performance than the threshold and can support governance needs through monotonic constraints and local explanation methods such as SHAP. Because scoring is overnight batch rather than real-time, the extra explanation computation is more acceptable. The best fit is not the most accurate model in isolation; it is the model that satisfies accuracy, explainability, and operational requirements together.

Simple tree trap fails because interpretability alone does not meet the minimum ROC-AUC requirement.
Accuracy-only trap ignores the reason-code and small-team operational constraints.
Random forest gap reaches the threshold but does not address the required per-claim explanation artifacts.

Question 7

Topic: Machine Learning

A retailer is training a model to forecast same-day order volume per fulfillment center for staffing. Operations wants a forecast that covers demand on most days, because understaffing causes missed delivery windows and is about four times more costly than overstaffing. The target is continuous, right-skewed, and the validation metric will include empirical coverage of the forecast. Which loss-function consideration is the BEST professional decision?

Options:

A. Train with symmetric mean absolute error
B. Train with mean squared error
C. Train with high-quantile pinball loss
D. Train with binary cross-entropy

Best answer: C

Explanation: The core issue is aligning the loss function with the prediction task and business cost. This is not a request for the conditional mean; operations needs an upper-demand forecast that reduces costly understaffing. A high-quantile pinball loss, such as a 0.8 or 0.9 quantile chosen from the stated service objective, directly trains the model to estimate a conditional quantile rather than an average. Validation should then check empirical coverage and operational cost, not only generic error metrics. Symmetric losses can be useful for typical-value forecasting, but they do not encode the stated asymmetry.

Mean squared error emphasizes large residuals but still targets the conditional mean, not a coverage-oriented upper forecast.
Binary cross-entropy is for classification probabilities, while the target is continuous order volume.
Symmetric absolute error reduces sensitivity to outliers but treats under- and overforecasting equally.

Question 8

Topic: Machine Learning

A financial services team is selecting a fraud triage model. Compliance requires investigator-facing reason codes, and the scoring service must keep p95 latency under 75 ms. The current baseline ROC-AUC is 0.78; the target is at least 0.82.

Exhibit: Validation and deployment summary

Model	ROC-AUC	p95 latency	Explanation support	Ops complexity
Single decision tree	0.79	9 ms	native path rules	low
Random forest	0.83	42 ms	local reason codes in batch	medium
Gradient boosting	0.85	118 ms	SHAP job required	high
Stacked ensemble	0.86	210 ms	inconsistent across layers	very high

Which conclusion is best supported by the exhibit?

Options:

A. Select the stacked ensemble because it performs best.
B. Select the single decision tree for maximum interpretability.
C. Select gradient boosting because it has higher ROC-AUC.
D. Select the random forest as the best trade-off.

Best answer: D

Explanation: Ensemble selection should consider performance, interpretability, and operational fit together. The random forest is the only model shown that clears the target ROC-AUC of 0.82, stays under the 75 ms p95 latency limit, and still supports investigator-facing reason codes. It is less transparent than a single decision tree, but the tree fails the performance target. The higher-scoring gradient boosting and stacked ensemble options introduce operational problems that violate latency and explanation requirements. In regulated or investigator-assisted workflows, a small AUC gain is not automatically worth added complexity when deployment constraints are missed.

Single-tree simplicity fails because its native interpretability does not compensate for missing the required ROC-AUC target.
Highest AUC focus fails because gradient boosting exceeds the latency constraint and needs a heavier explanation workflow.
Stacked ensemble appeal fails because the extra AUC comes with very high latency and weak explanation consistency.

Question 9

Topic: Machine Learning

A healthcare network is building a triage model for routing patients to outreach programs. Clinicians require a small set of human-readable rules they can review, and EDA shows that risk changes sharply at different thresholds for age, recent visits, and lab values. Which model family best maps to these requirements?

Options:

A. k-nearest neighbors
B. Decision tree
C. Gradient-boosted trees
D. Logistic regression

Best answer: B

Explanation: A decision tree is the best fit when the business requirement emphasizes transparent, reviewable decision rules and the data suggests non-linear split points. Trees recursively partition the feature space, so they can capture threshold effects such as different risk above a lab-value cutoff or within an age band without requiring manual interaction terms. The resulting path from root to leaf is also easier for clinicians or compliance reviewers to inspect than distance-based or ensemble behavior. A boosted tree model may improve predictive performance, but it sacrifices the simple single-tree rule structure requested in the scenario.

Linear interpretability makes logistic regression appealing, but it does not naturally capture sharp non-linear splits without additional feature engineering.
Local similarity lets k-nearest neighbors model non-linear boundaries, but it does not produce stable, human-readable rules.
Ensemble accuracy makes gradient-boosted trees powerful, but many sequential trees are less directly interpretable than a single decision tree.

Question 10

Topic: Machine Learning

A facilities analytics team needs a model to predict monthly maintenance cost for each production line and explain how cost changes with operating conditions. Which supervised learning method is best supported by the exhibit?

Exhibit: EDA and requirements summary

Item	Finding
Target	Monthly cost in USD
Predictors	Run hours, machine age, load, temperature
EDA pattern	Mostly linear trends with cost
Stakeholder need	Explain marginal effect of each predictor

Options:

A. Multiple linear regression
B. Logistic regression
C. K-means clustering
D. Random forest classifier

Best answer: A

Explanation: Multiple linear regression fits the scenario because the target is a continuous numeric outcome: monthly maintenance cost. The exhibit also states that predictor relationships are mostly linear and stakeholders need to explain the marginal effect of each predictor. A linear regression coefficient can be interpreted as the expected change in cost for a one-unit change in a predictor, holding other predictors constant. That directly supports both prediction and explainability. A more complex model could be considered later if residual analysis shows nonlinearity or interactions, but the provided evidence supports starting with an interpretable linear model.

Binary outcome mismatch makes logistic regression unsuitable because the target is not a class or probability.
Classifier mismatch makes a random forest classifier unsuitable because the task is numeric cost prediction, not class assignment.
Unsupervised mismatch makes K-means unsuitable because the team has a labeled target to predict.

Continue in the web app

Use IT Mastery for interactive CompTIA DataAI DY0-001 practice with mixed sets, timed mocks, topic drills, explanations, and progress tracking.

Try CompTIA DataAI DY0-001 on Web

Modeling, Analysis, and Outcomes

Operations and Processes

Free CompTIA DataAI DY0-001 Practice Questions: Machine Learning

Topic snapshot

How to use this topic drill

Sample questions

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Continue in the web app

Related focused pages

Browse Certification Practice Tests by Exam Family