Free CompTIA DataAI DY0-001 Practice Questions: Mathematics and Statistics

Last revised: July 14, 2026

Practice 10 free CompTIA DataAI (CompTIA DataAI DY0-001) questions on Mathematics and Statistics, with answers, explanations, and the IT Mastery next step.

Try the IT Mastery web app for a richer interactive practice experience with mixed sets, timed mocks, topic drills, explanations, and progress tracking.

Try CompTIA DataAI DY0-001 on Web

Topic snapshot

Field	Detail
Practice target	CompTIA DataAI DY0-001
Topic area	Mathematics and Statistics
Blueprint weight	17%
Page purpose	Focused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Mathematics and Statistics for CompTIA DataAI DY0-001. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

Pass	What to do	What to record
First attempt	Answer without checking the explanation first.	The fact, rule, calculation, or judgment point that controlled your answer.
Review	Read the explanation even when you were correct.	Why the best answer is stronger than the closest distractor.
Repair	Repeat only missed or uncertain items after a short break.	The pattern behind misses, not the answer letter.
Transfer	Return to mixed practice once the topic feels stable.	Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 17% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These are original IT Mastery practice questions aligned to this topic area. They are not official CompTIA questions, copied live-exam content, or exam dumps. Use them to preview question style and explanation depth before continuing with topic drills, mixed sets, and timed mocks in IT Mastery.

Question 1

Topic: Mathematics and Statistics

A subscription platform is building a penalized logistic regression model to predict account churn. The feature monthly_api_calls is available before prediction time and has this training profile: median 1,200; mean 18,700; 95th percentile 75,000; max 2,400,000; skewness +9.6. The highest values are verified enterprise customers, not data-entry errors. The team needs calibrated probabilities and wants simple monthly retraining. Which action is the BEST professional decision?

Options:

A. Standardize the raw feature and keep the original scale.
B. Replace logistic regression with a deep neural network.
C. Apply log1p in the training pipeline and validate calibration.
D. Remove accounts above the 95th percentile before training.

Best answer: C

Explanation: Positive skewness means the distribution has a long right tail: most accounts have much lower API usage than a small number of very high-usage accounts. Because those extreme values are valid, dropping them would discard important business signal. A monotonic transform such as log1p compresses the right tail while preserving order and allowing zero or low counts to remain usable. Applying it inside the training pipeline supports reproducible monthly retraining and avoids inconsistent preprocessing. The transformed feature should still be validated against calibration and business metrics, because transformation is a modeling hypothesis, not a guaranteed improvement. Standardization changes location and scale but does not address the asymmetric tail shape.

Deleting high values fails because the extreme accounts are verified enterprise customers, not errors.
Standardizing only fails because z-scaling does not reduce the long right tail or leverage from extreme values.
Using a neural network is overengineering when a simpler calibrated model can be improved with an appropriate feature transform.

Question 2

Topic: Mathematics and Statistics

A team is evaluating an ordinary least squares model to predict insurance claim severity. The model documentation assumes approximately normal residuals with constant variance before using coefficient tests and prediction intervals.

Exhibit: Residual diagnostics

Diagnostic	Result
Residual mean	0.02
Residual skewness	2.9
Q-Q plot	Upper tail far above line
Residuals vs. fitted	Fan-shaped spread
Breusch-Pagan test	p = 0.003

Which modeling concern is best supported by the exhibit?

Options:

A. The main issue is multicollinearity among predictors.
B. The model primarily suffers from class imbalance.
C. The model is invalid because residuals are not centered at zero.
D. Coefficient tests and prediction intervals may be unreliable.

Best answer: D

Explanation: OLS point estimates can still be useful in some settings, but common coefficient tests, standard errors, and prediction intervals rely on residual assumptions such as approximately normal errors and constant variance. The exhibit shows a highly right-skewed residual distribution, a Q-Q tail departure, and a fan-shaped residual plot with a significant Breusch-Pagan test. Together, these indicate non-normality and heteroskedasticity, so inference based on the default OLS assumptions is questionable. A better next step could include transformations, robust standard errors, weighted regression, or a distribution more appropriate for claim severity.

Centered residuals is not the issue because the residual mean is close to zero.
Class imbalance applies to classification target distribution, not continuous claim severity residual diagnostics.
Multicollinearity would require predictor correlation or variance inflation evidence, which the exhibit does not provide.

Question 3

Topic: Mathematics and Statistics

A data science team is comparing two forecasting models for monthly claim severity. Model B will replace Model A only if the improvement is statistically defensible for an executive risk report. The team has 12 paired monthly holdout results and planned a paired t-test on monthly MAE differences, where positive values favor Model B.

Diagnostic	Result
Mean difference	1.8
Median difference	0.3
Skewness	2.1
Shapiro-Wilk p-value	0.02
Notable point	One month: +15.0

Which action is the BEST professional decision before claiming Model B is better?

Options:

A. Check the paired-difference distribution and use a robust paired comparison if needed.
B. Proceed with the paired t-test because the mean difference is positive.
C. Switch to an unpaired t-test using all prediction-level errors.
D. Remove the +15.0 month and rerun the paired t-test.

Best answer: A

Explanation: A paired t-test on model comparison results assumes the paired differences are approximately normally distributed, especially with a small number of pairs. Here, only 12 monthly differences are available, the median is far below the mean, skewness is high, the normality check is significant, and one month dominates the average. Before making an executive claim, the team should inspect the paired-difference distribution, understand the outlier, and use a defensible robust paired method such as a permutation test, Wilcoxon signed-rank test when appropriate, or bootstrap confidence interval. The key is not to reject Model B, but to avoid overstating evidence from a test whose distributional assumptions are doubtful.

Mean-only reasoning fails because a positive mean can be driven by a skewed outlier in a small paired sample.
Dropping the outlier is not justified unless there is a documented data or process error.
Unpaired testing discards the monthly pairing and can misstate uncertainty by treating dependent comparisons as independent.

Question 4

Topic: Mathematics and Statistics

A marketplace runs a large A/B test on a checkout change. The test shows an absolute conversion lift of 0.08 percentage points with p < 0.001 and a 95% confidence interval of 0.05 to 0.11 percentage points. Finance states that rollout is justified only if the true lift is at least 0.25 percentage points. Which interpretation best aligns the statistical result with the business requirement?

Options:

A. Approve rollout because the p-value is below 0.05
B. Convert the result to a standardized effect size and ignore absolute lift
C. Treat the lift as statistically significant but not practically sufficient
D. Extend the test until the confidence interval excludes zero

Best answer: C

Explanation: Statistical significance answers whether the observed effect is unlikely under a null hypothesis, but it does not prove the effect is large enough to matter operationally. Here, the confidence interval is entirely positive, so the checkout change likely improves conversion. However, the full interval from 0.05 to 0.11 percentage points is below the required 0.25 percentage-point lift. The decision should compare the estimated effect and its uncertainty with the minimum practical business impact, not just with zero. A very large sample can make a tiny effect statistically significant while still failing the financial threshold.

P-value only fails because it ignores whether the effect clears the rollout threshold.
More testing is unnecessary for excluding zero because the current interval already does that.
Standardized effect size may help compare studies, but the business requirement is defined as absolute conversion lift.

Question 5

Topic: Mathematics and Statistics

A payments team is comparing binary fraud classifiers before selecting an operating threshold. The fraud-review capacity and false-positive cost will be finalized next month, so stakeholders need a threshold-independent ranking from the same untouched validation set.

Model	ROC AUC	Accuracy at 0.50	Recall at 0.50
Logistic regression	0.86	0.95	0.41
Random forest	0.91	0.94	0.58
Gradient boosting	0.89	0.96	0.52

Which decision is BEST?

Options:

A. Choose a threshold before comparing the models.
B. Rank logistic regression first because it is simpler.
C. Rank random forest first by validation ROC AUC.
D. Rank gradient boosting first by accuracy at 0.50.

Best answer: C

Explanation: ROC AUC is the appropriate ranking metric when the operating threshold is not yet fixed and the goal is to compare classifiers across possible decision thresholds. In this scenario, all models were evaluated on the same untouched validation set, and stakeholders have not finalized the review capacity or false-positive cost. The model with the highest ROC AUC has the strongest aggregate ability to rank fraud cases above non-fraud cases over threshold choices. After ranking, the team should still select an operating threshold using business costs, capacity, and possibly precision-recall behavior. A single default-threshold metric such as accuracy at 0.50 can be misleading when the threshold is arbitrary.

Default threshold trap fails because accuracy at 0.50 reflects one cutoff, not classifier ranking across thresholds.
Simplicity alone fails because model interpretability does not outrank stronger validation evidence for this stated comparison.
Premature cutoff fails because the business threshold inputs are not finalized, so threshold-independent comparison is needed first.

Question 6

Topic: Mathematics and Statistics

A credit-risk team is choosing the split criterion for an interpretable tree-based classifier used as a nightly challenger model. The business goal is stable recall at a fixed false-positive rate, and retraining must finish within a 30-minute batch window.

Validation summary: Gini index and entropy produce recall and AUC within the same confidence interval. Entropy trees are slightly deeper and take 35% longer to train. Which decision is BEST?

Options:

A. Use entropy because it directly optimizes recall.
B. Use Gini index with the existing pruning and validation protocol.
C. Use entropy and remove pruning to preserve information gain.
D. Use Gini index as the final business performance metric.

Best answer: B

Explanation: Gini index and entropy are both impurity measures used to select splits in decision trees. Gini index uses squared class probabilities and is often computationally simpler; entropy uses logarithms and information gain. Neither criterion directly optimizes recall, AUC, or a business threshold. When validation performance is statistically indistinguishable, the professional decision should favor the option that satisfies operational constraints and maintains the existing validation and pruning controls. Here, Gini meets the batch retraining requirement with no demonstrated loss in model quality. The key is not that Gini is universally better, but that the evidence and constraints make it the more defensible choice in this scenario.

Recall confusion fails because entropy is an impurity criterion, not a direct optimizer for recall at a fixed false-positive rate.
Pruning removal increases overfitting risk; information gain does not eliminate the need for complexity control.
Metric mix-up confuses the Gini splitting criterion with final model evaluation metrics such as recall, AUC, or business cost.

Question 7

Topic: Mathematics and Statistics

A data science team is building a nearest-neighbor search to group 2 million customer support tickets. Each ticket is encoded as a high-dimensional sparse TF-IDF vector. Ticket length varies widely, and stakeholders care about similar terminology patterns rather than the number of words in a ticket. Which distance metric consideration best maps to these requirements?

Options:

A. Use cosine distance on normalized sparse vectors
B. Use Hamming distance after binarizing all terms
C. Use Euclidean distance on raw TF-IDF counts
D. Use Mahalanobis distance with the full covariance matrix

Best answer: A

Explanation: Cosine distance is often the right consideration for comparing observations represented as high-dimensional sparse text vectors. It compares the angle between vectors, so two tickets with similar term-weight patterns can be close even if one ticket is much longer. This matches the business requirement to group tickets by terminology pattern rather than magnitude. In contrast, magnitude-sensitive metrics can overemphasize document length or total term weight, and covariance-based metrics can be unstable or impractical in very high-dimensional sparse spaces. The key takeaway is to match the metric to what “similar” should mean in the feature space.

Raw Euclidean distance can be dominated by vector magnitude, which conflicts with the requirement to reduce the effect of ticket length.
Full Mahalanobis distance requires reliable covariance estimation, which is usually impractical for millions of sparse text dimensions.
Binarized Hamming distance discards TF-IDF weight information that helps represent term importance and usage patterns.

Question 8

Topic: Mathematics and Statistics

A data science team is preparing a survival model for time to customer churn. The training extract uses event_time_days and event_observed.

Exhibit: Outcome audit sample

Segment	Evidence in source	Extracted outcome
Cancelled	cancellation date present	time known, event = 1
Still subscribed at cutoff	active on day 365	time = 365, event = 0
Lost to follow-up	last billing signal on day 210	time = 210, event = 0
ETL failure	no status or dates loaded	time blank, event blank

Which interpretation is best supported by the exhibit?

Options:

A. Convert ETL failures to censored observations at day 365.
B. Drop every record with event_observed = 0 because the outcome is missing.
C. Treat active and lost records as right-censored, but investigate ETL failures as missing outcomes.
D. Impute churn dates for all blank event times before fitting the model.

Best answer: C

Explanation: In survival analysis, censoring is not the same as an ordinary missing outcome. A right-censored observation has a known period of observation and no event observed during that period, such as a customer still subscribed at the study cutoff or last observed before follow-up ended. The model can use the follow-up time and an event indicator of 0. By contrast, the ETL-failure records have neither a reliable event status nor a reliable time origin/end point, so they are data-quality missingness that must be corrected, excluded under a justified rule, or handled separately. The key distinction is whether a valid time-at-risk is known.

Imputing churn dates confuses unobserved events with known non-events during follow-up and can bias survival estimates.
Dropping censored records wastes valid survival information and overrepresents customers who churned.
Censoring ETL failures invents a follow-up time not supported by the source evidence.

Question 9

Topic: Mathematics and Statistics

A subscription company wants to know whether customer churn status is associated with the support channel used most often. The dataset contains one row per customer, and both variables are categorical: churned = yes/no and primary_channel = chat/email/phone/self-service. Expected counts in every contingency-table cell are above 5. Which method best maps to this requirement?

Options:

A. Pearson correlation test
B. One-way ANOVA
C. Two-sample t-test
D. Chi-squared test of independence

Best answer: D

Explanation: A chi-squared test of independence is appropriate when the goal is to evaluate whether two categorical variables are related in a population. Here, churn status and primary support channel are both categorical, and the data can be summarized as counts in a contingency table. The expected cell counts are large enough for the usual chi-squared approximation, so the method fits both the data type and the business question. The test does not estimate the size or direction of churn risk by channel; it evaluates whether the observed distribution differs from what would be expected if the variables were independent.

t-test mismatch fails because a t-test compares means of a continuous variable across two groups.
Correlation mismatch fails because Pearson correlation assumes numeric variables and a linear relationship.
ANOVA mismatch fails because ANOVA compares continuous outcomes across categorical groups, not association between two categorical variables.

Question 10

Topic: Mathematics and Statistics

A data scientist is reviewing a batch-scoring failure after a dimensionality-reduction step was added before a linear risk model. Which interpretation of the exhibit best explains the issue?

Exhibit: Scoring pipeline shapes

Object	Meaning	Shape
`X`	standardized feature matrix	`50,000 x 30`
`P`	projection matrix	`30 x 6`
`w`	model weight vector	`6 x 1`
Current scoring	`X * w`	fails

Options:

A. The weight vector should be transposed before scoring.
B. The raw feature matrix should be transposed before scoring.
C. The projection must be applied before multiplying by w.
D. The model needs 30 weights because X has 30 columns.

Best answer: C

Explanation: Matrix operations matter because models often represent data as a feature matrix and parameters as vectors or transformation matrices. Here, X has 30 original features, but the model weights w were learned after projection into a 6-dimensional space. The valid sequence is X * P, producing a 50,000 x 6 transformed design matrix, followed by multiplication by w to produce one score per row. Directly multiplying X * w fails because the inner dimensions, 30 and 6, do not match. The key takeaway is that feature transformations and model weights must be applied in the same matrix space and order used during training.

Transposing weights would make w a 1 x 6 row vector, which still does not align with X.
Transposing features would change rows into columns and would not represent the intended batch of observations.
Using 30 weights ignores that the trained model operates on 6 projected features, not the original 30 features.

Continue in the web app

Use IT Mastery for interactive CompTIA DataAI DY0-001 practice with mixed sets, timed mocks, topic drills, explanations, and progress tracking.

Try CompTIA DataAI DY0-001 on Web

Quick Reference

Modeling, Analysis, and Outcomes

Free CompTIA DataAI DY0-001 Practice Questions: Mathematics and Statistics

Topic snapshot

How to use this topic drill

Sample questions

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Continue in the web app

Related focused pages

Browse Certification Practice Tests by Exam Family