Try 10 focused CompTIA DataAI DY0-001 questions on Mathematics and Statistics, with explanations, then continue with IT Mastery.
Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.
Try CompTIA DataAI DY0-001 on Web View full CompTIA DataAI DY0-001 practice page
| Field | Detail |
|---|---|
| Exam route | CompTIA DataAI DY0-001 |
| Topic area | Mathematics and Statistics |
| Blueprint weight | 17% |
| Page purpose | Focused sample questions before returning to mixed practice |
Use this page to isolate Mathematics and Statistics for CompTIA DataAI DY0-001. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.
| Pass | What to do | What to record |
|---|---|---|
| First attempt | Answer without checking the explanation first. | The fact, rule, calculation, or judgment point that controlled your answer. |
| Review | Read the explanation even when you were correct. | Why the best answer is stronger than the closest distractor. |
| Repair | Repeat only missed or uncertain items after a short break. | The pattern behind misses, not the answer letter. |
| Transfer | Return to mixed practice once the topic feels stable. | Whether the same skill holds up when the topic is no longer obvious. |
Blueprint context: 17% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.
These original IT Mastery practice questions are aligned to this topic area. Use them for self-assessment, scope review, and deciding what to drill next.
Topic: Mathematics and Statistics
A subscription platform is building a penalized logistic regression model to predict account churn. The feature monthly_api_calls is available before prediction time and has this training profile: median 1,200; mean 18,700; 95th percentile 75,000; max 2,400,000; skewness +9.6. The highest values are verified enterprise customers, not data-entry errors. The team needs calibrated probabilities and wants simple monthly retraining. Which action is the BEST professional decision?
Options:
A. Standardize the raw feature and keep the original scale.
B. Replace logistic regression with a deep neural network.
C. Apply log1p in the training pipeline and validate calibration.
D. Remove accounts above the 95th percentile before training.
Best answer: C
Explanation: Positive skewness means the distribution has a long right tail: most accounts have much lower API usage than a small number of very high-usage accounts. Because those extreme values are valid, dropping them would discard important business signal. A monotonic transform such as log1p compresses the right tail while preserving order and allowing zero or low counts to remain usable. Applying it inside the training pipeline supports reproducible monthly retraining and avoids inconsistent preprocessing. The transformed feature should still be validated against calibration and business metrics, because transformation is a modeling hypothesis, not a guaranteed improvement. Standardization changes location and scale but does not address the asymmetric tail shape.
Topic: Mathematics and Statistics
A team is evaluating an ordinary least squares model to predict insurance claim severity. The model documentation assumes approximately normal residuals with constant variance before using coefficient tests and prediction intervals.
Exhibit: Residual diagnostics
| Diagnostic | Result |
|---|---|
| Residual mean | 0.02 |
| Residual skewness | 2.9 |
| Q-Q plot | Upper tail far above line |
| Residuals vs. fitted | Fan-shaped spread |
| Breusch-Pagan test | p = 0.003 |
Which modeling concern is best supported by the exhibit?
Options:
A. The main issue is multicollinearity among predictors.
B. The model primarily suffers from class imbalance.
C. The model is invalid because residuals are not centered at zero.
D. Coefficient tests and prediction intervals may be unreliable.
Best answer: D
Explanation: OLS point estimates can still be useful in some settings, but common coefficient tests, standard errors, and prediction intervals rely on residual assumptions such as approximately normal errors and constant variance. The exhibit shows a highly right-skewed residual distribution, a Q-Q tail departure, and a fan-shaped residual plot with a significant Breusch-Pagan test. Together, these indicate non-normality and heteroskedasticity, so inference based on the default OLS assumptions is questionable. A better next step could include transformations, robust standard errors, weighted regression, or a distribution more appropriate for claim severity.
Topic: Mathematics and Statistics
A data science team is comparing two forecasting models for monthly claim severity. Model B will replace Model A only if the improvement is statistically defensible for an executive risk report. The team has 12 paired monthly holdout results and planned a paired t-test on monthly MAE differences, where positive values favor Model B.
| Diagnostic | Result |
|---|---|
| Mean difference | 1.8 |
| Median difference | 0.3 |
| Skewness | 2.1 |
| Shapiro-Wilk p-value | 0.02 |
| Notable point | One month: +15.0 |
Which action is the BEST professional decision before claiming Model B is better?
Options:
A. Check the paired-difference distribution and use a robust paired comparison if needed.
B. Proceed with the paired t-test because the mean difference is positive.
C. Switch to an unpaired t-test using all prediction-level errors.
D. Remove the +15.0 month and rerun the paired t-test.
Best answer: A
Explanation: A paired t-test on model comparison results assumes the paired differences are approximately normally distributed, especially with a small number of pairs. Here, only 12 monthly differences are available, the median is far below the mean, skewness is high, the normality check is significant, and one month dominates the average. Before making an executive claim, the team should inspect the paired-difference distribution, understand the outlier, and use a defensible robust paired method such as a permutation test, Wilcoxon signed-rank test when appropriate, or bootstrap confidence interval. The key is not to reject Model B, but to avoid overstating evidence from a test whose distributional assumptions are doubtful.
Topic: Mathematics and Statistics
A marketplace runs a large A/B test on a checkout change. The test shows an absolute conversion lift of 0.08 percentage points with p < 0.001 and a 95% confidence interval of 0.05 to 0.11 percentage points. Finance states that rollout is justified only if the true lift is at least 0.25 percentage points. Which interpretation best aligns the statistical result with the business requirement?
Options:
A. Approve rollout because the p-value is below 0.05
B. Convert the result to a standardized effect size and ignore absolute lift
C. Treat the lift as statistically significant but not practically sufficient
D. Extend the test until the confidence interval excludes zero
Best answer: C
Explanation: Statistical significance answers whether the observed effect is unlikely under a null hypothesis, but it does not prove the effect is large enough to matter operationally. Here, the confidence interval is entirely positive, so the checkout change likely improves conversion. However, the full interval from 0.05 to 0.11 percentage points is below the required 0.25 percentage-point lift. The decision should compare the estimated effect and its uncertainty with the minimum practical business impact, not just with zero. A very large sample can make a tiny effect statistically significant while still failing the financial threshold.
Topic: Mathematics and Statistics
A payments team is comparing binary fraud classifiers before selecting an operating threshold. The fraud-review capacity and false-positive cost will be finalized next month, so stakeholders need a threshold-independent ranking from the same untouched validation set.
| Model | ROC AUC | Accuracy at 0.50 | Recall at 0.50 |
|---|---|---|---|
| Logistic regression | 0.86 | 0.95 | 0.41 |
| Random forest | 0.91 | 0.94 | 0.58 |
| Gradient boosting | 0.89 | 0.96 | 0.52 |
Which decision is BEST?
Options:
A. Choose a threshold before comparing the models.
B. Rank logistic regression first because it is simpler.
C. Rank random forest first by validation ROC AUC.
D. Rank gradient boosting first by accuracy at 0.50.
Best answer: C
Explanation: ROC AUC is the appropriate ranking metric when the operating threshold is not yet fixed and the goal is to compare classifiers across possible decision thresholds. In this scenario, all models were evaluated on the same untouched validation set, and stakeholders have not finalized the review capacity or false-positive cost. The model with the highest ROC AUC has the strongest aggregate ability to rank fraud cases above non-fraud cases over threshold choices. After ranking, the team should still select an operating threshold using business costs, capacity, and possibly precision-recall behavior. A single default-threshold metric such as accuracy at 0.50 can be misleading when the threshold is arbitrary.
Topic: Mathematics and Statistics
A credit-risk team is choosing the split criterion for an interpretable tree-based classifier used as a nightly challenger model. The business goal is stable recall at a fixed false-positive rate, and retraining must finish within a 30-minute batch window.
Validation summary: Gini index and entropy produce recall and AUC within the same confidence interval. Entropy trees are slightly deeper and take 35% longer to train. Which decision is BEST?
Options:
A. Use entropy because it directly optimizes recall.
B. Use Gini index with the existing pruning and validation protocol.
C. Use entropy and remove pruning to preserve information gain.
D. Use Gini index as the final business performance metric.
Best answer: B
Explanation: Gini index and entropy are both impurity measures used to select splits in decision trees. Gini index uses squared class probabilities and is often computationally simpler; entropy uses logarithms and information gain. Neither criterion directly optimizes recall, AUC, or a business threshold. When validation performance is statistically indistinguishable, the professional decision should favor the option that satisfies operational constraints and maintains the existing validation and pruning controls. Here, Gini meets the batch retraining requirement with no demonstrated loss in model quality. The key is not that Gini is universally better, but that the evidence and constraints make it the more defensible choice in this scenario.
Topic: Mathematics and Statistics
A data science team is building a nearest-neighbor search to group 2 million customer support tickets. Each ticket is encoded as a high-dimensional sparse TF-IDF vector. Ticket length varies widely, and stakeholders care about similar terminology patterns rather than the number of words in a ticket. Which distance metric consideration best maps to these requirements?
Options:
A. Use cosine distance on normalized sparse vectors
B. Use Hamming distance after binarizing all terms
C. Use Euclidean distance on raw TF-IDF counts
D. Use Mahalanobis distance with the full covariance matrix
Best answer: A
Explanation: Cosine distance is often the right consideration for comparing observations represented as high-dimensional sparse text vectors. It compares the angle between vectors, so two tickets with similar term-weight patterns can be close even if one ticket is much longer. This matches the business requirement to group tickets by terminology pattern rather than magnitude. In contrast, magnitude-sensitive metrics can overemphasize document length or total term weight, and covariance-based metrics can be unstable or impractical in very high-dimensional sparse spaces. The key takeaway is to match the metric to what “similar” should mean in the feature space.
Topic: Mathematics and Statistics
A data science team is preparing a survival model for time to customer churn. The training extract uses event_time_days and event_observed.
Exhibit: Outcome audit sample
| Segment | Evidence in source | Extracted outcome |
|---|---|---|
| Cancelled | cancellation date present | time known, event = 1 |
| Still subscribed at cutoff | active on day 365 | time = 365, event = 0 |
| Lost to follow-up | last billing signal on day 210 | time = 210, event = 0 |
| ETL failure | no status or dates loaded | time blank, event blank |
Which interpretation is best supported by the exhibit?
Options:
A. Convert ETL failures to censored observations at day 365.
B. Drop every record with event_observed = 0 because the outcome is missing.
C. Treat active and lost records as right-censored, but investigate ETL failures as missing outcomes.
D. Impute churn dates for all blank event times before fitting the model.
Best answer: C
Explanation: In survival analysis, censoring is not the same as an ordinary missing outcome. A right-censored observation has a known period of observation and no event observed during that period, such as a customer still subscribed at the study cutoff or last observed before follow-up ended. The model can use the follow-up time and an event indicator of 0. By contrast, the ETL-failure records have neither a reliable event status nor a reliable time origin/end point, so they are data-quality missingness that must be corrected, excluded under a justified rule, or handled separately. The key distinction is whether a valid time-at-risk is known.
Topic: Mathematics and Statistics
A subscription company wants to know whether customer churn status is associated with the support channel used most often. The dataset contains one row per customer, and both variables are categorical: churned = yes/no and primary_channel = chat/email/phone/self-service. Expected counts in every contingency-table cell are above 5. Which method best maps to this requirement?
Options:
A. Pearson correlation test
B. One-way ANOVA
C. Two-sample t-test
D. Chi-squared test of independence
Best answer: D
Explanation: A chi-squared test of independence is appropriate when the goal is to evaluate whether two categorical variables are related in a population. Here, churn status and primary support channel are both categorical, and the data can be summarized as counts in a contingency table. The expected cell counts are large enough for the usual chi-squared approximation, so the method fits both the data type and the business question. The test does not estimate the size or direction of churn risk by channel; it evaluates whether the observed distribution differs from what would be expected if the variables were independent.
Topic: Mathematics and Statistics
A data scientist is reviewing a batch-scoring failure after a dimensionality-reduction step was added before a linear risk model. Which interpretation of the exhibit best explains the issue?
Exhibit: Scoring pipeline shapes
| Object | Meaning | Shape |
|---|---|---|
X | standardized feature matrix | 50,000 x 30 |
P | projection matrix | 30 x 6 |
w | model weight vector | 6 x 1 |
| Current scoring | X * w | fails |
Options:
A. The weight vector should be transposed before scoring.
B. The raw feature matrix should be transposed before scoring.
C. The projection must be applied before multiplying by w.
D. The model needs 30 weights because X has 30 columns.
Best answer: C
Explanation: Matrix operations matter because models often represent data as a feature matrix and parameters as vectors or transformation matrices. Here, X has 30 original features, but the model weights w were learned after projection into a 6-dimensional space. The valid sequence is X * P, producing a 50,000 x 6 transformed design matrix, followed by multiplication by w to produce one score per row. Directly multiplying X * w fails because the inner dimensions, 30 and 6, do not match. The key takeaway is that feature transformations and model weights must be applied in the same matrix space and order used during training.
w a 1 x 6 row vector, which still does not align with X.Use the CompTIA DataAI DY0-001 Practice Test page for the full IT Mastery practice bank, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.
Try CompTIA DataAI DY0-001 on Web View CompTIA DataAI DY0-001 Practice Test
Use the full IT Mastery practice page above for the latest review links and practice page.