Free CompTIA DataAI DY0-001 Practice Exam: CompTIA DataAI

Last revised: July 14, 2026

Try 90 free CompTIA DataAI (CompTIA DataAI DY0-001) questions across the exam domains, with explanations, then continue with IT Mastery practice.

This free full-length CompTIA DataAI DY0-001 practice exam includes 90 original IT Mastery questions across the exam domains.

These are original IT Mastery practice questions. They are not official CompTIA questions, copied live-exam content, or exam dumps. Use them to preview question style and explanation depth before continuing with mixed sets, topic drills, and timed mocks in IT Mastery.

Count note: this page uses the full-length practice count maintained in the Mastery exam catalog. Some certification vendors publish total questions, scored questions, duration, or unscored/pretest-item rules differently; always confirm exam-day rules with the sponsor.

Try the IT Mastery web app for a richer interactive practice experience with mixed sets, timed mocks, topic drills, explanations, and progress tracking.

Try CompTIA DataAI DY0-001 on Web

Exam snapshot

Practice target: CompTIA DataAI DY0-001
Practice-set question count: 90
Time limit: 165 minutes
Practice style: mixed-domain diagnostic run with answer explanations

Full-length exam mix

Domain	Weight
Mathematics and Statistics	17%
Modeling, Analysis, and Outcomes	24%
Machine Learning	24%
Operations and Processes	22%
Specialized Applications of Data Science	13%

Use this as one diagnostic run. IT Mastery gives you timed mocks, topic drills, analytics, code-reading practice where relevant, and interactive practice.

Practice questions

Questions 1-25

Question 1

Topic: Operations and Processes

A data science team must support audit requests for a credit-risk model. For any reported metric or deployed model, auditors need to identify the exact training code, input data snapshot, model artifact, and evaluation results used at that time. Which version-control practice best meets this requirement?

Options:

A. Create a branch for each analyst’s notebook changes
B. Store only the final model file in a shared release folder
C. Keep the latest training dataset under the same filename
D. Commit an experiment manifest linking code, data, model, and result versions

Best answer: D

Explanation: Traceability requires more than saving code or artifacts separately. A robust practice records immutable identifiers across the full experiment lineage: source code commit, data snapshot or hash, model artifact version, configuration, and evaluation output. Checking that manifest into version control or attaching it to a versioned release makes the relationship reviewable and reproducible. This supports auditability because a metric or deployed model can be traced back to the exact inputs and process that produced it.

Saving only a model file or notebook history preserves part of the work, but it does not prove which data and results belonged to that run.

Model-only storage misses the training data, code version, configuration, and evaluation evidence needed for audit traceability.
Mutable dataset filename creates ambiguity because the same name can refer to different content over time.
Notebook branching tracks analyst edits but does not automatically link those edits to data snapshots, artifacts, or metrics.

Question 2

Topic: Operations and Processes

A data science team is preparing to deploy a credit-risk model into a regulated loan-origination workflow. The business wants a release this week, but the model must meet latency targets, preserve auditability, and avoid interrupting application decisions if performance degrades after release. Which deployment process is the BEST professional decision?

Options:

A. Run automated tests, canary deploy, monitor drift and KPIs, and assign an accountable owner
B. Require manual approval for each prediction until a new model is trained
C. Deploy after offline validation and document rollback after the first incident
D. Release to all users, monitor latency only, and retrain if complaints increase

Best answer: A

Explanation: A production model deployment process should verify the artifact before release, limit blast radius during rollout, define rollback criteria, monitor model and system health, and assign clear ownership. In this scenario, offline validation alone is insufficient because the model enters a regulated, real-time decision workflow where latency, drift, business KPIs, and auditability matter after release. A canary or phased deployment supports rollback before broad impact, while automated tests and monitoring provide evidence that the deployed model matches expected behavior. Named ownership ensures someone is responsible for incidents, threshold review, and stakeholder communication.

Delayed rollback planning is risky because rollback procedures should be defined before a regulated production release.
Latency-only monitoring misses model quality, drift, and business outcome degradation after deployment.
Manual prediction approval disrupts the workflow and does not provide a scalable deployment control.

Question 3

Topic: Machine Learning

A delivery platform is training a model to predict package transit time in minutes. Product managers will display the prediction as a conservative commitment time, not the average expected time. Which loss function consideration best aligns training with the task?

Exhibit: Training requirement

Item	Value
Target	Continuous transit minutes
Distribution	Right-skewed with late-delivery tail
Business goal	About 90% of actual deliveries should be at or below the prediction
Cost note	Under-predictions cause SLA penalties

Options:

A. Use quantile loss with $\tau=0.90$.
B. Use binary cross-entropy on late-delivery labels.
C. Use mean squared error to optimize the mean.
D. Use symmetric MAE to estimate median transit time.

Best answer: A

Explanation: The task is a continuous prediction problem, but the business requirement is not an average ETA. The exhibit asks for a conservative value that actual deliveries fall below about 90% of the time. Quantile, or pinball, loss directly aligns training with that target by estimating a chosen conditional quantile. With $\tau=0.90$, under-predictions are penalized more heavily than over-predictions, which matches the SLA risk described in the scenario. Symmetric losses such as MSE or MAE optimize central tendency and do not encode the desired coverage level.

Mean optimization fails because MSE targets the conditional mean and is especially sensitive to the late-delivery tail.
Classification framing fails because binary cross-entropy discards the continuous transit-time target needed for a commitment estimate.
Median training fails because symmetric MAE targets the 50th percentile, not the required 90th percentile.

Question 4

Topic: Operations and Processes

A data science team has validated a claims-triage model that uses a custom text preprocessing library and a specific Python runtime. The service must run in a cloud staging environment, an on-premises production environment, and a disaster-recovery site with minimal behavior differences. Security requires repeatable builds and auditable promotion between environments. Which deployment decision is BEST?

Options:

A. Install the required libraries manually on each host
B. Package the inference service and dependencies in a versioned container image
C. Run the model from the original training notebook
D. Deploy separate native services for each environment

Best answer: B

Explanation: Containerized deployment is the best fit when portability and environment consistency are central requirements. A container image can include the model artifact, preprocessing code, runtime, system dependencies, and startup configuration, then be promoted through staging, production, and disaster recovery with the same tested package. This reduces “works in development” failures caused by library, OS, or runtime differences and supports auditable CI/CD controls. Non-containerized approaches can work in stable, single-environment deployments, but they shift more responsibility to host configuration and increase reproducibility risk across heterogeneous targets. The key takeaway is to package the inference environment, not just the model file, when consistent behavior across environments matters.

Manual host setup increases configuration drift and weakens repeatability across cloud, on-premises, and disaster-recovery environments.
Training notebook serving is not an operationally ready deployment pattern for audited, repeatable inference.
Separate native services create duplicated deployment logic and make consistency harder to validate across environments.

Question 5

Topic: Modeling, Analysis, and Outcomes

A lender is validating a default-risk model before scoring applications from new applicants next quarter. The historical data has repeated records per applicant and a rare positive class.

Unit: application record
Group key: applicant_id, 1-5 records each
Time key: application_month, 18 months
Label: default within 90 days, 3% positive
Requirements: estimate future performance, prevent applicant leakage, avoid validation folds with no positives

Which validation approach best fits these requirements?

Options:

A. Leave-one-month-out validation using all other months for training
B. Random stratified k-fold validation by application record
C. Shuffled group k-fold validation by applicant_id
D. Forward-chaining blocked validation with whole applicant_id groups per split

Best answer: D

Explanation: The validation design must respect the constraint that matters for deployment: predicting future applications while preventing applicant-level leakage. A forward-chaining or blocked temporal split trains on earlier months and validates on later months, matching the next-quarter scoring scenario. Keeping each applicant_id entirely on one side of a split prevents repeated records from making validation look better than it will be for new applicants. Because defaults are rare, validation windows should be large enough, or combined, so that each fold contains positive cases and produces stable metrics. Random or shuffled approaches may improve class balance, but they break the temporal requirement or leak applicant behavior.

Random stratification preserves the 3% label rate but can place records from the same applicant and future months into training.
Shuffled group k-fold prevents applicant leakage but ignores the future-scoring requirement by mixing months.
Leave-one-month-out uses future months to train for earlier validation months, creating temporal leakage.

Question 6

Topic: Machine Learning

A subscription media company wants to identify naturally occurring customer segments for differentiated messaging. The data includes viewing frequency, content-category proportions, device mix, tenure, and support-contact counts, but there is no historical label for “segment” or “campaign response.” The business team wants an exploratory grouping approach before designing campaigns. Which method best maps to these requirements?

Options:

A. Run association rules on watched titles
B. Train a logistic regression response model
C. Fit a supervised churn classifier
D. Cluster customers using behavioral features

Best answer: D

Explanation: Clustering is an unsupervised learning approach used when the objective is to discover structure or natural groupings in data without a target label. In this scenario, the company has behavioral features but no known segment labels or response outcomes, so a supervised model would not have a valid target to learn. A clustering workflow could standardize numeric features, choose a similarity measure, evaluate cluster quality, and profile each segment for business usability.

The key distinction is that clustering finds groups; classification predicts predefined labels.

Response modeling requires historical campaign-response labels, which the scenario explicitly lacks.
Association rules can find co-occurring watched titles, but that is not the main requirement of segmenting customers across multiple behavioral features.
Churn classification requires known churn outcomes and would optimize prediction, not exploratory segment discovery.

Question 7

Topic: Specialized Applications of Data Science

A logistics company wants to improve how autonomous carts route and queue for charging during each shift. The data science team summarizes the problem before choosing a modeling approach.

Exhibit: Problem summary

Observation	Detail
Decision pattern	Cart chooses a next action every 30 seconds
Feedback	Reward increases for on-time deliveries and battery health
Outcome timing	Some rewards are delayed until later route segments
Training data	Simulator records states, actions, and rewards; no labeled best action

Which approach is best supported by the exhibit?

Options:

A. Graph centrality analysis
B. Supervised multiclass classification
C. Association rule mining
D. Reinforcement learning

Best answer: D

Explanation: Reinforcement learning fits problems where an agent repeatedly chooses actions in an environment and improves its policy based on rewards. The exhibit shows the key signals: state-action-reward records, sequential decisions, delayed feedback, and an objective based on cumulative outcomes rather than a fixed label for the correct action. A supervised classifier would need labeled examples of the best action at each decision point, which the summary explicitly lacks.

The key takeaway is that rewards over sequential actions point to reinforcement learning, especially when the best current action depends on future consequences.

Classification label trap fails because there is no labeled best action for each state.
Market basket trap fails because association rules find co-occurring items, not policies for sequential decisions.
Graph metric trap fails because centrality can describe a network but does not learn actions from reward feedback.

Question 8

Topic: Machine Learning

A data science team is building a gradient-boosted model to prioritize high-risk insurance claims. The team has 180,000 labeled claims, moderate class imbalance, and a regulatory requirement to report an unbiased estimate of production performance before deployment. Engineers have been using cross-validation to compare feature sets and tune tree depth, learning rate, and class weights. What is the best professional decision before presenting final performance to stakeholders?

Options:

A. Report the best cross-validation fold score from tuning
B. Train on all data and estimate performance from training metrics
C. Lock the tuning process, then evaluate once on a held-out test set
D. Retune hyperparameters on the held-out test set

Best answer: C

Explanation: Hyperparameter tuning must be separated from final model evaluation because repeated choices adapt to the validation signal. Cross-validation is appropriate for comparing hyperparameters, class weights, and feature sets, but its scores become part of the model-selection process. After those decisions are frozen, the team should run one final evaluation on an untouched holdout set or an equivalent nested outer test procedure. This gives stakeholders and regulators a defensible estimate of how the selected model is likely to perform on new claims. Using the test set during tuning would leak evaluation information into model selection and inflate reported performance.

Best fold score is optimistic because it selects the most favorable validation result observed during tuning.
Test-set retuning contaminates the final evaluation by making the test set part of model selection.
Training metrics do not estimate generalization and are especially misleading for flexible boosted models.

Question 9

Topic: Specialized Applications of Data Science

A manufacturer’s defect-detection model has 96% validation accuracy in the lab but misses defects during a pilot on a new production line. The release must avoid unnecessary model redesign unless the data evidence supports it.

Audit summary:

Finding	Evidence
Image quality	Pilot images have glare, blur, and partial occlusion not seen in training
Labels	Two reviewers disagree on hairline scratches in 22% of sampled images
Representativeness	Training data mostly uses one camera, day shift, and centered parts

Which action is the BEST professional decision?

Options:

A. Generate synthetic glare examples and retrain immediately
B. Run a targeted data audit and rebuild the validation set
C. Lower the confidence threshold for all defect classes
D. Replace the model with a larger neural network

Best answer: B

Explanation: The core issue is not yet model capacity; it is whether the training and validation data reflect the deployment environment. Glare, blur, occlusion, camera differences, shift differences, and inconsistent scratch labels can make lab validation misleading. A defensible next step is to audit the data, standardize labeling rules with adjudication, collect representative pilot images, and rebuild a stratified validation set that includes the new operating conditions. Only after that evidence is reliable should the team compare model changes or augmentation strategies.

Changing architecture or thresholds may hide the real failure mode and can worsen false positives or false negatives in production.

Larger model fails because poor labels and nonrepresentative validation data can make any architecture appear better or worse than it is.
Lower threshold fails because it changes operating behavior without addressing missed deployment conditions or label ambiguity.
Synthetic glare only fails because it addresses one quality issue while ignoring blur, occlusion, camera coverage, shift coverage, and reviewer disagreement.

Question 10

Topic: Mathematics and Statistics

A fraud team is training a supervised classifier on 2 million labeled transactions. Fraud cases are 0.4% of the data, false negatives are much more costly than false positives, and the first model has high overall accuracy but very low recall on fraud in stratified cross-validation. Which is the best professional decision before the next model iteration?

Options:

A. Keep the data unchanged and optimize only for accuracy
B. Downsample fraud cases to match legitimate transactions
C. Oversample fraud cases within each training fold only
D. Oversample the entire dataset before cross-validation

Best answer: C

Explanation: Oversampling can help when a supervised-learning problem has a severe class imbalance and the minority class is operationally important. In this case, fraud is rare, false negatives are costly, and validation already shows poor minority-class recall despite high accuracy. The defensible next step is to apply oversampling only to the training portion of each split or fold, then evaluate on the original validation distribution with recall, precision, PR-AUC, or cost-sensitive metrics. This gives the learner model more minority examples without contaminating validation data. Oversampling before splitting would leak duplicated or synthetic minority patterns into validation and inflate performance estimates.

Pre-split oversampling fails because duplicated or synthetic examples can appear in both training and validation data.
Accuracy-only optimization fails because high accuracy can hide poor detection of the rare fraud class.
Downsampling fraud fails because it further reduces the already scarce minority class examples needed for learning.

Question 11

Topic: Machine Learning

A lender is building a model to flag applications that are likely to become 90-day delinquent within 12 months. The compliance team requires a defensible probability-style risk score, clear feature-effect explanations for adverse action review, and low-latency scoring. EDA shows mostly monotonic relationships between engineered features and delinquency, with no strong nonlinear interaction signal. Which modeling decision is BEST?

Options:

A. Use a deep neural network optimized for AUC
B. Use k-means clustering with delinquency-rate profiling
C. Use regularized logistic regression with calibration validation
D. Use an uncalibrated hard-margin support vector machine

Best answer: C

Explanation: Logistic regression is well suited when the target is binary and stakeholders need interpretable probability-style outputs. In this scenario, the business must explain feature effects for compliance, score applications quickly, and does not have evidence that complex nonlinear interactions dominate. Regularization helps control overfitting, and calibration validation checks whether predicted probabilities are reliable enough for risk-based decisions. A more complex model might improve ranking metrics in some cases, but it would be harder to justify if interpretability and defensible probabilities are primary constraints.

Clustering misuse fails because k-means is unsupervised and does not directly model the labeled delinquency outcome.
SVM limitation fails because an uncalibrated hard-margin SVM emphasizes separation, not reliable probability-style risk scores.
Neural network overbuild fails because optimizing AUC alone does not satisfy the compliance need for clear feature-effect explanations.

Question 12

Topic: Modeling, Analysis, and Outcomes

A subscription company is replacing a rules-based churn intervention model. The team must recommend production use only when a candidate improves the business outcome without violating customer-contact guardrails.

Exhibit: Latest iteration results

Measure	Baseline rules	Candidate model	Requirement
AUC	0.69	0.78	Higher is better
Monthly net retention value	$410,000	$455,000	≥$440,000
Unnecessary contact rate	7.5%	11.8%	≤9.0%
Calibration error	0.04	0.09	≤0.05

Which iteration plan is most appropriate before recommending production use?

Options:

A. Calibrate the candidate, tune the intervention threshold, and compare against the baseline on value and guardrails.
B. Deploy the candidate because its AUC and net retention value exceed the baseline.
C. Retrain using a more complex model and select the highest AUC result.
D. Keep the baseline permanently because the candidate violates two guardrails.

Best answer: A

Explanation: A production recommendation should compare the baseline, candidate model, and business outcome against all required constraints. Here, the candidate has better AUC and meets the retention-value target, but it also exceeds the unnecessary-contact limit and has poor calibration. That means the evidence supports another iteration, not immediate deployment. A strong plan would calibrate probabilities, sweep or optimize the intervention threshold, and then compare the revised candidate with the baseline using both net retention value and guardrail metrics. The key takeaway is that a better predictive metric alone is insufficient when the business decision depends on operational and customer-impact constraints.

AUC-only deployment fails because the candidate violates customer-contact and calibration guardrails despite better ranking performance.
Permanent rejection is too strong because the candidate has business-value upside that may be recoverable through calibration and threshold tuning.
More complexity does not target the observed failure; it may improve AUC while leaving the business guardrails unresolved.

Question 13

Topic: Modeling, Analysis, and Outcomes

A retail analytics team is building account segments for targeted campaigns. Each account is represented by 60,000 one-hot product and event features; 99.6% of cells are zero. A Euclidean KNN prototype produces different nearest neighbors across folds and near-zero silhouette scores, but stakeholders still need defensible segments within the current quarter. Which decision is BEST?

Options:

A. Increase KNN neighbors until silhouette improves
B. Deploy the current model and monitor campaign lift
C. Use sparsity-aware similarity after reducing or aggregating features
D. Mean-impute zeros and keep Euclidean KNN

Best answer: C

Explanation: Sparse, high-dimensional one-hot data often causes distance concentration: many observations appear similarly far apart, so Euclidean nearest neighbors become unstable and weakly meaningful. A professional response is to reconsider the feature representation and distance measure before deployment. Aggregating rare events, selecting informative features, applying dimensionality reduction, or using cosine/Jaccard-style similarity can make neighborhood structure more reliable for sparse vectors. The revised approach should be checked for segment stability and business usefulness before stakeholders act on it. Simply tuning KNN or monitoring after deployment does not fix the core method-fit problem.

More neighbors may smooth variance, but it does not address unreliable distances caused by sparse high-dimensional features.
Mean-imputing zeros changes the meaning of absence indicators and can make Euclidean distances even less interpretable.
Deploying first shifts validation risk to the business and ignores evidence that the segments are unstable.

Question 14

Topic: Operations and Processes

A retailer’s demand-forecasting model performed well in offline validation and is scheduled to drive automatic replenishment orders for high-volume stores. The current workflow uses an analyst notebook that merges daily POS data, supplier lead-time files, and manual stockout corrections from a shared folder. Forecast errors could cause missed sales or excess inventory, and finance wants auditability for order decisions. What is the BEST professional decision before enabling automation?

Options:

A. Operationalize a governed data pipeline with lineage, validation, versioning, and monitoring
B. Deploy the notebook unchanged because offline validation is already successful
C. Keep the model as an advisory dashboard without changing the data workflow
D. Replace the model with a larger neural network before deployment

Best answer: A

Explanation: Governed data pipelines are needed when model outputs directly affect business-critical operations, especially automated decisions with financial impact. In this scenario, replenishment orders depend on data from multiple sources, manual corrections, and files that may change outside controlled processes. Offline model validation is not enough because the operational risk is in repeatability, data quality, lineage, and accountability at the time decisions are made. A governed pipeline should validate schemas and data quality, track dataset and model versions, preserve lineage, control access, log outputs, and monitor drift or failures. The key takeaway is that reliable operations require governance around the data-to-decision path, not only a well-performing model.

Offline metrics only fail because validation does not prove the notebook workflow is repeatable, auditable, or safe for automated purchasing.
More model complexity misses the main risk, which is uncontrolled operational data flow rather than inadequate predictive capacity.
Advisory-only use avoids automation risk but does not satisfy the stated goal of enabling replenishment automation with auditability.

Question 15

Topic: Mathematics and Statistics

A telecom data science team must decide whether customer contract type is associated with churn reason before recommending targeted retention offers. The dataset contains one row per customer, both fields are categorical, observations are independent, and an EDA check shows all expected contingency-table cell counts are at least 8. Which statistical decision is the BEST fit?

Options:

A. Use a paired t-test on encoded category labels
B. Use ANOVA with churn reason as the response
C. Use linear regression on one-hot churn categories
D. Use a chi-squared test of independence

Best answer: D

Explanation: A chi-squared test of independence evaluates whether two categorical variables are associated by comparing observed contingency-table counts with expected counts under independence. In this scenario, contract type and churn reason are categorical, each customer contributes one independent observation, and expected cell counts are large enough for the chi-squared approximation. That combination satisfies the method-fit requirements without turning nominal categories into artificial numeric values. The result would support whether the relationship is statistically plausible before designing targeted offers, though it would not by itself prove causation.

Encoded t-test fails because numeric codes assigned to categories do not create meaningful interval-scale measurements.
ANOVA response choice fails because churn reason is categorical, not a continuous response variable.
Linear regression framing overcomplicates the association test and does not directly test independence in a contingency table.

Question 16

Topic: Specialized Applications of Data Science

A healthcare analytics team is building an ingestion pipeline for legacy patient intake forms. The downstream model needs structured fields such as patient name, visit date, diagnosis code, and handwritten notes from the source files.

Exhibit: Source data profile

Attribute	Observation
File type	Scanned PDFs and phone photos
Content	Typed and handwritten form text
Required output	Machine-readable text fields
Current blocker	Text is embedded in images

Which computer vision approach is most appropriate for the pipeline?

Options:

A. Object detection
B. Semantic segmentation
C. Optical character recognition
D. Image classification

Best answer: C

Explanation: Optical character recognition is the appropriate method when the task is to extract readable text from images, scanned PDFs, or photographed documents. The exhibit states that the required output is machine-readable text fields and that the current blocker is text embedded in images. OCR can be used alone or with preprocessing steps such as deskewing, denoising, layout detection, or handwriting recognition, but the core task is still text extraction rather than labeling the whole image or locating objects.

Image classification, object detection, and segmentation may support document workflows, but they do not directly convert visual text into structured text values.

Image classification would label the entire form or page type, not extract names, dates, codes, or notes.
Object detection could locate fields or boxes, but it would not itself read the text inside them.
Semantic segmentation could separate regions such as header, table, or signature area, but it is not the primary method for text extraction.

Question 17

Topic: Operations and Processes

A subscription company monitors a production churn model weekly. The model has not had code or feature-pipeline changes, and delayed ground-truth labels are now available for the monitored period.

Exhibit: Production monitoring summary

Metric	Baseline validation	Last 7 days
ROC AUC	0.84	0.69
Recall at current threshold	0.78	0.52
Calibration error	0.04	0.16
Feature PSI: usage_minutes	0.06	0.31

Which next life-cycle step is best supported by the exhibit?

Options:

A. Tune only the decision threshold in production
B. Archive monitoring results and continue observing
C. Start a retraining and validation iteration
D. Promote the current model to wider rollout

Best answer: C

Explanation: Production monitoring is a control point in the data science life cycle. Here, model quality has degraded materially: ROC AUC and recall dropped, calibration worsened, and the usage_minutes feature shows notable population shift. Because labels are available, the next step is not blind deployment or passive observation; it is to re-enter the model-development loop. The team should investigate drift, refresh or reweight training data as appropriate, retrain candidate models, and validate them against current labeled data before any production replacement. Threshold tuning may be part of later optimization, but it does not address the broader ranking and calibration degradation shown in the exhibit.

Wider rollout fails because the monitored model is performing worse than baseline and should not be expanded.
Threshold-only tuning is insufficient because ROC AUC and calibration degradation indicate more than a cutoff problem.
Passive observation fails because labeled evidence already supports an active model iteration.

Question 18

Topic: Modeling, Analysis, and Outcomes

A retailer is reviewing a proposed recommendation report for a subscription marketplace. The analyst recommends promoting the same top-selling add-ons to all customers because a linear model shows weak individual feature effects. However, the user-item matrix is 96% empty, niche add-ons have few ratings, and prior EDA shows strong interactions between customer segment, season, and bundle type. Which business risk is most likely if the recommendation is accepted as written?

Options:

A. Lost personalization revenue from missed long-tail and interaction effects
B. Higher storage costs from retaining too many historical ratings
C. Faster model inference at the expense of dashboard refresh speed
D. Regulatory noncompliance from using customer purchase history

Best answer: A

Explanation: Sparse data and non-linear patterns are especially important in recommendation systems because the most valuable signal may appear only in small user-item subgroups or in interactions among context, segment, and product combinations. Treating weak linear main effects as evidence that personalization has little value can bias the recommendation toward already-popular items. That creates a business risk: reduced conversion, lower customer retention, and underexposure of niche or high-margin products that would perform well for specific customers. The issue is not just model accuracy; it affects revenue allocation and stakeholder confidence in the recommendation strategy. A better analysis would explicitly account for sparsity and interactions before making a broad business recommendation.

Storage cost focus misses the decision risk; the stem is about recommendation quality, not data retention expense.
Compliance claim is unsupported because no prohibited data use or regulatory constraint is stated.
Inference trade-off confuses operational speed with the business impact of ignoring sparse and non-linear signals.

Question 19

Topic: Machine Learning

A manufacturing team wants to predict microscopic defects from 500 synchronized sensor signals. Individual variables have weak predictive power, but engineers expect failures to emerge from nonlinear interactions across temperature, vibration, pressure, and load patterns. The team has a large labeled history and wants the model to learn intermediate representations rather than manually specifying every interaction term. Which model family best maps to these requirements?

Options:

A. A linear model with main effects only
B. A multilayer artificial neural network
C. A k-means clustering model
D. A naive Bayes classifier

Best answer: B

Explanation: An artificial neural network is designed to learn complex relationships by passing inputs through connected layers of units. During training, backpropagation updates weights so that earlier layers can learn useful combinations of raw features and later layers can combine those representations into a prediction. With enough labeled data and nonlinear interactions among many sensor signals, a multilayer network can reduce the need to hand-code every interaction term. A purely linear main-effects model is usually too restrictive for this requirement because it does not learn hierarchical nonlinear feature combinations by itself.

Linear main effects miss the stated need to learn nonlinear interactions without manually adding interaction terms.
Naive Bayes assumes conditional independence patterns that conflict with the interaction-heavy defect mechanism.
K-means clustering is unsupervised and does not directly use the labeled defect outcomes for prediction.

Question 20

Topic: Machine Learning

A data science team is building a churn model for a subscription service. The current L2-regularized logistic regression model has training AUC 0.63 and validation AUC 0.62 across repeated temporal splits. Learning curves plateau early, and error analysis shows missed churn patterns tied to nonlinear tenure effects and support-ticket interactions. The model must remain low-latency and reasonably explainable for customer success leaders. Which action is the BEST professional decision?

Options:

A. Collect substantially more records before changing the model design
B. Replace the model with a large deep neural network immediately
C. Add targeted nonlinear and interaction features, then tune regularization with validation
D. Increase regularization and remove weakly correlated features

Best answer: C

Explanation: High bias is indicated when both training and validation performance are consistently poor and close together. The model is not fitting important structure in the training data, so the primary response is to increase useful model capacity rather than gather more of the same data or further constrain the model. In this scenario, the evidence points to specific missed nonlinear and interaction patterns, while the business requires low latency and explainability. Targeted feature engineering plus cross-validated regularization addresses the underfitting while keeping the solution operationally appropriate. A much larger model may improve capacity, but it ignores the explainability and latency constraints and adds unnecessary complexity before testing a simpler fix.

More data first fails because plateaued learning curves suggest the current hypothesis class is too simple, not merely data-limited.
Stronger regularization fails because it would further restrict an already underfitting model.
Deep network replacement fails because it overengineers the response and conflicts with latency and explainability constraints.

Question 21

Topic: Mathematics and Statistics

A subscription platform tests a new churn-risk model on 2,000,000 customer-months. The holdout analysis shows a statistically significant lift in retained customers, but the deployment adds outreach costs and latency to the retention workflow.

Metric	Result
Retention lift	0.08 percentage points
95% CI	0.03 to 0.13 percentage points
p-value	<0.001
Minimum lift for positive ROI	0.30 percentage points

Which recommendation is the BEST professional decision?

Options:

A. Replace the model with a more complex algorithm to increase statistical significance.
B. Do not deploy broadly; report statistical significance but insufficient business impact.
C. Deploy immediately because the p-value proves the model improves retention.
D. Increase the test sample until the confidence interval excludes zero by a wider margin.

Best answer: B

Explanation: Statistical significance means the observed effect is unlikely to be due to random chance under the test assumptions; practical significance asks whether the effect is large enough to matter operationally or financially. In this scenario, the confidence interval is entirely above zero, so there is evidence of a real retention lift. However, even the upper bound, 0.13 percentage points, is below the 0.30 percentage-point lift required for positive ROI. A professional recommendation should communicate both facts: the result is statistically significant, but it does not justify broad deployment under the stated cost constraint. The key takeaway is that a very large sample can make tiny effects statistically significant without making them valuable.

P-value overreach fails because a small p-value does not prove the effect is large enough to cover outreach costs.
More sampling fails because the current confidence interval already shows the lift is below the ROI threshold.
Complexity bias fails because changing algorithms does not address the practical-value gap shown by the validation evidence.

Question 22

Topic: Specialized Applications of Data Science

A payment processor wants to score transactions for manual review before settlement. It has three years of investigator-confirmed labels for fraud and legitimate; confirmed fraud is 0.4% of transactions. Missing a true fraud case is much more expensive than reviewing a legitimate transaction, and the team needs explanations tied to known fraud patterns.

Which approach best maps to these requirements?

Options:

A. Customer segmentation using unsupervised clustering
B. Supervised fraud detection with cost-sensitive evaluation
C. Unsupervised anomaly detection on transaction outliers
D. Ordinary classification optimized for overall accuracy

Best answer: B

Explanation: Fraud detection is appropriate when the target event is rare, labeled, and has asymmetric business impact. Here, confirmed fraud and legitimate labels support supervised learning, while the 0.4% event rate requires imbalance-aware training and evaluation. Because false negatives are more costly than false positives, the model should be judged using cost-sensitive metrics, precision-recall trade-offs, recall at review capacity, or expected loss rather than plain accuracy. Anomaly detection is better when labels are unavailable or the goal is to surface unusual behavior without a verified target. Ordinary classification may be technically possible, but optimizing for overall accuracy would hide performance on the rare fraud class.

Outlier-only detection misses the value of confirmed labels and may flag unusual but legitimate high-value behavior.
Accuracy optimization is misleading because predicting nearly all transactions as legitimate can score highly on a 0.4% fraud dataset.
Customer segmentation groups similar customers but does not directly target rare, costly fraudulent transactions.

Question 23

Topic: Operations and Processes

A subscription company is preparing a churn model with one training row per customer_id and calendar month. The team must merge these sources while preserving auditability and avoiding inflated training rows.

Source	Grain	Key issue
Churn labels	customer-month	`churn_next_month`
App events	event	email can change
Support tickets	ticket	0..many per customer-month
Billing	account-month	one account may contain multiple customers

Which decision is BEST before modeling?

Options:

A. Create a customer-month integration layer with key mapping, aggregation, and cardinality checks
B. Copy each monthly churn label onto every event and ticket row
C. Randomly select one ticket and one invoice per customer-month
D. Inner join all sources and drop duplicate rows after the merge

Best answer: A

Explanation: Merging should be controlled at the target modeling grain: one row per customer-month. The safest professional decision is to establish a canonical key mapping for mutable identifiers such as email and account relationships, aggregate many-row sources such as events and tickets to customer-month features, and verify expected join cardinality and row counts. This prevents duplicate training rows, accidental label replication, and mismatched account-level billing data. It also supports auditability because each transformation can be traced from raw source to feature table. The key takeaway is to align keys and granularity before joining, rather than trying to repair duplication after the merged table is created.

Dropping duplicates late can hide many-to-many join explosions and may remove valid customer-month facts.
Label replication inflates the sample size and can bias validation by treating correlated event rows as independent examples.
Random row selection discards signal and does not resolve the account-to-customer granularity mismatch.

Question 24

Topic: Machine Learning

A data science team is modeling equipment-failure risk from high-frequency sensor summaries. Which interpretation is best supported by the exhibit?

Exhibit: Validation and model-behavior summary

Evidence	Result
EDA finding	Failure rate rises mainly when vibration harmonics, temperature variance, and pressure spikes occur together
Linear model, raw features	ROC AUC = 0.63
Single linear layer, no activation	ROC AUC = 0.64
ReLU network, 3 hidden layers	ROC AUC = 0.87
Hidden-unit probe	Several units activate strongly for specific sensor combinations, not for any single sensor alone

Options:

A. Hidden layers learn nonlinear feature combinations through weighted connections.
B. The deeper network proves the training data was memorized.
C. The network is better because it ignores weak marginal features.
D. The single linear layer should match the deeper network after scaling.

Best answer: A

Explanation: An artificial neural network models complex relationships by learning weights on connections between neurons and applying nonlinear activation functions across layers. In the exhibit, each individual sensor is only weakly predictive, but combinations of vibration, temperature variance, and pressure spikes are highly informative. A single linear layer can only form one weighted additive combination, so it performs similarly to a linear model. The ReLU network can transform weighted inputs into hidden representations, then combine those learned representations in later layers. The hidden-unit probe supports this: units respond to feature combinations rather than isolated features. The key takeaway is that depth plus nonlinear activations allows the network to represent interactions that linear models cannot capture well.

Ignoring weak features fails because the exhibit shows combinations of weak marginal features are important, not that the features should be discarded.
Single linear layer fails because without nonlinear activation, it remains essentially a linear transformation.
Memorization claim is unsupported because the exhibit reports validation performance and hidden-unit behavior, not train-only accuracy or overfitting evidence.

Question 25

Topic: Machine Learning

A fraud detection team trained a boosted-tree model for near-real-time scoring. The business requires reliable performance on new transactions before deployment, not just historical fit.

Validation summary

Metric	Training	Validation
AUC	0.99	0.71
Log loss	0.04	0.68

Which action best addresses the observed model behavior?

Options:

A. Train for more boosting rounds
B. Select features using validation labels
C. Deploy and monitor production drift
D. Increase regularization and limit tree depth

Best answer: D

Explanation: High variance occurs when a model fits training data very well but fails to generalize to validation data. Here, near-perfect training AUC and very low training log loss contrast with much weaker validation performance, which is a classic overfitting pattern. For boosted trees, appropriate remedies include increasing regularization, limiting tree depth, reducing learning complexity, using early stopping, or collecting more representative training data. The key requirement is reliable performance on new transactions before deployment, so the next action should reduce complexity and revalidate rather than reward additional fit to the training set.

The closest trap is training longer, which usually worsens high variance unless paired with controls such as early stopping.

More boosting rounds may improve training fit but can further overfit when validation performance is already poor.
Validation-label feature selection creates leakage from the evaluation process and can inflate performance estimates.
Deploy and monitor skips the requirement to demonstrate generalization before production use.

Questions 26-50

Question 26

Topic: Operations and Processes

A retailer piloted a demand-forecasting model for regional distribution centers. Operations now wants next month’s predictions to automatically create purchase orders for high-volume SKUs. Review the deployment note.

Exhibit: Deployment status

Item	Current state
Output use	Auto-generates POs over $250,000 weekly
Inputs	POS, promotions, supplier lead times
Pipeline	Analyst notebook run manually
Controls	No lineage, validation gates, or approval log
Monitoring	Forecast error checked monthly in a spreadsheet

Which next action is best supported by the exhibit?

Options:

A. Move scoring into a governed production pipeline.
B. Schedule the analyst notebook to run weekly.
C. Publish forecasts in a read-only dashboard.
D. Increase model complexity before creating purchase orders.

Best answer: A

Explanation: A governed data pipeline is needed when model outputs directly affect business-critical operations, especially automated financial or supply-chain actions. The exhibit shows purchase orders over $250,000 are generated from forecasts, but the current process lacks reproducible lineage, validation gates, approval evidence, and operational monitoring. Moving scoring into a governed pipeline makes the model workflow auditable and reliable before it can trigger purchasing decisions.

The key issue is not whether the model can forecast; it is whether the operational process is controlled enough to safely use the forecast as an automated decision input.

Model complexity misses the governance risk; a more complex model can still produce uncontrolled business actions.
Read-only dashboard changes consumption but does not govern automated purchase-order generation.
Notebook scheduling adds automation without lineage, validation gates, approval records, or production monitoring.

Question 27

Topic: Mathematics and Statistics

A credit-risk team is comparing logistic regression models trained on the same dataset to estimate default probability. The model will be submitted for regulatory review, so the business requirement is to prefer a simpler explanation unless fit improvement is material; validation log-loss changes below 0.005 are not considered material.

Model	Predictors	AIC	BIC	Validation log loss
Base	12	48,120	48,250	0.362
Expanded	28	47,980	48,420	0.360
Interaction-heavy	75	48,010	48,970	0.359

Which approach best maps to these requirements?

Options:

A. Average all three models to avoid model selection bias
B. Select the Interaction-heavy model using validation log loss
C. Select the Expanded model using AIC
D. Select the Base model using BIC

Best answer: D

Explanation: AIC and BIC both balance model fit against complexity, but BIC penalizes additional parameters more strongly, especially with larger samples. In this scenario, the regulatory and business requirement favors parsimony unless the fit improvement is material. The Base model has the lowest BIC, and the larger models improve validation log loss by only 0.002 to 0.003, which is below the stated 0.005 materiality threshold. AIC would favor the Expanded model because it rewards improved fit with a lighter penalty, but that does not match the stated need for a simpler, more defensible model.

AIC preference misses that the business requirement prioritizes parsimony over a small fit gain.
Lowest log loss is tempting, but the improvement is below the stated materiality threshold.
Model averaging may improve prediction but reduces interpretability and does not use the AIC/BIC trade-off requested.

Question 28

Topic: Modeling, Analysis, and Outcomes

A data scientist is modeling hourly building energy demand using ordinary least squares with outdoor temperature as a single numeric predictor. The model is stable but fails to meet the validation target. Which analysis response is best supported by the exhibit?

Exhibit: EDA and model check

Evidence	Result
Demand vs. temperature	Clear U-shaped curve
Residuals by temperature decile	Positive at cold/hot extremes, negative near 20°C
Train RMSE	18.7
Validation RMSE	19.0
Baseline RMSE	20.1

Options:

A. Use validation RMSE as the final production KPI.
B. Increase regularization to reduce coefficient variance.
C. Remove cold and hot records as outliers.
D. Add nonlinear temperature features and revalidate the model.

Best answer: D

Explanation: The exhibit indicates underfitting from non-linearity, not a variance or stability problem. A single linear temperature coefficient can only model a straight-line effect, but the demand pattern is U-shaped: energy use rises at both cold and hot extremes and is lower near mild temperatures. The residual pattern confirms the misspecification because errors are systematic across temperature deciles rather than randomly scattered. A defensible next step is to represent the curved relationship, such as with polynomial terms, splines, binning, interactions, a GAM, or a tree-based model, then compare performance on validation data. Removing extremes would discard real operating conditions, and regularization would not add the missing curvature.

Regularization trap fails because shrinking coefficients addresses variance, not a systematically curved residual pattern.
Outlier removal trap fails because cold and hot observations are valid parts of the U-shaped relationship.
KPI substitution trap fails because changing the reporting metric does not fix model misspecification.

Question 29

Topic: Machine Learning

A fraud detection team must select a model for a real-time scoring pilot. The business goal is to improve precision-recall performance on rare fraud cases while keeping p95 scoring latency under 80 ms. Validation used time-based splits to reflect production drift.

Candidate	Train log loss	CV log loss	Holdout PR-AUC	p95 latency
L2 logistic regression	0.41	0.45	0.31	8 ms
Deep tree ensemble	0.05	0.62	0.24	210 ms
Regularized shallow boosting	0.32	0.38	0.36	45 ms
Large stacked ensemble	0.22	0.35	0.37	160 ms

Which response is the BEST professional decision?

Options:

A. Pilot regularized shallow boosting with monitoring
B. Use the large stacked ensemble for highest PR-AUC
C. Deploy the deep tree ensemble for lowest training loss
D. Keep L2 logistic regression because it is simplest

Best answer: A

Explanation: The key bias-variance signal is the gap between training and validation performance, not training loss alone. The deep tree ensemble has very low training loss but much worse CV loss and holdout PR-AUC, indicating high variance and poor generalization. The regularized shallow boosting model has a smaller train-CV gap, the best latency-compliant holdout PR-AUC, and fits the business goal of improving rare-event detection. The stacked ensemble has slightly higher PR-AUC, but it violates the 80 ms latency requirement, so it is not operationally suitable. A simpler model is not automatically best when a more capable model generalizes better and meets deployment constraints.

Training-loss trap fails because low training loss with poor CV and holdout performance indicates overfitting.
Metric-only choice fails because the stacked ensemble exceeds the real-time latency constraint.
Simplicity bias fails because logistic regression is operationally easy but underperforms the latency-compliant boosted model.

Question 30

Topic: Mathematics and Statistics

A retailer is choosing a validation design for a weekly demand forecast that will be used for staffing and inventory planning. The team initially reports a strong random-split result, but the business will rely most on the model during late-year peaks.

Exhibit: EDA and validation summary

Evidence	Observation
History	4 years of weekly sales
Demand pattern	Weeks 46-52 average 2.4x other weeks
ACF	Strong spike at lag 52
Random 80/20 split	MAE = 9.6 units
Last-26-weeks holdout	MAE = 24.8 units; largest errors in weeks 46-52

Options:

A. Remove weeks 46-52 as outliers before validation.
B. Aggregate weekly sales to one annual total per store.
C. Use time-ordered validation that preserves annual seasonal cycles.
D. Keep the random split because it has the lowest MAE.

Best answer: C

Explanation: Seasonality should influence forecasting validation when outcomes repeat in a time-linked pattern and the deployment period depends on that pattern. Here, late-year weeks are systematically higher, the autocorrelation spike at lag 52 indicates an annual cycle, and chronological holdout performance is much worse than the random split. A random split can leak seasonal information across train and test records, making the model look better than it will perform in future seasonal peaks. A better approach is rolling-origin or blocked time-series validation that preserves order and includes full seasonal cycles, with evaluation focused on the business-critical weeks. The key takeaway is that seasonal structure is not noise; it must be represented in both feature design and validation design.

Random split optimism fails because shuffled records can mix the same seasonal regime across training and test sets.
Outlier removal fails because weeks 46-52 are expected recurring demand peaks, not anomalous records.
Annual aggregation fails because it hides the weekly seasonal pattern needed for staffing and inventory decisions.

Question 31

Topic: Operations and Processes

A health insurer is preparing to deploy a new claims-triage model as a containerized API. The application code has passed unit and integration tests, but the model was trained on last quarter’s claims, uses a feature pipeline updated weekly, and must meet a documented false-negative tolerance before routing cases automatically. Which deployment decision best addresses concerns that are specific to deploying the model rather than ordinary application code?

Options:

A. Deploy only after rewriting the model service in the same language as the core claims platform
B. Version the model, features, and training data; validate on current holdout data; monitor drift and false negatives after release
C. Promote the container because code tests passed and rollback the API if latency increases
D. Freeze all incoming claim fields so production data always matches the training dataset

Best answer: B

Explanation: Deploying a model is not just shipping deterministic application code. The released artifact depends on training data, feature definitions, validation data, thresholds, and real-world data distributions that can change after deployment. In this scenario, the decision must ensure the model still meets the false-negative tolerance on relevant current data, that feature and data lineage are reproducible, and that post-release monitoring catches drift or performance decay. Ordinary CI/CD checks such as unit tests, container health, and latency are necessary but insufficient because they do not prove the model remains clinically or operationally acceptable.

Code-only promotion misses that passing service tests does not validate model behavior, data lineage, or the false-negative requirement.
Language rewrite focuses on implementation consistency, not model-specific risks such as feature skew or drift.
Freezing fields is unrealistic and can block legitimate data evolution instead of managing schema and distribution changes.

Question 32

Topic: Mathematics and Statistics

A team is comparing two binary classifiers using mean negative log likelihood, where each case contributes $-\ln(p_{true})$. The exhibit shows the probability each model assigned to the actual class on the same validation cases.

Case	Model A $p_{true}$	Model B $p_{true}$
1	0.95	0.70
2	0.90	0.65
3	0.85	0.60
4	0.01	0.55
Mean NLL	1.24	0.51

Which interpretation is best supported by the exhibit?

Options:

A. Model A should be preferred because it is more confident on most cases.
B. The models are equivalent because log loss evaluates only class labels.
C. Model A’s near-zero true-class probability creates a large log-loss penalty.
D. Model B’s lower loss occurs because logarithms cap low-probability penalties.

Best answer: C

Explanation: Negative log likelihood uses logarithms to convert a product of assigned probabilities into additive penalties. A high probability for the true class produces a small penalty, but assigning a probability near zero to the true class produces a very large penalty. In the exhibit, Model A is strong on three cases but assigns only 0.01 to the actual class on case 4, so $-\ln(0.01)$ is large enough to drive up the mean NLL. Model B is less confident but consistently assigns moderate probability to the true class, producing lower average loss. The key takeaway is that log-based losses strongly punish confidently wrong probability estimates, not just incorrect class labels.

Most-case confidence fails because log loss is not a majority vote over rows; one near-zero true-class probability can dominate.
Capped penalty fails because $-\ln(p)$ increases without bound as $p$ approaches zero.
Class-label only fails because log loss evaluates calibrated probabilities assigned to the true class, not just predicted labels.

Question 33

Topic: Modeling, Analysis, and Outcomes

A subscription platform uses a churn-risk model to target retention offers. Business requirements are recall $\ge 0.72$ on the minority churn class, precision $\ge 0.40$, p95 scoring latency under 80 ms, and no increase in offer budget. A data scientist proposes replacing the current model with a newer deep model.

Exhibit: Validation summary

Model	Recall	Precision	p95 latency	Notes
Current gradient boosting	0.74	0.43	35 ms	Stable across 5 folds
New deep model	0.75	0.39	140 ms	Higher training AUC
Logistic regression	0.66	0.45	12 ms	Below recall target

Which decision best aligns with the requirements?

Options:

A. Switch to logistic regression because it has the lowest latency.
B. Replace it with the deep model because training AUC is higher.
C. Deploy the deep model as the primary model to gather production evidence.
D. Keep the current model and iterate only if validation shows a KPI gap.

Best answer: D

Explanation: Model iteration should be justified by evidence that the current model fails a required outcome or that a challenger improves the required metrics without violating constraints. Here, the current gradient-boosted model meets recall, precision, latency, and stability requirements. The deep model is newer and has slightly higher recall, but it fails precision and latency and relies on training AUC, which is weaker evidence than validation performance. The logistic model is fast but misses the recall requirement. The appropriate decision is to retain the validated model and reserve iteration for a demonstrated KPI, calibration, drift, fairness, or operational gap.

Training AUC trap fails because training performance does not override validation precision and latency violations.
Production-first testing creates avoidable operational risk when validation already shows unmet constraints.
Latency-only selection misses the business requirement for minimum churn recall.

Question 34

Topic: Operations and Processes

A health insurer wants to train a readmission-risk model using synthetic patient records because real records contain protected health information. A small permitted audit against de-identified real records shows the synthetic generator matches age and diagnosis frequencies, but it underrepresents rare comorbidity combinations and breaks the relationship between medication changes and 30-day readmission. Which pipeline decision best maps to these requirements?

Options:

A. Train and test only on independent synthetic samples
B. Accept the data because marginal distributions match
C. Validate with real holdout edge cases before use
D. Increase the number of synthetic records generated

Best answer: C

Explanation: Synthetic data can reduce privacy exposure, but it is not automatically representative. In this scenario, the generator matches simple one-variable frequencies while failing on joint relationships and rare edge cases that directly affect readmission risk. The safest pipeline decision is to keep a permitted real-data validation step, focused on edge-case coverage, subgroup behavior, and downstream model performance. If the synthetic data fails those checks, the generator, sampling strategy, or training design must be revised before deployment. Matching marginals alone is not enough when the target depends on interactions among clinical variables.

More synthetic rows can reduce sampling noise but will not fix a generator that omits rare combinations or breaks causal-looking relationships.
Marginal matching misses multivariate structure, which is the stated failure mode in the audit.
Synthetic-only testing can hide the same generator bias in both training and evaluation data.

Question 35

Topic: Specialized Applications of Data Science

A manufacturer is building an optimization model to choose weekly production quantities across three product lines. The business goal is to improve weekly profit, but the company pays penalties for missed contracted orders. Machine hours and raw material inventory are fixed for the week, and compliance requires a minimum safety-stock level for one regulated component. Which optimization framing is the BEST professional decision?

Options:

A. Maximize total units produced from available machine hours
B. Minimize demand forecast error before selecting quantities
C. Maximize expected net contribution after penalties with operational constraints
D. Minimize average unit production cost across all products

Best answer: C

Explanation: The objective function should match the business outcome, not a proxy metric. Here, the decision variable is weekly production quantity, and the business goal is profit improvement under real constraints. A suitable constrained optimization framing maximizes expected net contribution, including revenue, variable costs, and missed-contract penalties, while enforcing machine-hour, raw-material, and safety-stock constraints. This keeps the model from recommending infeasible or noncompliant production plans.

Cost minimization or throughput maximization can look efficient but may sacrifice high-margin products or ignore contract penalties. Forecast accuracy matters as an input quality issue, but it is not the production optimization objective by itself.

Unit cost focus fails because the cheapest production mix may reduce profit or miss contracted orders.
Throughput focus fails because producing the most units ignores margin differences and penalty exposure.
Forecast-error focus confuses predictive model quality with the downstream production decision objective.

Question 36

Topic: Machine Learning

A media platform has a large user-item interaction matrix and wants to discover compact latent preference features before clustering users. Which next action is best supported by the exhibit?

Exhibit: Matrix profile

Attribute	Observation
Rows	2,000,000 users
Columns	80,000 titles
Values	Watch time, mostly zero
Goal	Low-rank latent factors
Constraint	Preserve strongest shared structure

Options:

A. Apply truncated SVD to factorize the interaction matrix
B. Use naive Bayes to classify users by genre
C. Run k-means directly on the raw interaction matrix
D. Expand title IDs with one-hot encoding

Best answer: A

Explanation: Singular value decomposition is the appropriate reasoning path when the evidence points to matrix factorization and latent structure. A large sparse user-item matrix can be approximated as lower-rank factors, often written as $X \approx U\Sigma V^T$, where the retained components capture the strongest shared patterns across users and items. Those compact factors can then be used for downstream tasks such as clustering, recommendation, or visualization. Running clustering directly on the full sparse matrix is usually inefficient and noisier, while expanding categorical IDs increases dimensionality instead of reducing it. The key clue is the requirement for low-rank latent factors from a matrix.

Raw clustering ignores the requested factorization step and keeps the high-dimensional sparse representation.
One-hot expansion increases sparsity and dimensionality rather than extracting latent structure.
Naive Bayes classification requires labels and a supervised target, which the scenario does not provide.

Question 37

Topic: Machine Learning

A hospital imaging team is considering adopting a deep-learning segmentation architecture reported by another hospital. The published model performed well, and the team has permission to reuse the architecture and pretrained weights. Which next action is best supported by the exhibit before clinical workflow deployment?

Exhibit: Transfer review

Evidence	Finding
Source training data	Adult scans, scanner vendor A
Organization data	Adult/pediatric scans, vendors A/C
Label protocol	Local radiologists use narrower lesion boundaries
Internal labeled sample	300 recent cases available
Published Dice score	0.91 on source holdout

Options:

A. Reproduce validation only on the source holdout set
B. Discard the pretrained weights and train from scratch
C. Validate on local cases, then fine-tune if needed
D. Deploy unchanged because the source Dice score is high

Best answer: C

Explanation: Transfer learning can reduce training effort, but it does not prove the model will perform safely on a new organization’s data. The exhibit shows several domain and labeling differences: pediatric cases, different scanner vendors, and a narrower local boundary definition. These can change image distributions and the target labels, so the source holdout Dice score is not enough. Because the team has 300 labeled local cases, the defensible next step is local validation and possible fine-tuning or recalibration based on those results. The key takeaway is that a strong published deep-learning result is evidence to investigate, not evidence to deploy without organization-specific validation.

Source benchmark reliance fails because the source holdout does not represent local scanners, patients, or labeling rules.
Source-only reproduction checks implementation fidelity but does not measure performance on the organization’s data distribution.
Training from scratch is unnecessarily extreme because pretrained weights may still be useful after local validation or fine-tuning.

Question 38

Topic: Modeling, Analysis, and Outcomes

A subscription company is analyzing churn drivers before recommending retention offers. Multivariate EDA shows that monthly_fee has almost no overall association with churn, but partial-dependence-style plots and stratified summaries show opposite patterns by contract_type: higher fees increase churn for month-to-month customers and slightly decrease churn for annual customers. Data volume is adequate in both segments, and leadership needs an interpretable recommendation. What is the BEST next analytical decision?

Options:

A. Model and report the monthly_fee by contract_type interaction
B. Use only univariate churn rates by monthly_fee
C. Drop monthly_fee because its overall association is weak
D. Deploy a high-capacity ensemble without segment interpretation

Best answer: A

Explanation: Multivariate EDA can reveal that a predictor’s relationship with an outcome changes across levels of another variable. Here, the weak overall association for monthly_fee masks different churn behavior by contract_type, which is a classic sign of an interaction or segmented effect. Because both segments have adequate data and leadership needs an interpretable recommendation, the best decision is to explicitly test, model, and communicate the interaction rather than relying on a pooled effect. This could involve an interaction term in a regression-style model, stratified effect estimates, or interpretable model diagnostics that show segment-specific behavior. The key takeaway is that aggregation can hide meaningful patterns when subgroup relationships differ.

Averaging away segments fails because the pooled relationship hides opposite segment-level effects.
Univariate-only analysis misses the multivariate dependency that makes the fee effect conditional on contract type.
Black-box deployment ignores the stakeholder need for interpretable retention guidance before operationalizing a model.

Question 39

Topic: Operations and Processes

A healthcare operations team wants to train a supervised model that predicts whether prior authorization requests should be clinically denied. The historical status field is inconsistent: some denials were overturned on appeal, some approvals were made for capacity reasons, and policy changes altered decisions midyear. The business requires defensible model outputs for audit review. Which pipeline decision best maps to these requirements?

Options:

A. Create adjudicated ground truth labels before training
B. Cluster requests and label each cluster by majority status
C. Impute missing status values from similar cases
D. Train on the latest quarter to avoid old policies

Best answer: A

Explanation: The core issue is target-label reliability. In supervised learning, the model learns the relationship between features and the target, so inconsistent or policy-contaminated labels can cause the model to reproduce operational noise rather than the intended clinical decision. For an auditable healthcare use case, the pipeline should first define labeling criteria and obtain adjudicated ground truth, often through clinical review, appeal outcomes, or a controlled labeling workflow. Only after the target is trustworthy should the team train, validate, and monitor the model. Narrowing the data window or imputing labels may reduce some noise, but it does not establish that the target represents the defensible decision the model is supposed to predict.

Status imputation fills missing values but can spread existing label errors into the training target.
Cluster labeling changes the task toward unsupervised grouping and can hide case-level clinical exceptions.
Recent-only training may reduce policy drift but still uses an unreliable status field as the target.

Question 40

Topic: Mathematics and Statistics

A utility company is building an alert-volume forecast for turbine maintenance staffing. After deduplicating repeated messages, the data are nonnegative integer counts per equal 1-hour exposure window; there is no fixed maximum count; events are approximately independent; and the sample mean and variance are both close to 2.7. Which distribution concept best explains this data-generating pattern?

Options:

A. Poisson distribution for event counts
B. Exponential distribution for waiting times
C. Binomial distribution for bounded trial outcomes
D. Normal distribution for symmetric measurements

Best answer: A

Explanation: The core concept is matching a distribution to the data-generating process, not just the observed shape. Independent counts over a fixed interval with no natural upper bound and roughly equal mean and variance are characteristic of a Poisson process. This supports estimating probabilities such as zero alerts or five or more alerts per hour without forcing a continuous or bounded-outcome assumption. If the variance were much larger than the mean, a negative binomial model might be considered, but the stated evidence supports Poisson as the first explanation.

Normal measurements fails because alert volume is discrete, nonnegative count data rather than continuous symmetric measurements.
Bounded trials fails because the stem provides no fixed number of independent trials or maximum count per hour.
Waiting times fails because exponential distributions model time between events, not the number of events in each fixed interval.

Question 41

Topic: Modeling, Analysis, and Outcomes

A subscription company is building a model at month-end to predict whether an active customer will churn in the next 30 days. The feature engineering review produced this lineage note:

Proposed feature	Source and timing
`avg_session_minutes_30d`	Product events from the 30 days before month-end
`failed_payment_count_90d`	Billing events from the 90 days before month-end
`discount_offer_accepted`	Retention CRM flag recorded after a churn-risk score is generated
`plan_tenure_days`	Account start date through month-end

Which proposed feature should be removed because it introduces leakage?

Options:

A. discount_offer_accepted
B. avg_session_minutes_30d
C. failed_payment_count_90d
D. plan_tenure_days

Best answer: A

Explanation: Feature leakage occurs when a predictor contains information that would not be available at scoring time or is derived from the target or downstream actions related to the target. In this scenario, scoring happens at month-end before the next 30-day churn outcome is known. Historical product, billing, and tenure fields are available by that cutoff. The CRM flag for accepting a discount offer is recorded only after a churn-risk score is generated, so it reflects a post-score intervention rather than pre-score customer behavior. Including it can make validation metrics look artificially strong and fail in deployment because the field does not exist at the time the model must make its prediction. The key check is feature availability relative to the prediction timestamp.

Historical usage is acceptable because it uses only product events before the month-end scoring cutoff.
Prior payment failures are acceptable because the billing events occur before the prediction time.
Account tenure is acceptable because it can be computed from known account data at month-end.

Question 42

Topic: Machine Learning

A risk analytics team is tuning a gradient-boosted model for loan default prediction. The business requires stable performance on future monthly cohorts, not just the highest training score. Recent results are:

Model state	Train AUC	Validation AUC	Monthly validation AUC range
Current model	0.94	0.71	0.64-0.76
Simpler baseline	0.78	0.74	0.72-0.76

Which response best addresses the bias-variance issue shown by the evidence?

Options:

A. Increase regularization and reduce tree complexity
B. Tune only on the most recent validation month
C. Replace AUC with training accuracy for selection
D. Add more boosting rounds to improve training AUC

Best answer: A

Explanation: The evidence indicates a high-variance model. The current gradient-boosted model fits the training data very well, but it generalizes worse than the simpler baseline and has unstable validation performance across monthly cohorts. A good response is to reduce effective complexity, such as using stronger regularization, shallower trees, fewer leaves, subsampling, or early stopping. This aligns with the business requirement for stable future performance. Adding more capacity would likely increase overfitting, and choosing a metric or validation slice that hides instability would create operational risk rather than solve the bias-variance problem.

More boosting rounds would usually increase model capacity and can worsen overfitting when validation performance is already weak.
Training accuracy selection ignores generalization evidence and may reward the same overfit behavior shown by the train-validation gap.
Most recent month only may overfit model selection to one cohort and miss the required stability across future monthly cohorts.

Question 43

Topic: Modeling, Analysis, and Outcomes

A subscription company is building a regularized linear model to predict 30-day renewal spend. Stakeholders require low-latency scoring and coefficient-level explanations. EDA for sessions_last_90d shows a nonnegative, highly right-skewed count with many small values, a few extreme values, and a diminishing-return relationship with spend. Cross-validation must avoid using validation-fold distribution information in preprocessing. Which transformation is the BEST professional decision?

Options:

A. Add a fifth-degree polynomial of the raw count
B. Apply log1p(sessions_last_90d) inside the training pipeline
C. Cap the count at the 99th percentile using all data
D. Apply min-max scaling to the raw session count

Best answer: B

Explanation: A log-style transformation is a strong fit when a nonnegative feature is highly right-skewed and has a diminishing marginal relationship with the target. Using log1p(x) is appropriate for count features because it is defined at zero and compresses large values while preserving order. Placing the transformation inside the training pipeline keeps preprocessing consistent across cross-validation and production scoring. This supports the stated need for a simple, low-latency, interpretable linear model without adding unnecessary complexity. Scaling alone changes units but not shape, while high-degree polynomial terms can create unstable, hard-to-explain behavior.

Min-max scaling preserves the skew and does not address the diminishing-return relationship.
Global capping uses all data to set the threshold, which can leak validation-fold information.
High-degree polynomial terms add complexity and overfitting risk while reducing interpretability.

Question 44

Topic: Operations and Processes

A manufacturer is deploying a computer vision defect-detection model across plants. Cameras generate high-resolution images that must be scored within 80 ms to stop a production line, and plant policy prohibits raw images from leaving the local network. The data science team also needs centralized experiment tracking, model approval, and periodic retraining using anonymized feature summaries from all plants. Which deployment approach is the BEST professional decision?

Options:

A. Use a hybrid deployment with local inference and cloud-based MLOps services
B. Move all image ingestion and inference to a centralized cloud service
C. Keep all training, inference, and governance tools isolated at each plant
D. Deploy only a batch scoring pipeline after each production shift

Best answer: A

Explanation: Hybrid deployment is appropriate when different parts of the AI workflow have different operational constraints. In this scenario, inference must happen close to the cameras because the production line needs sub-80 ms decisions, and raw images cannot leave the plant network. At the same time, centralized experiment tracking, approval workflows, and retraining across plants create value from shared governance and aggregated learning. A strong design would run preprocessing and inference locally, send only approved anonymized summaries or model telemetry to the central environment, and distribute approved model versions back to plants. The key is not to choose cloud or on-premises exclusively when the constraints clearly require both.

Centralized cloud inference violates the raw-image restriction and is unlikely to meet the plant-level latency requirement.
Fully isolated plants preserve locality but lose centralized governance, cross-plant learning, and consistent model approval.
Batch scoring ignores the real-time stop-the-line requirement, even if it simplifies deployment.

Question 45

Topic: Operations and Processes

A risk analytics team publishes a monthly default-risk report generated from a batch feature pipeline and a scoring model. After the reported high-risk segment increases sharply, audit and business stakeholders ask which source tables, transformation steps, and model version produced the report.

Exhibit: Current pipeline evidence

Artifact	Current state
Source extracts	Stored by run date
Transform jobs	Logs show success/failure only
Feature table	Overwritten after each run
Model registry	Stores model version and metrics
Report output	Stores final PDF and timestamp

Which next action is most directly supported by the exhibit?

Options:

A. Implement end-to-end data lineage capture
B. Increase model retraining frequency
C. Replace batch processing with streaming
D. Add more report-level summary metrics

Best answer: A

Explanation: Data lineage is needed when users must trace how data moved from original sources through transformations, feature creation, model scoring, and final reporting. In this scenario, the available artifacts are incomplete for auditability: source extracts are retained, but transform logs only show job status, the feature table is overwritten, and the final report lacks a documented path back to inputs and intermediate outputs. Capturing lineage would connect source tables, transformation logic, feature versions, model version, run timestamp, and report output so stakeholders can explain the high-risk segment change defensibly.

Retraining, extra metrics, or streaming may be useful in other situations, but they do not solve traceability across the data-to-report chain.

Retraining frequency addresses model freshness, not evidence of which data and transformations produced a specific report.
More summary metrics may improve communication, but they do not create source-to-output traceability.
Streaming conversion changes processing latency, not the ability to reconstruct lineage for an existing batch report.

Question 46

Topic: Mathematics and Statistics

A data science team is building a nearest-neighbor search to group 2 million customer support tickets. Each ticket is encoded as a high-dimensional sparse TF-IDF vector. Ticket length varies widely, and stakeholders care about similar terminology patterns rather than the number of words in a ticket. Which distance metric consideration best maps to these requirements?

Options:

A. Use Euclidean distance on raw TF-IDF counts
B. Use Hamming distance after binarizing all terms
C. Use Mahalanobis distance with the full covariance matrix
D. Use cosine distance on normalized sparse vectors

Best answer: D

Explanation: Cosine distance is often the right consideration for comparing observations represented as high-dimensional sparse text vectors. It compares the angle between vectors, so two tickets with similar term-weight patterns can be close even if one ticket is much longer. This matches the business requirement to group tickets by terminology pattern rather than magnitude. In contrast, magnitude-sensitive metrics can overemphasize document length or total term weight, and covariance-based metrics can be unstable or impractical in very high-dimensional sparse spaces. The key takeaway is to match the metric to what “similar” should mean in the feature space.

Raw Euclidean distance can be dominated by vector magnitude, which conflicts with the requirement to reduce the effect of ticket length.
Full Mahalanobis distance requires reliable covariance estimation, which is usually impractical for millions of sparse text dimensions.
Binarized Hamming distance discards TF-IDF weight information that helps represent term importance and usage patterns.

Question 47

Topic: Mathematics and Statistics

A risk analytics team is building distribution summaries for two insurance variables: annual claim count per policyholder and repair cost per claim. The business needs the probability of exactly 2 claims in a year and the probability that a repair cost falls between $1,000 and $1,500. Which approach best maps to these requirements?

Options:

A. Use a PDF for both variables and compute point probabilities directly.
B. Use a PMF for both variables after rounding repair costs to dollars.
C. Use a PDF for claim counts and read the cost PDF at $1,500.
D. Use a PMF for claim counts and integrate a PDF over the cost interval.

Best answer: D

Explanation: A probability mass function (PMF) assigns probability to discrete outcomes, such as 0, 1, 2, or 3 annual claims. That makes it appropriate for the probability of exactly 2 claims. A probability density function (PDF) describes a continuous variable, such as repair cost, where probability is represented by area over an interval rather than by the density at a single point. For the cost requirement, the appropriate probability is the area under the PDF from $1,000 to $1,500. A PDF value at one dollar amount is a density, not the probability of that exact cost.

Reversed functions fail because claim counts are discrete and repair costs are continuous.
Rounding costs changes the measurement process and can introduce avoidable binning bias.
Point probability fails because continuous variables have interval probabilities, not meaningful exact-value probabilities.

Question 48

Topic: Operations and Processes

A lender deployed a credit risk model after offline validation. Three weeks later, monitoring shows the automated approval service is still meeting latency targets, but production behavior no longer matches release evidence. Which MLOps response is most appropriate?

Exhibit: Monitoring summary

Evidence	Validation	Production week 3
Input contract	schema v12	schema v13
`debt_to_income` missing	1.2%	38.4%
`zip_risk_index` PSI	0.00	0.31 (alert >0.25)
ROC-AUC on matured labels	0.89	0.68
Calibration slope	0.98	0.61

Data-lineage note: schema v13 renamed the raw debt field; the current pipeline imputes unmatched values to the median.

Options:

A. Wait for more labels before taking action.
B. Retrain immediately on week 3 production records.
C. Lower the decision threshold to restore approval volume.
D. Route to a validated fallback, fix lineage, then revalidate.

Best answer: D

Explanation: This is an MLOps production-monitoring issue, not just a model-tuning issue. The validation evidence no longer applies because the production input contract changed, missingness spiked, drift alerts fired, and real-world discrimination and calibration degraded. The data-lineage note identifies a likely pipeline break: a renamed raw field is being treated as missing and imputed. The safest response is to move decisions to a validated fallback path, correct the lineage or schema mapping, and revalidate the model with the restored production pipeline before resuming automated use. Retraining can be considered later, but not while the input data is known to be corrupted.

Threshold adjustment masks degraded model behavior and does not address the broken feature pipeline.
Immediate retraining risks learning from corrupted production features instead of fixing the data contract first.
Waiting for labels is inappropriate because the exhibit already shows an operational data failure and degraded matured-label performance.

Question 49

Topic: Operations and Processes

A retail lender wants an AI service to flag loan applications for enhanced fraud review. Fraud labels are confirmed 30-90 days after funding, underwriters must decide within 2 minutes, and the review team can manually inspect only a limited queue each hour. Which requirement set is the best professional decision to approve before model development?

Options:

A. Executive dashboard layout, fraud trend KPIs, and quarterly model retraining
B. Highest possible AUC, all historical fields, and monthly fraud-label refreshes
C. Deep-learning architecture, GPU budget, and post-funding transaction history
D. Prediction cutoff, available features, risk tolerance, queue capacity, and audit requirements

Best answer: D

Explanation: Requirements for a production data science system should first make the decision context testable and operationally realistic. In this scenario, the team must know the prediction cutoff because underwriters decide within 2 minutes, and the model may use only data available before that cutoff. The requirements also need stakeholder-approved risk tolerance, such as the acceptable balance between missed fraud and unnecessary manual reviews. Because the review team has limited hourly capacity, the scoring threshold and queue design must fit operational constraints. Audit requirements matter because lending fraud review can affect regulated decisions. A strong requirement set prevents leakage, unrealistic evaluation, and a model that performs well offline but cannot be used in the workflow.

AUC-only targeting fails because it ignores decision timing, leakage risk from unavailable fields, and review capacity.
Architecture-first planning overengineers before confirming usable data, business risk tolerance, and workflow constraints.
Dashboard-focused requirements may support reporting but do not define the real-time decision, acceptable errors, or operational readiness.

Question 50

Topic: Machine Learning

A data science team is training a feed-forward neural network for customer churn prediction. The model must keep the same feature set and loss function, but the team wants to reduce hidden-unit co-adaptation during training.

Exhibit: Training log summary

Configuration	Train AUC	Validation AUC	Notes
Baseline MLP, epoch 20	0.98	0.72	Validation loss rising after epoch 8
Baseline MLP, epoch 30	0.99	0.69	Larger train-validation gap

Which next action is best supported by the exhibit?

Options:

A. Increase the number of training epochs
B. Add more hidden units to the network
C. Add dropout during neural network training
D. Evaluate only on the training set

Best answer: C

Explanation: The exhibit shows classic neural network overfitting: training AUC is very high while validation AUC declines as training continues. Dropout is a regularization technique that randomly disables a fraction of units during each training update, forcing the network to learn more robust distributed representations instead of relying on specific hidden-unit combinations. This directly addresses hidden-unit co-adaptation while keeping the feature set and loss function unchanged. Increasing model capacity or training longer would likely worsen the train-validation gap, and removing validation evidence would hide the problem rather than fix it.

More epochs fails because validation performance is already deteriorating as training continues.
More hidden units fails because added capacity can intensify overfitting when validation performance is weak.
Training-only evaluation fails because it removes the evidence needed to detect generalization problems.

Questions 51-75

Question 51

Topic: Operations and Processes

A subscription business asks the data science team to “build an AI model that predicts customer churn.” The available data includes account history, product usage, support tickets, and cancellation dates. Stakeholders have not specified what decision the score will drive, who will act on it, the intervention capacity, or how success will be measured. Which next step best maps to the requirement?

Options:

A. Train a gradient-boosted classifier using historical cancellations
B. Cluster customers by usage and support-ticket patterns
C. Define the operational decision and KPI before selecting a model
D. Optimize the model for the highest possible ROC AUC

Best answer: C

Explanation: A modeling request is under-specified when it names a prediction target but not the operational decision the prediction will support. In this case, “predict churn” could mean prioritizing retention calls, triggering discounts, routing accounts to customer success, forecasting revenue loss, or measuring product risk. Each use implies different labels, features, thresholds, validation data, costs, and KPIs. Before choosing a model family or metric, the team should clarify the action, decision cadence, capacity constraints, and business outcome. The key takeaway is that a technically valid churn model may still be unusable if it is not tied to a defined decision workflow.

Model-first approach misses that historical cancellations alone do not define the intervention, threshold, or business objective.
Unsupervised segmentation may help exploration, but it does not resolve the missing operational decision.
AUC optimization selects a technical metric before confirming whether ranking quality, calibration, lift, cost, or ROI matters most.

Question 52

Topic: Modeling, Analysis, and Outcomes

A streaming media company wants to reduce voluntary cancellations by sending a limited number of retention offers each week. The scoring job will run every Monday for currently active subscribers, and the marketing team needs a ranked list of customers likely to cancel within the next 30 days. The data warehouse includes app usage, support interactions, billing events, plan changes, and a cancellation_reason field that is populated only after cancellation. Which model-design decision is BEST?

Options:

A. Cluster subscribers into behavior segments and offer discounts to the largest cluster.
B. Train a regression model to predict next-month revenue for all subscribers.
C. Train a binary classifier for 30-day voluntary churn using only pre-scoring predictors.
D. Train a classifier using cancellation_reason to improve churn prediction accuracy.

Best answer: C

Explanation: The business objective is intervention: identify active subscribers who are likely to voluntarily cancel soon enough for a retention offer to matter. That calls for a supervised binary classification target such as “voluntary cancellation within 30 days,” scored only on currently active subscribers. Predictors must be available before the Monday scoring time, such as prior usage, support, billing, and plan-change history. Post-outcome fields like cancellation_reason would leak information because they are known only after cancellation. Evaluation should emphasize ranking and action value, such as lift, precision at the offer capacity, calibration, and offer ROI, not just generic accuracy.

Revenue target mismatch fails because next-month revenue does not directly identify customers who need a cancellation-prevention offer.
Unsupervised segmentation may support exploration, but it does not optimize the stated 30-day churn ranking objective.
Post-cancel leakage fails because cancellation_reason is unavailable when active subscribers are scored.

Question 53

Topic: Specialized Applications of Data Science

A legal analytics team is building a search feature for 80,000 internal policy documents. Stakeholders need interpretable keyword signals that emphasize terms that are distinctive to a small subset of documents, while reducing the influence of common boilerplate words that appear across nearly every document. The first release must be inexpensive, fast to retrain nightly, and easy to explain to compliance reviewers. Which representation is the best professional decision?

Options:

A. Fine-tune a large transformer embedding model
B. Use one-hot encoding for each document label
C. Use TF-IDF vectors for the document corpus
D. Use raw term-count vectors only

Best answer: C

Explanation: TF-IDF is well suited when the goal is interpretable term weighting across a document collection. It combines term frequency with inverse document frequency, so terms that occur often in a document but rarely across the corpus receive higher weight, while boilerplate terms common to many documents are reduced. This fits the search requirement, supports explainability for compliance reviewers, and can be retrained efficiently on a large text corpus. A transformer model might improve semantic matching later, but it adds cost and complexity when the stated need is distinctive keyword importance rather than deep contextual meaning.

Raw counts preserve frequency but do not reduce the importance of terms that appear across most documents.
One-hot labels represent categories, not term importance within and across documents.
Transformer embeddings may capture semantics, but they are less directly interpretable and overengineer the first-release requirement.

Question 54

Topic: Mathematics and Statistics

A bank is comparing logistic regression churn models fit on the same training set of 150,000 accounts using the same likelihood. The pre-registered selection rule prioritizes BIC because the model must be explainable to risk governance; AIC is secondary for exploratory comparison.

Model	Parameters	AIC	BIC	Validation note
Base demographic	14	82,410	82,560	Calibrated
Behavioral	46	82,230	82,690	Calibrated
Interaction-heavy	160	82,190	83,820	Calibration drift
Segmented ensemble	310	82,260	85,410	Hard to explain

Which model is the BEST professional decision for production?

Options:

A. Deploy the interaction-heavy model
B. Deploy the base demographic model
C. Deploy the behavioral model
D. Deploy the segmented ensemble

Best answer: B

Explanation: AIC and BIC both reward better likelihood but penalize complexity; lower values are preferred. BIC applies a stronger complexity penalty, especially with large sample sizes, so it is often used when parsimony and model identification are more important than maximizing in-sample fit. In this scenario, the selection rule was set before fitting and prioritizes BIC for governance explainability. The base demographic model has the lowest BIC and is calibrated, so the modest AIC improvement from more complex models does not justify added parameters. The key takeaway is to follow the criterion that matches the business and governance objective, not simply the model with the lowest AIC.

Lowest AIC trap fails because the behavioral model improves AIC but loses under the pre-registered BIC criterion.
Likelihood chasing fails because the interaction-heavy model has calibration drift and a much worse BIC.
Overengineering fails because the segmented ensemble adds complexity and explainability risk without winning either AIC or BIC.

Question 55

Topic: Modeling, Analysis, and Outcomes

A data science team is exploring whether push notifications increase 30-day user retention. An initial chart shows users in the highest notification-count decile have much higher retention. Which EDA path is best supported by the exhibit before making an inference?

Exhibit: EDA profile

Check	Finding
Raw row grain	One row per notification event
Duplicate audit	18% duplicate delivery receipts from retries
Outcome grain	One row per user for 30-day retention
Sampling	Paid campaign users oversampled 6:1 vs. organic
Cohort note	Paid users receive more notifications

Options:

A. Deduplicate events, aggregate to user level, and stratify or reweight by acquisition source
B. Remove paid campaign users and analyze only organic users
C. Fit a retention model using the raw event table and notification count as-is
D. Compare mean notifications between retained and churned users without changing the dataset

Best answer: A

Explanation: The core EDA issue is separating a possible behavioral signal from artifacts introduced by row duplication, mismatched aggregation level, and sampling design. The notification feature is measured at event grain, while retention is measured at user grain, so raw event rows can overweight highly messaged users. Duplicate retry receipts inflate counts further. Because paid campaign users are oversampled and also receive more notifications, acquisition source can distort the apparent relationship. A defensible EDA path first removes duplicate event receipts, aggregates notification exposure to the user level, and then compares retention within acquisition strata or applies sampling weights. This checks whether the relationship persists after correcting the artifacts.

Raw event modeling fails because the model would learn from duplicated event rows and a feature grain that does not match the outcome.
Dropping paid users may reduce one bias but discards a major cohort and does not address duplicates or aggregation.
Mean comparison only leaves the original artifact structure intact, so the apparent difference may not represent a true signal.

Question 56

Topic: Machine Learning

A claims analytics team must choose a production model for fraud triage. The business requires ROC-AUC of at least 0.84, per-claim reason codes for analyst review, weekly retraining by a small MLOps team, and batch scoring overnight.

Candidate	Validation ROC-AUC	Notes
Pruned decision tree	0.78	Easy to explain
Random forest	0.84	Many trees; slower explanations
Gradient-boosted trees	0.87	Supports monotonic constraints and SHAP values
Stacked ensemble	0.89	Combines trees and neural network

Which option best maps to these requirements?

Options:

A. Gradient-boosted trees with constraints and SHAP reason codes
B. Stacked ensemble optimized only for ROC-AUC
C. Random forest without local explanation artifacts
D. Pruned decision tree with analyst-readable rules

Best answer: A

Explanation: The key trade-off is ensemble performance versus interpretability and operational complexity. The pruned tree is easiest to explain, but it fails the required ROC-AUC threshold. The stacked ensemble has the best validation score, but its mixed architecture increases deployment, monitoring, and explanation burden for a small team. Gradient-boosted trees provide stronger performance than the threshold and can support governance needs through monotonic constraints and local explanation methods such as SHAP. Because scoring is overnight batch rather than real-time, the extra explanation computation is more acceptable. The best fit is not the most accurate model in isolation; it is the model that satisfies accuracy, explainability, and operational requirements together.

Simple tree trap fails because interpretability alone does not meet the minimum ROC-AUC requirement.
Accuracy-only trap ignores the reason-code and small-team operational constraints.
Random forest gap reaches the threshold but does not address the required per-claim explanation artifacts.

Question 57

Topic: Machine Learning

A data science team is tuning a gradient-boosted churn model for a regulated subscription business. The data spans 24 months, churn behavior has seasonal drift, and an audit plan designates the most recent 2 months as a locked final test set. The team must compare model families, tune hyperparameters, and provide one unbiased performance estimate before deployment. Which tuning strategy is the BEST professional decision?

Options:

A. Use random cross-validation across all 24 months for tuning and reporting.
B. Select the best model by validation score and skip test evaluation.
C. Tune repeatedly on the locked test set until metrics stabilize.
D. Use rolling validation within pre-test data, then evaluate once on the locked test set.

Best answer: D

Explanation: Hyperparameter tuning and model-family selection should use only training data and validation folds, not the final test set. Because the data is time-dependent and seasonal drift is plausible, rolling or forward-chaining validation on the first 22 months better reflects the deployment pattern than random folds. After selecting the model and hyperparameters, the team can train the final candidate on the available pre-test data and evaluate exactly once on the locked 2-month test set. That final result is the least biased estimate for audit and deployment readiness. Reusing the test set for tuning turns it into validation data and makes the reported performance overly optimistic.

Repeated test tuning leaks information from the final holdout into model selection and biases the reported estimate.
Random all-month folds mix future and past periods, undermining temporal validation when drift is expected.
Skipping final testing leaves no independent estimate for the audit and deployment decision.

Question 58

Topic: Specialized Applications of Data Science

A telecom provider wants to reduce churn among enterprise accounts. Account managers believe churn spreads through reseller relationships and shared implementation partners: when a highly connected customer leaves, nearby customers in the relationship network often follow. The team must prioritize retention outreach using both account attributes and the structure of these inter-account relationships. Which approach best fits the requirements?

Options:

A. Univariate survival analysis on contract age
B. ARIMA forecasting of monthly churn counts
C. Graph analysis using centrality and community features
D. K-means clustering on account spend only

Best answer: C

Explanation: Graph analysis is appropriate when the relationships among entities are central to the business problem, not just individual records. In this scenario, the relevant signal includes who is connected through resellers and implementation partners, whether churn clusters in communities, and whether highly connected accounts influence nearby accounts. Centrality, community detection, link-based features, or graph-based models can represent those network effects and combine them with account-level predictors. A time-series forecast may estimate aggregate churn volume, and clustering may group similar accounts, but neither directly captures relationship structure between accounts.

Contract age only misses the network mechanism and reduces the problem to a single-account time-to-event view.
Spend-only clustering groups accounts by one attribute but ignores reseller and partner connections.
Monthly churn forecasting predicts aggregate counts, not which connected accounts should receive retention outreach.

Question 59

Topic: Modeling, Analysis, and Outcomes

A data science team must present mean claim-processing time to business leaders. The audience needs to compare categorical groups: five claim types, split by two customer segments. There is no time sequence, and the goal is to make cross-segment differences easy to see. Which chart type best fits these requirements?

Options:

A. Pie chart
B. Line chart
C. Grouped bar chart
D. Scatter plot

Best answer: C

Explanation: A grouped bar chart is best when a business audience needs to compare numeric measures across categorical groups and a small number of subgroups. In this case, claim type is categorical, customer segment is a second categorical split, and mean processing time is the measured value. Placing segment bars side by side within each claim type makes the comparison direct and avoids implying an ordered time relationship. A horizontal orientation could improve readability if claim type names are long, but the key chart family is grouped bars.

Line chart suggests continuity or time order, which the scenario explicitly does not have.
Pie chart emphasizes parts of a whole and becomes weak for comparing two segments across five categories.
Scatter plot is better for relationships between two numeric variables, not categorical group comparisons.

Question 60

Topic: Operations and Processes

A data science team is starting a churn initiative for a subscription business. The project sponsor asks the team to “use AI to reduce churn,” but the kickoff notes show unresolved disagreement.

Exhibit: Kickoff notes

Stakeholder	Stated priority	Proposed success metric
Sales	Save the largest accounts	Retained revenue
Support	Reduce complaint-driven cancellations	Complaint volume reduction
Finance	Avoid expensive retention offers	Net margin impact
Legal	Limit use of sensitive attributes	Compliance exceptions

Which workflow step should the team take next?

Options:

A. Train a churn classifier using retained revenue as the target
B. Create a dashboard with all proposed stakeholder metrics
C. Collect additional customer interaction data before scoping
D. Facilitate objective alignment and define success criteria

Best answer: D

Explanation: In the data science life cycle, unresolved stakeholder goals should be addressed before model design, data acquisition expansion, or reporting. The exhibit shows multiple plausible but competing objectives: revenue retention, complaint reduction, margin protection, and compliance control. These imply different labels, features, optimization targets, evaluation metrics, and intervention policies. The best next step is requirements and problem framing: align the decision objective, define success criteria, document constraints, and confirm how trade-offs will be handled. Without this agreement, a technically strong model may optimize the wrong outcome or create governance risk.

The key takeaway is that workflow discipline starts with decision clarity, not model selection.

Revenue-only target fails because it prematurely chooses Sales’ objective without resolving Finance, Support, and Legal constraints.
More data first fails because additional data collection may be wasteful or inappropriate until the decision objective and constraints are known.
Dashboard all metrics fails because displaying competing metrics does not resolve which decision the model should support.

Question 61

Topic: Machine Learning

A retailer deploys a deep-learning vision model to identify damaged packages on conveyor video. Offline validation was strong, but production inputs now include new package designs, seasonal lighting changes, and camera replacements. True damage labels are available only after manual audit several days later. Which monitoring concern is the best professional priority?

Options:

A. GPU utilization during batch retraining
B. Immediate retraining on every low-confidence prediction
C. Training loss on the original dataset
D. Input distribution and concept drift

Best answer: D

Explanation: The core concern is drift: the deployed model is seeing real-world inputs that may no longer match the validation data. New package designs, lighting changes, and camera replacements can shift pixel distributions, learned embeddings, and feature relationships. Because labels arrive later, monitoring should combine early signals such as input statistics, embedding drift, confidence changes, and prediction mix with delayed outcome metrics once audits arrive. This helps distinguish normal variation from degraded model validity before business decisions become unreliable.

Training metrics from the original dataset do not show whether production data has changed. Automatic retraining on every uncertain prediction can amplify noise unless drift is confirmed and a governed retraining process exists.

Original training loss is stale because it measures fit to historical data, not current production validity.
GPU utilization matters operationally, but it does not reveal whether changing inputs are degrading model behavior.
Reactive retraining is risky because low confidence alone is not enough evidence for safe model updates.

Question 62

Topic: Modeling, Analysis, and Outcomes

A regulated insurer is selecting a model to prioritize claims for manual review. The business requires validation AUC within 0.02 of the best candidate, reason codes for adverse decisions, scoring under 50 ms per claim on existing CPU infrastructure, and monthly retraining by a small MLOps team.

Model	Validation AUC	p95 scoring latency	Interpretability
Elastic-net logistic regression	0.812	8 ms	Coefficients are directly explainable
Depth-limited gradient boosting	0.836	35 ms	SHAP-based reason codes available
Deep neural network ensemble	0.842	160 ms	Requires GPU; opaque explanations
Large random forest	0.839	90 ms	Feature importance only

Which model is the BEST professional decision?

Options:

A. Large random forest
B. Depth-limited gradient boosting
C. Elastic-net logistic regression
D. Deep neural network ensemble

Best answer: B

Explanation: Model selection should optimize the full operating objective, not only the top validation metric. The deep neural network has the best AUC, but it violates latency, infrastructure, and interpretability constraints. The depth-limited gradient boosting model is within 0.006 AUC of the best result, satisfies the under-50 ms CPU scoring requirement, and can provide auditable reason codes using SHAP-style explanations. It is more complex than logistic regression, but the added complexity is justified because logistic regression falls outside the allowed AUC tolerance. The random forest is competitive on AUC but misses the latency target and offers weaker decision-level explanations. The key trade-off is sufficient performance with deployable, explainable, maintainable operation.

Simplest model trap fails because logistic regression is too far below the best validation AUC for the stated tolerance.
Highest AUC trap fails because the neural ensemble ignores latency, CPU-only deployment, and auditability constraints.
Competitive metric trap fails because the random forest exceeds the scoring latency requirement and gives weaker reason codes.

Question 63

Topic: Modeling, Analysis, and Outcomes

A customer-success team asks: “Do slower first-response times increase the probability of renewal downgrade for enterprise accounts after accounting for account size?” An analyst presents this graph description:

Graph: Monthly stacked bars of support tickets by issue category
Overlay: Line for total enterprise renewals per month
Filters: Enterprise accounts only
No fields shown: first-response time, downgrade outcome, account size

Which assessment best maps the graph to the stated analysis question?

Options:

A. It partially answers the question if issue categories are sorted by total ticket volume.
B. It answers the question because enterprise renewals are plotted over the same months as ticket categories.
C. It does not answer the question; use account-level downgrade versus response-time analysis with account-size adjustment.
D. It answers the question after adding a trend line to the renewal overlay.

Best answer: C

Explanation: A graph answers an analysis question only when its encoded variables and level of detail align with the question. Here, the business question is about an account-level relationship: first-response time as the predictor, renewal downgrade as the outcome, and account size as a control. The presented graph shows monthly ticket mix and total renewals, which may be operationally useful but does not display downgrade probability, response time, or account size. It also aggregates by month, which can hide account-level relationships and create misleading ecological interpretations.

The key takeaway is that visual relevance requires matching the variables, grain, and comparison needed by the stated question—not just showing related business activity.

Shared time axis is insufficient because plotting renewals by month does not evaluate response-time effects on downgrade probability.
Sorting categories changes readability but still does not add the missing predictor, outcome, or control.
Adding a trend line may summarize renewals over time, but it cannot infer the requested account-level relationship.

Question 64

Topic: Mathematics and Statistics

A team is evaluating an ordinary least squares model to predict insurance claim severity. The model documentation assumes approximately normal residuals with constant variance before using coefficient tests and prediction intervals.

Exhibit: Residual diagnostics

Diagnostic	Result
Residual mean	0.02
Residual skewness	2.9
Q-Q plot	Upper tail far above line
Residuals vs. fitted	Fan-shaped spread
Breusch-Pagan test	p = 0.003

Which modeling concern is best supported by the exhibit?

Options:

A. The model is invalid because residuals are not centered at zero.
B. Coefficient tests and prediction intervals may be unreliable.
C. The main issue is multicollinearity among predictors.
D. The model primarily suffers from class imbalance.

Best answer: B

Explanation: OLS point estimates can still be useful in some settings, but common coefficient tests, standard errors, and prediction intervals rely on residual assumptions such as approximately normal errors and constant variance. The exhibit shows a highly right-skewed residual distribution, a Q-Q tail departure, and a fan-shaped residual plot with a significant Breusch-Pagan test. Together, these indicate non-normality and heteroskedasticity, so inference based on the default OLS assumptions is questionable. A better next step could include transformations, robust standard errors, weighted regression, or a distribution more appropriate for claim severity.

Centered residuals is not the issue because the residual mean is close to zero.
Class imbalance applies to classification target distribution, not continuous claim severity residual diagnostics.
Multicollinearity would require predictor correlation or variance inflation evidence, which the exhibit does not provide.

Question 65

Topic: Modeling, Analysis, and Outcomes

A subscription retailer is preparing features for a churn model. The customer-by-product-category matrix has 12,000 binary columns; a 0 means the customer did not buy from that category during the observation window. A separate household_income field is null for 18% of customers because the value was not provided. The pipeline must preserve behavioral signal and avoid introducing bias. Which data preparation action best fits these requirements?

Options:

A. Replace null income with 0 and store all features sparsely
B. Mean-impute all 0 category indicators and null income values
C. Drop category indicators with more than 90% zeros
D. Store category indicators sparsely; impute and flag null income

Best answer: D

Explanation: Sparse observations and missing values require different preparation. In the category matrix, 0 is an observed value meaning no purchase occurred, so those values should be preserved and represented efficiently with a sparse format or sparse-aware model. The null household_income values mean the data was not collected, so they should be handled with an appropriate imputation strategy and often a missingness indicator, fitted within the training pipeline to avoid leakage. Treating valid zeros as missing changes the behavioral signal; treating missing income as a real zero creates a false value.

Imputing category zeros fails because 0 already has a valid behavioral meaning in the observation window.
Dropping sparse columns can remove rare but predictive product signals solely because the matrix is sparse.
Using zero income fails because missing income is unknown, not a measured value of 0.

Question 66

Topic: Operations and Processes

A data science team is preparing features for a 30-day readmission risk model. The model must use only information available at discharge, and the team wants to preserve clinically meaningful data-quality signals.

Exhibit: EDA summary for last_creatinine_mg_dl

Finding	Value
Feature type	Continuous lab value, right-skewed
Missing rate	38%
Missingness pattern	Test often not ordered for low-acuity visits
Outcome rate if present	14% readmitted
Outcome rate if missing	7% readmitted

Which imputation approach is best supported by the exhibit?

Options:

A. Mean imputation without an indicator
B. KNN imputation fit before train-test splitting
C. Median imputation with a missingness indicator
D. Drop all rows with missing creatinine

Best answer: C

Explanation: This is informative missingness: the lab is missing partly because the test was not ordered, and the missing group has a different readmission rate. Because the feature is continuous and right-skewed, median imputation is more robust than mean imputation. Adding a missingness indicator lets the model learn that “not measured” carries predictive information distinct from the imputed numeric value. The imputer and indicator creation should be fit inside the training pipeline to avoid leakage into validation or test data. Dropping rows would remove a large, systematically different subgroup.

Mean-only imputation hides the fact that missingness itself is associated with the outcome.
Dropping rows would discard 38% of records and bias the sample toward patients who received the lab.
Pre-split KNN imputation can leak validation or test distribution information into training preprocessing.

Question 67

Topic: Operations and Processes

A team is moving a containerized real-time recommendation model into production. Traffic has short 8x spikes, the service must keep p95 inference latency below 200 ms, and each replica is stateless but requires a GPU slice. During load tests, CPU stays moderate while request queues grow. Which container orchestration consideration best maps to these requirements?

Options:

A. Autoscale replicas using queue/latency metrics with GPU-aware scheduling
B. Autoscale only on average CPU utilization across replicas
C. Use a nightly batch scoring job for all requests
D. Run one larger container to avoid replica coordination

Best answer: A

Explanation: For model-serving workloads, orchestration should match the bottleneck and placement constraints of inference. This service is stateless, so horizontal scaling is appropriate, but CPU is not the observed limiting signal. Queue depth, concurrent requests, or p95 latency are better autoscaling signals for bursty online inference. Because each replica needs GPU capacity, the orchestrator also needs resource requests or equivalent GPU-aware scheduling so it does not overpack nodes or create contention. The key takeaway is to scale on serving-specific telemetry and schedule against the scarce accelerator resource.

CPU-only scaling fails because CPU is not tracking the observed latency and queueing bottleneck.
Single larger container reduces horizontal elasticity and can create a larger failure domain during spikes.
Nightly batch scoring does not satisfy the real-time latency requirement for incoming requests.

Question 68

Topic: Modeling, Analysis, and Outcomes

A marketplace data science team is profiling seller_daily_revenue before choosing transformations for a churn model. The feature is numeric, nonnegative, has many exact zeros, and appears to have a long right tail with possible data-entry extremes. Stakeholders need a clear view of the feature’s distribution without using the target label. Which univariate EDA technique is BEST?

Options:

A. Box plot only
B. Histogram with defensible bin widths
C. Pearson correlation heat map
D. Target-stratified violin plot

Best answer: B

Explanation: A histogram is the best first univariate EDA choice for examining the distribution of a single numeric feature. It shows how observations are distributed across value ranges, making it easier to see mass at zero, skewness, multimodality, gaps, and the influence of extreme values. For a heavy-tailed nonnegative feature, the analyst may compare reasonable bin widths or use a transformed view later, but the core technique remains a frequency-based distribution plot. This also avoids using the churn target, keeping the analysis focused on the predictor’s marginal distribution.

Target stratification adds label information and changes the task from univariate distribution profiling to outcome comparison.
Correlation heat maps require multiple variables and do not show the shape of one feature’s values.
Box plots summarize median, spread, and outliers, but they hide frequency shape and zero concentration.

Question 69

Topic: Modeling, Analysis, and Outcomes

A grocery delivery company is building a model to predict late deliveries at the moment an order is accepted. The model must use only information available at acceptance time, work for new drivers, and capture recurring congestion patterns by time and location without creating very sparse features. Which engineered feature is the BEST professional decision?

Options:

A. Rolling late-rate by pickup geohash and hour-of-week
B. Actual trip duration after delivery completion
C. Target-encoded driver ID using the full training set
D. One-hot encoded exact pickup and drop-off addresses

Best answer: A

Explanation: The core feature-engineering decision is to represent grouped behavior at a useful granularity while respecting prediction time. A rolling late-rate by pickup geohash and hour-of-week summarizes recent lateness for a location-time group, which directly matches the business pattern: congestion and fulfillment delays vary by area and recurring time window. Computing it as a rolling, time-aware aggregate prevents use of future outcomes, and geohash grouping is less sparse than exact addresses. It also avoids dependence on a specific driver, so the feature can generalize to new drivers. The key takeaway is to encode historical group behavior only from data that would have been known when the prediction is made.

Driver target encoding is tempting, but using the full training set leaks outcome information and may not generalize to new drivers.
Actual trip duration is unavailable at order acceptance, so it is post-event leakage.
Exact address one-hot encoding creates high-cardinality sparse features and is less robust than location aggregation.

Question 70

Topic: Machine Learning

A risk analytics team trained a single decision tree on a complex tabular underwriting dataset with nonlinear feature interactions. Across repeated cross-validation runs, AUC ranges from 0.61 to 0.79, and small changes in the training sample produce very different split rules. The business requirement is to improve generalization and reduce model instability while still using tree-based handling of mixed feature types. Which approach best fits these requirements?

Options:

A. Use a single unregularized logistic regression
B. Replace the model with k-means clustering
C. Train a random forest ensemble
D. Prune the single tree less aggressively

Best answer: C

Explanation: The core concept is variance reduction through bagging-based ensembles. A single decision tree can be highly unstable because small data changes may produce different early splits and different downstream rules. A random forest trains many trees on bootstrap samples and uses feature subsampling to decorrelate them, then aggregates their predictions. This keeps the nonparametric, interaction-friendly behavior of trees while making predictions less sensitive to any one sample or split. The key requirement is not just tree-based modeling; it is stable generalization under complex tabular patterns.

Looser pruning can increase tree complexity and often worsens the high-variance behavior already observed.
Clustering substitution changes the task to unsupervised grouping and does not directly solve supervised risk prediction.
Unregularized linear model may be stable in some cases, but it misses the stated nonlinear interactions and adds avoidable underfit risk.

Question 71

Topic: Modeling, Analysis, and Outcomes

A fraud analytics team has a deployed regularized logistic regression model. A stakeholder asks to replace it with a newer boosted-tree model before the next release. The release must reduce manual reviews without increasing fraud losses, keep p95 scoring latency under 75 ms, and provide auditable reason codes.

Exhibit: Temporal holdout results

Model	PR-AUC	Recall at 5% FPR	Expected net value	p95 latency
Current logistic model	0.412	0.63	$1.28M	18 ms
Boosted-tree challenger	0.418	0.64	$1.29M	92 ms

The 95% confidence interval for the challenger’s net-value lift is -$40,000 to +$60,000, and its post-hoc reason codes vary across retraining runs. What is the BEST professional decision?

Options:

A. Replace the current model because the challenger has higher PR-AUC
B. Build a neural network ensemble before the release decision
C. Tune deeper boosted trees until the lift is statistically significant
D. Retain the current model and define evidence-based iteration criteria

Best answer: D

Explanation: Model iteration is justified when validation evidence shows a meaningful improvement against the business objective and operational constraints. Here, the challenger has only a tiny metric gain, the net-value confidence interval includes zero, latency exceeds the 75 ms requirement, and reason codes are unstable. Those facts do not support replacement just because the method is newer. A defensible next step is to keep the current model, document the decision, and set explicit iteration criteria such as minimum net-value lift, acceptable latency, stable explanations, and segment-level error improvement. Newer methods can be explored, but promotion should depend on evidence that improves business value without violating deployment requirements.

Higher PR-AUC alone fails because the improvement is small and does not satisfy latency or auditability constraints.
Deeper tuning risks overfitting and does not address the current challenger’s operational failures.
Neural ensemble exploration adds complexity without evidence that it will improve the release outcome or meet constraints.

Question 72

Topic: Machine Learning

A fintech team is building a supervised model to approve small-business credit applications. The labeled training set has 18,000 records, 45 mostly tabular features, and EDA shows nonlinear threshold effects and feature interactions. Regulators require consistent adverse-action reason codes, and the scoring service must respond in under 80 ms. Which modeling choice is the BEST professional decision?

Options:

A. A k-nearest neighbors classifier using all standardized application features
B. Constrained gradient-boosted trees with probability calibration and feature-attribution reason codes
C. A simple unregularized logistic regression using only the raw input fields
D. A deep neural network with several hidden layers and automated feature learning

Best answer: B

Explanation: For medium-sized structured tabular data with nonlinear effects and interactions, gradient-boosted trees are often a strong supervised-learning choice. Constraints such as monotonicity, calibration, and documented feature attributions can make the model more defensible for regulated credit decisions while preserving strong predictive performance. Tree ensembles also score quickly when deployed properly, which supports the latency requirement. The key trade-off is not choosing the most complex model, but matching the method to data size, feature behavior, interpretability, and operational needs. A plain linear model is easier to explain but may underfit the observed nonlinearities; a neural network or KNN adds complexity without satisfying the audit and latency constraints as well.

Neural network complexity is hard to justify for 18,000 tabular records when auditability and reason-code consistency are required.
KNN scoring can be slow and difficult to explain because each prediction depends on stored training examples and distance behavior.
Plain logistic regression is interpretable, but using only raw fields is likely to miss the stated nonlinear thresholds and interactions.

Question 73

Topic: Specialized Applications of Data Science

A manufacturer wants to automate visual inspection of metal panels. The system must identify each defect, estimate its area in square millimeters, and send defect locations to a rework station.

Exhibit: Labeled image sample

Finding	Image-level label	Bounding box	Pixel mask	Instances per image
Scratch	yes	yes	yes	0-8
Dent	yes	yes	yes	0-4
Discoloration	yes	yes	yes	0-3

Which computer vision approach best fits this business problem?

Options:

A. Image classification
B. Instance segmentation
C. Image similarity search
D. OCR pipeline

Best answer: B

Explanation: The core requirement is not just to decide whether a panel is defective; the system must locate each individual defect and estimate its area. Because the labels include pixel masks and multiple defect instances can appear in one image, instance segmentation is the strongest fit. It produces object-specific masks, allowing downstream calculations such as defect area and coordinates for rework. Object detection would locate defects with boxes, but boxes are less precise for area measurement than masks. Image classification is too coarse because it produces image-level labels only.

Image-level classification fails because it cannot locate each defect or measure defect area.
OCR is for extracting text from images, not identifying physical surface defects.
Similarity search can retrieve visually similar panels but does not directly produce per-defect masks or locations.

Question 74

Topic: Mathematics and Statistics

A fraud-operations team compares mean review time across four queue interfaces. The outcome is seconds per reviewed case, and lower is better. Samples are independent, distributions are approximately symmetric, and variance checks do not indicate a serious violation.

Exhibit: Experiment summary

Interface	n	Mean seconds	SD
A	45	83	12
B	45	76	11
C	45	81	13
D	45	91	14

One-way ANOVA: $p = 0.018$ at $\alpha = 0.05$ Tukey adjusted post hoc results: B vs. D $p = 0.011$; all other pairwise $p > 0.10$

Which interpretation is best supported by the exhibit?

Options:

A. Replace ANOVA with a chi-square test of independence.
B. Accept equal means because most pairwise comparisons are not significant.
C. Reject equal means; only B vs. D is supported as different.
D. Conclude all four interfaces have significantly different means.

Best answer: C

Explanation: A one-way ANOVA is appropriate when comparing a continuous outcome mean across more than two independent groups. Here, the ANOVA $p = 0.018$ is below $0.05$, so the data provide evidence that not all interface mean review times are equal. ANOVA does not, by itself, prove every group differs from every other group. The Tukey adjusted post hoc results show which pairwise differences remain significant while controlling for multiple comparisons. Only B versus D has an adjusted $p$-value below $0.05$, so the defensible interpretation is that the overall means differ, with specific support for B and D being different. The key is separating the omnibus ANOVA conclusion from post hoc pairwise claims.

Most pairs nonsignificant does not override a significant omnibus ANOVA when one adjusted pairwise difference is supported.
All means differ overstates the evidence because Tukey results support only one pairwise difference.
Chi-square test is for categorical association, not comparing a continuous mean across four groups.

Question 75

Topic: Specialized Applications of Data Science

An NLP team is building a classifier to detect whether customer chat messages express consent to renew a subscription. The proposed preprocessing lowercases text, removes punctuation, removes standard stop words, and lemmatizes tokens. A validation review samples records where preprocessing changed model behavior.

Exhibit:

Original message	Cleaned tokens	Ground truth
I do not want to renew	want renew	No consent
Do not cancel my renewal	cancel renewal	Consent
I never agreed to auto-renew	agree auto renew	No consent
Please renew, no changes needed	renew change need	Consent

Which next action is best supported by the exhibit?

Options:

A. Address the issue by balancing consent classes
B. Customize preprocessing to preserve negation cues
C. Keep preprocessing because renewal keywords remain
D. Replace lemmatization with stemming only

Best answer: B

Explanation: For this NLP task, preprocessing must preserve meaning that determines consent. The exhibit shows that standard stop-word removal deletes negation terms such as not, never, and no. Those words are not cosmetic; they reverse or qualify intent in messages like “do not want to renew” and “do not cancel my renewal.” A better pipeline would use a task-aware stop-word list, retain negation tokens, or encode negation scope before model training and validation. Keyword retention alone is not sufficient when the target label depends on sentence meaning rather than isolated topic words.

Keyword-only reasoning fails because tokens like renew and cancel can imply opposite labels depending on negation.
Stemming swap does not address the loss of intent caused by removing negation words.
Class balancing may help sampling or training, but it cannot recover meaning removed from the text.

Questions 76-90

Question 76

Topic: Specialized Applications of Data Science

A logistics company is analyzing overhead video from loading docks. The team must report how long each forklift remains in a restricted zone and whether the same forklift returns later in the clip.

Exhibit: Required output

Video evidence	Required result
Forklifts enter, overlap, and leave view	Stable ID per forklift
Positions change across frames	Per-forklift path over time
Restricted-zone crossings	Duration by forklift ID

Which computer vision approach is most appropriate?

Options:

A. Multi-object tracking
B. Image classification
C. Optical character recognition
D. Single-frame object detection

Best answer: A

Explanation: Computer vision tracking is used when objects or entities must be followed across sequential frames while preserving identity over time. In this scenario, detecting forklifts in a single frame is not enough because the business output depends on stable forklift IDs, trajectories, zone crossings, and dwell time. Multi-object tracking combines frame-level localization with temporal association so the same physical forklift can be linked from frame to frame, even as it moves, overlaps, or exits and re-enters view. The key signal in the exhibit is the requirement for per-entity paths and durations, not merely whether forklifts are present.

Single-frame detection can locate forklifts in each image but does not inherently preserve the same identity across frames.
Image classification would label a frame or clip category, not produce per-forklift trajectories or dwell time.
OCR extracts text from images and does not address moving-object identity across video frames.

Question 77

Topic: Modeling, Analysis, and Outcomes

A data science team is publishing a model performance visualization for a broad audience, including executives, clinicians, and community reviewers. The current chart compares false-negative rates across hospital units using only a red-to-green gradient, with exact values shown only on hover. The report must work in static PDF and web formats, support screen-reader users, and preserve the underlying metric and uncertainty intervals. Which improvement is the best professional decision?

Options:

A. Keep the chart interactive and add longer hover tooltips.
B. Use colorblind-safe redundant encoding with labels and alt text.
C. Increase red and green saturation to sharpen contrast.
D. Replace the intervals with a simplified risk ranking.

Best answer: B

Explanation: Accessible visualization for a broad audience should not rely on color alone, especially when the report must work in static and web formats. A colorblind-safe palette plus redundant encodings, such as labels, patterns, or symbols, helps users distinguish groups without depending on red-green perception. Direct labels make values available when hover is unavailable in a PDF, and concise alt text or a text summary supports screen-reader users. Preserving uncertainty intervals also avoids making the chart easier to read by removing important statistical context. The key is to improve perception and interpretation while keeping the analytical meaning intact.

More saturation does not solve red-green color-vision barriers and can make the chart less accessible.
Hover-only detail fails in static PDF output and does not reliably support screen-reader access.
Simplified ranking may communicate quickly, but it removes uncertainty information required by the report constraints.

Question 78

Topic: Specialized Applications of Data Science

A property insurer’s computer vision model estimates roof damage from drone images. Validation was strong, but performance dropped after expansion to a new region. The training images were mostly sunny suburban roofs from one drone vendor, while production images include mixed weather, rural properties, and two vendors. Annotators also disagree on the boundary between “minor” and “moderate” damage. Which approach should the team perform first?

Options:

A. Increase model capacity and tune thresholds on recent production predictions.
B. Apply broad weather augmentation to all training images before reviewing labels.
C. Run a stratified data audit covering image quality, label agreement, and production representativeness.
D. Pool training and production images, then run random cross-validation.

Best answer: C

Explanation: Computer vision performance often fails because the training data no longer matches production conditions, labels are inconsistent, or image acquisition quality changes. In this scenario, all three risks are visible: different weather and property types affect representativeness, different vendors may change resolution or perspective, and annotator disagreement threatens ground-truth reliability. A stratified data audit should compare train vs. production slices, inspect image-quality metrics, and measure label consistency through adjudication or inter-annotator agreement. This determines whether the issue is data quality, labeling policy, sampling coverage, or a model limitation. Retuning or augmenting before this audit can hide the root cause.

Model-first tuning misses the stated evidence gaps and may optimize around mislabeled or unrepresentative data.
Weather augmentation addresses one possible image shift but ignores vendor differences, property mix, and label inconsistency.
Random pooling can mask distribution shift and creates validation evidence that does not reflect production deployment.

Question 79

Topic: Specialized Applications of Data Science

A payments company wants to identify likely account takeover and card-testing activity. Analysts observe bursts of low-value transactions, unusual merchant-category sequences, new device fingerprints, and rapid changes in shipping addresses. The business requirement is to score each transaction or session for suspicious behavior before settlement. Which specialized data-science application best fits these requirements?

Options:

A. Optical character recognition
B. Topic modeling
C. Fraud detection
D. Survival analysis

Best answer: C

Explanation: Fraud detection is the best application when the goal is to identify suspicious activity in transactions, accounts, or user behavior. The scenario includes classic fraud signals: transaction velocity, unusual merchant patterns, new device fingerprints, and address changes. It also requires risk scoring before settlement, which aligns with operational fraud detection systems that combine behavioral features, transaction attributes, and historical labels or anomaly indicators.

Topic modeling is for discovering themes in text, OCR extracts text from images, and survival analysis estimates time-to-event outcomes. These methods may support other workflows, but they do not directly map to suspicious transaction behavior scoring.

Topic modeling fits unstructured text themes, not transaction-risk scoring from behavioral signals.
OCR extracts machine-readable text from images and does not address suspicious account activity.
Survival analysis models time until an event, but the requirement is immediate fraud-risk scoring.

Question 80

Topic: Machine Learning

A team is evaluating dimensionality reduction on 800 standardized sensor features for a defect classifier. The business goal is to reduce model artifact size without degrading holdout ROC-AUC; stakeholders also want to know whether a 2D view proves class separation.

Exhibit: Dimensionality-reduction results

Representation	Key result	Holdout ROC-AUC	Artifact size
Raw features	Baseline	0.812	100%
PCA, 2 components	38% variance retained; classes overlap	0.651	0.3%
PCA, 40 components	95% variance retained; lower CV variance	0.831	5%
UMAP, 2 dimensions	Clusters change across random seeds	0.668	0.3%

Which interpretation is best supported by the exhibit?

Options:

A. Use 2-component PCA because it maximizes compression.
B. Use 2D UMAP because it creates visual clusters.
C. Reject dimensionality reduction because PCA with 2 components underperforms.
D. Use 40-component PCA for compression and modeling.

Best answer: D

Explanation: Dimensionality reduction should be judged against the intended use. For compression and modeling, the 40-component PCA result is useful because it reduces the representation from 800 features to 5% of the artifact size, retains 95% of variance, and does not degrade holdout ROC-AUC. The 2D projections are not strong evidence for modeling or class separation: PCA with 2 components loses too much variance and performs poorly, while UMAP’s apparent clusters are unstable across random seeds. A low-dimensional visualization can help exploration, but it should not override validation metrics and stability evidence.

UMAP visual clusters fail because unstable clusters across seeds are weak evidence for reliable class separation.
Rejecting all reduction overgeneralizes from the poor 2-component PCA result and ignores the strong 40-component PCA result.
Maximum compression is not sufficient because 2-component PCA loses substantial variance and harms ROC-AUC.

Question 81

Topic: Mathematics and Statistics

A data science team is evaluating whether support outcome is related to customer tier. The team has one record per customer, no repeated customers, and wants a nonparametric test of association using the summarized counts.

Exhibit: EDA summary

Variable	Type	Levels
Customer tier	Categorical	Basic, Pro, Enterprise
Support outcome	Categorical	Resolved, Follow-up, Escalated
Sample size	Count	4,800 records
Minimum expected cell count	Count	42

Options:

A. Use Pearson correlation on encoded category labels
B. Use a chi-squared test of independence
C. Use a paired t-test by customer tier
D. Use one-way ANOVA on support outcome

Best answer: B

Explanation: A chi-squared test of independence is appropriate when the goal is to evaluate whether two categorical variables are associated using counts in a contingency table. The exhibit shows two categorical variables, independent observations, a large sample, and expected cell counts well above the usual minimum guideline. This supports testing whether the distribution of support outcomes differs by customer tier without assuming a numeric outcome or normal residuals. Encoding category labels as numbers would not create meaningful intervals, and tests for numeric means would not match the data type.

ANOVA mismatch fails because support outcome is categorical, not a continuous response whose group means are compared.
Encoded correlation fails because arbitrary numeric codes for tiers or outcomes do not represent meaningful continuous measurements.
Paired t-test mismatch fails because the records are independent customers, not paired before-and-after or matched numeric observations.

Question 82

Topic: Mathematics and Statistics

A risk analytics team is moving a validated linear multi-output model from a notebook to a nightly scoring pipeline. Each application has the same 120 standardized features. The model produces five scores using learned coefficients and bias terms. The business wants reproducible, auditable scores, and the platform team wants optimized batch computation instead of custom per-score logic. Which pipeline decision best maps to these requirements?

Options:

A. Collapse the 120 features into one composite index before scoring
B. Train five separate univariate models and average their ranks
C. Represent records as $X$ and weights as $W$; compute $XW+b$
D. Use a feature correlation matrix as the coefficient matrix

Best answer: C

Explanation: Matrix operations matter because model inputs, weights, and transformations can be represented with compatible dimensions. Here, the feature matrix $X$ has one row per application and one column per standardized feature, while the weight matrix $W$ maps those 120 features to five output scores. The product $XW$, with bias terms added, applies the same learned linear transformation to every record and every score in a reproducible, vectorized way. This also supports auditability because each coefficient remains tied to a feature and output. Collapsing features, substituting correlations, or using univariate rank aggregation changes the learned model rather than operationalizing it.

Composite index loses feature-level coefficients and cannot reproduce the validated 120-feature model.
Correlation matrix describes feature relationships, not learned target coefficients for scoring.
Univariate rank averaging ignores the multivariate weight structure required for the five model outputs.

Question 83

Topic: Machine Learning

A logistics company deployed a deep-learning image model to classify package damage from warehouse camera feeds. Since deployment, new camera models, lighting changes, and seasonal packaging materials have been introduced. The business requirement is to detect when real-world inputs no longer resemble the validated training distribution before classification quality degrades. Which monitoring approach best maps to this requirement?

Options:

A. Use the training set as production ground truth
B. Monitor only average inference latency
C. Track input data drift against the training baseline
D. Increase the number of training epochs weekly

Best answer: C

Explanation: The core concern is data drift: production inputs are changing because camera hardware, lighting, and packaging materials differ from the validated training distribution. For a deployed deep-learning model, monitoring feature or embedding distributions, image-quality statistics, and prediction patterns against a baseline helps detect when the model is seeing inputs it was not validated to handle. This is especially important when labels arrive late or are expensive, because input drift can provide an early warning before measured accuracy drops. Latency monitoring is useful operationally, but it does not show whether the model’s inputs remain representative.

More epochs retrains the model but does not monitor whether live inputs have shifted.
Latency only checks serving performance, not input distribution or model validity.
Training as truth creates false confidence because training labels do not validate current production conditions.

Question 84

Topic: Modeling, Analysis, and Outcomes

A lender is designing a first-release default-risk model for a new small-business loan product. The model must produce auditable reason codes for adverse-action notices, serve online decisions with p95 latency under 40 ms, and use only the 8,000 labeled loans currently available.

Exhibit: Candidate model summary

Candidate	Holdout AUC	p95 latency	Design notes
Deep neural net	0.89	180 ms	Needs larger training set; opaque features
Gradient-boosted trees	0.88	70 ms	Post hoc explanations only
Regularized scorecard	0.84	12 ms	Monotonic bins; coefficient-based reason codes
KNN classifier	0.82	220 ms	Stores neighbor records for scoring

Options:

A. Use gradient-boosted trees with post hoc explanations.
B. Use KNN because it avoids parametric assumptions.
C. Use the deep neural net for the highest AUC.
D. Use the regularized scorecard design.

Best answer: D

Explanation: Model design should optimize for the full set of operational and governance constraints, not just the best predictive metric. In this scenario, auditable reason codes, p95 latency under 40 ms, and limited labeled data are hard requirements. The regularized scorecard has lower AUC, but it meets the latency target, can be trained on the available labeled data, and supports transparent coefficient-based reason codes through monotonic bins. The higher-AUC candidates violate at least one required constraint, so their apparent performance advantage is not deployable for this use case. The key takeaway is that model selection is a constrained design decision, not a leaderboard exercise.

Highest AUC trap fails because the neural net misses latency, interpretability, and data-availability requirements.
Post hoc explanation trap fails because explanations after the fact may not satisfy auditable reason-code needs and the model exceeds latency.
Assumption-free trap fails because KNN has unacceptable scoring latency and creates operational concerns by storing neighbor records.

Question 85

Topic: Modeling, Analysis, and Outcomes

A team is preparing an experiment summary for a churn model. The locked holdout set was not viewed until after model family and hyperparameters were chosen.

Exhibit: Experiment log

Step	Data used	Action	Result
1	Development split	5-fold CV across 4 model families	Gradient boosting best, AUC 0.842
2	Development split	CV tuning for gradient boosting	Tuned AUC 0.856
3	Locked holdout	Evaluate final tuned model once	Holdout AUC 0.831

Which interpretation correctly distinguishes model-selection evidence from post-selection validation evidence?

Options:

A. Use CV results for selection and the locked holdout for validation.
B. Select the model family by the locked holdout AUC.
C. Use the tuned CV AUC as the final validation estimate.
D. Average the tuned CV AUC and holdout AUC as validation.

Best answer: A

Explanation: Model-selection evidence is used to compare alternatives or tune settings before the final model is chosen. In the exhibit, the development split and cross-validation results support choosing gradient boosting and its hyperparameters. Post-selection validation should estimate how the already-selected model performs on data that did not influence selection. Because the locked holdout was evaluated once only after the final tuned model was fixed, its AUC is the appropriate post-selection validation evidence. The key risk is optimistic bias: any dataset repeatedly used to compare or tune models becomes part of the selection process and should not be treated as an untouched final validation source.

Tuned CV as final fails because those folds influenced hyperparameter selection and can be optimistically biased.
Holdout for selection fails because using the locked holdout to choose the model would contaminate post-selection validation.
Averaging metrics fails because it mixes selection evidence with independent validation evidence and obscures the validation estimate.

Question 86

Topic: Operations and Processes

A data science team retrains a claims-risk model and must make each reported validation result reproducible during audit. Review the current release trace.

Artifact	Current trace
Training data	`claims_train_latest.parquet`
Feature code	Git commit `8f31c2a`
Feature pipeline config	`prod.yaml` overwritten each run
Model artifact	Registry version `risk_model:17`
Validation result	Metrics copied into a slide

Which version-control practice best preserves traceability between data, code, model, and result changes?

Options:

A. Store model binaries directly in the Git repository
B. Tag only the source repository for each release
C. Keep validation metrics in the presentation deck
D. Version a run manifest linking immutable artifact IDs

Best answer: D

Explanation: Traceability in data science requires more than source-code history. The audit gap is that several artifacts are mutable or detached from the run: the training data uses a latest name, the pipeline config is overwritten, and the validation result is copied into a slide. A strong version-control practice records an immutable run manifest, often alongside code or in an experiment-tracking system, that links the dataset snapshot or hash, code commit, configuration version, model artifact ID, and metric output. This creates a chain from input data through code and configuration to the trained model and reported result. Tagging code alone cannot explain which data or config produced a metric.

Code-only tagging misses the mutable dataset and overwritten pipeline configuration.
Binary-in-Git storage bloats the repository and still does not link metrics to data and config.
Slide-based metrics are reporting artifacts, not reproducible evidence tied to a specific run.

Question 87

Topic: Operations and Processes

A data science team is preparing a customer churn model for production. Review the workflow status and choose the next action most consistent with a standard data science life cycle.

Exhibit: Workflow status

Stage	Current status
Problem definition	`churn` means either cancellation or 90-day inactivity; KPI not approved
Data preparation	CRM, billing, and support data merged; leakage review incomplete
Modeling	Gradient boosting model trained; AUC = 0.86 on random split
Evaluation	No business threshold or cost-based review completed
Deployment	Container image scheduled for release next sprint

Options:

A. Add production monitoring and proceed with deployment
B. Deploy the container because the AUC exceeds 0.80
C. Finalize the problem definition and evaluation criteria before continuing
D. Tune hyperparameters to improve the AUC before release

Best answer: C

Explanation: A data science workflow should sequence work from problem definition to data preparation, modeling, evaluation, and deployment. In the exhibit, the team has already modeled and packaged the system, but the foundational business definition of churn is inconsistent, the KPI is not approved, and evaluation lacks an agreed decision threshold or cost review. That means the reported AUC is not enough to justify deployment because the model may be optimizing the wrong target or using data that should not be available at prediction time. The next step is to resolve the problem definition and evaluation criteria, then revisit preparation, modeling, and evaluation as needed before release.

AUC shortcut fails because a single technical metric cannot override an undefined target and missing business acceptance criteria.
Hyperparameter tuning is premature because improving AUC does not fix an ambiguous outcome definition or incomplete leakage review.
Monitoring-first deployment is insufficient because production monitoring cannot compensate for unresolved problem framing and validation gaps.

Question 88

Topic: Machine Learning

A support platform must route 30 million labeled text tickets into 40 categories. New tickets must be classified in under 20 ms, and low-confidence cases should be sent to human triage. The current feature store provides sparse token counts and a few categorical metadata indicators.

Candidate	Holdout macro-F1	p95 latency	Notes
Multinomial NB + calibration	0.82	5 ms	Produces class posteriors
Boosted trees on embeddings	0.84	65 ms	Batch scoring only
Transformer classifier	0.88	280 ms	Requires GPU serving

Which decision is BEST?

Options:

A. Use KNN over all historical ticket vectors
B. Deploy the transformer classifier for maximum F1
C. Deploy calibrated multinomial naive Bayes with monitoring
D. Deploy boosted trees and accept batch-only routing

Best answer: C

Explanation: Naive Bayes is often suitable for large-scale text classification when features are sparse token counts or indicators and the goal is fast probabilistic routing. The conditional independence assumption is simplified, but it can work well enough in high-dimensional text settings, especially when validation shows acceptable performance. In this scenario, the calibrated multinomial naive Bayes model meets the p95 latency target and provides posterior scores for human triage. The models with higher offline F1 fail operational constraints, so their extra accuracy does not translate into the best production decision.

The key takeaway is to select the model that satisfies both statistical fit and deployment requirements, not just the highest offline metric.

Maximum F1 trap fails because the transformer violates latency and serving-cost constraints.
Batch-only scoring fails because ticket routing requires real-time classification.
KNN at scale fails because nearest-neighbor search over 30 million sparse vectors is operationally expensive and not the best fit here.

Question 89

Topic: Mathematics and Statistics

A retail lender is building a model to predict loan delinquency for the next quarter. The data covers monthly applications from January 2021 through December 2024, and the business sponsor wants evidence that the model will generalize to future quarters. An analyst proposes a random 80/20 validation split because it preserves the delinquency rate in each split.

Which modeling concern most directly makes that validation plan misleading?

Options:

A. Right-censoring of incomplete event times
B. Temporal leakage from future periods into training
C. High variance from using too little training data
D. Multicollinearity among correlated borrower features

Best answer: B

Explanation: Temporal validation must respect the order in which observations become available. For a next-quarter prediction problem, a random split can mix future months into the training set and earlier months into validation. That can make the model look better than it will be in production, especially when borrower behavior, underwriting policy, macroeconomic conditions, or delinquency base rates change over time. A time-based holdout, rolling-origin validation, or forward-chaining cross-validation better matches the deployment setting because the model is always evaluated on periods after the training period.

The key issue is not just preserving class balance; it is preventing future information and time-dependent patterns from contaminating the validation estimate.

Right-censoring matters for survival analysis when event times are incomplete, but the stem focuses on next-quarter classification validation.
Training size may affect variance, but an 80/20 split is not the core flaw described.
Multicollinearity can affect coefficient stability, but it does not explain why random validation overstates future performance.

Question 90

Topic: Modeling, Analysis, and Outcomes

A data science team is preparing features for a 30-day readmission model. The business goal is to preserve rare clinical signals while preventing artifacts from a known source-system outage.

Feature group	Stored value	Data profile
Diagnosis code counts	`0`	Patient had coverage but no claim with that code
Procedure code counts	`0`	Patient had coverage but no claim with that code
Lab result value	`NULL`	12% missing, concentrated in two clinics during an outage

Which preparation decision is BEST?

Options:

A. Convert all zeros to NULL and impute them with medians
B. Replace lab NULLs with zero and leave code counts unchanged
C. Keep code-count zeros; impute lab NULLs separately with missingness indicators
D. Drop sparse diagnosis and procedure features before modeling

Best answer: C

Explanation: Sparse observations and missing values require different preparation. In the exhibit, a 0 diagnosis or procedure count means the patient had coverage but no recorded claim with that code, so it is an observed absence and may be predictive, especially for rare conditions. A lab NULL is different: the value is unknown, and the outage pattern may itself carry information or bias. Preparing the sparse count features as zeros, while imputing lab values with a missingness indicator, preserves valid absence signals and handles nonrandom missingness without inventing measurements. Treating all zeros as missing would corrupt the feature meaning; treating NULL labs as zero would create clinically false values.

Zero-as-missing fails because claim-count zeros are observed absences, not unknown measurements.
Null-as-zero fails because a missing lab result is not the same as a measured value of zero.
Drop sparse features fails because rare diagnosis and procedure codes can be high-value predictors despite sparsity.

Continue in the web app

Use IT Mastery for interactive CompTIA DataAI DY0-001 practice with mixed sets, timed mocks, topic drills, explanations, and progress tracking.

Try CompTIA DataAI DY0-001 on Web

Focused topic pages

Specialized Applications of Data Science

Official Resources