PMI-CPMAI: Manage AI Model Development and Evaluation

May 1, 2026

Try 10 focused PMI-CPMAI questions on Manage AI Model Development and Evaluation, with answers and explanations, then continue with PM Mastery.

On this page

Open the matching PM Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try PMI-CPMAI on Web View full PMI-CPMAI practice page

Topic snapshot

Field	Detail
Exam route	PMI-CPMAI
Topic area	Manage AI Model Development and Evaluation
Blueprint weight	16%
Page purpose	Focused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Manage AI Model Development and Evaluation for PMI-CPMAI. Work through the 10 questions first, then review the explanations and return to mixed practice in PM Mastery.

Pass	What to do	What to record
First attempt	Answer without checking the explanation first.	The fact, rule, calculation, or judgment point that controlled your answer.
Review	Read the explanation even when you were correct.	Why the best answer is stronger than the closest distractor.
Repair	Repeat only missed or uncertain items after a short break.	The pattern behind misses, not the answer letter.
Transfer	Return to mixed practice once the topic feels stable.	Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 16% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These questions are original PM Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.

Question 1

Topic: Manage AI Model Development and Evaluation

During model testing, the team trains the same classifier 10 times using identical data splits but different random seeds. The AUC ranges from 0.62 to 0.83, the confusion matrix varies noticeably run-to-run, and the top 5 features change order each time. A developer proposes selecting the single “best” run and moving forward.

Which AI model QA/QC principle best matches what the project manager should insist on next?

A. Conduct a bias/fairness assessment across protected groups
B. Perform stability/robustness checks and report metric variability across runs
C. Implement production drift monitoring and retraining triggers
D. Expand hyperparameter search to maximize peak AUC

Best answer: B

What this tests: Manage AI Model Development and Evaluation

Explanation: The wide spread in AUC and shifting results across repeated trainings are classic instability signals. QA/QC should focus on model stability and robustness by measuring performance variability and diagnosing sources of variance (e.g., sensitivity to initialization, sampling, or overfitting). Selecting the single best run masks risk and undermines confidence in expected performance.

The core concept is model stability/robustness during development testing: a model should produce consistent performance and behavior under small, non-meaningful changes (like random seeds) when data and evaluation setup are otherwise the same. Large run-to-run swings are an instability signal that the reported metric is not reliable enough for a release decision.

Practical QA/QC actions include:

Repeated training/cross-validation and summarizing results (mean, variance, ranges)
Investigating variance sources (data leakage, small sample size, high model variance/overfitting)
Using confidence intervals or similar uncertainty reporting for key metrics

The key takeaway is to characterize and reduce variability before treating any single score as representative.

Large run-to-run metric swings indicate instability, so QA/QC should quantify variability (not cherry-pick a best run) before a go/no-go decision.

Question 2

Topic: Manage AI Model Development and Evaluation

A team is building a churn prediction model and has only used a single 80/20 train/validation split. The selected gradient-boosted model shows 0.94 AUC on the training set but validation AUC ranges from 0.71 to 0.83 depending on the random split seed, raising concerns about overfitting and unstable model selection. The team has not created a final holdout test set yet.

As the AI project lead, what is the best next step to reduce overfitting risk while selecting the model?

A. Move the current model to a limited production pilot to measure real-world AUC
B. Create an untouched test set, use stratified k-fold cross-validation (with tuning inside CV) to compare candidates, then evaluate once on the test set
C. Use the test set during hyperparameter tuning to ensure the model generalizes before release
D. Repeat the train/validation split several times and keep the model with the highest validation AUC

Best answer: B

What this tests: Manage AI Model Development and Evaluation

Explanation: The problem is unstable validation performance, which signals overfitting and sensitivity to a single split. The next step is to use a robust cross-validation-based model selection approach (ideally with tuning inside CV) and reserve a final untouched test set for one-time confirmation. This reduces selection bias and provides a more reliable estimate of generalization.

When performance swings across random splits, a single holdout validation set is not a reliable basis for model selection and can lead to overfitting to that particular split. The next step is to put a disciplined evaluation structure in place: reserve a final holdout test set that stays untouched until the end, then use stratified k-fold cross-validation to compare candidate models and tune hyperparameters based on cross-validated results (e.g., mean and variance across folds). This reduces variance from any one split and makes it harder to “accidentally” pick a model that just got lucky on one validation partition. After selecting the best candidate using CV, run a single final evaluation on the holdout test set to confirm expected generalization before moving toward pilot deployment.

This uses cross-validation for stable model selection while keeping a final holdout test set to avoid leakage and overfitting to evaluation data.

Question 3

Topic: Manage AI Model Development and Evaluation

An AI model performed well in validation, but after deployment its accuracy steadily declines because the relationship between key input features and the business outcome has changed (for example, customer behavior shifts after a pricing change). Which term best describes this generalization risk?

A. Concept drift
B. Data leakage
C. Overfitting
D. Data drift (covariate shift)

Best answer: A

What this tests: Manage AI Model Development and Evaluation

Explanation: This situation describes a changing mapping between predictors and the target after go-live, which is a core generalization risk. When the concept being predicted evolves, a once-valid model can become inaccurate even if the model code is unchanged. That is concept drift.

Concept drift occurs when the real-world relationship between inputs and the outcome changes over time, so the model’s learned patterns no longer generalize to current conditions. In go/no-go readiness, this is a key robustness concern because it drives the need for operational monitoring and clear retraining/rollback triggers.

Practically, you detect it by tracking outcome-based performance (when labels are available) and investigating whether changes in business processes, customer behavior, or external conditions have altered the meaning of features relative to the target. This differs from input-only distribution change, where feature distributions shift but the underlying input-to-target relationship may still hold.

The key takeaway is to plan for post-deployment monitoring that can surface relationship changes, not just input changes.

Concept drift is when the underlying input-to-target relationship changes, causing performance degradation in new operating conditions.

Question 4

Topic: Manage AI Model Development and Evaluation

A fraud detection team must deliver an initial model in 3 weeks and expects ~12 model iterations. Full-data training takes ~8 hours on a GPU. Due to shared infrastructure, GPUs are only available from 8:00 p.m.–6:00 a.m. and must not impact daytime operations.

Which training schedule and resource allocation approach is best?

A. Iterate on sampled data; reserve nightly GPUs for full retrains
B. Train on full data only, sequentially, during business hours
C. Pause work until new dedicated GPUs are procured
D. Defer iteration; do one full training in the final week

Best answer: A

What this tests: Manage AI Model Development and Evaluation

Explanation: With a fixed, limited overnight GPU window, the team should separate rapid iteration from computationally expensive full retraining. Using a representative sample for fast daytime cycles lets the team test features and parameters quickly, while reserving overnight GPUs for the heavier full-data runs needed for validation and readiness decisions.

Training schedule and resource allocation should match iteration needs to compute constraints. In this scenario, the decisive factor is restricted GPU availability that cannot affect daytime operations, yet the team needs many iterations.

A practical plan is:

Use a representative sampled dataset (or lighter training setup) for rapid daytime experiments.
Reserve/schedule the overnight GPU window for full-data retraining and confirmation runs.
Promote only the most promising candidates to the overnight queue to avoid wasting scarce GPU time.

This approach increases learning velocity without violating operational constraints, while still ensuring that final decisions are based on performance measured on full-data training and appropriate evaluation splits.

This maximizes iteration speed while fitting full retraining into the limited off-peak GPU window.

Question 5

Topic: Manage AI Model Development and Evaluation

A retail bank is building an AI model to support small-business loan approval decisions using 50,000 historical applications with tabular features (income, cash flow, credit history, collateral). Because denials require applicant-specific, regulator-auditable reasons, the risk committee has set a hard requirement for high transparency and stable explanations during model reviews.

Which model technique is the best fit for this use case given that requirement?

A. Generalized additive model (GAM) with constrained, human-interpretable features
B. Gradient-boosted trees with post-hoc feature attribution
C. Deep neural network with automated feature learning
D. Unsupervised clustering to segment applicants, then manual rules per segment

Best answer: A

What this tests: Manage AI Model Development and Evaluation

Explanation: The single most important discriminator is the requirement for regulator-auditable, stable, applicant-specific explanations. An inherently interpretable technique like a GAM supports transparent reasoning that can be reviewed, documented, and defended without relying on approximations. This aligns model selection to governance needs for a high-impact decision.

When a use case involves high-stakes decisions (like credit approval) and strict governance, the primary model-selection driver is often interpretability and auditability, not maximum predictive power. Inherently interpretable models (e.g., linear models or GAMs) make it easier to show how each input affects the outcome and to provide consistent, reviewable reason codes for decisions.

A GAM is a strong fit for tabular risk modeling because it can capture some non-linear effects while keeping the relationship between each feature and the prediction understandable. Constraining features (e.g., monotonic effects where appropriate) further improves defensibility and stability in model reviews. The key takeaway is to prefer intrinsic interpretability over post-hoc explanations when explanations are a hard requirement.

A GAM provides strong inherent interpretability and audit-friendly, stable feature-to-outcome relationships suitable for high-stakes decisions.

Question 6

Topic: Manage AI Model Development and Evaluation

A health insurer deployed an AI model to flag potentially fraudulent claims for review. In UAT, the model achieved high performance using a randomly shuffled train/test split of last year’s claims.

Two weeks after go-live, reviewers report a surge of false positives, mostly on telehealth-related claims introduced after a reimbursement policy change. A drift report shows a large increase in previously rare procedure codes and a shift in the distribution of billed amounts compared with training data.

What is the most likely underlying cause?

A. Reviewers were not provided enough explanation details, reducing trust and causing perceived poor performance.
B. Drift monitoring was configured to run too infrequently, delaying the identification of performance issues.
C. The decision threshold was set too low, which increased the number of positive flags.
D. The pre-deployment evaluation did not test out-of-time/shifted conditions, so the model failed to generalize after a policy-driven data shift.

Best answer: D

What this tests: Manage AI Model Development and Evaluation

Explanation: The clues point to a distribution shift introduced by a new policy and new/rare codes that were not adequately represented or tested in pre-production evaluation. Using a random split can hide time-based leakage and overestimate performance, so the model appears strong in UAT but degrades under new conditions. The underlying issue is insufficient robustness/generalization assessment for realistic operating shifts.

This is a generalization/robustness failure caused by a mismatch between evaluation conditions and real operating conditions. A randomly shuffled split can make the test set look like the training set (and sometimes leak time-dependent patterns), so it does not reveal how the model will behave when the world changes.

Given the policy change and the drift report (new procedure codes and shifted billed-amount distributions), the go/no-go gap is the absence of shift-aware validation, such as:

Time-based (out-of-time) validation and backtesting
Stress testing on rare/new codes and segment slices
Explicit shift/OO D detection and fallback rules for novel inputs

Threshold changes or more frequent monitoring may help manage symptoms, but they do not address the core problem: the model was not verified to generalize to plausible new conditions before operationalization.

A random split can overstate performance and miss temporal/policy shifts, leading to poor generalization when inputs change in production.

Question 7

Topic: Manage AI Model Development and Evaluation

A churn-prediction model was approved at the go/no-go gate and deployed. Three weeks later, AUC in production drops from 0.86 to 0.71, the service desk reports “no alerts were configured,” and the on-call engineer cannot identify which training dataset or feature pipeline version produced the running model. Separately, a privacy review flags that inference logs contain raw customer identifiers with no documented retention or access controls. What is the most likely underlying cause?

A. The model requires additional training data to improve accuracy
B. Production performance issues are caused by insufficient compute capacity
C. Low user adoption is primarily due to inadequate change management training
D. Operational documentation and governance procedures were insufficient at go/no-go

Best answer: D

What this tests: Manage AI Model Development and Evaluation

Explanation: The clues point to gaps in readiness artifacts for operational support and governance: no monitoring/alerting setup, no traceability to data and pipeline versions, and no documented privacy-safe logging and retention. These are documentation and operational procedure deficiencies that should be validated before approving operationalization. Addressing them is a go/no-go prerequisite to ensure the model can be supported, audited, and governed safely.

For a go/no-go decision, “model readiness” includes more than offline performance—it requires that operations and governance can reliably support the solution. In the scenario, the team cannot determine the deployed model’s provenance (dataset/feature pipeline/version), has no alerting configured to detect drift-related degradation, and is logging identifiers without documented controls.

These symptoms collectively indicate missing or incomplete operational procedures and documentation, such as:

Monitoring/alerting runbooks and ownership (RACI/on-call)
Reproducibility artifacts (model registry entry, data/feature lineage, build/deploy records)
Privacy-safe logging, retention, and access control procedures

The key takeaway is that inadequate support and governance procedures are an underlying readiness failure, not just a performance-tuning problem.

Missing runbooks, version/lineage records, and logging/privacy procedures explain undetected drift, unreproducible deployments, and the privacy issue.

Question 8

Topic: Manage AI Model Development and Evaluation

You are preparing a go/no-go decision for an ML model that auto-blocks suspicious online refunds.

Success criteria: stop ≥15% more fraudulent refunds than current rules while keeping wrongful blocks (false positives) ≤0.3% of legitimate refunds. Risk tolerance: no subgroup wrongful-block rate >1.25× overall.

Pilot results: AUC 0.95, but at the selected threshold the model wrongfully blocks 1.1% of legitimate refunds; one region is at 1.8× overall. Operations overrides 40% of blocks, and support agents started emailing case details to handle appeals (a privacy incident). Drift monitoring shows feature distributions stable versus training.

What is the most likely underlying cause?

A. Privacy incident indicates the model is inherently unsafe to use
B. Concept drift from changing customer refund behavior
C. Misaligned evaluation/threshold to success and risk criteria
D. User change resistance causing low adoption and overrides

Best answer: C

What this tests: Manage AI Model Development and Evaluation

Explanation: The pilot clearly violates the defined acceptance thresholds for wrongful blocks and subgroup impact, even though the aggregate AUC looks strong. That pattern most often occurs when the team optimizes or reports the wrong performance view (e.g., AUC/F1) or selects an operating threshold that is not constrained by the stated success criteria and risk tolerance. The overrides and workaround are downstream effects of that mismatch.

Go/no-go readiness requires evaluating the model exactly against the agreed success criteria and risk tolerance at the intended operating point (decision threshold and workflow). Here, the model fails the explicit constraints: false positives are 1.1% vs a ≤0.3% limit, and one region exceeds the subgroup tolerance (1.8× vs ≤1.25×). A high AUC can coexist with unacceptable harm rates if the chosen cutoff prioritizes catching fraud over limiting wrongful blocks.

To diagnose and fix, re-run evaluation using:

The agreed business KPI view (fraud prevented and customer harm)
Threshold tuning constrained by the ≤0.3% false-positive cap
Subgroup checks against the 1.25× tolerance before approving

Stable drift signals make “data shift” a less likely root cause than an evaluation/threshold misalignment.

The model was judged on aggregate metrics and a chosen cutoff, not on the specified false-positive and subgroup risk limits.

Question 9

Topic: Manage AI Model Development and Evaluation

A project team has completed initial data profiling and must document data quality findings, key decisions (including any go/no-go constraints), and next steps so business stakeholders and the governance board can review dataset fitness for the AI use case. Which term best describes this dataset-focused documentation artifact?

A. Data dictionary
B. Model card
C. Datasheet for datasets (dataset card)
D. Data lineage

Best answer: C

What this tests: Manage AI Model Development and Evaluation

Explanation: A datasheet for datasets (often called a dataset card) is used to communicate dataset provenance and quality, how the data should and should not be used, known limitations, and planned remediation/monitoring. This directly supports transparent stakeholder communication and governance decisions on whether the data is fit to proceed into preparation and modeling.

The core concept is dataset documentation for transparency and governance. A datasheet for datasets (dataset card) captures what the data is, where it came from, how it was collected/processed, observed data quality issues, known limitations/bias risks, and decisions and next steps (e.g., remediation actions, acceptance criteria, owners, and refresh cadence). In a go/no-go context, it provides an auditable record of why the team believes the data is (or is not) ready to move into data preparation and model development.

A model card documents a trained model’s behavior and evaluation, while data lineage focuses on traceability of data movement and transformations rather than a stakeholder-friendly quality decision summary.

It documents a dataset’s provenance, quality, intended use, limitations, and maintenance actions for stakeholder/governance review.

Question 10

Topic: Manage AI Model Development and Evaluation

A team is developing a customer-churn classifier. In testing, the AUC across 5 cross-validation folds ranges from 0.61 to 0.88, but the team reports only the mean AUC (0.75) and proceeds to user acceptance testing without investigating the variability.

What is the most likely near-term impact of this omission?

A. Immediate data privacy breach occurs during test execution
B. Production drift goes unnoticed for months after deployment
C. Go/no-go decision becomes unreliable; acceptance testing likely fails
D. Regulators issue penalties for inadequate model documentation

Best answer: C

What this tests: Manage AI Model Development and Evaluation

Explanation: Large performance swings across validation folds are a clear instability signal during QA/QC. Reporting only the average hides this risk and can lead to approving a model that behaves inconsistently on different samples. The most immediate consequence is unreliable acceptance results and rework before release.

During development and testing, you should monitor not only point estimates (e.g., mean AUC) but also stability indicators such as variance/range across folds, confidence intervals, and sensitivity to data splits or random seeds. A wide spread (0.61 to 0.88) suggests the model may be overfitting, the data are not representative, the split strategy is flawed (e.g., leakage, non-stratified sampling), or performance depends heavily on a small subset of cases. If the team advances based on the mean alone, near-term stakeholders will see inconsistent outcomes when the evaluation sample changes (as in UAT), making acceptance criteria and release decisions unreliable. The right response is to investigate and reduce variability before proceeding.

Key takeaway: instability in test metrics primarily threatens near-term validation and release decisions, not downstream operational monitoring or regulatory outcomes.

High fold-to-fold metric variability signals instability, so relying on the mean increases the chance the model performs inconsistently in acceptance testing.

Continue with full practice

Use the PMI-CPMAI Practice Test page for the full PM Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Open the matching PM Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try PMI-CPMAI on Web View full PMI-CPMAI practice page

Free review resource

Read the PMI-CPMAI guide on PMExams.com, then return to PM Mastery for timed practice.

Revised on Thursday, May 14, 2026

Identify Data Needs

Operationalize AI Solution

Browse Certification Practice Tests by Exam Family

PMI-CPMAI: Manage AI Model Development and Evaluation

Topic snapshot

How to use this topic drill

Sample questions

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Continue with full practice

Related focused pages

Free review resource