Try 10 focused PMI-CPMAI questions on Identify Data Needs, with answers and explanations, then continue with PM Mastery.
| Field | Detail |
|---|---|
| Exam route | PMI-CPMAI |
| Topic area | Identify Data Needs |
| Blueprint weight | 26% |
| Page purpose | Focused sample questions before returning to mixed practice |
Use this page to isolate Identify Data Needs for PMI-CPMAI. Work through the 10 questions first, then review the explanations and return to mixed practice in PM Mastery.
| Pass | What to do | What to record |
|---|---|---|
| First attempt | Answer without checking the explanation first. | The fact, rule, calculation, or judgment point that controlled your answer. |
| Review | Read the explanation even when you were correct. | Why the best answer is stronger than the closest distractor. |
| Repair | Repeat only missed or uncertain items after a short break. | The pattern behind misses, not the answer letter. |
| Transfer | Return to mixed practice once the topic feels stable. | Whether the same skill holds up when the topic is no longer obvious. |
Blueprint context: 26% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.
These questions are original PM Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.
Topic: Identify Data Needs
Your team is building an AI model to prioritize loan applications for manual review. During data readiness checks, you find that the last 18 months of approved/denied outcomes are missing for one region, and the remaining history contains far fewer labeled outcomes for a minority subgroup. Early evaluation shows good overall performance but a much higher false-negative rate for that subgroup. The business sponsor asks to “launch a pilot anyway” to hit a deadline.
What is the best next step?
Best answer: C
What this tests: Identify Data Needs
Explanation: When data is incomplete and uneven across segments, you must explicitly communicate the likely impacts on model performance, fairness, and operational risk before moving forward. A concise impact brief supports transparent tradeoffs and enables a governance-backed go/no-go decision. This keeps the initiative aligned to acceptable risk and prevents premature deployment based on misleading aggregate metrics.
The core concept is that data limitations (missing outcomes, sparse labels, and non-representative samples) directly drive unpredictable performance, biased error rates, and operational risk when the model is used in production. The best next step is to communicate those consequences in a decision-oriented artifact so stakeholders can choose an appropriate path before additional modeling or any deployment.
A practical impact brief should cover:
Trying to “fix” outcomes only in modeling or deploying and monitoring later skips the prerequisite of making risk and tradeoffs explicit and agreed.
It communicates how the data gaps affect performance, fairness, and operational risk and drives an explicit decision on mitigations before proceeding.
Topic: Identify Data Needs
When coordinating an AI workspace, the team wants the compute environment to automatically add resources during peak training runs and release them when idle to stay within budget and avoid over-provisioning. Which term best describes this capability?
Best answer: D
What this tests: Identify Data Needs
Explanation: Autoscaling is the infrastructure capability that dynamically adjusts compute capacity up or down in response to workload. It supports provisioning resources for data processing and model development while honoring constraints like cost caps and performance needs. This directly matches the goal of adding resources during peaks and releasing them when idle.
The core concept is dynamic compute provisioning. In AI initiatives, workloads (feature engineering, training, batch scoring) can be bursty, so provisioning fixed capacity often wastes money or creates bottlenecks. Autoscaling addresses this by adjusting compute capacity in response to demand signals (for example, queued jobs, CPU/GPU utilization, or scheduled windows) while enforcing guardrails such as maximum nodes, instance types, and budget limits. This helps teams right-size infrastructure to the scope of experimentation and delivery without manual intervention.
Key takeaway: choose autoscaling when the requirement is automatic scale-up/scale-down of compute resources.
Autoscaling automatically increases or decreases compute resources based on workload demand and defined limits.
Topic: Identify Data Needs
A bank is building an AI model to prioritize home-loan applications for manual underwriting (high customer impact). The team has received an initial 12-month data extract, but quick profiling shows 18% missing income values, inconsistent date formats across sources, and duplicate applicant records. Stakeholders disagree on what “good enough” data quality means, and the team is scheduled to start model training next sprint.
What is the best next step?
Best answer: A
What this tests: Identify Data Needs
Explanation: Because this is a high-impact decision-support use case, the team needs explicit, measurable data quality standards and acceptance criteria before training. Setting risk-based thresholds (e.g., completeness, uniqueness, consistency, timeliness, label reliability) creates a clear go/no-go gate and reduces rework. Stakeholder sign-off ensures the standards reflect business and risk expectations, not just technical preference.
The core need is to define data quality standards and acceptance criteria that are proportional to the solution’s risk and impact. Here, missing income, duplicates, and inconsistent dates directly affect model reliability and could create unfair or inconsistent underwriting prioritization, so the team should establish measurable thresholds and checks before model development.
A practical next step is to:
Starting training or piloting without agreed acceptance criteria is premature because it hides risk and makes “ready” subjective.
Before training, the team must align on measurable, risk-appropriate data quality thresholds that determine a go/no-go for data use.
Topic: Identify Data Needs
A retail bank is scoping a near-real-time fraud detection model. Transaction patterns and merchant behaviors change weekly, so the use case requires new data to be incorporated continuously after launch. The team proposes a one-time data extract for model training and says they will “request another extract later if needed.”
Which AI project management concept best matches what the project should define next?
Best answer: D
What this tests: Identify Data Needs
Explanation: Because the fraud use case depends on fast-changing patterns, the project must plan for continuous acquisition and refresh of input data, not a one-time pull. That plan typically specifies ingestion cadence/latency, ownership, data quality checks, versioning/lineage, and how refreshed datasets are made available for training and scoring.
The core concept is defining an ongoing data collection and refresh process when the use case requires continuous updates. In the stem, relying on a one-time extract creates predictable failure modes: stale training data, broken assumptions about feature distributions, and an ad hoc access process that delays updates.
A fit-for-purpose refresh process typically clarifies:
Drift monitoring and retraining decisions come later, but they depend on having this reliable refresh mechanism in place.
The use case needs a defined cadence, pipeline, ownership, and controls to continuously ingest and refresh data beyond a one-time extract.
Topic: Identify Data Needs
You are preparing a steering committee update for an AI initiative to score insurance claims for fraud at intake to reduce investigative cost and cycle time. The team completed a data readiness assessment.
Exhibit: Data readiness summary (last 12 months)
Target decision: fraud score within 1 hour of claim submission
Label (fraud confirmed) available within 60 days for 38% of claims
Source system latency for key features: 14 days
Field `claim_type` missing: 22%
Region B codes unmapped: 15% of values
Investigator notes restricted (cannot be used)
Based on the exhibit, what is the best recommendation to present that links data readiness to a business decision?
Best answer: B
What this tests: Identify Data Needs
Explanation: The exhibit shows the solution cannot meet the intended operational decision: key features arrive 14 days late and most labels are not available in time to train and validate a true intake model. The leadership-relevant message is that expected business outcomes (faster intake decisions and cost reduction) are unattainable without re-scoping and targeted investment in data availability and quality.
Data readiness findings should be translated into what decisions the solution can and cannot support, and what that means for timeline, scope, and projected benefits. Here, the target is a fraud score within 1 hour, but the input features have 14-day latency, making real-time intake scoring infeasible regardless of modeling approach. In addition, only 38% of labels arrive within 60 days, limiting timely training/evaluation and weakening evidence for benefits. Missing and unmapped categorical values increase operational failure modes and rework.
A leadership-ready recommendation is to:
The key is tying data constraints to an explicit decision on scope, schedule, and expected ROI.
The current label timeliness and 14-day feature latency prevent an intake (1-hour) decision, so leadership should decide on re-scoping and investing in data remediation before committing to the promised outcomes.
Topic: Identify Data Needs
A retailer is launching an AI initiative to predict customer churn within 8 weeks. Relevant data is believed to be spread across several internal systems (CRM, billing, call-center tickets, and the e-commerce app), but there is no enterprise data catalog. The privacy office requires that any data containing PII follow documented access approvals and least-privilege controls, and executives have low risk tolerance for compliance issues.
What is the BEST next action?
Best answer: A
What this tests: Identify Data Needs
Explanation: The next step is to inventory where relevant internal data resides and who owns it so the team can request appropriate access and confirm data suitability. With strict PII controls and low risk tolerance, documenting sources, data elements, and approval paths is necessary before moving data or designing features. This reduces rework and avoids unauthorized handling of sensitive data.
In Domain III, identifying data needs starts with a practical inventory of internal data sources and locations, especially when no catalog exists and privacy constraints are tight. For this churn use case, the team should quickly create a data source register that maps each candidate system to its data owner/steward, key data fields, sensitivity (e.g., PII), retention constraints, and the approved access method. This enables compliant access requests and helps validate whether the needed historical coverage, granularity, and linkage keys exist before investing in extraction or model design.
A lightweight approach is to:
The key takeaway is to confirm and document internal data locations and governance paths before building pipelines or selecting modeling details.
This establishes a governed inventory of internal systems, owners, and access paths before any extraction or modeling.
Topic: Identify Data Needs
A health insurer is building an ML model to predict high-cost claims to prioritize care management. The training dataset contains 4 years of claims and enrollment data, but provider networks and coding practices changed significantly in the last 9 months. The model must launch in 6 weeks, and analysts can only access de-identified data in a controlled environment (no free-form extracts). Which approach best optimizes risk reduction and model quality by assessing data freshness and relevance to expected operating conditions while meeting the constraints?
Best answer: D
What this tests: Identify Data Needs
Explanation: The most effective way to assess freshness and relevance is to evaluate whether the most recent data reflects current operating conditions and how it differs from the historical training set. Profiling and distribution comparisons on the last 9 months, followed by defined data freshness SLAs, reduces the risk of deploying a model trained on outdated patterns. This can be done within a de-identified, controlled-access environment and within the 6-week timeline.
Data freshness and relevance are about whether the data represents the conditions the model will see after launch (current networks, coding practices, and patient mix). When operations have changed recently, “more data” can increase volume but also increase mismatch, raising the risk of poor real-world performance.
A practical approach under tight timelines and privacy constraints is to:
This produces an evidence-based decision on whether to reweight, window, or exclude older data and sets a maintainable plan for operating conditions after go-live.
It directly tests whether recent data matches expected conditions and establishes an actionable freshness plan without violating access constraints.
Topic: Identify Data Needs
A retailer is building a model to predict next-week product demand. The team proposes using five years of historical sales and promotion data. In the last six months, the company introduced same-day delivery and changed its promotion calendar, and leadership expects these shifts to persist.
Which approach SHOULD AVOID when evaluating whether the available data is fresh and relevant to expected operating conditions?
Best answer: D
What this tests: Identify Data Needs
Explanation: Data freshness and relevance mean the training and validation data should reflect the conditions the model will face in production. When operating conditions have changed, older data can be misleading unless you deliberately account for those changes. Treating all historical data as equally representative is the key anti-pattern in this scenario.
To assess freshness and relevance, focus on whether the data represents the current (and expected) business environment. Here, same-day delivery and a new promotion calendar likely change buying patterns and the relationship between features (promotions, inventory, lead time) and demand. Good evaluation practices include detecting temporal drift, validating that feature definitions and data capture didn’t change, and measuring model performance on a recent time slice that mirrors production conditions. The approach to avoid is blindly pooling all historical data without assessing or adjusting for these shifts, because it can overweight obsolete patterns and inflate offline metrics that won’t hold after deployment.
This assumes historical patterns remain relevant despite known business changes, risking training on stale, non-representative data.
Topic: Identify Data Needs
A healthcare analytics team wants to predict 30-day readmissions and believes housing stability is a key feature. During data readiness assessment, they find housing status is not captured in any internal systems, and compliance confirms they cannot collect it for this use case. What is the best strategy to address this data gap?
Best answer: D
What this tests: Identify Data Needs
Explanation: Because housing stability cannot be collected for this use case, the team should not plan the solution around that feature. The appropriate response is to adjust the use case or success criteria to what can be supported by available and permissible data, potentially as a phased roadmap.
A core outcome of determining whether data meets solution needs is identifying gaps that are not realistically closable. If a key feature or label cannot be acquired due to privacy, compliance, or operational constraints, “collect more data” is not an option, and fabricating it via augmentation would undermine validity and governance. In that situation, the best practice is to adjust scope: redefine the prediction target, population, or decision to be supported using data that is available, permissible, and sufficiently representative.
Practical scope adjustments include:
The goal is to maintain a defensible, auditable solution rather than forcing a model to infer prohibited data.
When a critical feature cannot be legally or operationally obtained, the most viable response is to narrow or redefine the use case around data you can use.
Topic: Identify Data Needs
A team is onboarding three upstream systems to supply data for a churn prediction model. Past integrations have broken when field names changed, data types shifted (string vs. integer), or new category values appeared without notice. The AI project manager proposes documenting the expected columns, data types, allowed ranges/categories, null-handling rules, and a versioning process that data producers must follow before changes are released.
Which principle or governance approach best matches this practice?
Best answer: D
What this tests: Identify Data Needs
Explanation: This situation is best addressed by explicitly defining the required data schema and constraints so producers and consumers share the same expectations. A versioned data contract (or canonical schema specification) prevents breaking changes by standardizing field names, types, permissible values, and change control before release.
The core concept is defining required data formats and schema characteristics as an enforceable specification between data producers and the AI team. A versioned data contract (often implemented as schema definitions plus validation checks) reduces integration failures by making the expected structure and constraints explicit and by controlling how changes are introduced. In this scenario, the key needs are stable column names, data types, permissible values/ranges, null-handling, and a process for schema evolution.
Practical elements to include are:
Other practices like access control, labeling, and drift monitoring are important, but they do not define the schema requirements upstream teams must follow.
A data contract specifies required schema, constraints, and change/version rules to keep pipelines reliable for modeling.
Use the PMI-CPMAI Practice Test page for the full PM Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.
Read the PMI-CPMAI guide on PMExams.com, then return to PM Mastery for timed practice.