Browse Certification Practice Tests by Exam Family

PMI-CPMAI: Identify Data Needs

Try 10 focused PMI-CPMAI questions on Identify Data Needs, with answers and explanations, then continue with PM Mastery.

On this page

Open the matching PM Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Topic snapshot

FieldDetail
Exam routePMI-CPMAI
Topic areaIdentify Data Needs
Blueprint weight26%
Page purposeFocused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Identify Data Needs for PMI-CPMAI. Work through the 10 questions first, then review the explanations and return to mixed practice in PM Mastery.

PassWhat to doWhat to record
First attemptAnswer without checking the explanation first.The fact, rule, calculation, or judgment point that controlled your answer.
ReviewRead the explanation even when you were correct.Why the best answer is stronger than the closest distractor.
RepairRepeat only missed or uncertain items after a short break.The pattern behind misses, not the answer letter.
TransferReturn to mixed practice once the topic feels stable.Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 26% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These questions are original PM Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.

Question 1

Topic: Identify Data Needs

Your team is building an AI model to prioritize loan applications for manual review. During data readiness checks, you find that the last 18 months of approved/denied outcomes are missing for one region, and the remaining history contains far fewer labeled outcomes for a minority subgroup. Early evaluation shows good overall performance but a much higher false-negative rate for that subgroup. The business sponsor asks to “launch a pilot anyway” to hit a deadline.

What is the best next step?

  • A. Tune decision thresholds to reduce the subgroup false-negative rate
  • B. Remove demographic fields and retrain to avoid fairness concerns
  • C. Prepare a data-limitations impact brief and seek go/no-go decision
  • D. Deploy a limited pilot with drift and bias monitoring enabled

Best answer: C

What this tests: Identify Data Needs

Explanation: When data is incomplete and uneven across segments, you must explicitly communicate the likely impacts on model performance, fairness, and operational risk before moving forward. A concise impact brief supports transparent tradeoffs and enables a governance-backed go/no-go decision. This keeps the initiative aligned to acceptable risk and prevents premature deployment based on misleading aggregate metrics.

The core concept is that data limitations (missing outcomes, sparse labels, and non-representative samples) directly drive unpredictable performance, biased error rates, and operational risk when the model is used in production. The best next step is to communicate those consequences in a decision-oriented artifact so stakeholders can choose an appropriate path before additional modeling or any deployment.

A practical impact brief should cover:

  • Which populations/time periods are underrepresented or missing
  • Expected effects on key metrics (overall vs segment-level)
  • Fairness risks (e.g., error-rate disparities and where they stem from)
  • Operational risks (misrouting workload, customer harm, rework, escalations)
  • Mitigation options and implications (collect/label more data, change scope, delay, or add guardrails)

Trying to “fix” outcomes only in modeling or deploying and monitoring later skips the prerequisite of making risk and tradeoffs explicit and agreed.

It communicates how the data gaps affect performance, fairness, and operational risk and drives an explicit decision on mitigations before proceeding.


Question 2

Topic: Identify Data Needs

When coordinating an AI workspace, the team wants the compute environment to automatically add resources during peak training runs and release them when idle to stay within budget and avoid over-provisioning. Which term best describes this capability?

  • A. Data lineage
  • B. Load balancing
  • C. Containerization
  • D. Autoscaling

Best answer: D

What this tests: Identify Data Needs

Explanation: Autoscaling is the infrastructure capability that dynamically adjusts compute capacity up or down in response to workload. It supports provisioning resources for data processing and model development while honoring constraints like cost caps and performance needs. This directly matches the goal of adding resources during peaks and releasing them when idle.

The core concept is dynamic compute provisioning. In AI initiatives, workloads (feature engineering, training, batch scoring) can be bursty, so provisioning fixed capacity often wastes money or creates bottlenecks. Autoscaling addresses this by adjusting compute capacity in response to demand signals (for example, queued jobs, CPU/GPU utilization, or scheduled windows) while enforcing guardrails such as maximum nodes, instance types, and budget limits. This helps teams right-size infrastructure to the scope of experimentation and delivery without manual intervention.

Key takeaway: choose autoscaling when the requirement is automatic scale-up/scale-down of compute resources.

Autoscaling automatically increases or decreases compute resources based on workload demand and defined limits.


Question 3

Topic: Identify Data Needs

A bank is building an AI model to prioritize home-loan applications for manual underwriting (high customer impact). The team has received an initial 12-month data extract, but quick profiling shows 18% missing income values, inconsistent date formats across sources, and duplicate applicant records. Stakeholders disagree on what “good enough” data quality means, and the team is scheduled to start model training next sprint.

What is the best next step?

  • A. Define and document risk-based data quality standards and acceptance criteria, then obtain stakeholder sign-off
  • B. Request additional historical data from more regions before setting any quality thresholds
  • C. Proceed to deploy a pilot with drift monitoring and collect quality feedback from production
  • D. Begin model training and address data quality issues after evaluating baseline performance

Best answer: A

What this tests: Identify Data Needs

Explanation: Because this is a high-impact decision-support use case, the team needs explicit, measurable data quality standards and acceptance criteria before training. Setting risk-based thresholds (e.g., completeness, uniqueness, consistency, timeliness, label reliability) creates a clear go/no-go gate and reduces rework. Stakeholder sign-off ensures the standards reflect business and risk expectations, not just technical preference.

The core need is to define data quality standards and acceptance criteria that are proportional to the solution’s risk and impact. Here, missing income, duplicates, and inconsistent dates directly affect model reliability and could create unfair or inconsistent underwriting prioritization, so the team should establish measurable thresholds and checks before model development.

A practical next step is to:

  • Agree on critical data elements and their required quality levels (e.g., completeness for income, uniqueness for applicant ID, consistency for dates)
  • Specify acceptance criteria and a go/no-go rule for training data (including remediation expectations)
  • Document and obtain sign-off so the criteria become the basis for data cleansing, pipeline validation, and governance gates

Starting training or piloting without agreed acceptance criteria is premature because it hides risk and makes “ready” subjective.

Before training, the team must align on measurable, risk-appropriate data quality thresholds that determine a go/no-go for data use.


Question 4

Topic: Identify Data Needs

A retail bank is scoping a near-real-time fraud detection model. Transaction patterns and merchant behaviors change weekly, so the use case requires new data to be incorporated continuously after launch. The team proposes a one-time data extract for model training and says they will “request another extract later if needed.”

Which AI project management concept best matches what the project should define next?

  • A. A model drift monitoring and retraining trigger policy
  • B. A labeling guideline and annotator quality plan
  • C. A data retention and archival policy
  • D. An ongoing data collection and refresh process

Best answer: D

What this tests: Identify Data Needs

Explanation: Because the fraud use case depends on fast-changing patterns, the project must plan for continuous acquisition and refresh of input data, not a one-time pull. That plan typically specifies ingestion cadence/latency, ownership, data quality checks, versioning/lineage, and how refreshed datasets are made available for training and scoring.

The core concept is defining an ongoing data collection and refresh process when the use case requires continuous updates. In the stem, relying on a one-time extract creates predictable failure modes: stale training data, broken assumptions about feature distributions, and an ad hoc access process that delays updates.

A fit-for-purpose refresh process typically clarifies:

  • Data sources, access method, and refresh cadence/latency (batch vs. streaming)
  • Operational ownership (who runs it), SLAs, and exception handling
  • Data quality validation at ingest and dataset versioning/lineage for auditability
  • How refreshed data is promoted for model development and production use

Drift monitoring and retraining decisions come later, but they depend on having this reliable refresh mechanism in place.

The use case needs a defined cadence, pipeline, ownership, and controls to continuously ingest and refresh data beyond a one-time extract.


Question 5

Topic: Identify Data Needs

You are preparing a steering committee update for an AI initiative to score insurance claims for fraud at intake to reduce investigative cost and cycle time. The team completed a data readiness assessment.

Exhibit: Data readiness summary (last 12 months)

Target decision: fraud score within 1 hour of claim submission
Label (fraud confirmed) available within 60 days for 38% of claims
Source system latency for key features: 14 days
Field `claim_type` missing: 22%
Region B codes unmapped: 15% of values
Investigator notes restricted (cannot be used)

Based on the exhibit, what is the best recommendation to present that links data readiness to a business decision?

  • A. Expand to all regions immediately to improve model generalization and adoption
  • B. Rebaseline the intake-scoring scope to what the data can support now and fund fixes
  • C. Keep the intake-scoring target date; use synthetic labels to compensate for slow labels
  • D. Proceed to model development; missing data will be handled during feature engineering

Best answer: B

What this tests: Identify Data Needs

Explanation: The exhibit shows the solution cannot meet the intended operational decision: key features arrive 14 days late and most labels are not available in time to train and validate a true intake model. The leadership-relevant message is that expected business outcomes (faster intake decisions and cost reduction) are unattainable without re-scoping and targeted investment in data availability and quality.

Data readiness findings should be translated into what decisions the solution can and cannot support, and what that means for timeline, scope, and projected benefits. Here, the target is a fraud score within 1 hour, but the input features have 14-day latency, making real-time intake scoring infeasible regardless of modeling approach. In addition, only 38% of labels arrive within 60 days, limiting timely training/evaluation and weakening evidence for benefits. Missing and unmapped categorical values increase operational failure modes and rework.

A leadership-ready recommendation is to:

  • Rebaseline the MVP to a batch/routing use case that matches current latency, or pause intake scoring
  • Fund specific data work (latency reduction, code mapping, completeness) with owners and dates
  • Update the business case (benefit timing, achievable KPIs) and reset the go/no-go gate criteria

The key is tying data constraints to an explicit decision on scope, schedule, and expected ROI.

The current label timeliness and 14-day feature latency prevent an intake (1-hour) decision, so leadership should decide on re-scoping and investing in data remediation before committing to the promised outcomes.


Question 6

Topic: Identify Data Needs

A retailer is launching an AI initiative to predict customer churn within 8 weeks. Relevant data is believed to be spread across several internal systems (CRM, billing, call-center tickets, and the e-commerce app), but there is no enterprise data catalog. The privacy office requires that any data containing PII follow documented access approvals and least-privilege controls, and executives have low risk tolerance for compliance issues.

What is the BEST next action?

  • A. Facilitate data-source inventory with owners and document locations/access
  • B. Choose the model approach and draft features based on assumptions
  • C. Buy third-party demographic data to accelerate data availability
  • D. Extract CRM and billing data into a data science sandbox now

Best answer: A

What this tests: Identify Data Needs

Explanation: The next step is to inventory where relevant internal data resides and who owns it so the team can request appropriate access and confirm data suitability. With strict PII controls and low risk tolerance, documenting sources, data elements, and approval paths is necessary before moving data or designing features. This reduces rework and avoids unauthorized handling of sensitive data.

In Domain III, identifying data needs starts with a practical inventory of internal data sources and locations, especially when no catalog exists and privacy constraints are tight. For this churn use case, the team should quickly create a data source register that maps each candidate system to its data owner/steward, key data fields, sensitivity (e.g., PII), retention constraints, and the approved access method. This enables compliant access requests and helps validate whether the needed historical coverage, granularity, and linkage keys exist before investing in extraction or model design.

A lightweight approach is to:

  • Interview system owners/SMEs and review existing schemas/data dictionaries
  • Record source location, owner, sensitivity, and access requirements
  • Identify join keys and known data quality gaps
  • Prioritize sources that best support the churn KPI within 8 weeks

The key takeaway is to confirm and document internal data locations and governance paths before building pipelines or selecting modeling details.

This establishes a governed inventory of internal systems, owners, and access paths before any extraction or modeling.


Question 7

Topic: Identify Data Needs

A health insurer is building an ML model to predict high-cost claims to prioritize care management. The training dataset contains 4 years of claims and enrollment data, but provider networks and coding practices changed significantly in the last 9 months. The model must launch in 6 weeks, and analysts can only access de-identified data in a controlled environment (no free-form extracts). Which approach best optimizes risk reduction and model quality by assessing data freshness and relevance to expected operating conditions while meeting the constraints?

  • A. Train only on the latest 2 weeks of claims
  • B. Train on all 4 years to maximize sample size
  • C. Buy third-party claims benchmarks to replace internal data
  • D. Compare recent 9-month data to training and set refresh SLAs

Best answer: D

What this tests: Identify Data Needs

Explanation: The most effective way to assess freshness and relevance is to evaluate whether the most recent data reflects current operating conditions and how it differs from the historical training set. Profiling and distribution comparisons on the last 9 months, followed by defined data freshness SLAs, reduces the risk of deploying a model trained on outdated patterns. This can be done within a de-identified, controlled-access environment and within the 6-week timeline.

Data freshness and relevance are about whether the data represents the conditions the model will see after launch (current networks, coding practices, and patient mix). When operations have changed recently, “more data” can increase volume but also increase mismatch, raising the risk of poor real-world performance.

A practical approach under tight timelines and privacy constraints is to:

  • Profile key features/labels in the most recent 9 months and the historical set
  • Compare distributions, missingness, coding/value shifts, and target rates
  • Validate label definitions and any known process changes that affect meaning
  • Define a data freshness SLA (and owners) for ongoing retraining/monitoring

This produces an evidence-based decision on whether to reweight, window, or exclude older data and sets a maintainable plan for operating conditions after go-live.

It directly tests whether recent data matches expected conditions and establishes an actionable freshness plan without violating access constraints.


Question 8

Topic: Identify Data Needs

A retailer is building a model to predict next-week product demand. The team proposes using five years of historical sales and promotion data. In the last six months, the company introduced same-day delivery and changed its promotion calendar, and leadership expects these shifts to persist.

Which approach SHOULD AVOID when evaluating whether the available data is fresh and relevant to expected operating conditions?

  • A. Confirming that key fields and definitions stayed consistent after process changes
  • B. Testing performance on a recent holdout period that reflects new operations
  • C. Checking for distribution shifts between recent months and older periods
  • D. Using the full five-year dataset unmodified because more data is always better

Best answer: D

What this tests: Identify Data Needs

Explanation: Data freshness and relevance mean the training and validation data should reflect the conditions the model will face in production. When operating conditions have changed, older data can be misleading unless you deliberately account for those changes. Treating all historical data as equally representative is the key anti-pattern in this scenario.

To assess freshness and relevance, focus on whether the data represents the current (and expected) business environment. Here, same-day delivery and a new promotion calendar likely change buying patterns and the relationship between features (promotions, inventory, lead time) and demand. Good evaluation practices include detecting temporal drift, validating that feature definitions and data capture didn’t change, and measuring model performance on a recent time slice that mirrors production conditions. The approach to avoid is blindly pooling all historical data without assessing or adjusting for these shifts, because it can overweight obsolete patterns and inflate offline metrics that won’t hold after deployment.

This assumes historical patterns remain relevant despite known business changes, risking training on stale, non-representative data.


Question 9

Topic: Identify Data Needs

A healthcare analytics team wants to predict 30-day readmissions and believes housing stability is a key feature. During data readiness assessment, they find housing status is not captured in any internal systems, and compliance confirms they cannot collect it for this use case. What is the best strategy to address this data gap?

  • A. Collect and label new housing-stability data for training
  • B. Augment the dataset by generating synthetic housing-status values
  • C. Proceed and rely on feature engineering to compensate
  • D. Adjust the use-case scope to fit available, allowable data

Best answer: D

What this tests: Identify Data Needs

Explanation: Because housing stability cannot be collected for this use case, the team should not plan the solution around that feature. The appropriate response is to adjust the use case or success criteria to what can be supported by available and permissible data, potentially as a phased roadmap.

A core outcome of determining whether data meets solution needs is identifying gaps that are not realistically closable. If a key feature or label cannot be acquired due to privacy, compliance, or operational constraints, “collect more data” is not an option, and fabricating it via augmentation would undermine validity and governance. In that situation, the best practice is to adjust scope: redefine the prediction target, population, or decision to be supported using data that is available, permissible, and sufficiently representative.

Practical scope adjustments include:

  • Change the target outcome or timeframe to align with available signals
  • Limit the initial release to cohorts with adequate data coverage
  • Treat the constrained data element as a future phase if policy changes

The goal is to maintain a defensible, auditable solution rather than forcing a model to infer prohibited data.

When a critical feature cannot be legally or operationally obtained, the most viable response is to narrow or redefine the use case around data you can use.


Question 10

Topic: Identify Data Needs

A team is onboarding three upstream systems to supply data for a churn prediction model. Past integrations have broken when field names changed, data types shifted (string vs. integer), or new category values appeared without notice. The AI project manager proposes documenting the expected columns, data types, allowed ranges/categories, null-handling rules, and a versioning process that data producers must follow before changes are released.

Which principle or governance approach best matches this practice?

  • A. Implement role-based access controls for restricted fields
  • B. Create labeling guidelines for ground-truth consistency
  • C. Establish drift monitoring to detect production data shifts
  • D. Define a versioned data contract with schema validation

Best answer: D

What this tests: Identify Data Needs

Explanation: This situation is best addressed by explicitly defining the required data schema and constraints so producers and consumers share the same expectations. A versioned data contract (or canonical schema specification) prevents breaking changes by standardizing field names, types, permissible values, and change control before release.

The core concept is defining required data formats and schema characteristics as an enforceable specification between data producers and the AI team. A versioned data contract (often implemented as schema definitions plus validation checks) reduces integration failures by making the expected structure and constraints explicit and by controlling how changes are introduced. In this scenario, the key needs are stable column names, data types, permissible values/ranges, null-handling, and a process for schema evolution.

Practical elements to include are:

  • Field definitions (name, type, units, semantics)
  • Constraints (required/optional, ranges, enums, uniqueness)
  • Encoding/time standards (timestamps, IDs)
  • Versioning and notification/change approval

Other practices like access control, labeling, and drift monitoring are important, but they do not define the schema requirements upstream teams must follow.

A data contract specifies required schema, constraints, and change/version rules to keep pipelines reliable for modeling.

Continue with full practice

Use the PMI-CPMAI Practice Test page for the full PM Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Open the matching PM Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Free review resource

Read the PMI-CPMAI guide on PMExams.com, then return to PM Mastery for timed practice.

Revised on Thursday, May 14, 2026