Browse Certification Practice Tests by Exam Family

Free CompTIA Data+ DA0-002 Full-Length Practice Exam: 90 Questions

Try 90 free CompTIA Data+ DA0-002 questions across the exam domains, with explanations, then continue with full IT Mastery practice.

This free full-length CompTIA Data+ DA0-002 practice exam includes 90 original IT Mastery questions across the exam domains.

Use these questions for self-assessment, scope review, and deciding what to drill next.

Count note: this page uses the full-length practice count maintained in the Mastery exam catalog. Some certification vendors publish total questions, scored questions, duration, or unscored/pretest-item rules differently; always confirm exam-day rules with the sponsor.

Need concept review first? Read the CompTIA Data+ DA0-002 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try CompTIA Data+ DA0-002 on Web View full CompTIA Data+ DA0-002 practice page

Exam snapshot

  • Exam route: CompTIA Data+ DA0-002
  • Practice-set question count: 90
  • Time limit: 90 minutes
  • Practice style: mixed-domain diagnostic run with answer explanations

Full-length exam mix

DomainWeight
Data Concepts and Environments20%
Data Acquisition and Preparation22%
Data Analysis24%
Visualization and Reporting20%
Data Governance14%

Use this as one diagnostic run. IT Mastery gives you timed mocks, topic drills, analytics, code-reading practice where relevant, and full practice.

Practice questions

Questions 1-25

Question 1

Topic: Data Analysis

A data analyst is preparing a slide for executives about a pilot retention program. The analysis uses one month of observational customer data, and customers were not randomly assigned to the program. The program group had a 6% lower churn rate, but the analyst found that program customers also had higher account tenure. Which wording is the best professional decision for the slide?

Options:

  • A. “The program should be expanded because it reduced churn.”

  • B. “The analysis is invalid because customers were not randomly assigned.”

  • C. “Customers in the program had 6% lower churn; tenure may also contribute.”

  • D. “The program caused churn to decrease by 6%.”

Best answer: C

Explanation: Communication should match what the analysis can support. Because the data is observational and customers were not randomly assigned, the result can support an association between program participation and lower churn, not proof that the program caused the decrease. The higher tenure in the program group is a potential confounding factor, so it should be acknowledged in executive wording. A strong summary can still be useful, but it must avoid causal language unless the study design or additional analysis supports causation.

The key takeaway is to state the finding clearly while describing limitations that affect interpretation.

  • Causal wording fails because the study design does not prove the program caused the churn difference.
  • Expansion recommendation overreaches because it turns an association into a confirmed business impact.
  • Invalid analysis goes too far because observational results can still be useful when limitations are disclosed.

Question 2

Topic: Data Acquisition and Preparation

A data analyst is preparing a monthly customer dataset for a dashboard that shows customer counts by segment and average order value. A profile shows that segment contains blanks and "Unknown", while order_value contains blanks and 0. Business users say a true zero order is valid, but missing segment values should not be merged into an existing segment. What is the best next step before building the dashboard?

Options:

  • A. Standardize missing segment indicators and review missing order_value separately

  • B. Convert missing segments to the largest existing segment

  • C. Drop every row with any blank field before reporting

  • D. Replace all blanks and zeros with NULL across the dataset

Best answer: A

Explanation: Missing values must be identified in the context of each field and business rule. In this dataset, blank and "Unknown" segment values can distort segment counts if they are ignored or merged incorrectly. For order_value, however, 0 is a valid numeric value, so treating it as missing would distort the average order value. The practical approach is to standardize only the null-like indicators for segment, separately profile blanks in order_value, and document the treatment before downstream transformations or dashboard calculations. This keeps the analysis accurate without deleting useful records or inventing segment assignments.

  • Blanket null conversion fails because valid zero-dollar orders would be misclassified as missing.
  • Row deletion fails because it can reduce counts and bias averages without first assessing the missingness pattern.
  • Largest segment assignment fails because it hides data quality issues and inflates an existing segment.

Question 3

Topic: Data Acquisition and Preparation

A retail analytics team receives daily sales files from several stores. The team must preserve each raw file in the target lakehouse for audit and lineage, then use the lakehouse compute engine to standardize dates, deduplicate records, and create reporting tables after the data arrives. Which approach best fits these requirements?

Options:

  • A. Use ETL to transform data before loading it into the lakehouse

  • B. Use a dashboard refresh to transform and store the raw files

  • C. Use ELT to load raw data first, then transform it in the lakehouse

  • D. Use web scraping to collect the store sales files directly

Best answer: C

Explanation: ETL and ELT differ mainly by when transformation occurs relative to loading the target environment. In ETL, data is extracted, transformed in a staging or processing layer, and then loaded into the target. In ELT, data is extracted and loaded first, often in raw form, and transformations happen inside the target platform. The stem requires preserving raw files in the lakehouse and performing standardization, deduplication, and reporting-table creation after arrival, so the timing maps to ELT. The key takeaway is to look for whether transformation happens before or after the target load.

  • Transform before load misses the requirement to preserve raw files in the target before standardization and deduplication.
  • Collection method confusion focuses on how data might be gathered, not when transformation occurs.
  • Dashboard refresh misuse is a reporting process, not an acquisition and preparation pattern for raw storage and transformation.

Question 4

Topic: Visualization and Reporting

A regional operations manager needs a weekly report showing where delivery delays are concentrated across service territories. The analyst has clean territory boundary data and order records with delivery ZIP code, delay status, and week. Individual customer addresses must not be shown. Which visualization is the BEST professional choice?

Options:

  • A. Line chart showing total delayed orders by week

  • B. Filled map by service territory using aggregated delay rate

  • C. Pivot table listing ZIP codes and delay counts

  • D. Infographic summarizing top delay causes

Best answer: B

Explanation: When the reporting question is mainly about where a metric is concentrated, a map is usually the best visual. In this scenario, the audience needs to compare delivery delays across service territories, and the analyst has territory boundaries plus ZIP-level order data. Aggregating delay rate to the service-territory level supports the geographic decision while avoiding display of individual customer addresses. A rate is also more comparable than raw counts when territories may have different order volumes.

The key takeaway is to match the visual to the decision: spatial distribution calls for a map, especially when regions or territories are central to the question.

  • Time trend only misses the location-focused requirement because it shows weekly change without showing where delays occur.
  • ZIP pivot detail may support analysis, but it is less effective for a manager needing a quick spatial pattern.
  • Cause infographic may explain reasons for delays, but it does not show regional concentration across territories.

Question 5

Topic: Data Concepts and Environments

A data analyst profiles API records before loading them into a relational reporting table. Which interpretation is best supported by the profile?

Exhibit: API record profile

{
  "order_id": "O-1042",
  "customer": {
    "id": "C-22",
    "address": { "city": "Denver", "state": "CO" }
  },
  "items": [
    { "sku": "A10", "qty": 2 },
    { "sku": "B07", "qty": 1 }
  ]
}

Options:

  • A. The record is unstructured because it has no fixed columns.

  • B. The record contains hierarchical and repeated nested fields.

  • C. The record has duplicate orders that should be removed.

  • D. The record is a flat delimited file with embedded separators.

Best answer: B

Explanation: Semi-structured data, such as JSON, can include fields inside other fields and arrays of repeated values. In this record, customer contains child fields such as id and address, and address contains city and state. The items field is an array with multiple item objects for the same order. That means a relational load may need parsing and possibly flattening or exploding the repeated array, depending on the reporting goal. The exhibit supports nested structure recognition, not duplicate detection or unstructured-data classification.

  • Unstructured data is too broad because the JSON has recognizable keys and organization.
  • Duplicate orders is not supported because the repeated entries are line items within one order.
  • Flat delimited file does not fit because the exhibit uses JSON objects and an array, not rows separated by delimiters.

Question 6

Topic: Data Analysis

A data analyst troubleshoots a scheduled sales dashboard refresh that is repeatable and sometimes fails. The manager needs evidence showing where failures occur and why some runs take longer before changing the workflow.

Exhibit: Current run history

Run timeStatusDurationDetail captured
08:00Success4 minCompleted
09:00Failed16 minJob failed
10:00Success5 minCompleted
11:00Failed18 minJob failed

Which next action best supports the troubleshooting need?

Options:

  • A. Rewrite the dashboard filters to reduce refresh time

  • B. Ask dashboard users to submit screenshots of failures

  • C. Profile the source table for null values and duplicates

  • D. Enable step-level logging with timestamps and error messages

Best answer: D

Explanation: Logging is the best troubleshooting method when a repeatable process needs evidence about where failures occur or how long each step takes. The exhibit shows only high-level status and duration, so it confirms a pattern but does not identify the failing step, error cause, or timing bottleneck. Step-level logs with timestamps, status codes, and error messages create an auditable trail across multiple runs without prematurely changing the workflow. This supports a data-based diagnosis before applying a fix.

The key distinction is that logging gathers operational evidence, while profiling or redesign changes the focus to data content or report design before the failure point is known.

  • Source profiling may reveal data-quality issues, but it does not directly capture process timing or runtime errors.
  • User screenshots are subjective and may miss backend failures that occur before the dashboard loads.
  • Filter rewrites are a possible fix only after evidence shows filters are causing the delay.

Question 7

Topic: Data Acquisition and Preparation

A data analyst is preparing transaction data for a monthly executive dashboard. The source system stores each customer’s exact age, but executives only need to compare purchasing patterns by broad life-stage groups. The dashboard should avoid exposing unnecessary detailed personal data and should remain easy to filter. Which transformation is the best choice?

Options:

  • A. Standardize ages using z-scores

  • B. Bin ages into defined age-range categories

  • C. Delete the age field before dashboard creation

  • D. Convert ages to text strings without changing values

Best answer: B

Explanation: Binning is the right transformation when continuous or highly detailed values need to be grouped into meaningful ranges or categories. In this scenario, exact ages are more detailed than executives need, and broad groups make the dashboard easier to filter and interpret. Binning also helps reduce unnecessary exposure of precise personal details while preserving the analytical value of age-based comparisons. The bins should be defined consistently, documented, and validated so each age maps to exactly one category.

  • Text conversion preserves the exact age values, so it does not meet the need for broad life-stage groups.
  • Field deletion protects detail but removes the age dimension needed for purchasing-pattern comparisons.
  • Z-score standardization supports numeric modeling or comparison, not executive-friendly categorical filtering.

Question 8

Topic: Visualization and Reporting

A public health analyst must share quarterly vaccination outreach results with community partners who have varying levels of data expertise. The report should tell a concise story that can be read quickly in a meeting handout.

Exhibit: Communication requirements

RequirementDetail
AudienceBroad, nontechnical partners
PurposeSummarize key results and next steps
FormatOne-page handout
Detail levelHigh-level trends and selected callouts

Which visualization/reporting format best fits these requirements?

Options:

  • A. Geospatial map

  • B. Pivot table

  • C. Infographic

  • D. Interactive dashboard

Best answer: C

Explanation: An infographic combines brief text, simple visuals, and key metrics to communicate a narrative quickly. In this scenario, the audience is broad and nontechnical, and the required deliverable is a one-page handout with high-level trends and selected callouts. That makes an infographic a better fit than tools designed for exploration, filtering, or detailed analysis. The key decision is the communication goal: tell a concise story, not provide a workspace for slicing data.

  • Pivot table is better for summarizing and drilling into structured data, not for a polished broad-audience narrative.
  • Interactive dashboard supports exploration and filtering, but it is more than needed for a one-page meeting handout.
  • Geospatial map is useful when location patterns are the main message, which is not the stated requirement.

Question 9

Topic: Data Analysis

A customer support manager wants to give front-line agents a report they can use during each shift. Agents need to see assigned tickets, aging status, priority, next action, and SLA risk so they can decide what to work on immediately. Which reporting approach best fits this audience and requirement?

Options:

  • A. Public infographic showing overall support volume

  • B. Detailed operational dashboard with actionable ticket-level views

  • C. Executive summary with monthly trend highlights

  • D. Board-level KPI scorecard with strategic targets

Best answer: B

Explanation: For individual contributors, the best reporting approach is usually operational and action-oriented. The report should expose enough detail for users to decide what to do next, such as assigned items, current status, priority, exceptions, and due dates. In this scenario, agents are not trying to understand long-term strategy or communicate broad performance trends; they need to manage tickets during a shift. A detailed operational dashboard or report with ticket-level views, filters, and clear SLA indicators supports that need. High-level summaries and scorecards are more appropriate for executives or managers who monitor trends and strategic KPIs rather than take immediate record-level action.

  • Monthly trends miss the shift-level actionability and ticket detail needed by agents.
  • Strategic scorecards fit leadership monitoring, not front-line work management.
  • Public infographics communicate broad patterns and would omit sensitive, actionable operational records.

Question 10

Topic: Visualization and Reporting

A regional sales team wants managers to explore approved revenue and quota reports independently without requesting custom exports each week. The analytics lead must keep access limited by role and prevent users from changing certified KPI definitions. Which delivery method best meets these requirements?

Options:

  • A. Static PDF summary

  • B. Ad hoc spreadsheet export

  • C. Public embedded dashboard

  • D. Self-service portal

Best answer: D

Explanation: A self-service portal is the best fit when users need independent access to approved data or reports within governance controls. It can expose certified dashboards, datasets, and KPIs while using permissions or role-based access to limit what each user can see. This supports exploration without requiring analysts to create repeated exports and without allowing users to redefine governed metrics. Static summaries are useful for fixed communication, but they do not support controlled exploration. Unmanaged exports and public dashboards weaken control over access, definitions, and reuse.

  • Static summary fails because it provides a fixed view rather than independent exploration.
  • Spreadsheet export fails because it can create uncontrolled copies and inconsistent KPI calculations.
  • Public dashboard fails because the scenario requires access limited by role, not open distribution.

Question 11

Topic: Data Concepts and Environments

A data analyst is reviewing how a weekly sales packet is created. Based on the exhibit, which concept is best illustrated?

Exhibit: Tool run log

06:00 Sign in to sales portal
06:01 Click Export CSV
06:03 Open KPI workbook
06:04 Refresh workbook data
06:06 Export PDF report
06:07 Email PDF to sales managers

Options:

  • A. Data lakehouse ingestion

  • B. Robotic process automation

  • C. Natural language processing

  • D. Manual ad hoc analysis

Best answer: B

Explanation: Robotic process automation (RPA) uses software to perform repetitive, rule-based steps that a person would otherwise do, such as signing in, clicking export, refreshing a workbook, creating a PDF, and sending an email. In this case, the reporting workflow is automated by executing the same sequence on a schedule. The result is automated reporting, but the exhibit specifically shows the RPA-style mechanism: mimicking user actions across applications.

  • Natural language processing involves analyzing or generating human language, which is not shown in the run log.
  • Data lakehouse ingestion would focus on storing or integrating data into a repository, not clicking through report creation steps.
  • Manual ad hoc analysis is not supported because the task runs as a repeatable automated sequence.

Question 12

Topic: Data Acquisition and Preparation

A data analyst is preparing a monthly sales report. The source system stores territory_code values such as NE-1, NE-2, and SE-1, but the business report must show standardized regions such as Northeast and Southeast. Finance requires that any reported region total can be traced back to the original source values used to create it. Which preparation approach best meets this requirement?

Options:

  • A. Create a mapped derived region field and retain the original code

  • B. Overwrite territory_code with the standardized region name

  • C. Remove source rows with nonstandard territory codes

  • D. Group the codes directly in the report visualization

Best answer: A

Explanation: Traceability is preserved by keeping the original source value and adding a derived field for reporting. In this scenario, territory_code should remain unchanged, while a documented mapping or lookup creates a standardized region field used in the monthly report. This allows Finance to see regional totals while still tracing each reported value back to the exact source codes that contributed to it. The mapping should be versioned or documented so future changes can be audited. Overwriting, hiding, or deleting source values weakens lineage and makes reconciliation harder.

  • Overwriting values fails because replacing territory_code removes the original value needed for audit and reconciliation.
  • Visualization-only grouping may help presentation, but it does not create a governed, reusable transformation with clear lineage.
  • Removing rows damages completeness and may exclude valid sales activity instead of resolving the traceability requirement.

Question 13

Topic: Visualization and Reporting

A data analyst is preparing a quarterly dashboard for regional sales directors. The main visual compares “retention rate” by region, but the CRM migration changed the definition from account-level retention to contract-level retention mid-quarter. One region also has two weeks of delayed updates. What is the BEST professional decision before publishing the visual?

Options:

  • A. Replace the chart with a more colorful regional map

  • B. Remove the delayed region from the dashboard without comment

  • C. Publish the chart because the source is the official CRM

  • D. Add a metric definition and note the delayed region data

Best answer: D

Explanation: A visual needs context when the audience could reasonably misread the numbers because of definitions, timing, completeness, or source changes. In this scenario, the retention metric changed during the quarter, and one region has delayed updates. A concise definition, note, tooltip, or footnote helps users interpret the comparison correctly without overcomplicating the dashboard. The goal is not to hide the issue or redesign the visual for style; it is to make the limitation visible at the point of use.

  • Official source trap fails because a trusted source can still contain definition changes or incomplete refreshes.
  • Silent removal fails because excluding a region without explanation can create a different misleading result.
  • Decorative redesign fails because color or map format does not explain the metric change or delayed data.

Question 14

Topic: Data Concepts and Environments

A team is deploying a relational database that will support an internal reporting application. Review the storage request and choose the best storage type.

Exhibit: Storage request

RequirementDetail
Access patternFrequent random reads and writes
AttachmentMounted by one database server
Behavior neededLow-level volume, partition, and file system control
Data typeStructured database files

Which storage approach best fits these requirements?

Options:

  • A. File storage

  • B. Object storage

  • C. Block storage

  • D. Data lake storage

Best answer: C

Explanation: Block storage is the best fit when an application or database needs low-level, volume-style storage behavior. It presents storage as attachable volumes that the server can format with a file system and use for random read/write workloads. This matches databases that manage structured database files and need predictable access through an operating system volume. Object storage is better for objects such as files in buckets, file storage is better for shared directory access, and data lake storage is a repository pattern rather than a low-level database volume.

  • Object storage is attractive for large datasets, but it does not provide mounted volume behavior with partition and file system control.
  • File storage supports directories and shared files, but it is not the best match for a database needing a dedicated low-level volume.
  • Data lake storage can hold many data formats, but the exhibit asks for storage behavior for database files, not a repository architecture.

Question 15

Topic: Data Analysis

A retail analyst is comparing daily sales consistency for two products. The manager wants the product with the higher relative variability, not just the larger standard deviation.

Exhibit: 30-day sales summary

ProductMean daily unitsStandard deviationFormula
Product A20020CV = standard deviation ÷ mean × 100
Product B8016CV = standard deviation ÷ mean × 100

Which conclusion is supported by the exhibit?

Options:

  • A. Product A has higher relative variability.

  • B. Product B has higher relative variability.

  • C. Product A is more variable because its standard deviation is larger.

  • D. Both products have the same relative variability.

Best answer: B

Explanation: The coefficient of variation (CV) compares dispersion relative to the mean, which is useful when the averages are different. Product A has CV = 20 ÷ 200 × 100 = 10%. Product B has CV = 16 ÷ 80 × 100 = 20%. Although Product A has the larger raw standard deviation, Product B varies more relative to its typical daily sales volume.

  • Raw deviation trap fails because a larger standard deviation alone does not account for different means.
  • Equal variability fails because the calculated CV values are 10% and 20%, not equal.
  • Product A higher reverses the relative comparison after applying the provided formula.

Question 16

Topic: Data Governance

A data analyst discovers that a weekly customer churn export containing names, email addresses, and account IDs was accidentally shared with an external vendor that is not approved to receive customer data. The analyst still has the sent message, file name, recipient, and timestamp. Which response best aligns with incident reporting practices?

Options:

  • A. Wait until misuse of the data is confirmed

  • B. Report through the approved incident channel and preserve evidence

  • C. Notify affected customers directly from the analyst’s email

  • D. Delete the export and ask the vendor to delete it

Best answer: B

Explanation: Incident reporting focuses on timely escalation through the organization’s approved process when a breach or security event is suspected. The analyst should not decide alone whether the event is reportable or try to remediate it informally. Preserving evidence such as the file name, recipient, timestamp, and message helps the security, privacy, or compliance team assess scope, containment, notification obligations, and corrective actions. The key is to report the suspected unauthorized disclosure promptly and avoid actions that could destroy evidence or bypass assigned response roles.

  • Deleting evidence may reduce further exposure, but it can also remove facts needed for investigation and does not satisfy formal reporting.
  • Waiting for misuse is inappropriate because suspected unauthorized disclosure is enough to trigger escalation.
  • Direct customer notice bypasses incident response, legal, and privacy review processes that determine required notifications.

Question 17

Topic: Visualization and Reporting

A finance team publishes a monthly KPI dashboard for executives. After each board meeting, users must be able to reopen the dashboard exactly as it appeared on the meeting date, even if the underlying sales data is later corrected or refreshed. Which dashboard versioning approach best meets this requirement?

Options:

  • A. Increase the scheduled refresh frequency

  • B. Allow users to apply date filters manually

  • C. Create a snapshot version after each board meeting

  • D. Enable real-time refresh on the dashboard

Best answer: C

Explanation: Snapshot versioning is used when users need a fixed historical reference to a report as it existed at a specific time. In this scenario, the board needs to reopen the same KPI dashboard state later, regardless of corrections or refreshes in the source data. A snapshot captures the visible report state, supporting auditability and consistent discussion of prior decisions. Refresh settings help keep reports current, but they do not preserve how the report looked at a past meeting. Manual filters can approximate a period, but they may still reflect updated source values.

  • Real-time refresh satisfies current-data needs, not historical consistency.
  • More frequent refreshes make the dashboard update sooner, which can overwrite the prior meeting view.
  • Manual date filters depend on user action and may still query corrected or changed data.

Question 18

Topic: Data Analysis

An operations manager wants to prioritize product quality reviews using the highest return rate, not the highest count of returned units. Use this formula:

Return rate = returned units ÷ shipped units

ProductShipped unitsReturned units
Alpha10,000400
Bravo4,000240
Charlie2,500125
Delta8,000320

Which product should the analyst prioritize?

Options:

  • A. Delta

  • B. Bravo

  • C. Alpha

  • D. Charlie

Best answer: B

Explanation: A derived measure combines existing fields to create a more useful KPI. Here, the requirement is to compare products by return rate, so returned units must be divided by shipped units for each product. Alpha is 400 ÷ 10,000 = 4%, Bravo is 240 ÷ 4,000 = 6%, Charlie is 125 ÷ 2,500 = 5%, and Delta is 320 ÷ 8,000 = 4%. The highest count of returns is not the same as the highest rate when shipment volumes differ. The key takeaway is to use the formula that matches the business question, not just the largest raw count.

  • Raw counts are misleading because Alpha has the most returns but only a 4% return rate.
  • Middle rate does not satisfy the requirement because Charlie’s 5% is below the highest rate.
  • Equal low rates do not qualify because Alpha and Delta both calculate to 4%.

Question 19

Topic: Data Concepts and Environments

A product team wants to analyze why mobile checkout sessions fail after a new release. An analyst reviews this source excerpt:

timestampsession_idcomponenteventstatus
2026-05-18 09:14:22S1027payment_apiauth_request200
2026-05-18 09:14:25S1027payment_apitoken_retryWARN
2026-05-18 09:14:27S1027checkoutorder_submitERROR

Which interpretation is best supported by the exhibit?

Options:

  • A. It is log data for event and error analysis.

  • B. It is a transaction ledger for revenue reporting.

  • C. It is reference data for product categorization.

  • D. It is survey data for customer sentiment analysis.

Best answer: A

Explanation: Log data records what systems, applications, or devices do over time. The exhibit contains timestamped events tied to a session and component, with status values such as WARN and ERROR. That structure supports analysis of event sequences, usage behavior, failures, latency, and operational conditions. For this checkout issue, the analyst can trace what happened before the failure and identify the affected component or event pattern.

A transaction ledger might confirm whether an order was completed, but it usually would not show the detailed system events leading to the checkout error.

  • Survey data fails because the excerpt contains system-generated events, not customer ratings or free-text feedback.
  • Reference data fails because the rows describe activity over time, not stable lookup values such as product categories.
  • Revenue reporting fails because the excerpt has no amounts, posted transactions, or accounting measures.

Question 20

Topic: Data Analysis

A customer success manager is building a dashboard for a quarterly retention initiative. The manager needs one KPI that gives an early signal before the renewal results are known.

Exhibit:

KPIMeasured whenTypical use
Onboarding completion within 14 daysFirst 2 weeksIdentify accounts needing intervention
Quarterly renewal rateEnd of quarterEvaluate retention success
Tickets closed per support agentDailyTrack support workload
Annual recurring revenue retainedEnd of yearAssess business impact

Which interpretation best aligns the KPI type with the manager’s need?

Options:

  • A. Use annual revenue retained as an operational KPI.

  • B. Use tickets closed as an outcome-oriented KPI.

  • C. Use quarterly renewal rate as a leading KPI.

  • D. Use onboarding completion as a leading KPI.

Best answer: D

Explanation: A leading KPI provides an early signal that can influence a later result. In this scenario, onboarding completion within 14 days happens before the quarterly renewal decision and can trigger intervention for at-risk accounts. Quarterly renewal rate and annual recurring revenue retained are lagging or outcome-oriented measures because they summarize results after the business period is complete. Tickets closed per support agent is operational because it tracks day-to-day activity or workload, not the strategic retention outcome directly.

The key distinction is timing and purpose: leading KPIs help predict or influence future outcomes, while lagging and outcome-oriented KPIs evaluate what already happened.

  • Renewal rate timing fails because quarterly renewal rate is known after the quarter, so it is lagging rather than leading.
  • Support workload fails because tickets closed measures operational activity, not the retention outcome itself.
  • Revenue retained fails because annual retained revenue reflects business impact after results occur, not daily operations.

Question 21

Topic: Data Governance

A sales operations team publishes a daily revenue dashboard from an ETL pipeline. A one-time profiling review was completed before launch. The team now wants to detect quality issues without waiting for users to report bad numbers.

Exhibit: Data quality notes

CheckCurrent approachRecent finding
Null customer_id rateManual review at launchRose from 0.2% to 4.8% last week
Duplicate order_id countManual review at launchSpiked on three daily loads
Dashboard refreshScheduled dailyCompleted successfully

Which next action is best supported by the exhibit?

Options:

  • A. Implement automated quality monitoring with alerts

  • B. Ignore the issue because refreshes succeeded

  • C. Replace the dashboard with a static monthly report

  • D. Repeat the original one-time profiling review

Best answer: A

Explanation: A one-time quality review or profiling activity is useful before launch to understand a dataset’s structure, nulls, duplicates, ranges, and anomalies at a point in time. In an ongoing reporting workflow, quality can drift after launch even when the refresh job technically succeeds. The exhibit shows recurring changes in null rates and duplicate counts across daily loads, so the better governance control is automated data quality monitoring with thresholds, scheduled checks, and alerts. That approach detects emerging issues as the pipeline runs and supports timely investigation before users rely on incorrect KPIs.

  • Repeat profiling may find the current problem, but it does not create ongoing detection for future daily loads.
  • Successful refresh only confirms the pipeline ran; it does not prove the loaded data is accurate or complete.
  • Static reporting reduces refresh frequency but does not address the quality drift shown in the daily data.

Question 22

Topic: Visualization and Reporting

Branch managers need staffing KPIs by 8 a.m. each business day. Each manager should see only their own branch. The metrics include employee IDs and absence reasons. The warehouse refreshes nightly, and managers are comfortable using the company BI portal but not SQL or spreadsheet modeling. Which delivery method best meets these requirements?

Options:

  • A. Create a real-time dashboard using a shared public link

  • B. Grant managers read access to warehouse tables

  • C. Email a spreadsheet export to each manager every morning

  • D. Publish a role-secured BI dashboard with scheduled refresh

Best answer: D

Explanation: A delivery method should match how users can consume the report while protecting sensitive data and meeting the refresh need. Here, a BI portal dashboard fits the managers’ capability, and scheduled refresh aligns with the nightly warehouse update and 8 a.m. deadline. Role-based or row-level security limits each manager to only their branch, which is important because the report includes employee-level sensitive information. A more open or raw-data delivery method would increase privacy and misuse risk without improving the business outcome. The key is to combine appropriate access control with a delivery format the audience can actually use.

  • Spreadsheet email may be timely, but it creates distribution and version-control risks for sensitive employee data.
  • Public shared link fails the access and sensitivity requirements, even if it could be fast.
  • Warehouse table access does not fit the managers’ stated skills and exposes more raw data than needed.

Question 23

Topic: Data Concepts and Environments

A data analyst must build a monthly vendor spend report. Each regional office sends the analyst an approved extract from its procurement system, but the offices do not share a managed database, API, or warehouse connection. The report must be refreshed from the exchanged files and checked for missing required columns before analysis. Which source approach is the BEST professional decision?

Options:

  • A. Use standardized flat-file extracts as the source

  • B. Manually retype each office’s totals into the report

  • C. Scrape each procurement system’s web interface

  • D. Require all offices to migrate into one database first

Best answer: A

Explanation: Flat files, such as CSV or Excel extracts, are practical sources when business users exchange data outside a managed database workflow. In this scenario, the offices already produce approved extracts, and there is no shared database, API, or warehouse connection. The analyst should standardize the file format and validate required columns before analysis so the monthly refresh is repeatable and data quality issues are caught early. A full database migration may be useful later, but it is beyond the stated need and would overengineer the source selection.

  • Database migration adds unnecessary scope because the requirement is to use exchanged monthly extracts.
  • Web scraping is fragile and inappropriate when approved extracts are already available.
  • Manual retyping increases error risk and does not support repeatable validation.

Question 24

Topic: Visualization and Reporting

A sales operations analyst must support a weekly review where managers explore revenue by region, product category, and month. The validated dataset is stored in a governed warehouse table, and the analyst is not allowed to alter the source table or create new governed summary fields. Managers need to regroup and filter totals during the meeting. Which approach is the best professional decision?

Options:

  • A. Create a pivot table from the approved dataset

  • B. Export raw rows for managers to manually edit

  • C. Add summary columns to the warehouse table

  • D. Design a static infographic of monthly revenue

Best answer: A

Explanation: A pivot table is appropriate when users need exploratory aggregation over an approved dataset. It lets managers group, filter, and summarize measures such as revenue by dimensions such as region, product category, and month without modifying the governed source table. This fits the meeting need because the audience can change the view interactively while the underlying validated data remains intact. Static visuals are better for fixed communication, and changing warehouse schema or allowing manual edits would create governance and quality risks.

  • Warehouse changes are unnecessary because the requirement is exploratory analysis, not permanent governed data modeling.
  • Static infographic fails because managers need to regroup and filter totals during the meeting.
  • Manual editing introduces data quality and governance risk by moving analysis away from the validated source.

Question 25

Topic: Data Concepts and Environments

A retail analytics team maintains a star schema for sales reporting. The business wants quarterly sales to remain tied to the customer segment that was active when each sale occurred, even if a customer’s segment changes later. Which dimension-table approach best supports this requirement?

Options:

  • A. Overwrite the segment in the customer dimension

  • B. Use a Type 2 slowly changing dimension

  • C. Store only the previous segment in a new column

  • D. Remove customer segment from the sales model

Best answer: B

Explanation: A slowly changing dimension is used when descriptive attributes in a dimension can change over time. Because the report must show sales by the customer segment that was valid at the time of each sale, the model needs to preserve historical segment values instead of replacing them. A Type 2 SCD typically adds a new dimension row for each changed version of the customer, often with a surrogate key plus effective dates or a current-row flag. This allows facts to remain associated with the correct historical dimension version. Overwriting the dimension would make older sales appear under the customer’s current segment, reducing reporting accuracy.

  • Overwrite current value fails because Type 1 behavior loses the segment that was valid when prior sales occurred.
  • Previous-value column is limited because it does not support a full history of multiple changes over time.
  • Remove the attribute fails because the business specifically needs segment-based historical reporting.

Questions 26-50

Question 26

Topic: Data Acquisition and Preparation

A retail analyst is reviewing a customer satisfaction survey intended to represent a typical month of online checkout experiences. Which interpretation is best supported by the collection profile?

Exhibit: Survey collection profile

FieldDetail
Target populationMonthly online purchasers
Collection windowNovember 24-27
Business noteBlack Friday promotion and checkout latency incident
Completed surveys1,850
Invitation methodEmail to purchasers during the window

Options:

  • A. Conclude the invitation method caused duplicate survey responses.

  • B. Flag possible timing bias from the collection window.

  • C. Diagnose nonresponse bias from customers who ignored the email.

  • D. Treat the results as representative because the response count is high.

Best answer: B

Explanation: Collection timing can bias results when data is gathered during an unusual period that changes normal behavior. Here, the survey is supposed to represent typical monthly checkout experiences, but the collection window overlaps a Black Friday promotion and a checkout latency incident. Those conditions can affect satisfaction, traffic patterns, purchase urgency, and response behavior. A large number of responses does not remove this bias if all observations come from an atypical window. The appropriate interpretation is to flag timing-related collection bias and consider recollecting or comparing against normal-period data.

  • High response count fails because sample size does not fix an unrepresentative collection period.
  • Nonresponse bias is possible in many surveys, but the exhibit’s decisive issue is the unusual timing.
  • Duplicate responses are not supported because the exhibit does not show repeated submissions or identifier problems.

Question 27

Topic: Visualization and Reporting

A data analyst is preparing a quarterly executive dashboard using 24 months of aggregated sales data. Leaders want to see how total revenue changes over time and how each product category contributes to that total. The chart must be easy to scan and should not expose transaction-level detail. Which chart type is the best choice?

Options:

  • A. Histogram

  • B. Stacked area chart

  • C. Pie chart

  • D. Scatter plot

Best answer: B

Explanation: Chart selection should match the analytical question. This scenario combines a trend question with a composition question: executives need to see revenue over 24 months and understand how product categories make up the total. A stacked area chart is designed for this type of time-based composition because the overall shape shows the total trend, while the stacked bands show category contributions. Aggregated monthly or quarterly data also supports the privacy and summary-level reporting need. A simple line chart could show trend, but it would not show part-to-whole composition as clearly.

  • Scatter plot is used for relationships between two numeric variables, not category contribution over time.
  • Pie chart shows composition at one point in time, but it does not show a 24-month trend.
  • Histogram shows distribution of one numeric variable, not time-based totals by category.

Question 28

Topic: Data Analysis

A logistics analyst is comparing two fulfillment partners for a contract renewal. Leadership says the priority is predictable delivery within a 4-day SLA, not the lowest average. Timestamps were validated, and the 1% missing delivery times for both partners are within the approved reporting tolerance.

PartnerMean daysStd. dev.IQR95th percentile
A2.81.41.95.2
B3.00.40.53.7

Which recommendation is the BEST professional decision?

Options:

  • A. Delay the decision until all missing values are imputed.

  • B. Recommend Partner A because its wider spread shows more delivery flexibility.

  • C. Recommend Partner B for more consistent SLA performance.

  • D. Recommend Partner A because it has the lower mean delivery time.

Best answer: C

Explanation: When the business question emphasizes consistency or risk, dispersion measures can matter more than the mean. Partner A has a slightly lower average delivery time, but its standard deviation and IQR are much larger, and its 95th percentile exceeds the 4-day SLA. That means more deliveries are likely to be late even though the average looks favorable. Partner B has a slightly higher mean, but the much smaller spread and 95th percentile of 3.7 days better match the requirement for predictable SLA performance. The key is aligning the statistic with the decision objective, not selecting the lowest average by default.

  • Mean-only choice ignores that Partner A has greater variability and a higher late-delivery risk.
  • Wider spread is not beneficial here because leadership wants predictable delivery, not variability.
  • Imputing all missing values overreacts because the missing rate is equal and within the approved reporting tolerance.

Question 29

Topic: Data Governance

A data analyst is reconciling conflicting October revenue totals before publishing an executive KPI report. The report must support month-end revenue decisions and use one authoritative reference.

Exhibit: Dataset metadata

DatasetOctober totalOwnerApproved useGovernance notes
CRM_Deals$1,240,000Sales OpsPipeline trackingIncludes open deals
ERP_Invoices$1,185,000FinanceRecognized revenueAudited month-end close
BI_Revenue_View$1,198,000AnalyticsDashboard stagingDerived from CRM and ERP
Analyst_Adjustments.xlsx$1,210,000AnalystAd hoc analysisManual edits, no approval

Which dataset should be selected as the source of truth for the KPI report?

Options:

  • A. ERP_Invoices

  • B. BI_Revenue_View

  • C. Analyst_Adjustments.xlsx

  • D. CRM_Deals

Best answer: A

Explanation: A source of truth is the authoritative dataset used when multiple sources conflict. For month-end revenue decisions, the best source is the dataset with the right business definition, accountable owner, approval status, and governance controls. The exhibit identifies ERP_Invoices as Finance-owned, approved for recognized revenue, and audited at month-end. Those qualities make it the authoritative reference for the KPI report, even if another dataset has a newer-looking or higher total. Derived dashboard views and manual spreadsheets can support analysis, but they should not override the governed system of record.

  • Pipeline data fails because open CRM deals are not the same as recognized revenue.
  • Derived view fails because a staging view inherits conflicts and is not the authoritative owner-approved source.
  • Manual spreadsheet fails because ad hoc edits without approval weaken traceability and governance.

Question 30

Topic: Data Concepts and Environments

A data analyst must build a monthly compliance report for finance managers. The report must use governed customer and transaction tables, support the same query logic each month, and return consistent structured records for audit review. Which data source is the best professional choice?

Options:

  • A. The approved relational database

  • B. A raw document folder in a data lake

  • C. A web scrape of customer account pages

  • D. A shared spreadsheet exported by users

Best answer: A

Explanation: For a recurring compliance report that depends on governed tables and repeatable query logic, the best source is an approved relational database or similar governed database environment. Relational databases store structured records in tables, enforce schemas and constraints, and allow analysts to run consistent SQL queries over time. This supports auditability because the source, query, and data structure can be documented and repeated. User-managed files, scraped pages, and raw document stores may be useful for exploration or unstructured data, but they introduce more variability and governance risk for this requirement.

  • User exports can change format or content without control, which weakens repeatability and auditability.
  • Web scraping is fragile and may bypass governed source-of-truth processes.
  • Raw data lake files may store useful data, but a raw document folder does not guarantee structured, query-ready, governed tables.

Question 31

Topic: Data Concepts and Environments

A retail analyst is designing a star schema for an executive dashboard. The sales fact table is at the order-line grain, and the CRM source allows each customer to have multiple interest segments. A direct join to the segment list makes reported revenue higher than the finance total. Executives still need revenue by segment and reconciled overall totals. What is the BEST professional decision?

Options:

  • A. Create a customer-segment bridge table with allocation weights

  • B. Join sales directly to the customer segment list

  • C. Store all customer segments in one delimited field

  • D. Keep only the customer’s most recent segment

Best answer: A

Explanation: A bridge table is used when a dimension relationship is many-to-many, such as customers belonging to multiple segments. Directly joining an additive fact like revenue to multiple segment rows repeats the same order-line amount, inflating totals. A customer-segment bridge stores one row per valid customer-to-segment relationship and can include an allocation factor when revenue must be distributed across segments. This preserves segment analysis while allowing overall totals to reconcile to the sales fact and finance source. The key is to model the relationship explicitly instead of hiding or deleting valid segment memberships.

  • Direct join fails because each customer’s sales can be repeated once per segment, overstating revenue.
  • Delimited segments make filtering and relationship management harder and do not solve double counting.
  • Most recent segment discards valid history or memberships, reducing analytical usefulness for segment reporting.

Question 32

Topic: Data Acquisition and Preparation

A marketing analyst is preparing a monthly campaign dataset for a revenue dashboard. The source file contains order_amount_raw as text; most values are valid amounts, but some contain entries such as TBD, blank values, or currency symbols. The dashboard needs a numeric amount for aggregation, and the data steward wants the original quality issue to remain auditable. Which preparation approach best meets these requirements?

Options:

  • A. Overwrite order_amount_raw with cleaned numeric values

  • B. Remove every row with a nonnumeric amount

  • C. Create a numeric derived field and retain the raw field with an error flag

  • D. Replace all invalid amounts with 0 before loading

Best answer: C

Explanation: Analysis readiness improves when a transformation creates a usable field without destroying evidence of the source issue. In this case, the dashboard needs a numeric value for aggregation, but the steward also needs auditability. A derived numeric amount field supports calculations, while retaining order_amount_raw preserves lineage. Adding an error or validity flag makes invalid values visible for quality review instead of silently hiding them. This approach separates reporting usability from data-quality remediation.

  • Overwriting raw data removes the original values, making it harder to audit how quality issues were handled.
  • Using zero imputation can distort revenue totals and hide whether values were truly zero or invalid.
  • Dropping rows may bias the dashboard and removes evidence needed for follow-up with the source owner.

Question 33

Topic: Data Analysis

A data analyst is writing a dashboard note for executives about a promotional email pilot. The note must summarize only what the data supports. Which wording is most appropriate?

Exhibit: Pilot summary

Population: loyalty members who opted in to email
Comparison: same members, 30 days before vs. after email
Average order value: $42 before; $47 after
Control group: none
Known issue: holiday sale overlapped pilot

Options:

  • A. All customers will increase average order value by $5 if emailed.

  • B. The email caused a $5 increase in average order value.

  • C. The email had no effect because the holiday sale overlapped the pilot.

  • D. Opted-in loyalty members had higher average order value after the email; causation is not established.

Best answer: D

Explanation: Communication should match the strength of the evidence. The exhibit shows a before-and-after increase for the same opted-in loyalty members, but there is no control group and a holiday sale occurred at the same time. That supports wording about an observed association or change in this group, not a causal claim or a prediction for all customers. A confounder weakens causal interpretation, but it does not prove the email had no effect. The safest wording states the observed result and clearly limits the conclusion.

  • Causal claim fails because a before-and-after comparison without a control group cannot isolate the email’s effect.
  • Overgeneralization fails because the data covers opted-in loyalty members, not all customers.
  • No-effect claim fails because the holiday sale creates uncertainty; it does not prove the email was ineffective.

Question 34

Topic: Data Analysis

A data analyst is asked whether a 2-week email campaign increased repeat purchases. The dashboard shows a lift, but the analyst notices a recent data-quality alert.

Exhibit: Campaign evidence summary

CheckResult
Repeat purchase rate, targeted customers8.4%
Repeat purchase rate, holdout customers7.9%
Missing campaign flag18% of orders
Missingness patternMostly from mobile checkout
Data alertCampaign flag pipeline changed 3 days before launch

Options:

  • A. Exclude all mobile orders and publish the revised lift

  • B. Impute missing campaign flags as not targeted

  • C. Validate the campaign flag source and rerun the analysis

  • D. Report that the campaign increased repeat purchases

Best answer: C

Explanation: Evidence-based conclusions should account for data quality before interpreting a KPI difference. Here, the observed lift is only 0.5 percentage points, while 18% of orders are missing the campaign flag. The missingness is not random because it is concentrated in mobile checkout, and a pipeline change occurred just before the campaign. That makes the campaign/holdout comparison unreliable until the source issue is investigated and the analysis is rerun with validated data. The defensible next action is to verify lineage and source extraction for the campaign flag rather than state a business conclusion from compromised evidence.

  • Positive conclusion fails because the lift may be an artifact of missing or misclassified campaign flags.
  • Dropping mobile orders may introduce selection bias and does not fix the underlying source-quality issue.
  • Default imputation is unsupported because treating missing flags as not targeted could systematically misclassify orders.

Question 35

Topic: Data Governance

A retail analytics team receives daily sales extracts from eight stores. The executive dashboard has repeatedly shown issues caused by blank store_id values, negative quantity values, and duplicate transaction_id values. The business wants to catch these issues before the dashboard refreshes, track recurrence by store over time, and notify the data owner automatically.

Which approach best maps to these requirements?

Options:

  • A. Run quarterly user acceptance testing on dashboard filters.

  • B. Mask transaction identifiers before loading the dashboard.

  • C. Implement automated profiling rules, quality metrics, and alerts.

  • D. Manually inspect a sample after the dashboard publishes.

Best answer: C

Explanation: Automated data-quality monitoring is the best fit when recurring issues must be detected consistently before downstream reporting is affected. Profiling rules can test fields for completeness, valid ranges, and uniqueness, while quality metrics can trend failure rates by source store over time. Alerts connect the monitoring process to governance by notifying the accountable data owner when thresholds or rules fail. This approach supports repeatable detection and evidence-based remediation instead of relying on occasional review or after-the-fact correction.

  • Quarterly UAT checks report usability, but it is too infrequent and does not monitor incoming data quality before each refresh.
  • Masking identifiers protects sensitive values, but it does not detect blanks, invalid quantities, or duplicates.
  • Manual sampling may miss recurring defects and occurs too late if performed after publication.

Question 36

Topic: Data Acquisition and Preparation

A data analyst is preparing a monthly customer churn model for the retention team. The source CRM table has 8% missing values in household_income, which is a useful predictor, and rows with missing values are otherwise complete. The business owner wants the model update delivered this week and asks that the preparation steps be reproducible and defensible. Which action is the BEST professional decision?

Options:

  • A. Delete all rows with missing income values before modeling

  • B. Replace all missing income values with zero

  • C. Impute the missing income values using a documented, validated method

  • D. Remove the income field from the model dataset

Best answer: C

Explanation: Imputation is appropriate when missing values should be filled using a reasonable method instead of automatically deleting records. In this scenario, the missing field is useful for churn modeling, the affected rows are otherwise complete, and the analyst must produce a defensible, reproducible preparation process. A documented method, such as median imputation or segment-based imputation followed by validation, can preserve sample size and reduce avoidable bias from dropping records. The chosen method should be recorded so reviewers can understand how missing values were handled.

Automatic deletion is a weaker choice because it discards usable customer records without first assessing the impact of the missingness.

  • Row deletion can reduce sample size and introduce bias when the remaining fields are usable.
  • Zero replacement is not defensible for income unless zero is a valid, verified value.
  • Field removal discards a useful predictor instead of addressing a manageable missing-value issue.

Question 37

Topic: Data Governance

A data analyst publishes a monthly revenue dashboard used by finance leadership. After a late adjustment from the source system, the March dataset and dashboard totals changed, but leaders still need to compare the current numbers with the originally published March report for audit discussion. What is the best professional decision?

Options:

  • A. Overwrite the March files with the corrected values

  • B. Delete the original dashboard to avoid conflicting totals

  • C. Maintain versioned snapshots of the dataset and dashboard output

  • D. Email finance a note explaining that totals changed

Best answer: C

Explanation: Data versioning is a governance control for preserving identifiable versions of datasets, reports, or analytical outputs as they change. In this scenario, finance needs both the originally published March report and the corrected version for audit comparison. Keeping versioned snapshots supports traceability, repeatability, and source-of-truth discipline without preventing valid corrections from being applied. Overwriting or deleting files removes evidence of what was previously published. A note alone may provide context, but it does not preserve the actual dataset and report state needed for comparison.

  • Overwrite only fails because it destroys the originally published March state needed for audit comparison.
  • Delete originals fails because removing prior outputs reduces traceability and can create governance risk.
  • Email explanation fails because communication does not substitute for controlled versions of the data and report output.

Question 38

Topic: Data Governance

A data analyst is preparing a monthly customer-service dashboard from support tickets that include customer names and email addresses. The company retention policy requires raw tickets to be retained for 180 days, disposed of after that period unless on legal hold, and aggregated KPI results to be retained for 3 years. Which action is the best professional decision?

Options:

  • A. Delete all raw tickets after dashboard publication

  • B. Retain raw tickets 180 days, then dispose unless held

  • C. Export raw tickets to a personal spreadsheet archive

  • D. Keep raw tickets indefinitely for trend analysis

Best answer: B

Explanation: Retention requirements define how long data must be kept and when it must be disposed of based on policy, regulation, or legal hold. In this scenario, raw tickets contain personal data, so the analyst should not keep them longer than allowed or delete them before the required retention period. The professional decision is to retain raw tickets for 180 days, dispose of them after that period unless a legal hold applies, and keep only the aggregated KPI results for the 3-year reporting requirement. This balances compliance, reporting continuity, and privacy risk.

  • Indefinite retention increases privacy and compliance risk because the policy gives a specific disposal point for raw tickets.
  • Early deletion violates the required 180-day retention period for raw ticket records.
  • Personal archives bypass governed storage and do not satisfy controlled retention or disposal requirements.

Question 39

Topic: Data Analysis

A data analyst is creating a calculated field named review_status for an orders dashboard. The field must classify each row using the rule in the exhibit. Which logical function pattern should be used?

Exhibit: Classification rule

ConditionResult
chargeback_flag = TRUEManual review
customer_status = "New" and order_amount >= 500Manual review
All other rowsStandard

Options:

  • A. Use CONCAT to combine chargeback_flag, customer_status, and order_amount

  • B. Use IF/CASE with chargeback_flag OR customer_status = "New" OR order_amount >= 500

  • C. Use IF/CASE with chargeback_flag OR (customer_status = "New" AND order_amount >= 500)

  • D. Use IF/CASE with chargeback_flag AND customer_status = "New" AND order_amount >= 500

Best answer: C

Explanation: Logical functions are used when a calculated field depends on conditions, flags, categories, or business rules. In this case, the output is category-based: each row becomes either Manual review or Standard. The exhibit defines two separate paths to Manual review: any chargeback, or a new customer with an order amount of at least 500. That means the overall test needs OR, while the new-customer high-value path needs AND grouped together. An IF or CASE expression can then return the correct label based on that combined Boolean result. The key is preserving the rule logic exactly, not merely checking whether any individual field has a value.

  • All AND logic is too restrictive because it would miss chargeback rows that are not new high-value orders.
  • All OR logic is too broad because it would flag every new customer or every high-value order independently.
  • Concatenation combines values as text but does not evaluate the business rule or return the required category.

Question 40

Topic: Data Acquisition and Preparation

A retail analyst receives a customer export to build a monthly churn report by region for executives. The report requires one row per customer, a valid cancellation date when churned, and a standardized region value. Which exploration step should the analyst perform first to determine whether the dataset is fit for this purpose?

Options:

  • A. Create the final churn trend dashboard

  • B. Delete records with any blank fields

  • C. Aggregate churn counts by product category

  • D. Profile customer IDs, cancellation dates, and region values

Best answer: D

Explanation: Data exploration should focus on the fields that directly support the analytical purpose. For a churn report by month and region, the analyst needs to verify uniqueness at the customer level, usable cancellation dates for monthly grouping, and consistent region categories for segmentation. Profiling these fields can reveal whether the dataset has duplicates, missing values, invalid dates, or inconsistent labels that would make the report unreliable.

The key takeaway is to assess fitness for the intended analysis before transforming, deleting, aggregating, or visualizing the data.

  • Dashboard first skips the quality check needed before executives rely on the churn results.
  • Deleting blanks may remove valid records and loses traceability before the analyst understands which blanks matter.
  • Product aggregation supports a different analysis and does not test the required month-by-region churn fields.

Question 41

Topic: Data Acquisition and Preparation

A retail analyst is asked to summarize customer satisfaction for all store shoppers. Due to time constraints, the team collected responses only from shoppers who voluntarily completed a tablet survey near the checkout counter during one Saturday afternoon. Which sampling limitation should the analyst document?

Options:

  • A. Random sample

  • B. Convenience-like sample

  • C. Incomplete sample

  • D. Fully unbiased sample

Best answer: B

Explanation: A convenience-like sample uses participants who are easiest to access, which can limit how well the results represent the target population. In this scenario, shoppers chose whether to respond, and collection occurred only near checkout during one Saturday afternoon. That approach is convenient, but it may overrepresent certain shoppers and miss others who shop at different times, skip the tablet, or use other checkout methods. An incomplete sample would emphasize missing required records or fields from an intended collection, while a random sample would require each shopper to have a known chance of selection. The key takeaway is to document the collection limitation before generalizing the results to all shoppers.

  • Random selection fails because the team did not describe a process that gave all shoppers a known chance to be selected.
  • Incomplete sample is tempting because some shoppers are missing, but the main limitation is the convenience-based collection method.
  • Unbiased result fails because voluntary responses from one time and place can systematically skew the results.

Question 42

Topic: Visualization and Reporting

A data analyst is preparing an executive update on website conversion rates. The dataset contains weekly conversion rates for 12 weeks before and 12 weeks after a homepage redesign. A paid marketing campaign began the same week as the redesign, and web tracking was incomplete for two days. Which visualization approach is the best professional decision?

Options:

  • A. 3D pie chart of pre- and post-redesign conversions

  • B. Forecast chart projecting future redesign gains

  • C. Annotated weekly line chart with data-quality notes

  • D. Before-and-after bar chart labeled as redesign impact

Best answer: C

Explanation: The core issue is choosing a visualization that supports the message without implying stronger evidence than the data can support. A weekly line chart lets executives see the pattern before and after the redesign, while annotations can mark the redesign date, the marketing campaign, and the tracking gap. This communicates a possible association and preserves transparency about data quality and confounding factors. A simple before-and-after comparison would be easier to read, but it can overstate causation when another campaign started at the same time.

  • Causal label fails because the concurrent marketing campaign prevents attributing the change only to the redesign.
  • Pie chart fails because conversion rate over time is not a part-to-whole relationship.
  • Forecasting gains fails because the stem asks for an executive update, not unsupported prediction from confounded evidence.

Question 43

Topic: Data Concepts and Environments

An analytics team is comparing AI tools for automated reporting. Which interpretation is best supported by the exhibit?

ToolModel notePlanned use
Assistant APretrained on large text corpora; generates SQL explanations and summariesAnalyst Q&A
Platform BPretrained on text, image, and audio; adaptable for captioning, transcription, and document searchMultiple reporting automations

Options:

  • A. Both tools are only LLMs because both can process text.

  • B. Assistant A is a foundation model; Platform B is only RPA.

  • C. Assistant A is an LLM; Platform B is a broader foundation model.

  • D. Neither tool uses AI unless trained from scratch internally.

Best answer: C

Explanation: A large language model (LLM) is focused on language tasks such as generating, summarizing, or explaining text. A foundation model is the broader category: it is pretrained on large datasets and can be adapted to many tasks, which may include language, images, audio, or other modalities. In the exhibit, Assistant A is language-centered, so it fits the LLM description. Platform B supports multiple data types and reporting tasks, so it is best interpreted as a broader foundation model. The key distinction is scope: language-specific capability versus broad pretrained adaptability.

  • RPA confusion fails because Platform B describes pretrained AI capabilities, not rule-based task automation.
  • Text-only assumption fails because Platform B also includes image and audio capabilities.
  • Internal training myth fails because externally pretrained models can still be AI tools used by an organization.

Question 44

Topic: Data Governance

A marketing analytics team wants to send customer purchase data to an external vendor for a campaign-response analysis. The vendor only needs customer segment, region, purchase month, and purchase amount. The contract states that direct identifiers and unnecessary personal data must not leave the organization. Which sharing approach best meets these requirements?

Options:

  • A. Send the raw file and require the vendor to delete extra fields

  • B. Send the full customer table after encrypting the file

  • C. Share the internal dashboard and ask the vendor to export needed data

  • D. Share a minimized, de-identified extract through an approved secure channel

Best answer: D

Explanation: When data leaves a team, system, vendor, or organizational boundary, sharing controls should be applied before release. In this scenario, the vendor has a narrow business need and the contract prohibits direct identifiers and unnecessary personal data from leaving the organization. The best approach is to create a least-privilege extract containing only required fields, remove or de-identify identifiers, and use an approved secure transfer method. Encryption is important, but it does not make unnecessary data appropriate to share. The key takeaway is to combine data minimization with protection controls before external sharing occurs.

  • Encryption only protects the file during transfer but still exposes unnecessary fields if the full table is sent.
  • Dashboard export may bypass governed extract controls and can expose more data than the vendor needs.
  • Vendor cleanup shifts control outside the organization after exposure has already occurred.

Question 45

Topic: Data Acquisition and Preparation

A data analyst is preparing transaction data for a monthly revenue report. The business owner warns that unusually high or low amounts may be legitimate bulk orders, refunds, data-entry errors, payment-system faults, or one-time promotional events. Which preparation approach best meets this requirement before the report is published?

Options:

  • A. Remove all values outside the interquartile range

  • B. Flag and investigate outliers before assigning a treatment

  • C. Ignore outliers if the monthly total reconciles

  • D. Replace all unusual values with the median amount

Best answer: B

Explanation: Outlier handling should start with identification and investigation, not automatic deletion or replacement. In this scenario, the same unusual value pattern could mean several different things: a valid extreme transaction, a refund, a data-entry issue, a system fault, or a special business event. A good preparation approach flags suspected outliers, checks source records or logs, consults business rules, and documents the reason for the final treatment. This preserves valid business signals while preventing bad data from distorting the report.

The key takeaway is that outliers are not automatically errors; they require context before remediation.

  • Automatic removal fails because true bulk orders or special events could be valid business activity.
  • Median replacement fails because it changes values before confirming whether they are erroneous.
  • Ignoring outliers fails because reconciliation alone may hide system faults or transaction-level quality problems.

Question 46

Topic: Data Governance

A data analyst supports a weekly customer-quality review for a data governance team. The customer table is loaded from three source systems, and recent reports show missing email addresses, duplicate customer IDs, invalid postal codes, and inconsistent state abbreviations. The team needs recurring metrics to track these issues over time before approving remediation work. What is the best professional decision?

Options:

  • A. Archive older customer records to reduce table size

  • B. Build a predictive churn model using the table

  • C. Create a one-time corrected extract for the review

  • D. Run scheduled data profiling on the customer table

Best answer: D

Explanation: Data profiling examines a dataset and produces quality metrics such as completeness, validity, uniqueness, and consistency. In this scenario, the governance team needs recurring measurements for specific data-quality issues across multiple source systems. Scheduled profiling supports trend tracking and helps prioritize remediation with evidence. It is more appropriate than fixing a single extract because the requirement is to understand and monitor quality patterns over time, not just make one report look correct.

  • One-time extract may temporarily hide issues, but it does not provide recurring quality metrics or trend evidence.
  • Predictive modeling uses data for forecasting, but it does not address the stated governance need to measure data quality.
  • Archiving records reduces volume, but it does not evaluate missing, duplicate, invalid, or inconsistent values.

Question 47

Topic: Data Analysis

A sales analyst is asked why the daily e-commerce dashboard does not match the finance month-end revenue report.

SourceReliabilityRefreshPreparation applied
Checkout transactionsOperational source of truthHourlyRemoves test orders, deduplicates transaction_id, converts time zone
Ad platform exportCampaign attribution estimateDaily at 2:00 a.m.No deduplication; includes pending returns
Finance ledgerAudited revenue sourceMonthly after closeApplies accounting adjustments

The dashboard was viewed on March 15 at 10:00 a.m. The finance report is closed through February 29. Which conclusion is best supported?

Options:

  • A. The finance report is less reliable because it is refreshed monthly.

  • B. The mismatch is expected; reconcile the same closed period using aligned preparation rules.

  • C. The ad platform should replace both sources because it includes attribution.

  • D. The dashboard proves March revenue is higher because it refreshes hourly.

Best answer: B

Explanation: Evidence-based conclusions should consider source reliability, refresh timing, and preparation before explaining differences. Here, the checkout data is reliable for current operational monitoring, but the finance ledger is the audited revenue source and is only closed through February 29. A March 15 dashboard and a February month-end report are not measuring the same time window. Preparation also differs: checkout data is deduplicated and standardized, while the finance ledger applies accounting adjustments. The defensible conclusion is to treat the mismatch as expected and compare the same closed period using aligned business rules before stating whether revenue truly differs.

  • Freshness overreach fails because hourly refresh supports current monitoring, not an audited month-end revenue conclusion.
  • Monthly refresh misconception fails because slower refresh does not make the finance ledger unreliable for closed financial reporting.
  • Attribution source confusion fails because campaign-attributed estimates are not the revenue source of truth and still need preparation.

Question 48

Topic: Data Analysis

A city transportation analyst has survey responses from 1,200 randomly selected bus riders and wants to estimate whether the overall rider population supports extending weekend service. The survey sample includes age and route fields, but a small number of responses have missing demographic values. The planning director needs a defensible conclusion for all riders, not just a summary of the sample. Which analysis approach is the BEST professional decision?

Options:

  • A. Create descriptive charts of sample response counts

  • B. Use inferential analysis with confidence intervals

  • C. Delete all incomplete records before reporting totals

  • D. Build a predictive model for individual rider behavior

Best answer: B

Explanation: Inferential analysis is appropriate when a sample is used to draw a conclusion about a larger population. Here, the analyst has a random sample of bus riders and the director needs a defensible statement about all riders. A confidence interval or hypothesis test can quantify uncertainty around the estimated level of support. The missing demographic values should be reviewed and handled consistently, but they do not change the core method selection if the support response is usable. Descriptive analysis would summarize only the 1,200 respondents, while predictive analysis would answer a different question.

  • Descriptive summary fails because counts and charts describe the sample but do not support a population-level conclusion.
  • Predictive modeling fails because the goal is estimating overall support, not predicting an individual rider outcome.
  • Record deletion fails because removing incomplete responses without assessing impact can introduce bias and ignores the inferential goal.

Question 49

Topic: Data Acquisition and Preparation

A data analyst is preparing support-ticket text for analysis. The ticket ID should be captured only when it follows the same repeatable pattern: two uppercase letters, a hyphen, four digits, a hyphen, and three digits.

Exhibit: Sample values

Raw textNeeded output
Refund request AB-2024-118 receivedAB-2024-118
Follow-up for CD-2023-007 closedCD-2023-007
No valid ticket referencenull

Which method is the best next action?

Options:

  • A. Impute missing ticket IDs with the mode

  • B. Apply a RegEx extraction rule

  • C. Convert the raw text column to a date type

  • D. Bin the raw text values by length

Best answer: B

Explanation: Regular expressions are used to find, extract, or validate text that follows a repeatable pattern. In this case, the needed ticket ID is embedded inside longer text and has a consistent structure: uppercase letters, hyphens, and fixed digit groups. A RegEx extraction rule can return the matching ID when present and leave rows without a valid match as null or flagged for review. This preserves the source text while creating a clean analysis field.

Type conversion, binning, and imputation do not identify the patterned substring inside each ticket note.

  • Date conversion fails because the full raw text is not a date field and contains non-date content.
  • Length binning groups records by character count but does not extract the ticket ID.
  • Mode imputation fills missing values but would invent repeated ticket IDs rather than locate valid ones.

Question 50

Topic: Data Concepts and Environments

A retail analytics team needs one repository for raw clickstream logs, product images, and curated sales tables. Data scientists must access raw files for experimentation, while BI users need governed SQL reporting, documented lineage, and consistent access controls. Which repository concept best fits these requirements?

Options:

  • A. Flat file repository

  • B. Operational database

  • C. Data lakehouse

  • D. Data silo

Best answer: C

Explanation: A data lakehouse is designed for scenarios that need the flexibility of a data lake and the managed analytics features of a data warehouse. In this case, the team must store diverse raw data such as logs and images, while also supporting curated tables, SQL reporting, lineage, and governed access. That combination points to a lakehouse rather than a single-purpose operational store or disconnected file location. The key takeaway is that lakehouses bridge exploratory storage and governed analytical consumption.

  • Data silo fails because it isolates data and usually makes shared governance, lineage, and cross-team analytics harder.
  • Operational database fails because it is optimized for transaction processing, not mixed raw file storage and governed analytics.
  • Flat file repository fails because it can store files but does not inherently provide warehouse-style controls for BI reporting.

Questions 51-75

Question 51

Topic: Data Acquisition and Preparation

A data analyst is profiling a staging table before building a monthly inventory report. The report must include every approved store, product category, and calendar month, even when the quantity is zero. Required fields are store_id, category, month_start, and quantity_on_hand. Which preparation approach should the analyst use first to assess completeness?

Options:

  • A. Build an expected grid and left join the staging data.

  • B. Drop rows that contain any null required field.

  • C. Aggregate quantities by store and compare grand totals.

  • D. Standardize category names and date formats only.

Best answer: A

Explanation: Completeness means verifying that all required data is present, not just that existing rows look valid. In this scenario, the analyst needs to confirm that every required store, category, and month appears in the data. Creating the full expected set from approved reference lists and comparing it to the staging table identifies absent combinations, such as a missing month for a store-category pair. The same profiling step can also flag records where required fields are null. This is stronger than row cleanup or summary comparison because missing source records may not appear as visible errors in the staging table.

  • Dropping nulls removes evidence of incomplete data and does not detect combinations that are entirely absent.
  • Standardizing values improves consistency, but it does not prove that every required period or category exists.
  • Comparing totals can hide gaps because missing records may be offset by other values.

Question 52

Topic: Visualization and Reporting

A sales operations dashboard must show prior-day results by 8:00 a.m. for regional managers. Users report that the dashboard still shows two-day-old figures at 8:30 a.m. The source system finishes its nightly export at 7:45 a.m., but the dashboard dataset refresh is scheduled for 7:00 a.m. What is the most likely refresh-rate issue?

Options:

  • A. The refresh schedule runs before the source update completes.

  • B. The source data contains duplicate records.

  • C. The dashboard needs a different chart type.

  • D. The managers need row-level security added.

Best answer: A

Explanation: A slow or stale refresh issue often occurs when scheduled reporting processes are not aligned with source-system update timing. In this scenario, the business requirement is prior-day reporting by 8:00 a.m., but the dashboard refresh runs at 7:00 a.m. while the source export does not finish until 7:45 a.m. The report is refreshing successfully, but it is refreshing too early to capture the newest data. The practical fix would be to schedule the dashboard refresh after the source export completes, with enough buffer for processing before the 8:00 a.m. requirement.

  • Wrong visual focus fails because chart type affects presentation, not whether current data is loaded.
  • Data quality issue is not supported because duplicates would distort totals, not explain two-day-old figures.
  • Access control issue fails because row-level security affects who can see data, not when data refreshes.

Question 53

Topic: Data Governance

A data analyst maintains a monthly executive revenue dashboard that combines CRM opportunity data with billing exports. The workflow standardizes region codes, removes test accounts, and maps product names before publishing. Finance needs an audit-friendly document explaining why dashboard revenue differs from raw billing totals and how each reported field is produced. Which documentation artifact is the best choice?

Options:

  • A. Data lineage documentation for the dashboard dataset

  • B. A user access matrix for executive report viewers

  • C. A high-level project charter for the dashboard refresh

  • D. A visual style guide for dashboard colors and labels

Best answer: A

Explanation: Data lineage documentation is the governance artifact that shows where data comes from, how it changes, and where it is used. In this scenario, Finance needs traceability from CRM and billing sources through region standardization, test-account removal, product mapping, and final dashboard fields. That makes lineage more useful than general project, access, or design documentation because it connects source data to transformed report outputs in an audit-friendly way. The key takeaway is to match the artifact to the question being asked: source-to-report traceability calls for lineage documentation.

  • Project charter describes purpose, scope, and stakeholders, but it does not trace field-level transformations or report calculations.
  • Access matrix documents who can view or modify assets, not how revenue values are derived.
  • Style guide supports consistent presentation, but it does not explain source integration or transformation logic.

Question 54

Topic: Data Acquisition and Preparation

A data analyst receives order data from an API. Each record has one order_id, one customer_id, and a products field that contains a list of purchased SKUs with quantities. The merchandising team needs a row-level dataset to analyze sales by SKU while preserving the original order_id for traceability. Which preparation step is the BEST professional decision?

Options:

  • A. Explode the products list into separate rows

  • B. Scale the quantity values to a common range

  • C. Append customer demographic columns to each order

  • D. Impute missing SKU values using the most common SKU

Best answer: A

Explanation: Exploding is the appropriate transformation when a field contains nested or list-like values that must be analyzed as separate observations. In this case, each order can contain multiple products, but the reporting need is SKU-level analysis. Exploding the products list creates one row per SKU or SKU-quantity pair and keeps the order_id available for lineage and traceability. This avoids treating a multi-product list as one unusable text value and supports aggregation by SKU.

  • Imputing SKUs would invent product values and does not solve the nested-list structure.
  • Scaling quantities changes numeric value ranges but leaves multiple SKUs trapped inside one field.
  • Appending demographics is augmentation and may add context, but it does not create item-level rows for analysis.

Question 55

Topic: Data Governance

A VP must decide at 12:15 PM whether to extend a flash sale. The decision requires order data current through noon and an auditable source.

DatasetGovernance noteRefresh intervalLast refresh
Certified sales martSource of truth for daily sales reportingNightly2:00 AM
Operational orders viewApproved for intraday monitoringEvery 15 minutes12:00 PM
Marketing trackerManually maintainedHourly12:00 PM

Which recommendation is the best professional decision?

Options:

  • A. Use the marketing tracker because it refreshed at noon.

  • B. Use the operational orders view with an as-of timestamp.

  • C. Use the certified sales mart because it is the source of truth.

  • D. Postpone the decision until the nightly sales mart refresh.

Best answer: B

Explanation: Refresh interval is a governance fact because it defines how current a dataset can be for a decision. The certified sales mart is authoritative for daily reporting, but its nightly refresh means it does not include the flash-sale activity needed by noon. The operational orders view is approved for intraday monitoring and refreshed at 12:00 PM, so it best satisfies both currency and auditability. The analyst should include an as-of timestamp so the VP understands the data version used for the decision.

Being the source of truth for one reporting purpose does not automatically make a dataset current enough for every decision.

  • Certified but stale fails because the sales mart is authoritative for daily reporting but last refreshed at 2:00 AM.
  • Current but weak governance fails because the manual tracker is not the approved auditable source.
  • Unnecessary delay fails because an approved intraday dataset already satisfies the business need.

Question 56

Topic: Visualization and Reporting

A data analyst is redesigning a sales performance dashboard for regional managers. The dashboard must be readable on projectors, usable by viewers with common color-vision deficiencies, and make high-risk values stand out without changing the company logo. Which color-scheme approach best meets these requirements?

Options:

  • A. Use a colorblind-safe palette with high contrast and reserve an accent color for high-risk values

  • B. Use the company logo colors for every chart and KPI status

  • C. Use red and green status colors only for all performance categories

  • D. Use a rainbow gradient across all measures to add visual variety

Best answer: A

Explanation: Effective dashboard color schemes should improve interpretation, not decorate the report. In this scenario, the design must work on projectors, support users with common color-vision deficiencies, and highlight high-risk values. A high-contrast, colorblind-safe palette improves readability and accessibility, while reserving one accent color for risk helps the audience notice the most important exception. Branding can still appear in the logo or limited design elements, but it should not control every data encoding choice.

The key takeaway is to use color sparingly and consistently to communicate meaning.

  • Brand-first coloring fails because logo colors may not provide enough contrast or accessible data encoding.
  • Rainbow gradients add decoration and can make category or value comparisons harder to interpret.
  • Red-green status only is risky because many viewers with color-vision deficiencies cannot reliably distinguish those colors.

Question 57

Topic: Data Concepts and Environments

A data analyst must feed an operations dashboard with the most current order status from a fulfillment application. Which source should the analyst select based on the exhibit?

Exhibit: Available source notes

SourceBehavior
App endpointGET /v1/orders/{order_id}/status; returns JSON fields order_id, status, updated_at; reflects app changes within minutes
SFTP exportCSV file delivered nightly at 2:00 a.m.
Warehouse tableLoaded after the nightly CSV export completes
Admin pageHuman-readable HTML page; layout changes without notice

Options:

  • A. SFTP export

  • B. Admin page

  • C. App endpoint

  • D. Warehouse table

Best answer: C

Explanation: An API source is the best fit when an application exposes current data through a documented request and response pattern. In the exhibit, the app endpoint specifies the request path, response fields, format, and update behavior, making it suitable for a recurring dashboard feed. The nightly CSV and warehouse table are stale for near-current operations, and the admin page is not a reliable data interface because its HTML layout can change without notice. The key distinction is using a defined machine-readable interface instead of exports or screen scraping.

  • Nightly export fails because it is delivered only once per day, so it cannot provide the most current status.
  • Warehouse table fails because it depends on the same nightly load and inherits that delay.
  • Admin page fails because changing HTML layout makes it fragile for repeatable data acquisition.

Question 58

Topic: Data Analysis

A marketing analyst is reviewing a draft report about a new recommendation feature. Which interpretation is best supported by the exhibit?

Exhibit: Draft report evidence

Source: Loyalty app purchases, beta users only
Period: 1 week after feature launch
Records: 1,240 customers who opted into beta
Result: Average basket size was 12% higher than the prior week
Comparison group: None
Known event: Weekend flash sale during the beta week
Draft claim: "The feature caused a 12% lift and will increase sales next month."

Options:

  • A. Infer causation because the lift is 12%

  • B. Forecast next month using only the beta-week average

  • C. Approve the causal and predictive claim as written

  • D. Treat the result as descriptive and collect stronger evidence first

Best answer: D

Explanation: The exhibit supports a descriptive observation: beta users spent more during the launch week than they did the prior week. It does not provide enough evidence for an inferential claim about all customers or a predictive claim about next month. The sample is self-selected, there is no control group, and a flash sale occurred during the same week. Any of these could explain the higher basket size. A stronger next step would be to collect representative data, compare against a control group, or validate a predictive model on appropriate holdout data. The key distinction is that an observed increase is not enough to prove cause or predict future performance.

  • Approving the claim fails because the evidence does not isolate the feature from other causes.
  • Using lift size alone fails because a large percentage change still may be biased or confounded.
  • Forecasting from one week fails because the data is too narrow and not validated for prediction.

Question 59

Topic: Data Acquisition and Preparation

A data analyst is preparing numeric fields for a customer segmentation model. The inputs include annual spend in dollars, visits per year, and satisfaction score from 1 to 5. The clustering method is distance-based, the profile shows no severe outliers, and the analytics team wants each value expressed relative to the typical customer and spread of that field. What is the BEST preparation step?

Options:

  • A. Convert each numeric field into high, medium, and low bins

  • B. Apply min-max scaling to force each field between 0 and 1

  • C. Leave the fields unchanged to preserve original business units

  • D. Apply z-score standardization to each numeric field

Best answer: D

Explanation: Standardization is the best fit when numeric variables have different units and a distance-based method should treat them comparably while interpreting values relative to the field’s average and spread. A z-score converts each value into the number of standard deviations above or below the mean. This prevents annual spend from dominating the clustering simply because it has larger raw numbers than satisfaction score. Scaling, such as min-max scaling, also makes ranges comparable, but it expresses values within a fixed range rather than relative to the distribution’s mean and standard deviation. The key distinction is fixed range versus distribution-relative units.

  • Min-max scaling is plausible for range alignment, but it does not express values as deviations from the typical customer.
  • Raw values would let large-unit fields dominate distance calculations.
  • Binning values reduces numeric detail and is unnecessary for making continuous fields comparable.

Question 60

Topic: Data Concepts and Environments

A marketing team uses an automated reporting tool to generate a weekly sales narrative and dashboard. Before distribution, the analyst reviews the run summary.

Exhibit: Automated report run summary

Report: Weekly regional sales
Orders extract: refreshed successfully
CRM region lookup: refresh failed; prior snapshot used
Join validation: 18% of orders unmatched to region
Duplicate invoice IDs: 73 found
Human review status: not completed

What should the analyst do next?

Options:

  • A. Hold the report for source validation and quality review

  • B. Rewrite the automated narrative to avoid mentioning regions

  • C. Distribute the report because the orders extract refreshed successfully

  • D. Archive the run summary and wait for next week’s report

Best answer: A

Explanation: Automated reporting can speed up recurring analysis, but it does not remove the need for source validation, review, and data-quality controls. In this run, one source failed to refresh, a prior snapshot was used, many records did not join to a region, duplicates were found, and human review is incomplete. These issues can change regional sales totals and mislead the audience. The appropriate next action is to stop distribution until the sources are validated, data-quality exceptions are investigated, and the report is reviewed. Automation supports reporting; it does not make unvalidated output trustworthy.

  • Successful partial refresh is not enough because the CRM lookup failed and affects regional reporting.
  • Narrative editing hides the symptom but does not fix stale source data, unmatched records, or duplicates.
  • Waiting for the next run leaves known quality issues unresolved and could allow bad reporting to continue.

Question 61

Topic: Data Acquisition and Preparation

A retail analyst is preparing customer data for a dashboard showing customer counts, average monthly spend, and segments by region and signup channel. Source notes say blank strings, NULL, Unknown, and age value 999 may indicate data was not collected; 0 monthly spend can be valid for inactive customers. Which preparation approach best prevents missing values from distorting the dashboard?

Options:

  • A. Replace all missing numeric values with 0 before averaging

  • B. Group missing categories into the largest existing segment

  • C. Delete every row containing any blank or placeholder value

  • D. Profile fields, standardize missing markers, and flag them before calculations

Best answer: D

Explanation: Missing values can distort metrics differently depending on the field and calculation. For this dashboard, blanks, NULL, Unknown, and 999 should be profiled and standardized so they are treated consistently. Because 0 monthly spend is valid, it should not be overwritten or treated as missing. Adding a missing-value flag or explicit “missing/unknown” category where appropriate helps analysts see how much incomplete data affects counts, averages, segmentation, and later transformations. The key is to identify and document missingness before applying deletion or imputation.

  • Zero imputation fails because valid 0 spend differs from unknown spend and would bias averages.
  • Row deletion fails because it may remove useful records and distort customer counts or segment sizes.
  • Largest-segment grouping fails because it hides missing categories and creates misleading segmentation.

Question 62

Topic: Data Concepts and Environments

A data analyst is helping deploy a departmental analytics database that will run on a virtual machine. The database engine requires a disk it can format and manage directly, with low-latency random reads and writes for indexes and transaction files. Which storage type is the BEST choice?

Options:

  • A. Shared file storage

  • B. Block storage volume

  • C. Data lake repository

  • D. Object storage bucket

Best answer: B

Explanation: Block storage is the best fit when an application, virtual machine, or database needs raw volume-style storage behavior. It presents storage as a disk-like volume that the operating system or database engine can format, mount, and manage directly. This supports low-latency random I/O patterns commonly needed for database files, indexes, and transaction logs.

Object storage is better for files or blobs accessed through object APIs, and shared file storage is better when multiple users or systems need a common file path. The key signal in the stem is that the database needs direct disk-style control, not just a place to store datasets.

  • Object storage is useful for large files and semi-structured data, but it does not provide a raw disk volume to the database engine.
  • Shared file storage supports file-level access, but it is not the best match for direct database volume management.
  • Data lake repository is a broader analytics storage pattern, not a low-level disk-style volume for database files.

Question 63

Topic: Data Analysis

A support manager reviews a weekly dashboard that reports an average ticket resolution time of 9.8 hours. The analyst validates the source data before presenting the KPI.

Exhibit: Ticket resolution profile

Ticket groupTicketsResolution time values
Resolved tickets4844 tickets under 4 hours; 3 tickets at 6-8 hours; 1 ticket at 240 hours
Open tickets12resolved_at is null

Options:

  • A. The count is misleading because duplicate tickets are shown.

  • B. The median is misleading because it ignores all resolved tickets.

  • C. The rate is misleading because ticket groups are aggregated by week.

  • D. The average is misleading due to an outlier and missing open tickets.

Best answer: D

Explanation: An average can be misleading when a distribution is skewed by an extreme outlier or when important records are missing from the calculation. In this exhibit, most resolved tickets are under 4 hours, but one 240-hour ticket pulls the average upward. Also, 12 open tickets have null resolved_at values, so they are not included in the resolution-time average even though they matter operationally. A better presentation would pair the mean with the median, show the outlier separately, and report the open-ticket count or aging. The key issue is not that averages are unusable, but that this average alone does not represent the typical ticket experience.

  • Median misconception fails because the median would still use resolved-ticket values and is less affected by the 240-hour outlier.
  • Duplicate-count issue is unsupported because the exhibit shows no duplicated ticket evidence.
  • Grouping concern is unsupported because the exhibit does not show a rate calculation or a week-level aggregation problem.

Question 64

Topic: Visualization and Reporting

A data analyst is designing a mobile dashboard for support managers. A line chart shows monthly ticket volume for three priority levels, each using a different color. The audience needs to compare the priority trends quickly, and there is limited screen space.

Which legend approach best fits the requirement?

Options:

  • A. Add a compact legend for the three priority colors.

  • B. Use a large legend with definitions of each priority policy.

  • C. Create a legend entry for every monthly data point.

  • D. Omit the legend because the colors are visibly different.

Best answer: A

Explanation: Legends should be used when they help the viewer interpret series, categories, or colors. In this case, the chart uses color to represent three priority levels, so a compact legend helps managers understand what each line means. Because the dashboard is mobile and space is limited, the legend should stay concise and close to the visual. The goal is quick interpretation, not a detailed explanation of business rules.

The key takeaway is to include a legend only when it adds clarity and to avoid turning it into visual clutter.

  • Omitting the legend fails because different colors do not explain which priority level each line represents.
  • Every data point overwhelms the chart and confuses a legend with data labeling.
  • Policy definitions add too much detail for a compact trend visual and should be documented elsewhere.

Question 65

Topic: Visualization and Reporting

A sales operations dashboard imports 18 million transaction-level rows each refresh. Executives only need monthly revenue trends by region and the top 10 product categories, but the report takes 9 minutes to open and often times out on tablets. The source data is valid, but the report includes every line-item column for drill-through that executives rarely use. What is the BEST professional decision?

Options:

  • A. Add more visuals to summarize the transaction details

  • B. Increase the refresh frequency to keep the dashboard current

  • C. Export the full dataset as a spreadsheet instead

  • D. Aggregate to monthly region/category totals and apply top-10 filtering

Best answer: D

Explanation: Large data size is a common report-performance and consumption issue. When the audience needs summarized trends rather than row-level detail, the analyst should reduce the data loaded into the report by filtering, aggregating, or redesigning the model to match the decision need. In this scenario, monthly region/category totals and a top-10 category filter directly address slow loading and tablet timeouts without discarding valid source data needed elsewhere. The key is to change the report’s grain and scope, not simply add presentation layers or move the same oversized dataset to another format.

  • Refresh frequency does not reduce the amount of data loaded or solve slow report rendering.
  • More visuals can make performance worse if they still query transaction-level rows.
  • Spreadsheet export shifts the consumption problem to another tool and ignores the executive dashboard requirement.

Question 66

Topic: Visualization and Reporting

A data analyst must present quarterly customer satisfaction results to all employees at a company meeting. The audience includes nontechnical staff, executives want the message limited to three key takeaways, and the source survey data has already been validated and summarized. Which visualization deliverable is the BEST choice?

Options:

  • A. Geographic heat map by customer region

  • B. Infographic with key metrics and short annotations

  • C. Interactive pivot table with drill-down filters

  • D. Detailed statistical appendix with confidence intervals

Best answer: B

Explanation: An infographic is appropriate when the goal is to communicate a small number of validated findings as a clear story for a broad audience. It combines concise text, simple visuals, and highlighted metrics so nontechnical readers can understand the main message without exploring raw data or complex controls. In this scenario, the survey results are already validated and summarized, and executives requested only three key takeaways, so a narrative visual summary fits better than an analysis tool or detailed technical report. The key takeaway is to match the deliverable to the audience and communication goal, not to add interactivity or detail that is not needed.

  • Pivot table overreach fails because drill-down analysis is useful for exploration, not a concise company-wide narrative.
  • Statistical appendix mismatch fails because detailed intervals are too technical for the stated broad audience.
  • Heat map distraction fails because geography is not identified as the main story or decision need.

Question 67

Topic: Data Analysis

A data analyst publishes a monthly sales variance report for finance. The report shows $4.2 million in net sales, but the finance source system’s month-end control total shows $4.8 million for the same period. Finance states that the source system is the system of record. What should the analyst do first?

Options:

  • A. Validate the report extract against the source records and control total

  • B. Remove outlier transactions from the sales dataset

  • C. Add a note that the dashboard is an estimate

  • D. Adjust the report calculation to match the finance total

Best answer: A

Explanation: When analysis results conflict with a trusted upstream record or control total, the first troubleshooting step is source validation. The analyst should confirm that the extract, query filters, date range, joins, refresh time, and transformation logic align with the system of record. A control total provides a known benchmark for reconciliation, so it should be used to identify whether records were excluded, duplicated, transformed incorrectly, or pulled from the wrong version of the data. Changing the report to force agreement hides the root cause and weakens data integrity. The key takeaway is to validate against the authoritative source before modifying calculations or presentation.

  • Forced adjustment fails because matching the number manually does not prove the data is complete or accurate.
  • Outlier removal fails because the stem gives no evidence that unusual transactions are invalid.
  • Estimate labeling fails because finance needs a reconciled report, not a disclaimer for a source mismatch.

Question 68

Topic: Data Analysis

A data analyst is troubleshooting a BI report that started failing during scheduled refresh after a tool update. The source data and report calculations were not changed.

Exhibit: Refresh log summary

Refresh: Failed at 02:05
Tool version: 2025.4.18
Connector: Web API OAuth connector v3.2
Error: OAuth token refresh failed after 60 minutes
Manual API test: 200 OK
Same dataset on gateway 2025.4.17: Success

What should the analyst do next?

Options:

  • A. Check vendor documentation and community posts for the exact version and error

  • B. Post the full refresh log and token details publicly

  • C. Rewrite the report calculations to reduce refresh complexity

  • D. Delete and rebuild the dataset from the source API

Best answer: A

Explanation: Tool-specific failures should be investigated using authoritative sources when the evidence points to product behavior. In this case, the same dataset works on the prior gateway version, the API responds successfully, and no report logic changed. That pattern suggests a connector or version-specific issue rather than a data, SQL, or calculation problem. The analyst should search vendor documentation, release notes, support articles, and relevant community threads for the exact error, connector version, and tool version, then apply a documented fix or workaround. Any public post should be sanitized to remove tokens, credentials, and sensitive data.

  • Calculation rewrite is not supported because the exhibit shows no calculation changes and a version-specific refresh difference.
  • Public token details are inappropriate because troubleshooting evidence must be sanitized before sharing online.
  • Dataset rebuild is premature because the source API works and the issue appears tied to the BI tool update.

Question 69

Topic: Data Concepts and Environments

A data analyst uses an automated reporting tool with generative AI summaries to create a weekly revenue report for executives. This week, the tool shows a 22% increase in renewals, but the billing source had a recent field-name change and the data profile shows an unusual spike in null customer IDs. What is the BEST professional decision before distributing the report?

Options:

  • A. Replace the reporting tool with a custom model

  • B. Validate sources and review data-quality exceptions

  • C. Rewrite the AI summary to sound less surprising

  • D. Distribute the report because it is automated

Best answer: B

Explanation: Automated reporting and AI-generated summaries can speed up recurring analysis, but they do not remove the analyst’s responsibility to validate source data and review quality signals. In this scenario, a field-name change could break mappings or calculations, and the spike in null customer IDs could affect renewal counts, joins, or segmentation. The professional decision is to pause distribution long enough to reconcile the report against trusted source records, inspect the failed or changed fields, and document any issue or correction before executives use the result. The key takeaway is that automation supports reporting; it does not replace source validation or data-quality controls.

  • Automation trust fails because automated output can still reflect broken mappings, missing identifiers, or stale assumptions.
  • Cosmetic editing fails because changing the wording does not confirm whether the renewal increase is valid.
  • Custom model replacement is overengineering because the immediate issue is validation of source and quality exceptions, not model selection.

Question 70

Topic: Data Acquisition and Preparation

A data analyst is building a repeatable monthly query that combines sales, returns, and customer tables. The logic requires several joins, filters out test accounts, and calculates return rates by region before loading a small reporting table. The source tables are large, and the analyst needs a clear way to validate the intermediate regional totals before the final load. What is the best professional decision?

Options:

  • A. Stage the cleaned regional results in a temporary table

  • B. Put all joins and calculations into one nested query

  • C. Create a permanent duplicate of each source table

  • D. Export each source table to separate spreadsheets

Best answer: A

Explanation: Temporary tables are useful intermediate structures when query work has multiple steps, large source tables, or validation checkpoints. In this scenario, the analyst can filter test accounts, join the needed data, aggregate return rates by region, and store those staged results temporarily. That makes the final load simpler and gives the analyst a concrete intermediate dataset to inspect for data quality issues before publishing the reporting table. Temporary tables are especially appropriate when the staged data is needed only during the current workflow or session, not as a long-term governed copy.

  • Single nested query may work technically, but it is harder to validate and maintain when the logic has several steps.
  • Spreadsheet exports add manual handling and reduce repeatability for a monthly extraction process.
  • Permanent duplicates create unnecessary storage and governance concerns when only intermediate staging is needed.

Question 71

Topic: Data Governance

A data analyst is preparing a customer-support dashboard. Support agents must verify customers using partial identifiers, but they should not see full sensitive values.

Exhibit: Privacy matrix

FieldCurrent value exampleAgent requirement
SSN123-45-6789View last 4 digits only
Credit card4111-1111-1111-1111View last 4 digits only
Emailalex.chen@example.comView domain only

Which next action best meets the requirement?

Options:

  • A. Delete the sensitive fields from the dashboard dataset

  • B. Anonymize the records by removing customer identity links

  • C. Apply role-based masking to the sensitive fields

  • D. Encrypt the fields at rest in the source database

Best answer: C

Explanation: Data masking is appropriate when sensitive values must be hidden or partially obscured while still allowing controlled business use. In this case, agents need partial SSNs, partial card numbers, and email domains for verification, not full values. Role-based masking can show only permitted portions to support agents while leaving the underlying governed data available for authorized processes. Anonymization and deletion reduce or remove usability, while encryption at rest protects stored data but does not determine what an authorized dashboard user can see.

  • Anonymization fails because removing identity links would prevent customer-specific support verification.
  • Deletion fails because agents still need partial identifiers to perform their workflow.
  • Encryption at rest fails because it protects storage, not the visible values shown to authorized dashboard users.

Question 72

Topic: Data Acquisition and Preparation

A data analyst is preparing a quarterly sales dashboard from point-of-sale extracts. The data dictionary requires OrderID, SaleDate, Region, and ProductCategory, and the dashboard must show every month and active category. A profile shows no nulls in required fields, but the Q2 extract has no May records for the Central region and no Accessories category, even though the source owner confirms both should have activity. What is the BEST professional decision before publishing?

Options:

  • A. Reconcile the extract against expected months, categories, and source counts

  • B. Publish the dashboard because all required fields are populated

  • C. Set the missing May Central and Accessories values to zero

  • D. Exclude Central and Accessories from the dashboard filters

Best answer: A

Explanation: Completeness checks verify that all expected data is present, including required fields, time periods, categories, and source records. In this case, the absence of nulls does not prove the dataset is complete because entire expected slices are missing. The analyst should compare the extract to a reference calendar, active category list, and source control totals by region or period, then request correction or clearly document the gap before publication. Treating absent records as zero would change the business meaning unless the source confirms zero activity.

  • Required fields only fails because populated columns do not prove expected rows, periods, or categories were included.
  • Zero filling is inappropriate because missing source records are not the same as confirmed zero sales.
  • Filter removal hides the completeness issue and could mislead dashboard users.

Question 73

Topic: Data Concepts and Environments

A data team is centralizing departmental CSV extracts, workbook files, and reference documents. Analysts must browse nested project folders from multiple workstations, preserve folder-level permissions, and let a legacy application read and write using standard file paths. Which storage approach best fits these requirements?

Options:

  • A. Object storage in a bucket with metadata tags

  • B. Shared file storage with a hierarchical namespace

  • C. Block storage attached to one application server

  • D. A relational data warehouse with curated tables

Best answer: B

Explanation: File storage is the best fit when the main requirement is shared, hierarchical file access. It presents data as folders and files, supports familiar paths, and commonly works with access controls at the file or folder level. That matches analysts browsing nested project directories and a legacy application using standard file paths. Object storage can be excellent for scalable unstructured data, but it typically organizes data as objects in buckets rather than as a shared file system. Block storage is usually attached as a low-level volume to a server, and a data warehouse is optimized for structured analytical queries rather than shared document-style access. The key signal is the need for shared folders and file paths.

  • Object storage misses the standard shared file path requirement even though it can store many files.
  • Block storage is better for server volumes, not direct multi-analyst folder browsing.
  • Data warehouse supports analytics on structured data, not hierarchical file sharing for workbooks and documents.

Question 74

Topic: Data Concepts and Environments

A reporting team must combine customer, billing, and support data into a monthly retention report. Auditors must be able to trace each reported metric back to its originating system, source key, and extraction date.

Exhibit: Source inventory

SourceMain keyRefreshNotes
CRMcustomer_idDailyCustomer status
Billingaccount_idDailyPaid subscriptions
SupportemailWeeklyOpen tickets

Which source strategy best preserves traceability for the report?

Options:

  • A. Export all sources to one spreadsheet and overwrite it monthly

  • B. Connect the dashboard directly to each source without documenting joins

  • C. Use only the CRM as the source because it has customer status

  • D. Stage each source, retain source metadata, and build a curated data mart

Best answer: D

Explanation: When multiple sources feed a report, traceability depends on preserving lineage from ingestion through transformation. A good strategy stages each source separately, keeps identifiers such as source system, original key, extraction timestamp, and transformation rules, then publishes a curated reporting layer such as a data mart. This lets analysts combine data for usability while still showing where each metric came from and how it was produced. Direct blending or manual exports may produce a report, but they make audits and historical validation difficult.

  • Spreadsheet overwrite loses historical extraction context and makes month-to-month lineage hard to prove.
  • Undocumented live joins may refresh data, but they do not explain source mapping or transformation logic.
  • Single-source reporting ignores required billing and support inputs, so the combined KPI would be incomplete.

Question 75

Topic: Data Governance

A merchandizing team will use a replenishment dashboard at 9:00 AM. Governance metadata states that every dataset used for this decision must have a successful refresh after 7:30 AM on the same business day.

DatasetRefresh intervalLast successful refresh
Inventory balances15 minutes8:45 AM
Supplier ETAs4 hours4:10 AM
Online orders1 hour8:05 AM
ReturnsDaily11:00 PM prior day

Which conclusion is best supported by the metadata?

Options:

  • A. The dashboard is current enough for the decision.

  • B. Supplier ETAs and returns are not current enough.

  • C. Only the inventory balances are usable.

  • D. The refresh intervals prove all sources are current.

Best answer: B

Explanation: Refresh metadata is a governance fact because it defines whether data is timely enough for a specific business decision. In this case, the decision rule is not just whether each source has a scheduled refresh interval; it requires a successful refresh after 7:30 AM on the same business day. Inventory balances and online orders meet that cutoff. Supplier ETAs last refreshed at 4:10 AM, and returns last refreshed the prior day, so those inputs make the dashboard insufficient for a 9:00 AM replenishment decision unless they are refreshed or excluded with clear disclosure. The displayed dashboard time cannot override source-level currency metadata.

  • Dashboard timestamp fails because a report-level time does not prove every underlying dataset meets the cutoff.
  • Only inventory usable is too restrictive because online orders also refreshed after 7:30 AM.
  • Refresh interval alone fails because scheduled frequency does not replace the last successful refresh time.

Questions 76-90

Question 76

Topic: Data Governance

A data analyst is updating a privacy review for a customer analytics dataset. The governance lead asks which reference should be cited for broad security and privacy practice guidance.

Exhibit: Review notes

FindingDetail
Data involvedCustomer IDs, email addresses, location data
Needed referenceFramework for controls and risk-based guidance
ConstraintNot limited to payment cards or healthcare records
GoalAlign masking, access control, and monitoring practices

Which next action is best supported by the exhibit?

Options:

  • A. Apply HIPAA because privacy controls are needed.

  • B. Use only the internal data dictionary.

  • C. Cite NIST guidance as the framework reference.

  • D. Use PCI DSS because customer identifiers are present.

Best answer: C

Explanation: NIST is a standards and guidance concept used when an organization needs a framework reference for security or privacy practices. The exhibit asks for broad, risk-based guidance to align masking, access control, and monitoring, and it explicitly says the case is not limited to payment cards or healthcare records. That points to NIST rather than a sector-specific or data-type-specific compliance program. A data dictionary can document fields and definitions, but it does not provide a security or privacy control framework.

  • PCI DSS scope fails because the exhibit does not indicate payment card data or card-processing requirements.
  • HIPAA scope fails because the exhibit does not indicate healthcare records or protected health information.
  • Data dictionary only fails because field documentation does not replace a recognized security or privacy framework.

Question 77

Topic: Data Concepts and Environments

A retail analytics team is choosing an environment for a new sales and inventory analytics workspace. Review the architecture notes.

Exhibit: Architecture notes

RequirementNote
Storage15 TB now; expected to double within a year
ComputeHeavy month-end queries; light daily exploration
AccessAnalysts need managed SQL and dashboard tooling
OperationsSmall team; minimal server maintenance desired

Which environment is best supported by the exhibit?

Options:

  • A. Departmental spreadsheet repository

  • B. Local desktop database

  • C. Single on-premises file server

  • D. Cloud provider analytics environment

Best answer: D

Explanation: A cloud provider analytics environment fits when the workload needs storage that can scale, compute capacity that can expand or contract with demand, and managed access to analytics services such as SQL query engines or dashboard integrations. The exhibit shows fast data growth, variable compute demand, and a small operations team that does not want to maintain servers. Those clues point to cloud-hosted storage and managed analytics services rather than local or manually maintained infrastructure. The key takeaway is that cloud environments are often chosen for elasticity and managed services, not just remote access.

  • On-premises file server may store files, but it does not directly address elastic compute or managed analytics access.
  • Local desktop database is not appropriate for 15 TB of growing shared analytics data.
  • Spreadsheet repository creates governance and scalability issues and does not provide managed compute for heavy queries.

Question 78

Topic: Data Acquisition and Preparation

During data exploration, an analyst integrates CRM and billing extracts for a monthly customer-contact dashboard. The audience wants one contact email per customer, but compliance requires traceability to source systems. Profiling shows crm_email and billing_email usually match, but billing sometimes stores an accounts payable contact while CRM stores the end user. Exact repeated CRM rows also exist. What is the BEST professional decision?

Options:

  • A. Treat mismatched emails as duplicate customers and merge the records.

  • B. Keep all rows and show both email fields on the dashboard.

  • C. Preserve source fields, remove exact repeated rows, and derive a documented preferred email.

  • D. Drop billing_email whenever crm_email is populated.

Best answer: C

Explanation: Redundancy occurs when fields overlap in meaning but are not identical or fully interchangeable. Here, crm_email and billing_email may both describe customer contact information, but they can represent different business roles. That makes them redundant attributes, not duplicate data to delete blindly. Exact repeated CRM rows are a separate duplication issue and should be de-duplicated using appropriate keys or row matching. The best preparation step is to keep the original source fields for traceability, create a derived preferred contact email using documented business rules, and record lineage so the dashboard has one usable value without losing source context.

  • Dropping billing data fails because billing contacts can be legitimate and compliance requires source traceability.
  • Merging mismatches fails because different email values may represent different contact roles, not duplicate customers.
  • Showing both fields fails because it preserves data but ignores the audience requirement for one contact email.

Question 79

Topic: Visualization and Reporting

A revenue operations analyst is revising a weekly report for account managers. The audience must identify which customers need follow-up and quote the exact variance during calls.

Exhibit: Report requirement

FieldExampleRequirement
CustomerAcme Co.Compare by customer
Contract value$48,250Show exact value
Actual revenue$45,980Show exact value
Variance-$2,270Show exact value
Renewal dateMay 31, 2026Sort and filter

Which visualization choice best fits this requirement?

Options:

  • A. A pie chart of variance by customer

  • B. A stacked bar chart by customer

  • C. A line chart of revenue over time

  • D. A sortable table with numeric columns

Best answer: D

Explanation: Tables are clearer than charts when the main task is to look up precise values or compare several detailed fields at the row level. In this scenario, account managers need customer-specific contract value, actual revenue, variance, and renewal date so they can take action and quote exact numbers. A sortable table also supports the stated need to filter and prioritize follow-up. Charts are better for showing trends, proportions, or high-level patterns, but they often make exact values harder to read when many categories or multiple measures are involved. The key takeaway is to match the format to the decision task, not just to visual appeal.

  • Stacked bars can compare totals or composition, but they make exact customer-level variance and dates harder to read.
  • Line charts are useful for trends over time, but the exhibit is focused on detailed customer records.
  • Pie charts show part-to-whole relationships poorly when many customers or precise variances matter.

Question 80

Topic: Data Analysis

A data analyst is preparing a monthly report for executives. The dataset contains online survey responses from 420 customers who chose to respond after a support interaction. The sample was not randomly selected, and no test of statistical significance was performed. Which wording should the analyst use to communicate the finding without overstating what the data supports?

Options:

  • A. Long wait times caused lower customer satisfaction.

  • B. Long wait times significantly reduce satisfaction across the customer base.

  • C. Respondents reported lower satisfaction after long wait times.

  • D. All customers are less satisfied after long wait times.

Best answer: C

Explanation: Communication wording should match the strength and limitations of the evidence. In this case, the data came from a self-selected survey sample, so it can describe what respondents reported but cannot safely represent all customers. Also, no significance test or controlled analysis was performed, so the report should avoid terms that imply proof, causation, or population-wide inference. A careful statement uses qualifiers such as “respondents reported” or “in this survey” and avoids unsupported claims like “caused,” “all customers,” or “significantly reduces.” The key is to present the observed pattern while preserving uncertainty and scope.

  • Causal claim fails because the survey observation does not prove wait times caused satisfaction changes.
  • Population claim fails because a self-selected respondent group may not represent all customers.
  • Significance claim fails because the stem states that no statistical significance test was performed.

Question 81

Topic: Data Analysis

An analyst is updating a weekly marketing dashboard for marketing managers. The report must show total sales, month-over-month revenue change, and conversion rate by channel. The source includes order-level revenue, order counts, and web sessions by channel. Validation found some session values are null, and new channels can have 0 sessions. Which approach is the best professional decision?

Options:

  • A. Build a predictive model to estimate sales and conversion for each channel.

  • B. Sum revenue, subtract prior-month revenue, and divide order count by valid sessions.

  • C. Count orders, compare them to target, and replace null sessions with 0 before division.

  • D. Average revenue, subtract sessions from orders, and format revenue per order as a percentage.

Best answer: B

Explanation: The required measures call for basic mathematical functions on numeric data. Total sales should use a sum of revenue. Month-over-month change should use subtraction between comparable current and prior-month revenue values. Conversion rate is a ratio: order count divided by sessions, then formatted as a percentage. Because sessions can be null or 0, the calculation should check for valid denominators before dividing; otherwise, the dashboard may show misleading or undefined percentages. A simple calculated-field approach satisfies the reporting need without adding unnecessary modeling.

  • Averaging revenue fails because average order value is not the same as total sales.
  • Replacing nulls with 0 can create divide-by-zero errors or misleading conversion rates.
  • Predictive modeling is overengineered because the requirement is descriptive reporting, not forecasting.

Question 82

Topic: Data Analysis

A sales analyst is troubleshooting a March revenue variance before publishing a monthly dashboard. The finance general ledger (GL) is the approved control source for booked net sales.

Exhibit: Report validation finding

EvidenceMarch result
Dashboard net sales$1,248,900
GL control total$1,312,400
BI calculation reviewSUM(net_amount), unchanged
ETL load statusCompleted, no errors
Extract row count18,420, lower than prior run

Which next action is best supported by the exhibit?

Options:

  • A. Clear the dashboard cache and republish

  • B. Change the dashboard formula to match the GL total

  • C. Remove March outliers from the analysis

  • D. Validate the extract against upstream GL records

Best answer: D

Explanation: When analysis results disagree with a trusted upstream record or control total, source validation should come before report changes. The BI calculation was reviewed and unchanged, and the ETL job completed without errors, but the extract row count is lower than expected and the dashboard total does not reconcile to the approved GL total. The analyst should compare the extracted records, source transactions, and control totals to determine whether records were omitted, filtered incorrectly, or received from the wrong source version. Adjusting the dashboard to force a match would hide the root cause.

  • Formula change is premature because the calculation has not been shown to be wrong.
  • Cache refresh may address stale visuals, but the exhibit points to a reconciliation gap.
  • Outlier removal changes the analysis population without evidence that outliers caused the variance.

Question 83

Topic: Data Analysis

A data analyst receives a customer feedback file with a single text field named comment. The reporting team needs a monthly dashboard that groups comments containing variations of “refund,” “refunded,” or “refunds” into a Refund category, regardless of capitalization or extra spaces. Which preparation approach best meets the requirement?

Options:

  • A. Aggregate comments by monthly count only

  • B. Delete comments that contain inconsistent capitalization

  • C. Normalize case and spaces, then use pattern matching

  • D. Convert the field to a numeric data type

Best answer: C

Explanation: String functions are the right fit when text needs cleanup, matching, extraction, or categorization. In this scenario, the analyst should standardize the text first, such as trimming extra spaces and converting to a consistent case, then use a matching function or pattern to identify refund-related terms. This preserves the original information while creating a reliable category for the dashboard.

The key takeaway is to clean and match text values rather than discard records or treat free-form comments as numeric data.

  • Numeric conversion fails because free-form comments are text and cannot support refund-term categorization as numbers.
  • Monthly aggregation only misses the requirement to identify which comments belong in the Refund category.
  • Deleting inconsistent text damages completeness when simple cleanup can standardize capitalization and spacing.

Question 84

Topic: Data Analysis

A data analyst is preparing a weekly dashboard for a marketing manager who wants to compare email campaign performance fairly across different audience sizes. Use the formula order rate = orders / emails sent * 100, and show one decimal place. Validation found one duplicate order in Campaign B that must be removed before reporting.

CampaignEmails sentOrders before validation
A10,000480
B6,000390

Which decision best satisfies the reporting need?

Options:

  • A. Report both campaigns using raw order counts only.

  • B. Report Campaign A as better because it has more total orders.

  • C. Report Campaign B at 6.5% without noting the adjustment.

  • D. Report Campaign B at 6.5% and document the duplicate removal.

Best answer: D

Explanation: The derived measure should match the business question and use validated data. Because the manager wants a fair comparison across different audience sizes, the order rate is more appropriate than raw orders. Campaign A is \(480 / 10{,}000 \times 100 = 4.8\%\). Campaign B must first remove the duplicate order, giving \(389 / 6{,}000 \times 100 = 6.483\%\), which rounds to 6.5%. The adjustment should be documented so the dashboard remains traceable and transparent.

  • Raw totals fail because larger audience size can inflate total orders without showing relative performance.
  • Undocumented adjustment fails because validated data changes should be traceable in reporting.
  • Raw counts only fail because they ignore the provided derived measure needed for fair comparison.

Question 85

Topic: Data Concepts and Environments

A retail analyst uses a scheduled job to collect competitor prices for a weekly pricing report. The report has become unreliable.

Exhibit: Collection log summary

DateResultNote
Week 198% capturedCSS selector div.price found
Week 241% capturedPrice moved to span.currentPrice
Week 30% capturedSite requires interactive consent
Week 4Blockedrobots.txt disallows automated price paths

Which interpretation is best supported by the exhibit?

Options:

  • A. Web scraping is a fragile source method for this report.

  • B. The source should be converted from CSV to JSON.

  • C. The report needs a wider date filter.

  • D. The analyst should remove missing price rows.

Best answer: A

Explanation: Web scraping can be useful when no formal data feed exists, but it is fragile because it depends on website structure and allowed access. In the exhibit, the scraper first breaks when the price element changes from one CSS selector to another. Later, an interactive consent step and a robots.txt restriction prevent automated collection. These are source-method risks, not normal data-cleaning issues. A more reliable next step would be to look for an approved API, data-sharing agreement, licensed feed, or another permissioned source before relying on the scraped data for recurring reporting. The key takeaway is that operational source selection must consider both technical stability and permission to collect the data.

  • File format change is not supported because the exhibit shows website collection failures, not a CSV or JSON parsing issue.
  • Date filtering does not address the missing records caused by selector changes and blocked access.
  • Deleting missing rows may hide the issue but does not fix the unstable and potentially unauthorized source method.

Question 86

Topic: Data Acquisition and Preparation

A data analyst must collect employee feedback for a workforce planning report. HR needs results by department and tenure band, but the privacy policy prohibits collecting names, employee IDs, email addresses, or free-text comments that could reveal identities. Which collection approach best meets these requirements?

Options:

  • A. Interview managers about individual employee concerns

  • B. Export the HRIS employee table and remove names after analysis

  • C. Collect open-ended survey comments and redact sensitive words later

  • D. Use an anonymous survey with closed-ended questions and demographic bands

Best answer: D

Explanation: Sensitive data constraints should shape the collection design before data is captured. In this scenario, HR needs aggregate analysis by department and tenure band, but the policy explicitly rules out direct identifiers and free-text comments. An anonymous, closed-ended survey can collect only the minimum necessary fields for the report while reducing the chance of capturing personally identifiable information. Designing the form this way is safer than collecting sensitive details first and trying to remove them later. The key takeaway is to minimize sensitive collection at the source while still preserving the required analysis granularity.

  • Remove later fails because exporting identifiable HRIS records collects prohibited fields before analysis.
  • Manager interviews fail because they would likely capture individual-level sensitive details.
  • Redact later fails because free-text collection is specifically prohibited by the policy.

Question 87

Topic: Visualization and Reporting

A fulfillment center manager uses a dashboard to decide when to reassign pickers between zones during each shift. The current report is refreshed from the prior night’s batch load.

Exhibit: Dashboard validation note

FindingDetail
Decision windowStaffing changes every 15 minutes
Current refreshDaily at 2:00 a.m.
ImpactLate alerts cause missed service-level targets
Source systemWarehouse events available within 2 minutes

Which reporting approach best supports the manager’s operational decision?

Options:

  • A. Daily snapshot dashboard

  • B. Near-real-time operational dashboard

  • C. Static PDF report after each shift

  • D. Weekly executive summary

Best answer: B

Explanation: Operational decisions that must be made during a shift require data that is current enough for the decision window. The exhibit shows staffing changes occur every 15 minutes, while source events are available within 2 minutes. A near-real-time operational dashboard can refresh frequently enough to support timely reassignment decisions. A daily snapshot is useful for historical review, but it is too stale for in-shift action. The key is to match report refresh behavior to the business decision cadence, not just to the availability of a report format.

  • Daily snapshot fails because the report would still reflect overnight data after conditions change during the shift.
  • Weekly summary is suited for trend review, not immediate operational staffing decisions.
  • Static PDF may document results, but it does not provide current data for active zone reassignment.

Question 88

Topic: Visualization and Reporting

A sales director and a regional manager report different values for the same KPI on what appears to be the same dashboard. The dashboard owner checks the report metadata.

Exhibit: Report validation finding

ViewerReport titleSnapshot timestampRevenue KPI
DirectorQ2 Sales DashboardJuly 1, 8:00 AM$4,820,000
ManagerQ2 Sales DashboardJune 30, 6:00 PM$4,760,000

What is the most likely interpretation of this issue?

Options:

  • A. The revenue calculation formula is incorrect.

  • B. The dashboard filter logic is excluding records.

  • C. The source database has duplicate sales records.

  • D. The users are viewing different data snapshots.

Best answer: D

Explanation: Dashboard versioning issues can occur when users access the same report name but not the same refreshed dataset or snapshot. In the exhibit, the title is identical, but the snapshot timestamps differ. That means the users are not comparing the same point-in-time version of the data, so the KPI values can legitimately differ even if the metric definition is unchanged.

The next validation step would be to align both users to the same snapshot or refresh cycle before investigating formulas, filters, or source data defects.

  • Formula error is not supported because the exhibit shows different snapshot times, not different metric definitions.
  • Filter issue is not supported because no filter or segment difference is shown.
  • Duplicate records would require source data evidence, not just mismatched dashboard snapshots.

Question 89

Topic: Data Acquisition and Preparation

A data analyst is preparing a weekly sales report from a transaction table. Leadership needs total revenue summarized by region and calendar week, duplicate transaction rows have already been removed, and the report must keep individual customer details out of the output. Which query approach is the best professional decision?

Options:

  • A. Join transactions to the customer profile table

  • B. Append weekly transaction files into one table

  • C. Sort records by region and transaction date

  • D. Group records by region and week, then sum revenue

Best answer: D

Explanation: Grouping is used when detailed records must be summarized by a category, time period, region, customer, or similar dimension. In this scenario, leadership wants total revenue by region and calendar week, not a row-by-row transaction listing. A grouped query with an aggregate function such as SUM(revenue) produces one summarized result per region-week combination and avoids exposing customer-level detail. Sorting may make records easier to read, but it does not summarize them. Joining or appending may be useful in other integration tasks, but they do not directly meet the reporting requirement after duplicates have already been handled.

  • Sorting only changes row order but leaves transaction-level detail and does not calculate weekly totals.
  • Joining profiles adds customer data, which conflicts with the requirement to keep customer details out.
  • Appending files combines rows from multiple sources, but it does not summarize revenue by region and week.

Question 90

Topic: Visualization and Reporting

A data analyst is preparing a monthly revenue dashboard for executives. The dashboard uses a new calculated field for recurring revenue, excludes refunded transactions, and includes a line chart that shows a sudden increase after a product launch. The source refresh completed successfully, but the analyst wants validation before publishing because the assumptions and visual interpretation may affect leadership decisions. What is the best next step?

Options:

  • A. Request an independent peer review

  • B. Run a stress test on the dashboard server

  • C. Create a new data dictionary

  • D. Replace the dashboard with a static PDF

Best answer: A

Explanation: Peer review is a report validation technique used when an independent analyst should examine whether assumptions, code, calculations, or visual interpretations are reasonable. In this scenario, the data refresh succeeded, but the risk is that the new recurring revenue logic, refund exclusion, and executive-facing trend interpretation could be wrong or misleading. A peer reviewer can verify the calculated field, check that exclusions match the business rule, and challenge whether the line chart supports the stated conclusion before leaders use it for decisions.

The key is matching the validation method to the risk: this is not mainly a performance, metadata, or delivery-format problem.

  • Stress testing targets load or performance issues, but the stem says the refresh completed and the concern is analytical validity.
  • Data dictionary work helps document fields, but it does not independently validate the dashboard logic and interpretation.
  • Static PDF delivery changes the report format without checking whether the calculations or visual message are correct.

Continue with full practice

Use the CompTIA Data+ DA0-002 Practice Test page for the full IT Mastery practice bank, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try CompTIA Data+ DA0-002 on Web View CompTIA Data+ DA0-002 Practice Test

Focused topic pages

Free review resource

Read the CompTIA Data+ DA0-002 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.

Revised on Thursday, May 28, 2026