Try 90 free CompTIA Data+ DA0-002 questions across the exam domains, with explanations, then continue with full IT Mastery practice.
This free full-length CompTIA Data+ DA0-002 practice exam includes 90 original IT Mastery questions across the exam domains.
Use these questions for self-assessment, scope review, and deciding what to drill next.
Count note: this page uses the full-length practice count maintained in the Mastery exam catalog. Some certification vendors publish total questions, scored questions, duration, or unscored/pretest-item rules differently; always confirm exam-day rules with the sponsor.
Need concept review first? Read the CompTIA Data+ DA0-002 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.
Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.
Try CompTIA Data+ DA0-002 on Web View full CompTIA Data+ DA0-002 practice page
| Domain | Weight |
|---|---|
| Data Concepts and Environments | 20% |
| Data Acquisition and Preparation | 22% |
| Data Analysis | 24% |
| Visualization and Reporting | 20% |
| Data Governance | 14% |
Use this as one diagnostic run. IT Mastery gives you timed mocks, topic drills, analytics, code-reading practice where relevant, and full practice.
Topic: Data Analysis
A data analyst is preparing a slide for executives about a pilot retention program. The analysis uses one month of observational customer data, and customers were not randomly assigned to the program. The program group had a 6% lower churn rate, but the analyst found that program customers also had higher account tenure. Which wording is the best professional decision for the slide?
Options:
A. “The program should be expanded because it reduced churn.”
B. “The analysis is invalid because customers were not randomly assigned.”
C. “Customers in the program had 6% lower churn; tenure may also contribute.”
D. “The program caused churn to decrease by 6%.”
Best answer: C
Explanation: Communication should match what the analysis can support. Because the data is observational and customers were not randomly assigned, the result can support an association between program participation and lower churn, not proof that the program caused the decrease. The higher tenure in the program group is a potential confounding factor, so it should be acknowledged in executive wording. A strong summary can still be useful, but it must avoid causal language unless the study design or additional analysis supports causation.
The key takeaway is to state the finding clearly while describing limitations that affect interpretation.
Topic: Data Acquisition and Preparation
A data analyst is preparing a monthly customer dataset for a dashboard that shows customer counts by segment and average order value. A profile shows that segment contains blanks and "Unknown", while order_value contains blanks and 0. Business users say a true zero order is valid, but missing segment values should not be merged into an existing segment. What is the best next step before building the dashboard?
Options:
A. Standardize missing segment indicators and review missing order_value separately
B. Convert missing segments to the largest existing segment
C. Drop every row with any blank field before reporting
D. Replace all blanks and zeros with NULL across the dataset
Best answer: A
Explanation: Missing values must be identified in the context of each field and business rule. In this dataset, blank and "Unknown" segment values can distort segment counts if they are ignored or merged incorrectly. For order_value, however, 0 is a valid numeric value, so treating it as missing would distort the average order value. The practical approach is to standardize only the null-like indicators for segment, separately profile blanks in order_value, and document the treatment before downstream transformations or dashboard calculations. This keeps the analysis accurate without deleting useful records or inventing segment assignments.
Topic: Data Acquisition and Preparation
A retail analytics team receives daily sales files from several stores. The team must preserve each raw file in the target lakehouse for audit and lineage, then use the lakehouse compute engine to standardize dates, deduplicate records, and create reporting tables after the data arrives. Which approach best fits these requirements?
Options:
A. Use ETL to transform data before loading it into the lakehouse
B. Use a dashboard refresh to transform and store the raw files
C. Use ELT to load raw data first, then transform it in the lakehouse
D. Use web scraping to collect the store sales files directly
Best answer: C
Explanation: ETL and ELT differ mainly by when transformation occurs relative to loading the target environment. In ETL, data is extracted, transformed in a staging or processing layer, and then loaded into the target. In ELT, data is extracted and loaded first, often in raw form, and transformations happen inside the target platform. The stem requires preserving raw files in the lakehouse and performing standardization, deduplication, and reporting-table creation after arrival, so the timing maps to ELT. The key takeaway is to look for whether transformation happens before or after the target load.
Topic: Visualization and Reporting
A regional operations manager needs a weekly report showing where delivery delays are concentrated across service territories. The analyst has clean territory boundary data and order records with delivery ZIP code, delay status, and week. Individual customer addresses must not be shown. Which visualization is the BEST professional choice?
Options:
A. Line chart showing total delayed orders by week
B. Filled map by service territory using aggregated delay rate
C. Pivot table listing ZIP codes and delay counts
D. Infographic summarizing top delay causes
Best answer: B
Explanation: When the reporting question is mainly about where a metric is concentrated, a map is usually the best visual. In this scenario, the audience needs to compare delivery delays across service territories, and the analyst has territory boundaries plus ZIP-level order data. Aggregating delay rate to the service-territory level supports the geographic decision while avoiding display of individual customer addresses. A rate is also more comparable than raw counts when territories may have different order volumes.
The key takeaway is to match the visual to the decision: spatial distribution calls for a map, especially when regions or territories are central to the question.
Topic: Data Concepts and Environments
A data analyst profiles API records before loading them into a relational reporting table. Which interpretation is best supported by the profile?
Exhibit: API record profile
{
"order_id": "O-1042",
"customer": {
"id": "C-22",
"address": { "city": "Denver", "state": "CO" }
},
"items": [
{ "sku": "A10", "qty": 2 },
{ "sku": "B07", "qty": 1 }
]
}
Options:
A. The record is unstructured because it has no fixed columns.
B. The record contains hierarchical and repeated nested fields.
C. The record has duplicate orders that should be removed.
D. The record is a flat delimited file with embedded separators.
Best answer: B
Explanation: Semi-structured data, such as JSON, can include fields inside other fields and arrays of repeated values. In this record, customer contains child fields such as id and address, and address contains city and state. The items field is an array with multiple item objects for the same order. That means a relational load may need parsing and possibly flattening or exploding the repeated array, depending on the reporting goal. The exhibit supports nested structure recognition, not duplicate detection or unstructured-data classification.
Topic: Data Analysis
A data analyst troubleshoots a scheduled sales dashboard refresh that is repeatable and sometimes fails. The manager needs evidence showing where failures occur and why some runs take longer before changing the workflow.
Exhibit: Current run history
| Run time | Status | Duration | Detail captured |
|---|---|---|---|
| 08:00 | Success | 4 min | Completed |
| 09:00 | Failed | 16 min | Job failed |
| 10:00 | Success | 5 min | Completed |
| 11:00 | Failed | 18 min | Job failed |
Which next action best supports the troubleshooting need?
Options:
A. Rewrite the dashboard filters to reduce refresh time
B. Ask dashboard users to submit screenshots of failures
C. Profile the source table for null values and duplicates
D. Enable step-level logging with timestamps and error messages
Best answer: D
Explanation: Logging is the best troubleshooting method when a repeatable process needs evidence about where failures occur or how long each step takes. The exhibit shows only high-level status and duration, so it confirms a pattern but does not identify the failing step, error cause, or timing bottleneck. Step-level logs with timestamps, status codes, and error messages create an auditable trail across multiple runs without prematurely changing the workflow. This supports a data-based diagnosis before applying a fix.
The key distinction is that logging gathers operational evidence, while profiling or redesign changes the focus to data content or report design before the failure point is known.
Topic: Data Acquisition and Preparation
A data analyst is preparing transaction data for a monthly executive dashboard. The source system stores each customer’s exact age, but executives only need to compare purchasing patterns by broad life-stage groups. The dashboard should avoid exposing unnecessary detailed personal data and should remain easy to filter. Which transformation is the best choice?
Options:
A. Standardize ages using z-scores
B. Bin ages into defined age-range categories
C. Delete the age field before dashboard creation
D. Convert ages to text strings without changing values
Best answer: B
Explanation: Binning is the right transformation when continuous or highly detailed values need to be grouped into meaningful ranges or categories. In this scenario, exact ages are more detailed than executives need, and broad groups make the dashboard easier to filter and interpret. Binning also helps reduce unnecessary exposure of precise personal details while preserving the analytical value of age-based comparisons. The bins should be defined consistently, documented, and validated so each age maps to exactly one category.
Topic: Visualization and Reporting
A public health analyst must share quarterly vaccination outreach results with community partners who have varying levels of data expertise. The report should tell a concise story that can be read quickly in a meeting handout.
Exhibit: Communication requirements
| Requirement | Detail |
|---|---|
| Audience | Broad, nontechnical partners |
| Purpose | Summarize key results and next steps |
| Format | One-page handout |
| Detail level | High-level trends and selected callouts |
Which visualization/reporting format best fits these requirements?
Options:
A. Geospatial map
B. Pivot table
C. Infographic
D. Interactive dashboard
Best answer: C
Explanation: An infographic combines brief text, simple visuals, and key metrics to communicate a narrative quickly. In this scenario, the audience is broad and nontechnical, and the required deliverable is a one-page handout with high-level trends and selected callouts. That makes an infographic a better fit than tools designed for exploration, filtering, or detailed analysis. The key decision is the communication goal: tell a concise story, not provide a workspace for slicing data.
Topic: Data Analysis
A customer support manager wants to give front-line agents a report they can use during each shift. Agents need to see assigned tickets, aging status, priority, next action, and SLA risk so they can decide what to work on immediately. Which reporting approach best fits this audience and requirement?
Options:
A. Public infographic showing overall support volume
B. Detailed operational dashboard with actionable ticket-level views
C. Executive summary with monthly trend highlights
D. Board-level KPI scorecard with strategic targets
Best answer: B
Explanation: For individual contributors, the best reporting approach is usually operational and action-oriented. The report should expose enough detail for users to decide what to do next, such as assigned items, current status, priority, exceptions, and due dates. In this scenario, agents are not trying to understand long-term strategy or communicate broad performance trends; they need to manage tickets during a shift. A detailed operational dashboard or report with ticket-level views, filters, and clear SLA indicators supports that need. High-level summaries and scorecards are more appropriate for executives or managers who monitor trends and strategic KPIs rather than take immediate record-level action.
Topic: Visualization and Reporting
A regional sales team wants managers to explore approved revenue and quota reports independently without requesting custom exports each week. The analytics lead must keep access limited by role and prevent users from changing certified KPI definitions. Which delivery method best meets these requirements?
Options:
A. Static PDF summary
B. Ad hoc spreadsheet export
C. Public embedded dashboard
D. Self-service portal
Best answer: D
Explanation: A self-service portal is the best fit when users need independent access to approved data or reports within governance controls. It can expose certified dashboards, datasets, and KPIs while using permissions or role-based access to limit what each user can see. This supports exploration without requiring analysts to create repeated exports and without allowing users to redefine governed metrics. Static summaries are useful for fixed communication, but they do not support controlled exploration. Unmanaged exports and public dashboards weaken control over access, definitions, and reuse.
Topic: Data Concepts and Environments
A data analyst is reviewing how a weekly sales packet is created. Based on the exhibit, which concept is best illustrated?
Exhibit: Tool run log
06:00 Sign in to sales portal
06:01 Click Export CSV
06:03 Open KPI workbook
06:04 Refresh workbook data
06:06 Export PDF report
06:07 Email PDF to sales managers
Options:
A. Data lakehouse ingestion
B. Robotic process automation
C. Natural language processing
D. Manual ad hoc analysis
Best answer: B
Explanation: Robotic process automation (RPA) uses software to perform repetitive, rule-based steps that a person would otherwise do, such as signing in, clicking export, refreshing a workbook, creating a PDF, and sending an email. In this case, the reporting workflow is automated by executing the same sequence on a schedule. The result is automated reporting, but the exhibit specifically shows the RPA-style mechanism: mimicking user actions across applications.
Topic: Data Acquisition and Preparation
A data analyst is preparing a monthly sales report. The source system stores territory_code values such as NE-1, NE-2, and SE-1, but the business report must show standardized regions such as Northeast and Southeast. Finance requires that any reported region total can be traced back to the original source values used to create it. Which preparation approach best meets this requirement?
Options:
A. Create a mapped derived region field and retain the original code
B. Overwrite territory_code with the standardized region name
C. Remove source rows with nonstandard territory codes
D. Group the codes directly in the report visualization
Best answer: A
Explanation: Traceability is preserved by keeping the original source value and adding a derived field for reporting. In this scenario, territory_code should remain unchanged, while a documented mapping or lookup creates a standardized region field used in the monthly report. This allows Finance to see regional totals while still tracing each reported value back to the exact source codes that contributed to it. The mapping should be versioned or documented so future changes can be audited. Overwriting, hiding, or deleting source values weakens lineage and makes reconciliation harder.
territory_code removes the original value needed for audit and reconciliation.Topic: Visualization and Reporting
A data analyst is preparing a quarterly dashboard for regional sales directors. The main visual compares “retention rate” by region, but the CRM migration changed the definition from account-level retention to contract-level retention mid-quarter. One region also has two weeks of delayed updates. What is the BEST professional decision before publishing the visual?
Options:
A. Replace the chart with a more colorful regional map
B. Remove the delayed region from the dashboard without comment
C. Publish the chart because the source is the official CRM
D. Add a metric definition and note the delayed region data
Best answer: D
Explanation: A visual needs context when the audience could reasonably misread the numbers because of definitions, timing, completeness, or source changes. In this scenario, the retention metric changed during the quarter, and one region has delayed updates. A concise definition, note, tooltip, or footnote helps users interpret the comparison correctly without overcomplicating the dashboard. The goal is not to hide the issue or redesign the visual for style; it is to make the limitation visible at the point of use.
Topic: Data Concepts and Environments
A team is deploying a relational database that will support an internal reporting application. Review the storage request and choose the best storage type.
Exhibit: Storage request
| Requirement | Detail |
|---|---|
| Access pattern | Frequent random reads and writes |
| Attachment | Mounted by one database server |
| Behavior needed | Low-level volume, partition, and file system control |
| Data type | Structured database files |
Which storage approach best fits these requirements?
Options:
A. File storage
B. Object storage
C. Block storage
D. Data lake storage
Best answer: C
Explanation: Block storage is the best fit when an application or database needs low-level, volume-style storage behavior. It presents storage as attachable volumes that the server can format with a file system and use for random read/write workloads. This matches databases that manage structured database files and need predictable access through an operating system volume. Object storage is better for objects such as files in buckets, file storage is better for shared directory access, and data lake storage is a repository pattern rather than a low-level database volume.
Topic: Data Analysis
A retail analyst is comparing daily sales consistency for two products. The manager wants the product with the higher relative variability, not just the larger standard deviation.
Exhibit: 30-day sales summary
| Product | Mean daily units | Standard deviation | Formula |
|---|---|---|---|
| Product A | 200 | 20 | CV = standard deviation ÷ mean × 100 |
| Product B | 80 | 16 | CV = standard deviation ÷ mean × 100 |
Which conclusion is supported by the exhibit?
Options:
A. Product A has higher relative variability.
B. Product B has higher relative variability.
C. Product A is more variable because its standard deviation is larger.
D. Both products have the same relative variability.
Best answer: B
Explanation: The coefficient of variation (CV) compares dispersion relative to the mean, which is useful when the averages are different. Product A has CV = 20 ÷ 200 × 100 = 10%. Product B has CV = 16 ÷ 80 × 100 = 20%. Although Product A has the larger raw standard deviation, Product B varies more relative to its typical daily sales volume.
Topic: Data Governance
A data analyst discovers that a weekly customer churn export containing names, email addresses, and account IDs was accidentally shared with an external vendor that is not approved to receive customer data. The analyst still has the sent message, file name, recipient, and timestamp. Which response best aligns with incident reporting practices?
Options:
A. Wait until misuse of the data is confirmed
B. Report through the approved incident channel and preserve evidence
C. Notify affected customers directly from the analyst’s email
D. Delete the export and ask the vendor to delete it
Best answer: B
Explanation: Incident reporting focuses on timely escalation through the organization’s approved process when a breach or security event is suspected. The analyst should not decide alone whether the event is reportable or try to remediate it informally. Preserving evidence such as the file name, recipient, timestamp, and message helps the security, privacy, or compliance team assess scope, containment, notification obligations, and corrective actions. The key is to report the suspected unauthorized disclosure promptly and avoid actions that could destroy evidence or bypass assigned response roles.
Topic: Visualization and Reporting
A finance team publishes a monthly KPI dashboard for executives. After each board meeting, users must be able to reopen the dashboard exactly as it appeared on the meeting date, even if the underlying sales data is later corrected or refreshed. Which dashboard versioning approach best meets this requirement?
Options:
A. Increase the scheduled refresh frequency
B. Allow users to apply date filters manually
C. Create a snapshot version after each board meeting
D. Enable real-time refresh on the dashboard
Best answer: C
Explanation: Snapshot versioning is used when users need a fixed historical reference to a report as it existed at a specific time. In this scenario, the board needs to reopen the same KPI dashboard state later, regardless of corrections or refreshes in the source data. A snapshot captures the visible report state, supporting auditability and consistent discussion of prior decisions. Refresh settings help keep reports current, but they do not preserve how the report looked at a past meeting. Manual filters can approximate a period, but they may still reflect updated source values.
Topic: Data Analysis
An operations manager wants to prioritize product quality reviews using the highest return rate, not the highest count of returned units. Use this formula:
Return rate = returned units ÷ shipped units
| Product | Shipped units | Returned units |
|---|---|---|
| Alpha | 10,000 | 400 |
| Bravo | 4,000 | 240 |
| Charlie | 2,500 | 125 |
| Delta | 8,000 | 320 |
Which product should the analyst prioritize?
Options:
A. Delta
B. Bravo
C. Alpha
D. Charlie
Best answer: B
Explanation: A derived measure combines existing fields to create a more useful KPI. Here, the requirement is to compare products by return rate, so returned units must be divided by shipped units for each product. Alpha is 400 ÷ 10,000 = 4%, Bravo is 240 ÷ 4,000 = 6%, Charlie is 125 ÷ 2,500 = 5%, and Delta is 320 ÷ 8,000 = 4%. The highest count of returns is not the same as the highest rate when shipment volumes differ. The key takeaway is to use the formula that matches the business question, not just the largest raw count.
Topic: Data Concepts and Environments
A product team wants to analyze why mobile checkout sessions fail after a new release. An analyst reviews this source excerpt:
| timestamp | session_id | component | event | status |
|---|---|---|---|---|
| 2026-05-18 09:14:22 | S1027 | payment_api | auth_request | 200 |
| 2026-05-18 09:14:25 | S1027 | payment_api | token_retry | WARN |
| 2026-05-18 09:14:27 | S1027 | checkout | order_submit | ERROR |
Which interpretation is best supported by the exhibit?
Options:
A. It is log data for event and error analysis.
B. It is a transaction ledger for revenue reporting.
C. It is reference data for product categorization.
D. It is survey data for customer sentiment analysis.
Best answer: A
Explanation: Log data records what systems, applications, or devices do over time. The exhibit contains timestamped events tied to a session and component, with status values such as WARN and ERROR. That structure supports analysis of event sequences, usage behavior, failures, latency, and operational conditions. For this checkout issue, the analyst can trace what happened before the failure and identify the affected component or event pattern.
A transaction ledger might confirm whether an order was completed, but it usually would not show the detailed system events leading to the checkout error.
Topic: Data Analysis
A customer success manager is building a dashboard for a quarterly retention initiative. The manager needs one KPI that gives an early signal before the renewal results are known.
Exhibit:
| KPI | Measured when | Typical use |
|---|---|---|
| Onboarding completion within 14 days | First 2 weeks | Identify accounts needing intervention |
| Quarterly renewal rate | End of quarter | Evaluate retention success |
| Tickets closed per support agent | Daily | Track support workload |
| Annual recurring revenue retained | End of year | Assess business impact |
Which interpretation best aligns the KPI type with the manager’s need?
Options:
A. Use annual revenue retained as an operational KPI.
B. Use tickets closed as an outcome-oriented KPI.
C. Use quarterly renewal rate as a leading KPI.
D. Use onboarding completion as a leading KPI.
Best answer: D
Explanation: A leading KPI provides an early signal that can influence a later result. In this scenario, onboarding completion within 14 days happens before the quarterly renewal decision and can trigger intervention for at-risk accounts. Quarterly renewal rate and annual recurring revenue retained are lagging or outcome-oriented measures because they summarize results after the business period is complete. Tickets closed per support agent is operational because it tracks day-to-day activity or workload, not the strategic retention outcome directly.
The key distinction is timing and purpose: leading KPIs help predict or influence future outcomes, while lagging and outcome-oriented KPIs evaluate what already happened.
Topic: Data Governance
A sales operations team publishes a daily revenue dashboard from an ETL pipeline. A one-time profiling review was completed before launch. The team now wants to detect quality issues without waiting for users to report bad numbers.
Exhibit: Data quality notes
| Check | Current approach | Recent finding |
|---|---|---|
Null customer_id rate | Manual review at launch | Rose from 0.2% to 4.8% last week |
Duplicate order_id count | Manual review at launch | Spiked on three daily loads |
| Dashboard refresh | Scheduled daily | Completed successfully |
Which next action is best supported by the exhibit?
Options:
A. Implement automated quality monitoring with alerts
B. Ignore the issue because refreshes succeeded
C. Replace the dashboard with a static monthly report
D. Repeat the original one-time profiling review
Best answer: A
Explanation: A one-time quality review or profiling activity is useful before launch to understand a dataset’s structure, nulls, duplicates, ranges, and anomalies at a point in time. In an ongoing reporting workflow, quality can drift after launch even when the refresh job technically succeeds. The exhibit shows recurring changes in null rates and duplicate counts across daily loads, so the better governance control is automated data quality monitoring with thresholds, scheduled checks, and alerts. That approach detects emerging issues as the pipeline runs and supports timely investigation before users rely on incorrect KPIs.
Topic: Visualization and Reporting
Branch managers need staffing KPIs by 8 a.m. each business day. Each manager should see only their own branch. The metrics include employee IDs and absence reasons. The warehouse refreshes nightly, and managers are comfortable using the company BI portal but not SQL or spreadsheet modeling. Which delivery method best meets these requirements?
Options:
A. Create a real-time dashboard using a shared public link
B. Grant managers read access to warehouse tables
C. Email a spreadsheet export to each manager every morning
D. Publish a role-secured BI dashboard with scheduled refresh
Best answer: D
Explanation: A delivery method should match how users can consume the report while protecting sensitive data and meeting the refresh need. Here, a BI portal dashboard fits the managers’ capability, and scheduled refresh aligns with the nightly warehouse update and 8 a.m. deadline. Role-based or row-level security limits each manager to only their branch, which is important because the report includes employee-level sensitive information. A more open or raw-data delivery method would increase privacy and misuse risk without improving the business outcome. The key is to combine appropriate access control with a delivery format the audience can actually use.
Topic: Data Concepts and Environments
A data analyst must build a monthly vendor spend report. Each regional office sends the analyst an approved extract from its procurement system, but the offices do not share a managed database, API, or warehouse connection. The report must be refreshed from the exchanged files and checked for missing required columns before analysis. Which source approach is the BEST professional decision?
Options:
A. Use standardized flat-file extracts as the source
B. Manually retype each office’s totals into the report
C. Scrape each procurement system’s web interface
D. Require all offices to migrate into one database first
Best answer: A
Explanation: Flat files, such as CSV or Excel extracts, are practical sources when business users exchange data outside a managed database workflow. In this scenario, the offices already produce approved extracts, and there is no shared database, API, or warehouse connection. The analyst should standardize the file format and validate required columns before analysis so the monthly refresh is repeatable and data quality issues are caught early. A full database migration may be useful later, but it is beyond the stated need and would overengineer the source selection.
Topic: Visualization and Reporting
A sales operations analyst must support a weekly review where managers explore revenue by region, product category, and month. The validated dataset is stored in a governed warehouse table, and the analyst is not allowed to alter the source table or create new governed summary fields. Managers need to regroup and filter totals during the meeting. Which approach is the best professional decision?
Options:
A. Create a pivot table from the approved dataset
B. Export raw rows for managers to manually edit
C. Add summary columns to the warehouse table
D. Design a static infographic of monthly revenue
Best answer: A
Explanation: A pivot table is appropriate when users need exploratory aggregation over an approved dataset. It lets managers group, filter, and summarize measures such as revenue by dimensions such as region, product category, and month without modifying the governed source table. This fits the meeting need because the audience can change the view interactively while the underlying validated data remains intact. Static visuals are better for fixed communication, and changing warehouse schema or allowing manual edits would create governance and quality risks.
Topic: Data Concepts and Environments
A retail analytics team maintains a star schema for sales reporting. The business wants quarterly sales to remain tied to the customer segment that was active when each sale occurred, even if a customer’s segment changes later. Which dimension-table approach best supports this requirement?
Options:
A. Overwrite the segment in the customer dimension
B. Use a Type 2 slowly changing dimension
C. Store only the previous segment in a new column
D. Remove customer segment from the sales model
Best answer: B
Explanation: A slowly changing dimension is used when descriptive attributes in a dimension can change over time. Because the report must show sales by the customer segment that was valid at the time of each sale, the model needs to preserve historical segment values instead of replacing them. A Type 2 SCD typically adds a new dimension row for each changed version of the customer, often with a surrogate key plus effective dates or a current-row flag. This allows facts to remain associated with the correct historical dimension version. Overwriting the dimension would make older sales appear under the customer’s current segment, reducing reporting accuracy.
Topic: Data Acquisition and Preparation
A retail analyst is reviewing a customer satisfaction survey intended to represent a typical month of online checkout experiences. Which interpretation is best supported by the collection profile?
Exhibit: Survey collection profile
| Field | Detail |
|---|---|
| Target population | Monthly online purchasers |
| Collection window | November 24-27 |
| Business note | Black Friday promotion and checkout latency incident |
| Completed surveys | 1,850 |
| Invitation method | Email to purchasers during the window |
Options:
A. Conclude the invitation method caused duplicate survey responses.
B. Flag possible timing bias from the collection window.
C. Diagnose nonresponse bias from customers who ignored the email.
D. Treat the results as representative because the response count is high.
Best answer: B
Explanation: Collection timing can bias results when data is gathered during an unusual period that changes normal behavior. Here, the survey is supposed to represent typical monthly checkout experiences, but the collection window overlaps a Black Friday promotion and a checkout latency incident. Those conditions can affect satisfaction, traffic patterns, purchase urgency, and response behavior. A large number of responses does not remove this bias if all observations come from an atypical window. The appropriate interpretation is to flag timing-related collection bias and consider recollecting or comparing against normal-period data.
Topic: Visualization and Reporting
A data analyst is preparing a quarterly executive dashboard using 24 months of aggregated sales data. Leaders want to see how total revenue changes over time and how each product category contributes to that total. The chart must be easy to scan and should not expose transaction-level detail. Which chart type is the best choice?
Options:
A. Histogram
B. Stacked area chart
C. Pie chart
D. Scatter plot
Best answer: B
Explanation: Chart selection should match the analytical question. This scenario combines a trend question with a composition question: executives need to see revenue over 24 months and understand how product categories make up the total. A stacked area chart is designed for this type of time-based composition because the overall shape shows the total trend, while the stacked bands show category contributions. Aggregated monthly or quarterly data also supports the privacy and summary-level reporting need. A simple line chart could show trend, but it would not show part-to-whole composition as clearly.
Topic: Data Analysis
A logistics analyst is comparing two fulfillment partners for a contract renewal. Leadership says the priority is predictable delivery within a 4-day SLA, not the lowest average. Timestamps were validated, and the 1% missing delivery times for both partners are within the approved reporting tolerance.
| Partner | Mean days | Std. dev. | IQR | 95th percentile |
|---|---|---|---|---|
| A | 2.8 | 1.4 | 1.9 | 5.2 |
| B | 3.0 | 0.4 | 0.5 | 3.7 |
Which recommendation is the BEST professional decision?
Options:
A. Delay the decision until all missing values are imputed.
B. Recommend Partner A because its wider spread shows more delivery flexibility.
C. Recommend Partner B for more consistent SLA performance.
D. Recommend Partner A because it has the lower mean delivery time.
Best answer: C
Explanation: When the business question emphasizes consistency or risk, dispersion measures can matter more than the mean. Partner A has a slightly lower average delivery time, but its standard deviation and IQR are much larger, and its 95th percentile exceeds the 4-day SLA. That means more deliveries are likely to be late even though the average looks favorable. Partner B has a slightly higher mean, but the much smaller spread and 95th percentile of 3.7 days better match the requirement for predictable SLA performance. The key is aligning the statistic with the decision objective, not selecting the lowest average by default.
Topic: Data Governance
A data analyst is reconciling conflicting October revenue totals before publishing an executive KPI report. The report must support month-end revenue decisions and use one authoritative reference.
Exhibit: Dataset metadata
| Dataset | October total | Owner | Approved use | Governance notes |
|---|---|---|---|---|
CRM_Deals | $1,240,000 | Sales Ops | Pipeline tracking | Includes open deals |
ERP_Invoices | $1,185,000 | Finance | Recognized revenue | Audited month-end close |
BI_Revenue_View | $1,198,000 | Analytics | Dashboard staging | Derived from CRM and ERP |
Analyst_Adjustments.xlsx | $1,210,000 | Analyst | Ad hoc analysis | Manual edits, no approval |
Which dataset should be selected as the source of truth for the KPI report?
Options:
A. ERP_Invoices
B. BI_Revenue_View
C. Analyst_Adjustments.xlsx
D. CRM_Deals
Best answer: A
Explanation: A source of truth is the authoritative dataset used when multiple sources conflict. For month-end revenue decisions, the best source is the dataset with the right business definition, accountable owner, approval status, and governance controls. The exhibit identifies ERP_Invoices as Finance-owned, approved for recognized revenue, and audited at month-end. Those qualities make it the authoritative reference for the KPI report, even if another dataset has a newer-looking or higher total. Derived dashboard views and manual spreadsheets can support analysis, but they should not override the governed system of record.
Topic: Data Concepts and Environments
A data analyst must build a monthly compliance report for finance managers. The report must use governed customer and transaction tables, support the same query logic each month, and return consistent structured records for audit review. Which data source is the best professional choice?
Options:
A. The approved relational database
B. A raw document folder in a data lake
C. A web scrape of customer account pages
D. A shared spreadsheet exported by users
Best answer: A
Explanation: For a recurring compliance report that depends on governed tables and repeatable query logic, the best source is an approved relational database or similar governed database environment. Relational databases store structured records in tables, enforce schemas and constraints, and allow analysts to run consistent SQL queries over time. This supports auditability because the source, query, and data structure can be documented and repeated. User-managed files, scraped pages, and raw document stores may be useful for exploration or unstructured data, but they introduce more variability and governance risk for this requirement.
Topic: Data Concepts and Environments
A retail analyst is designing a star schema for an executive dashboard. The sales fact table is at the order-line grain, and the CRM source allows each customer to have multiple interest segments. A direct join to the segment list makes reported revenue higher than the finance total. Executives still need revenue by segment and reconciled overall totals. What is the BEST professional decision?
Options:
A. Create a customer-segment bridge table with allocation weights
B. Join sales directly to the customer segment list
C. Store all customer segments in one delimited field
D. Keep only the customer’s most recent segment
Best answer: A
Explanation: A bridge table is used when a dimension relationship is many-to-many, such as customers belonging to multiple segments. Directly joining an additive fact like revenue to multiple segment rows repeats the same order-line amount, inflating totals. A customer-segment bridge stores one row per valid customer-to-segment relationship and can include an allocation factor when revenue must be distributed across segments. This preserves segment analysis while allowing overall totals to reconcile to the sales fact and finance source. The key is to model the relationship explicitly instead of hiding or deleting valid segment memberships.
Topic: Data Acquisition and Preparation
A marketing analyst is preparing a monthly campaign dataset for a revenue dashboard. The source file contains order_amount_raw as text; most values are valid amounts, but some contain entries such as TBD, blank values, or currency symbols. The dashboard needs a numeric amount for aggregation, and the data steward wants the original quality issue to remain auditable. Which preparation approach best meets these requirements?
Options:
A. Overwrite order_amount_raw with cleaned numeric values
B. Remove every row with a nonnumeric amount
C. Create a numeric derived field and retain the raw field with an error flag
D. Replace all invalid amounts with 0 before loading
Best answer: C
Explanation: Analysis readiness improves when a transformation creates a usable field without destroying evidence of the source issue. In this case, the dashboard needs a numeric value for aggregation, but the steward also needs auditability. A derived numeric amount field supports calculations, while retaining order_amount_raw preserves lineage. Adding an error or validity flag makes invalid values visible for quality review instead of silently hiding them. This approach separates reporting usability from data-quality remediation.
Topic: Data Analysis
A data analyst is writing a dashboard note for executives about a promotional email pilot. The note must summarize only what the data supports. Which wording is most appropriate?
Exhibit: Pilot summary
Population: loyalty members who opted in to email
Comparison: same members, 30 days before vs. after email
Average order value: $42 before; $47 after
Control group: none
Known issue: holiday sale overlapped pilot
Options:
A. All customers will increase average order value by $5 if emailed.
B. The email caused a $5 increase in average order value.
C. The email had no effect because the holiday sale overlapped the pilot.
D. Opted-in loyalty members had higher average order value after the email; causation is not established.
Best answer: D
Explanation: Communication should match the strength of the evidence. The exhibit shows a before-and-after increase for the same opted-in loyalty members, but there is no control group and a holiday sale occurred at the same time. That supports wording about an observed association or change in this group, not a causal claim or a prediction for all customers. A confounder weakens causal interpretation, but it does not prove the email had no effect. The safest wording states the observed result and clearly limits the conclusion.
Topic: Data Analysis
A data analyst is asked whether a 2-week email campaign increased repeat purchases. The dashboard shows a lift, but the analyst notices a recent data-quality alert.
Exhibit: Campaign evidence summary
| Check | Result |
|---|---|
| Repeat purchase rate, targeted customers | 8.4% |
| Repeat purchase rate, holdout customers | 7.9% |
| Missing campaign flag | 18% of orders |
| Missingness pattern | Mostly from mobile checkout |
| Data alert | Campaign flag pipeline changed 3 days before launch |
Options:
A. Exclude all mobile orders and publish the revised lift
B. Impute missing campaign flags as not targeted
C. Validate the campaign flag source and rerun the analysis
D. Report that the campaign increased repeat purchases
Best answer: C
Explanation: Evidence-based conclusions should account for data quality before interpreting a KPI difference. Here, the observed lift is only 0.5 percentage points, while 18% of orders are missing the campaign flag. The missingness is not random because it is concentrated in mobile checkout, and a pipeline change occurred just before the campaign. That makes the campaign/holdout comparison unreliable until the source issue is investigated and the analysis is rerun with validated data. The defensible next action is to verify lineage and source extraction for the campaign flag rather than state a business conclusion from compromised evidence.
Topic: Data Governance
A retail analytics team receives daily sales extracts from eight stores. The executive dashboard has repeatedly shown issues caused by blank store_id values, negative quantity values, and duplicate transaction_id values. The business wants to catch these issues before the dashboard refreshes, track recurrence by store over time, and notify the data owner automatically.
Which approach best maps to these requirements?
Options:
A. Run quarterly user acceptance testing on dashboard filters.
B. Mask transaction identifiers before loading the dashboard.
C. Implement automated profiling rules, quality metrics, and alerts.
D. Manually inspect a sample after the dashboard publishes.
Best answer: C
Explanation: Automated data-quality monitoring is the best fit when recurring issues must be detected consistently before downstream reporting is affected. Profiling rules can test fields for completeness, valid ranges, and uniqueness, while quality metrics can trend failure rates by source store over time. Alerts connect the monitoring process to governance by notifying the accountable data owner when thresholds or rules fail. This approach supports repeatable detection and evidence-based remediation instead of relying on occasional review or after-the-fact correction.
Topic: Data Acquisition and Preparation
A data analyst is preparing a monthly customer churn model for the retention team. The source CRM table has 8% missing values in household_income, which is a useful predictor, and rows with missing values are otherwise complete. The business owner wants the model update delivered this week and asks that the preparation steps be reproducible and defensible. Which action is the BEST professional decision?
Options:
A. Delete all rows with missing income values before modeling
B. Replace all missing income values with zero
C. Impute the missing income values using a documented, validated method
D. Remove the income field from the model dataset
Best answer: C
Explanation: Imputation is appropriate when missing values should be filled using a reasonable method instead of automatically deleting records. In this scenario, the missing field is useful for churn modeling, the affected rows are otherwise complete, and the analyst must produce a defensible, reproducible preparation process. A documented method, such as median imputation or segment-based imputation followed by validation, can preserve sample size and reduce avoidable bias from dropping records. The chosen method should be recorded so reviewers can understand how missing values were handled.
Automatic deletion is a weaker choice because it discards usable customer records without first assessing the impact of the missingness.
Topic: Data Governance
A data analyst publishes a monthly revenue dashboard used by finance leadership. After a late adjustment from the source system, the March dataset and dashboard totals changed, but leaders still need to compare the current numbers with the originally published March report for audit discussion. What is the best professional decision?
Options:
A. Overwrite the March files with the corrected values
B. Delete the original dashboard to avoid conflicting totals
C. Maintain versioned snapshots of the dataset and dashboard output
D. Email finance a note explaining that totals changed
Best answer: C
Explanation: Data versioning is a governance control for preserving identifiable versions of datasets, reports, or analytical outputs as they change. In this scenario, finance needs both the originally published March report and the corrected version for audit comparison. Keeping versioned snapshots supports traceability, repeatability, and source-of-truth discipline without preventing valid corrections from being applied. Overwriting or deleting files removes evidence of what was previously published. A note alone may provide context, but it does not preserve the actual dataset and report state needed for comparison.
Topic: Data Governance
A data analyst is preparing a monthly customer-service dashboard from support tickets that include customer names and email addresses. The company retention policy requires raw tickets to be retained for 180 days, disposed of after that period unless on legal hold, and aggregated KPI results to be retained for 3 years. Which action is the best professional decision?
Options:
A. Delete all raw tickets after dashboard publication
B. Retain raw tickets 180 days, then dispose unless held
C. Export raw tickets to a personal spreadsheet archive
D. Keep raw tickets indefinitely for trend analysis
Best answer: B
Explanation: Retention requirements define how long data must be kept and when it must be disposed of based on policy, regulation, or legal hold. In this scenario, raw tickets contain personal data, so the analyst should not keep them longer than allowed or delete them before the required retention period. The professional decision is to retain raw tickets for 180 days, dispose of them after that period unless a legal hold applies, and keep only the aggregated KPI results for the 3-year reporting requirement. This balances compliance, reporting continuity, and privacy risk.
Topic: Data Analysis
A data analyst is creating a calculated field named review_status for an orders dashboard. The field must classify each row using the rule in the exhibit. Which logical function pattern should be used?
Exhibit: Classification rule
| Condition | Result |
|---|---|
chargeback_flag = TRUE | Manual review |
customer_status = "New" and order_amount >= 500 | Manual review |
| All other rows | Standard |
Options:
A. Use CONCAT to combine chargeback_flag, customer_status, and order_amount
B. Use IF/CASE with chargeback_flag OR customer_status = "New" OR order_amount >= 500
C. Use IF/CASE with chargeback_flag OR (customer_status = "New" AND order_amount >= 500)
D. Use IF/CASE with chargeback_flag AND customer_status = "New" AND order_amount >= 500
Best answer: C
Explanation: Logical functions are used when a calculated field depends on conditions, flags, categories, or business rules. In this case, the output is category-based: each row becomes either Manual review or Standard. The exhibit defines two separate paths to Manual review: any chargeback, or a new customer with an order amount of at least 500. That means the overall test needs OR, while the new-customer high-value path needs AND grouped together. An IF or CASE expression can then return the correct label based on that combined Boolean result. The key is preserving the rule logic exactly, not merely checking whether any individual field has a value.
Topic: Data Acquisition and Preparation
A retail analyst receives a customer export to build a monthly churn report by region for executives. The report requires one row per customer, a valid cancellation date when churned, and a standardized region value. Which exploration step should the analyst perform first to determine whether the dataset is fit for this purpose?
Options:
A. Create the final churn trend dashboard
B. Delete records with any blank fields
C. Aggregate churn counts by product category
D. Profile customer IDs, cancellation dates, and region values
Best answer: D
Explanation: Data exploration should focus on the fields that directly support the analytical purpose. For a churn report by month and region, the analyst needs to verify uniqueness at the customer level, usable cancellation dates for monthly grouping, and consistent region categories for segmentation. Profiling these fields can reveal whether the dataset has duplicates, missing values, invalid dates, or inconsistent labels that would make the report unreliable.
The key takeaway is to assess fitness for the intended analysis before transforming, deleting, aggregating, or visualizing the data.
Topic: Data Acquisition and Preparation
A retail analyst is asked to summarize customer satisfaction for all store shoppers. Due to time constraints, the team collected responses only from shoppers who voluntarily completed a tablet survey near the checkout counter during one Saturday afternoon. Which sampling limitation should the analyst document?
Options:
A. Random sample
B. Convenience-like sample
C. Incomplete sample
D. Fully unbiased sample
Best answer: B
Explanation: A convenience-like sample uses participants who are easiest to access, which can limit how well the results represent the target population. In this scenario, shoppers chose whether to respond, and collection occurred only near checkout during one Saturday afternoon. That approach is convenient, but it may overrepresent certain shoppers and miss others who shop at different times, skip the tablet, or use other checkout methods. An incomplete sample would emphasize missing required records or fields from an intended collection, while a random sample would require each shopper to have a known chance of selection. The key takeaway is to document the collection limitation before generalizing the results to all shoppers.
Topic: Visualization and Reporting
A data analyst is preparing an executive update on website conversion rates. The dataset contains weekly conversion rates for 12 weeks before and 12 weeks after a homepage redesign. A paid marketing campaign began the same week as the redesign, and web tracking was incomplete for two days. Which visualization approach is the best professional decision?
Options:
A. 3D pie chart of pre- and post-redesign conversions
B. Forecast chart projecting future redesign gains
C. Annotated weekly line chart with data-quality notes
D. Before-and-after bar chart labeled as redesign impact
Best answer: C
Explanation: The core issue is choosing a visualization that supports the message without implying stronger evidence than the data can support. A weekly line chart lets executives see the pattern before and after the redesign, while annotations can mark the redesign date, the marketing campaign, and the tracking gap. This communicates a possible association and preserves transparency about data quality and confounding factors. A simple before-and-after comparison would be easier to read, but it can overstate causation when another campaign started at the same time.
Topic: Data Concepts and Environments
An analytics team is comparing AI tools for automated reporting. Which interpretation is best supported by the exhibit?
| Tool | Model note | Planned use |
|---|---|---|
| Assistant A | Pretrained on large text corpora; generates SQL explanations and summaries | Analyst Q&A |
| Platform B | Pretrained on text, image, and audio; adaptable for captioning, transcription, and document search | Multiple reporting automations |
Options:
A. Both tools are only LLMs because both can process text.
B. Assistant A is a foundation model; Platform B is only RPA.
C. Assistant A is an LLM; Platform B is a broader foundation model.
D. Neither tool uses AI unless trained from scratch internally.
Best answer: C
Explanation: A large language model (LLM) is focused on language tasks such as generating, summarizing, or explaining text. A foundation model is the broader category: it is pretrained on large datasets and can be adapted to many tasks, which may include language, images, audio, or other modalities. In the exhibit, Assistant A is language-centered, so it fits the LLM description. Platform B supports multiple data types and reporting tasks, so it is best interpreted as a broader foundation model. The key distinction is scope: language-specific capability versus broad pretrained adaptability.
Topic: Data Governance
A marketing analytics team wants to send customer purchase data to an external vendor for a campaign-response analysis. The vendor only needs customer segment, region, purchase month, and purchase amount. The contract states that direct identifiers and unnecessary personal data must not leave the organization. Which sharing approach best meets these requirements?
Options:
A. Send the raw file and require the vendor to delete extra fields
B. Send the full customer table after encrypting the file
C. Share the internal dashboard and ask the vendor to export needed data
D. Share a minimized, de-identified extract through an approved secure channel
Best answer: D
Explanation: When data leaves a team, system, vendor, or organizational boundary, sharing controls should be applied before release. In this scenario, the vendor has a narrow business need and the contract prohibits direct identifiers and unnecessary personal data from leaving the organization. The best approach is to create a least-privilege extract containing only required fields, remove or de-identify identifiers, and use an approved secure transfer method. Encryption is important, but it does not make unnecessary data appropriate to share. The key takeaway is to combine data minimization with protection controls before external sharing occurs.
Topic: Data Acquisition and Preparation
A data analyst is preparing transaction data for a monthly revenue report. The business owner warns that unusually high or low amounts may be legitimate bulk orders, refunds, data-entry errors, payment-system faults, or one-time promotional events. Which preparation approach best meets this requirement before the report is published?
Options:
A. Remove all values outside the interquartile range
B. Flag and investigate outliers before assigning a treatment
C. Ignore outliers if the monthly total reconciles
D. Replace all unusual values with the median amount
Best answer: B
Explanation: Outlier handling should start with identification and investigation, not automatic deletion or replacement. In this scenario, the same unusual value pattern could mean several different things: a valid extreme transaction, a refund, a data-entry issue, a system fault, or a special business event. A good preparation approach flags suspected outliers, checks source records or logs, consults business rules, and documents the reason for the final treatment. This preserves valid business signals while preventing bad data from distorting the report.
The key takeaway is that outliers are not automatically errors; they require context before remediation.
Topic: Data Governance
A data analyst supports a weekly customer-quality review for a data governance team. The customer table is loaded from three source systems, and recent reports show missing email addresses, duplicate customer IDs, invalid postal codes, and inconsistent state abbreviations. The team needs recurring metrics to track these issues over time before approving remediation work. What is the best professional decision?
Options:
A. Archive older customer records to reduce table size
B. Build a predictive churn model using the table
C. Create a one-time corrected extract for the review
D. Run scheduled data profiling on the customer table
Best answer: D
Explanation: Data profiling examines a dataset and produces quality metrics such as completeness, validity, uniqueness, and consistency. In this scenario, the governance team needs recurring measurements for specific data-quality issues across multiple source systems. Scheduled profiling supports trend tracking and helps prioritize remediation with evidence. It is more appropriate than fixing a single extract because the requirement is to understand and monitor quality patterns over time, not just make one report look correct.
Topic: Data Analysis
A sales analyst is asked why the daily e-commerce dashboard does not match the finance month-end revenue report.
| Source | Reliability | Refresh | Preparation applied |
|---|---|---|---|
| Checkout transactions | Operational source of truth | Hourly | Removes test orders, deduplicates transaction_id, converts time zone |
| Ad platform export | Campaign attribution estimate | Daily at 2:00 a.m. | No deduplication; includes pending returns |
| Finance ledger | Audited revenue source | Monthly after close | Applies accounting adjustments |
The dashboard was viewed on March 15 at 10:00 a.m. The finance report is closed through February 29. Which conclusion is best supported?
Options:
A. The finance report is less reliable because it is refreshed monthly.
B. The mismatch is expected; reconcile the same closed period using aligned preparation rules.
C. The ad platform should replace both sources because it includes attribution.
D. The dashboard proves March revenue is higher because it refreshes hourly.
Best answer: B
Explanation: Evidence-based conclusions should consider source reliability, refresh timing, and preparation before explaining differences. Here, the checkout data is reliable for current operational monitoring, but the finance ledger is the audited revenue source and is only closed through February 29. A March 15 dashboard and a February month-end report are not measuring the same time window. Preparation also differs: checkout data is deduplicated and standardized, while the finance ledger applies accounting adjustments. The defensible conclusion is to treat the mismatch as expected and compare the same closed period using aligned business rules before stating whether revenue truly differs.
Topic: Data Analysis
A city transportation analyst has survey responses from 1,200 randomly selected bus riders and wants to estimate whether the overall rider population supports extending weekend service. The survey sample includes age and route fields, but a small number of responses have missing demographic values. The planning director needs a defensible conclusion for all riders, not just a summary of the sample. Which analysis approach is the BEST professional decision?
Options:
A. Create descriptive charts of sample response counts
B. Use inferential analysis with confidence intervals
C. Delete all incomplete records before reporting totals
D. Build a predictive model for individual rider behavior
Best answer: B
Explanation: Inferential analysis is appropriate when a sample is used to draw a conclusion about a larger population. Here, the analyst has a random sample of bus riders and the director needs a defensible statement about all riders. A confidence interval or hypothesis test can quantify uncertainty around the estimated level of support. The missing demographic values should be reviewed and handled consistently, but they do not change the core method selection if the support response is usable. Descriptive analysis would summarize only the 1,200 respondents, while predictive analysis would answer a different question.
Topic: Data Acquisition and Preparation
A data analyst is preparing support-ticket text for analysis. The ticket ID should be captured only when it follows the same repeatable pattern: two uppercase letters, a hyphen, four digits, a hyphen, and three digits.
Exhibit: Sample values
| Raw text | Needed output |
|---|---|
| Refund request AB-2024-118 received | AB-2024-118 |
| Follow-up for CD-2023-007 closed | CD-2023-007 |
| No valid ticket reference | null |
Which method is the best next action?
Options:
A. Impute missing ticket IDs with the mode
B. Apply a RegEx extraction rule
C. Convert the raw text column to a date type
D. Bin the raw text values by length
Best answer: B
Explanation: Regular expressions are used to find, extract, or validate text that follows a repeatable pattern. In this case, the needed ticket ID is embedded inside longer text and has a consistent structure: uppercase letters, hyphens, and fixed digit groups. A RegEx extraction rule can return the matching ID when present and leave rows without a valid match as null or flagged for review. This preserves the source text while creating a clean analysis field.
Type conversion, binning, and imputation do not identify the patterned substring inside each ticket note.
Topic: Data Concepts and Environments
A retail analytics team needs one repository for raw clickstream logs, product images, and curated sales tables. Data scientists must access raw files for experimentation, while BI users need governed SQL reporting, documented lineage, and consistent access controls. Which repository concept best fits these requirements?
Options:
A. Flat file repository
B. Operational database
C. Data lakehouse
D. Data silo
Best answer: C
Explanation: A data lakehouse is designed for scenarios that need the flexibility of a data lake and the managed analytics features of a data warehouse. In this case, the team must store diverse raw data such as logs and images, while also supporting curated tables, SQL reporting, lineage, and governed access. That combination points to a lakehouse rather than a single-purpose operational store or disconnected file location. The key takeaway is that lakehouses bridge exploratory storage and governed analytical consumption.
Topic: Data Acquisition and Preparation
A data analyst is profiling a staging table before building a monthly inventory report. The report must include every approved store, product category, and calendar month, even when the quantity is zero. Required fields are store_id, category, month_start, and quantity_on_hand. Which preparation approach should the analyst use first to assess completeness?
Options:
A. Build an expected grid and left join the staging data.
B. Drop rows that contain any null required field.
C. Aggregate quantities by store and compare grand totals.
D. Standardize category names and date formats only.
Best answer: A
Explanation: Completeness means verifying that all required data is present, not just that existing rows look valid. In this scenario, the analyst needs to confirm that every required store, category, and month appears in the data. Creating the full expected set from approved reference lists and comparing it to the staging table identifies absent combinations, such as a missing month for a store-category pair. The same profiling step can also flag records where required fields are null. This is stronger than row cleanup or summary comparison because missing source records may not appear as visible errors in the staging table.
Topic: Visualization and Reporting
A sales operations dashboard must show prior-day results by 8:00 a.m. for regional managers. Users report that the dashboard still shows two-day-old figures at 8:30 a.m. The source system finishes its nightly export at 7:45 a.m., but the dashboard dataset refresh is scheduled for 7:00 a.m. What is the most likely refresh-rate issue?
Options:
A. The refresh schedule runs before the source update completes.
B. The source data contains duplicate records.
C. The dashboard needs a different chart type.
D. The managers need row-level security added.
Best answer: A
Explanation: A slow or stale refresh issue often occurs when scheduled reporting processes are not aligned with source-system update timing. In this scenario, the business requirement is prior-day reporting by 8:00 a.m., but the dashboard refresh runs at 7:00 a.m. while the source export does not finish until 7:45 a.m. The report is refreshing successfully, but it is refreshing too early to capture the newest data. The practical fix would be to schedule the dashboard refresh after the source export completes, with enough buffer for processing before the 8:00 a.m. requirement.
Topic: Data Governance
A data analyst maintains a monthly executive revenue dashboard that combines CRM opportunity data with billing exports. The workflow standardizes region codes, removes test accounts, and maps product names before publishing. Finance needs an audit-friendly document explaining why dashboard revenue differs from raw billing totals and how each reported field is produced. Which documentation artifact is the best choice?
Options:
A. Data lineage documentation for the dashboard dataset
B. A user access matrix for executive report viewers
C. A high-level project charter for the dashboard refresh
D. A visual style guide for dashboard colors and labels
Best answer: A
Explanation: Data lineage documentation is the governance artifact that shows where data comes from, how it changes, and where it is used. In this scenario, Finance needs traceability from CRM and billing sources through region standardization, test-account removal, product mapping, and final dashboard fields. That makes lineage more useful than general project, access, or design documentation because it connects source data to transformed report outputs in an audit-friendly way. The key takeaway is to match the artifact to the question being asked: source-to-report traceability calls for lineage documentation.
Topic: Data Acquisition and Preparation
A data analyst receives order data from an API. Each record has one order_id, one customer_id, and a products field that contains a list of purchased SKUs with quantities. The merchandising team needs a row-level dataset to analyze sales by SKU while preserving the original order_id for traceability. Which preparation step is the BEST professional decision?
Options:
A. Explode the products list into separate rows
B. Scale the quantity values to a common range
C. Append customer demographic columns to each order
D. Impute missing SKU values using the most common SKU
Best answer: A
Explanation: Exploding is the appropriate transformation when a field contains nested or list-like values that must be analyzed as separate observations. In this case, each order can contain multiple products, but the reporting need is SKU-level analysis. Exploding the products list creates one row per SKU or SKU-quantity pair and keeps the order_id available for lineage and traceability. This avoids treating a multi-product list as one unusable text value and supports aggregation by SKU.
Topic: Data Governance
A VP must decide at 12:15 PM whether to extend a flash sale. The decision requires order data current through noon and an auditable source.
| Dataset | Governance note | Refresh interval | Last refresh |
|---|---|---|---|
| Certified sales mart | Source of truth for daily sales reporting | Nightly | 2:00 AM |
| Operational orders view | Approved for intraday monitoring | Every 15 minutes | 12:00 PM |
| Marketing tracker | Manually maintained | Hourly | 12:00 PM |
Which recommendation is the best professional decision?
Options:
A. Use the marketing tracker because it refreshed at noon.
B. Use the operational orders view with an as-of timestamp.
C. Use the certified sales mart because it is the source of truth.
D. Postpone the decision until the nightly sales mart refresh.
Best answer: B
Explanation: Refresh interval is a governance fact because it defines how current a dataset can be for a decision. The certified sales mart is authoritative for daily reporting, but its nightly refresh means it does not include the flash-sale activity needed by noon. The operational orders view is approved for intraday monitoring and refreshed at 12:00 PM, so it best satisfies both currency and auditability. The analyst should include an as-of timestamp so the VP understands the data version used for the decision.
Being the source of truth for one reporting purpose does not automatically make a dataset current enough for every decision.
Topic: Visualization and Reporting
A data analyst is redesigning a sales performance dashboard for regional managers. The dashboard must be readable on projectors, usable by viewers with common color-vision deficiencies, and make high-risk values stand out without changing the company logo. Which color-scheme approach best meets these requirements?
Options:
A. Use a colorblind-safe palette with high contrast and reserve an accent color for high-risk values
B. Use the company logo colors for every chart and KPI status
C. Use red and green status colors only for all performance categories
D. Use a rainbow gradient across all measures to add visual variety
Best answer: A
Explanation: Effective dashboard color schemes should improve interpretation, not decorate the report. In this scenario, the design must work on projectors, support users with common color-vision deficiencies, and highlight high-risk values. A high-contrast, colorblind-safe palette improves readability and accessibility, while reserving one accent color for risk helps the audience notice the most important exception. Branding can still appear in the logo or limited design elements, but it should not control every data encoding choice.
The key takeaway is to use color sparingly and consistently to communicate meaning.
Topic: Data Concepts and Environments
A data analyst must feed an operations dashboard with the most current order status from a fulfillment application. Which source should the analyst select based on the exhibit?
Exhibit: Available source notes
| Source | Behavior |
|---|---|
| App endpoint | GET /v1/orders/{order_id}/status; returns JSON fields order_id, status, updated_at; reflects app changes within minutes |
| SFTP export | CSV file delivered nightly at 2:00 a.m. |
| Warehouse table | Loaded after the nightly CSV export completes |
| Admin page | Human-readable HTML page; layout changes without notice |
Options:
A. SFTP export
B. Admin page
C. App endpoint
D. Warehouse table
Best answer: C
Explanation: An API source is the best fit when an application exposes current data through a documented request and response pattern. In the exhibit, the app endpoint specifies the request path, response fields, format, and update behavior, making it suitable for a recurring dashboard feed. The nightly CSV and warehouse table are stale for near-current operations, and the admin page is not a reliable data interface because its HTML layout can change without notice. The key distinction is using a defined machine-readable interface instead of exports or screen scraping.
Topic: Data Analysis
A marketing analyst is reviewing a draft report about a new recommendation feature. Which interpretation is best supported by the exhibit?
Exhibit: Draft report evidence
Source: Loyalty app purchases, beta users only
Period: 1 week after feature launch
Records: 1,240 customers who opted into beta
Result: Average basket size was 12% higher than the prior week
Comparison group: None
Known event: Weekend flash sale during the beta week
Draft claim: "The feature caused a 12% lift and will increase sales next month."
Options:
A. Infer causation because the lift is 12%
B. Forecast next month using only the beta-week average
C. Approve the causal and predictive claim as written
D. Treat the result as descriptive and collect stronger evidence first
Best answer: D
Explanation: The exhibit supports a descriptive observation: beta users spent more during the launch week than they did the prior week. It does not provide enough evidence for an inferential claim about all customers or a predictive claim about next month. The sample is self-selected, there is no control group, and a flash sale occurred during the same week. Any of these could explain the higher basket size. A stronger next step would be to collect representative data, compare against a control group, or validate a predictive model on appropriate holdout data. The key distinction is that an observed increase is not enough to prove cause or predict future performance.
Topic: Data Acquisition and Preparation
A data analyst is preparing numeric fields for a customer segmentation model. The inputs include annual spend in dollars, visits per year, and satisfaction score from 1 to 5. The clustering method is distance-based, the profile shows no severe outliers, and the analytics team wants each value expressed relative to the typical customer and spread of that field. What is the BEST preparation step?
Options:
A. Convert each numeric field into high, medium, and low bins
B. Apply min-max scaling to force each field between 0 and 1
C. Leave the fields unchanged to preserve original business units
D. Apply z-score standardization to each numeric field
Best answer: D
Explanation: Standardization is the best fit when numeric variables have different units and a distance-based method should treat them comparably while interpreting values relative to the field’s average and spread. A z-score converts each value into the number of standard deviations above or below the mean. This prevents annual spend from dominating the clustering simply because it has larger raw numbers than satisfaction score. Scaling, such as min-max scaling, also makes ranges comparable, but it expresses values within a fixed range rather than relative to the distribution’s mean and standard deviation. The key distinction is fixed range versus distribution-relative units.
Topic: Data Concepts and Environments
A marketing team uses an automated reporting tool to generate a weekly sales narrative and dashboard. Before distribution, the analyst reviews the run summary.
Exhibit: Automated report run summary
Report: Weekly regional sales
Orders extract: refreshed successfully
CRM region lookup: refresh failed; prior snapshot used
Join validation: 18% of orders unmatched to region
Duplicate invoice IDs: 73 found
Human review status: not completed
What should the analyst do next?
Options:
A. Hold the report for source validation and quality review
B. Rewrite the automated narrative to avoid mentioning regions
C. Distribute the report because the orders extract refreshed successfully
D. Archive the run summary and wait for next week’s report
Best answer: A
Explanation: Automated reporting can speed up recurring analysis, but it does not remove the need for source validation, review, and data-quality controls. In this run, one source failed to refresh, a prior snapshot was used, many records did not join to a region, duplicates were found, and human review is incomplete. These issues can change regional sales totals and mislead the audience. The appropriate next action is to stop distribution until the sources are validated, data-quality exceptions are investigated, and the report is reviewed. Automation supports reporting; it does not make unvalidated output trustworthy.
Topic: Data Acquisition and Preparation
A retail analyst is preparing customer data for a dashboard showing customer counts, average monthly spend, and segments by region and signup channel. Source notes say blank strings, NULL, Unknown, and age value 999 may indicate data was not collected; 0 monthly spend can be valid for inactive customers. Which preparation approach best prevents missing values from distorting the dashboard?
Options:
A. Replace all missing numeric values with 0 before averaging
B. Group missing categories into the largest existing segment
C. Delete every row containing any blank or placeholder value
D. Profile fields, standardize missing markers, and flag them before calculations
Best answer: D
Explanation: Missing values can distort metrics differently depending on the field and calculation. For this dashboard, blanks, NULL, Unknown, and 999 should be profiled and standardized so they are treated consistently. Because 0 monthly spend is valid, it should not be overwritten or treated as missing. Adding a missing-value flag or explicit “missing/unknown” category where appropriate helps analysts see how much incomplete data affects counts, averages, segmentation, and later transformations. The key is to identify and document missingness before applying deletion or imputation.
0 spend differs from unknown spend and would bias averages.Topic: Data Concepts and Environments
A data analyst is helping deploy a departmental analytics database that will run on a virtual machine. The database engine requires a disk it can format and manage directly, with low-latency random reads and writes for indexes and transaction files. Which storage type is the BEST choice?
Options:
A. Shared file storage
B. Block storage volume
C. Data lake repository
D. Object storage bucket
Best answer: B
Explanation: Block storage is the best fit when an application, virtual machine, or database needs raw volume-style storage behavior. It presents storage as a disk-like volume that the operating system or database engine can format, mount, and manage directly. This supports low-latency random I/O patterns commonly needed for database files, indexes, and transaction logs.
Object storage is better for files or blobs accessed through object APIs, and shared file storage is better when multiple users or systems need a common file path. The key signal in the stem is that the database needs direct disk-style control, not just a place to store datasets.
Topic: Data Analysis
A support manager reviews a weekly dashboard that reports an average ticket resolution time of 9.8 hours. The analyst validates the source data before presenting the KPI.
Exhibit: Ticket resolution profile
| Ticket group | Tickets | Resolution time values |
|---|---|---|
| Resolved tickets | 48 | 44 tickets under 4 hours; 3 tickets at 6-8 hours; 1 ticket at 240 hours |
| Open tickets | 12 | resolved_at is null |
Options:
A. The count is misleading because duplicate tickets are shown.
B. The median is misleading because it ignores all resolved tickets.
C. The rate is misleading because ticket groups are aggregated by week.
D. The average is misleading due to an outlier and missing open tickets.
Best answer: D
Explanation: An average can be misleading when a distribution is skewed by an extreme outlier or when important records are missing from the calculation. In this exhibit, most resolved tickets are under 4 hours, but one 240-hour ticket pulls the average upward. Also, 12 open tickets have null resolved_at values, so they are not included in the resolution-time average even though they matter operationally. A better presentation would pair the mean with the median, show the outlier separately, and report the open-ticket count or aging. The key issue is not that averages are unusable, but that this average alone does not represent the typical ticket experience.
Topic: Visualization and Reporting
A data analyst is designing a mobile dashboard for support managers. A line chart shows monthly ticket volume for three priority levels, each using a different color. The audience needs to compare the priority trends quickly, and there is limited screen space.
Which legend approach best fits the requirement?
Options:
A. Add a compact legend for the three priority colors.
B. Use a large legend with definitions of each priority policy.
C. Create a legend entry for every monthly data point.
D. Omit the legend because the colors are visibly different.
Best answer: A
Explanation: Legends should be used when they help the viewer interpret series, categories, or colors. In this case, the chart uses color to represent three priority levels, so a compact legend helps managers understand what each line means. Because the dashboard is mobile and space is limited, the legend should stay concise and close to the visual. The goal is quick interpretation, not a detailed explanation of business rules.
The key takeaway is to include a legend only when it adds clarity and to avoid turning it into visual clutter.
Topic: Visualization and Reporting
A sales operations dashboard imports 18 million transaction-level rows each refresh. Executives only need monthly revenue trends by region and the top 10 product categories, but the report takes 9 minutes to open and often times out on tablets. The source data is valid, but the report includes every line-item column for drill-through that executives rarely use. What is the BEST professional decision?
Options:
A. Add more visuals to summarize the transaction details
B. Increase the refresh frequency to keep the dashboard current
C. Export the full dataset as a spreadsheet instead
D. Aggregate to monthly region/category totals and apply top-10 filtering
Best answer: D
Explanation: Large data size is a common report-performance and consumption issue. When the audience needs summarized trends rather than row-level detail, the analyst should reduce the data loaded into the report by filtering, aggregating, or redesigning the model to match the decision need. In this scenario, monthly region/category totals and a top-10 category filter directly address slow loading and tablet timeouts without discarding valid source data needed elsewhere. The key is to change the report’s grain and scope, not simply add presentation layers or move the same oversized dataset to another format.
Topic: Visualization and Reporting
A data analyst must present quarterly customer satisfaction results to all employees at a company meeting. The audience includes nontechnical staff, executives want the message limited to three key takeaways, and the source survey data has already been validated and summarized. Which visualization deliverable is the BEST choice?
Options:
A. Geographic heat map by customer region
B. Infographic with key metrics and short annotations
C. Interactive pivot table with drill-down filters
D. Detailed statistical appendix with confidence intervals
Best answer: B
Explanation: An infographic is appropriate when the goal is to communicate a small number of validated findings as a clear story for a broad audience. It combines concise text, simple visuals, and highlighted metrics so nontechnical readers can understand the main message without exploring raw data or complex controls. In this scenario, the survey results are already validated and summarized, and executives requested only three key takeaways, so a narrative visual summary fits better than an analysis tool or detailed technical report. The key takeaway is to match the deliverable to the audience and communication goal, not to add interactivity or detail that is not needed.
Topic: Data Analysis
A data analyst publishes a monthly sales variance report for finance. The report shows $4.2 million in net sales, but the finance source system’s month-end control total shows $4.8 million for the same period. Finance states that the source system is the system of record. What should the analyst do first?
Options:
A. Validate the report extract against the source records and control total
B. Remove outlier transactions from the sales dataset
C. Add a note that the dashboard is an estimate
D. Adjust the report calculation to match the finance total
Best answer: A
Explanation: When analysis results conflict with a trusted upstream record or control total, the first troubleshooting step is source validation. The analyst should confirm that the extract, query filters, date range, joins, refresh time, and transformation logic align with the system of record. A control total provides a known benchmark for reconciliation, so it should be used to identify whether records were excluded, duplicated, transformed incorrectly, or pulled from the wrong version of the data. Changing the report to force agreement hides the root cause and weakens data integrity. The key takeaway is to validate against the authoritative source before modifying calculations or presentation.
Topic: Data Analysis
A data analyst is troubleshooting a BI report that started failing during scheduled refresh after a tool update. The source data and report calculations were not changed.
Exhibit: Refresh log summary
Refresh: Failed at 02:05
Tool version: 2025.4.18
Connector: Web API OAuth connector v3.2
Error: OAuth token refresh failed after 60 minutes
Manual API test: 200 OK
Same dataset on gateway 2025.4.17: Success
What should the analyst do next?
Options:
A. Check vendor documentation and community posts for the exact version and error
B. Post the full refresh log and token details publicly
C. Rewrite the report calculations to reduce refresh complexity
D. Delete and rebuild the dataset from the source API
Best answer: A
Explanation: Tool-specific failures should be investigated using authoritative sources when the evidence points to product behavior. In this case, the same dataset works on the prior gateway version, the API responds successfully, and no report logic changed. That pattern suggests a connector or version-specific issue rather than a data, SQL, or calculation problem. The analyst should search vendor documentation, release notes, support articles, and relevant community threads for the exact error, connector version, and tool version, then apply a documented fix or workaround. Any public post should be sanitized to remove tokens, credentials, and sensitive data.
Topic: Data Concepts and Environments
A data analyst uses an automated reporting tool with generative AI summaries to create a weekly revenue report for executives. This week, the tool shows a 22% increase in renewals, but the billing source had a recent field-name change and the data profile shows an unusual spike in null customer IDs. What is the BEST professional decision before distributing the report?
Options:
A. Replace the reporting tool with a custom model
B. Validate sources and review data-quality exceptions
C. Rewrite the AI summary to sound less surprising
D. Distribute the report because it is automated
Best answer: B
Explanation: Automated reporting and AI-generated summaries can speed up recurring analysis, but they do not remove the analyst’s responsibility to validate source data and review quality signals. In this scenario, a field-name change could break mappings or calculations, and the spike in null customer IDs could affect renewal counts, joins, or segmentation. The professional decision is to pause distribution long enough to reconcile the report against trusted source records, inspect the failed or changed fields, and document any issue or correction before executives use the result. The key takeaway is that automation supports reporting; it does not replace source validation or data-quality controls.
Topic: Data Acquisition and Preparation
A data analyst is building a repeatable monthly query that combines sales, returns, and customer tables. The logic requires several joins, filters out test accounts, and calculates return rates by region before loading a small reporting table. The source tables are large, and the analyst needs a clear way to validate the intermediate regional totals before the final load. What is the best professional decision?
Options:
A. Stage the cleaned regional results in a temporary table
B. Put all joins and calculations into one nested query
C. Create a permanent duplicate of each source table
D. Export each source table to separate spreadsheets
Best answer: A
Explanation: Temporary tables are useful intermediate structures when query work has multiple steps, large source tables, or validation checkpoints. In this scenario, the analyst can filter test accounts, join the needed data, aggregate return rates by region, and store those staged results temporarily. That makes the final load simpler and gives the analyst a concrete intermediate dataset to inspect for data quality issues before publishing the reporting table. Temporary tables are especially appropriate when the staged data is needed only during the current workflow or session, not as a long-term governed copy.
Topic: Data Governance
A data analyst is preparing a customer-support dashboard. Support agents must verify customers using partial identifiers, but they should not see full sensitive values.
Exhibit: Privacy matrix
| Field | Current value example | Agent requirement |
|---|---|---|
| SSN | 123-45-6789 | View last 4 digits only |
| Credit card | 4111-1111-1111-1111 | View last 4 digits only |
| alex.chen@example.com | View domain only |
Which next action best meets the requirement?
Options:
A. Delete the sensitive fields from the dashboard dataset
B. Anonymize the records by removing customer identity links
C. Apply role-based masking to the sensitive fields
D. Encrypt the fields at rest in the source database
Best answer: C
Explanation: Data masking is appropriate when sensitive values must be hidden or partially obscured while still allowing controlled business use. In this case, agents need partial SSNs, partial card numbers, and email domains for verification, not full values. Role-based masking can show only permitted portions to support agents while leaving the underlying governed data available for authorized processes. Anonymization and deletion reduce or remove usability, while encryption at rest protects stored data but does not determine what an authorized dashboard user can see.
Topic: Data Acquisition and Preparation
A data analyst is preparing a quarterly sales dashboard from point-of-sale extracts. The data dictionary requires OrderID, SaleDate, Region, and ProductCategory, and the dashboard must show every month and active category. A profile shows no nulls in required fields, but the Q2 extract has no May records for the Central region and no Accessories category, even though the source owner confirms both should have activity. What is the BEST professional decision before publishing?
Options:
A. Reconcile the extract against expected months, categories, and source counts
B. Publish the dashboard because all required fields are populated
C. Set the missing May Central and Accessories values to zero
D. Exclude Central and Accessories from the dashboard filters
Best answer: A
Explanation: Completeness checks verify that all expected data is present, including required fields, time periods, categories, and source records. In this case, the absence of nulls does not prove the dataset is complete because entire expected slices are missing. The analyst should compare the extract to a reference calendar, active category list, and source control totals by region or period, then request correction or clearly document the gap before publication. Treating absent records as zero would change the business meaning unless the source confirms zero activity.
Topic: Data Concepts and Environments
A data team is centralizing departmental CSV extracts, workbook files, and reference documents. Analysts must browse nested project folders from multiple workstations, preserve folder-level permissions, and let a legacy application read and write using standard file paths. Which storage approach best fits these requirements?
Options:
A. Object storage in a bucket with metadata tags
B. Shared file storage with a hierarchical namespace
C. Block storage attached to one application server
D. A relational data warehouse with curated tables
Best answer: B
Explanation: File storage is the best fit when the main requirement is shared, hierarchical file access. It presents data as folders and files, supports familiar paths, and commonly works with access controls at the file or folder level. That matches analysts browsing nested project directories and a legacy application using standard file paths. Object storage can be excellent for scalable unstructured data, but it typically organizes data as objects in buckets rather than as a shared file system. Block storage is usually attached as a low-level volume to a server, and a data warehouse is optimized for structured analytical queries rather than shared document-style access. The key signal is the need for shared folders and file paths.
Topic: Data Concepts and Environments
A reporting team must combine customer, billing, and support data into a monthly retention report. Auditors must be able to trace each reported metric back to its originating system, source key, and extraction date.
Exhibit: Source inventory
| Source | Main key | Refresh | Notes |
|---|---|---|---|
| CRM | customer_id | Daily | Customer status |
| Billing | account_id | Daily | Paid subscriptions |
| Support | email | Weekly | Open tickets |
Which source strategy best preserves traceability for the report?
Options:
A. Export all sources to one spreadsheet and overwrite it monthly
B. Connect the dashboard directly to each source without documenting joins
C. Use only the CRM as the source because it has customer status
D. Stage each source, retain source metadata, and build a curated data mart
Best answer: D
Explanation: When multiple sources feed a report, traceability depends on preserving lineage from ingestion through transformation. A good strategy stages each source separately, keeps identifiers such as source system, original key, extraction timestamp, and transformation rules, then publishes a curated reporting layer such as a data mart. This lets analysts combine data for usability while still showing where each metric came from and how it was produced. Direct blending or manual exports may produce a report, but they make audits and historical validation difficult.
Topic: Data Governance
A merchandizing team will use a replenishment dashboard at 9:00 AM. Governance metadata states that every dataset used for this decision must have a successful refresh after 7:30 AM on the same business day.
| Dataset | Refresh interval | Last successful refresh |
|---|---|---|
| Inventory balances | 15 minutes | 8:45 AM |
| Supplier ETAs | 4 hours | 4:10 AM |
| Online orders | 1 hour | 8:05 AM |
| Returns | Daily | 11:00 PM prior day |
Which conclusion is best supported by the metadata?
Options:
A. The dashboard is current enough for the decision.
B. Supplier ETAs and returns are not current enough.
C. Only the inventory balances are usable.
D. The refresh intervals prove all sources are current.
Best answer: B
Explanation: Refresh metadata is a governance fact because it defines whether data is timely enough for a specific business decision. In this case, the decision rule is not just whether each source has a scheduled refresh interval; it requires a successful refresh after 7:30 AM on the same business day. Inventory balances and online orders meet that cutoff. Supplier ETAs last refreshed at 4:10 AM, and returns last refreshed the prior day, so those inputs make the dashboard insufficient for a 9:00 AM replenishment decision unless they are refreshed or excluded with clear disclosure. The displayed dashboard time cannot override source-level currency metadata.
Topic: Data Governance
A data analyst is updating a privacy review for a customer analytics dataset. The governance lead asks which reference should be cited for broad security and privacy practice guidance.
Exhibit: Review notes
| Finding | Detail |
|---|---|
| Data involved | Customer IDs, email addresses, location data |
| Needed reference | Framework for controls and risk-based guidance |
| Constraint | Not limited to payment cards or healthcare records |
| Goal | Align masking, access control, and monitoring practices |
Which next action is best supported by the exhibit?
Options:
A. Apply HIPAA because privacy controls are needed.
B. Use only the internal data dictionary.
C. Cite NIST guidance as the framework reference.
D. Use PCI DSS because customer identifiers are present.
Best answer: C
Explanation: NIST is a standards and guidance concept used when an organization needs a framework reference for security or privacy practices. The exhibit asks for broad, risk-based guidance to align masking, access control, and monitoring, and it explicitly says the case is not limited to payment cards or healthcare records. That points to NIST rather than a sector-specific or data-type-specific compliance program. A data dictionary can document fields and definitions, but it does not provide a security or privacy control framework.
Topic: Data Concepts and Environments
A retail analytics team is choosing an environment for a new sales and inventory analytics workspace. Review the architecture notes.
Exhibit: Architecture notes
| Requirement | Note |
|---|---|
| Storage | 15 TB now; expected to double within a year |
| Compute | Heavy month-end queries; light daily exploration |
| Access | Analysts need managed SQL and dashboard tooling |
| Operations | Small team; minimal server maintenance desired |
Which environment is best supported by the exhibit?
Options:
A. Departmental spreadsheet repository
B. Local desktop database
C. Single on-premises file server
D. Cloud provider analytics environment
Best answer: D
Explanation: A cloud provider analytics environment fits when the workload needs storage that can scale, compute capacity that can expand or contract with demand, and managed access to analytics services such as SQL query engines or dashboard integrations. The exhibit shows fast data growth, variable compute demand, and a small operations team that does not want to maintain servers. Those clues point to cloud-hosted storage and managed analytics services rather than local or manually maintained infrastructure. The key takeaway is that cloud environments are often chosen for elasticity and managed services, not just remote access.
Topic: Data Acquisition and Preparation
During data exploration, an analyst integrates CRM and billing extracts for a monthly customer-contact dashboard. The audience wants one contact email per customer, but compliance requires traceability to source systems. Profiling shows crm_email and billing_email usually match, but billing sometimes stores an accounts payable contact while CRM stores the end user. Exact repeated CRM rows also exist. What is the BEST professional decision?
Options:
A. Treat mismatched emails as duplicate customers and merge the records.
B. Keep all rows and show both email fields on the dashboard.
C. Preserve source fields, remove exact repeated rows, and derive a documented preferred email.
D. Drop billing_email whenever crm_email is populated.
Best answer: C
Explanation: Redundancy occurs when fields overlap in meaning but are not identical or fully interchangeable. Here, crm_email and billing_email may both describe customer contact information, but they can represent different business roles. That makes them redundant attributes, not duplicate data to delete blindly. Exact repeated CRM rows are a separate duplication issue and should be de-duplicated using appropriate keys or row matching. The best preparation step is to keep the original source fields for traceability, create a derived preferred contact email using documented business rules, and record lineage so the dashboard has one usable value without losing source context.
Topic: Visualization and Reporting
A revenue operations analyst is revising a weekly report for account managers. The audience must identify which customers need follow-up and quote the exact variance during calls.
Exhibit: Report requirement
| Field | Example | Requirement |
|---|---|---|
| Customer | Acme Co. | Compare by customer |
| Contract value | $48,250 | Show exact value |
| Actual revenue | $45,980 | Show exact value |
| Variance | -$2,270 | Show exact value |
| Renewal date | May 31, 2026 | Sort and filter |
Which visualization choice best fits this requirement?
Options:
A. A pie chart of variance by customer
B. A stacked bar chart by customer
C. A line chart of revenue over time
D. A sortable table with numeric columns
Best answer: D
Explanation: Tables are clearer than charts when the main task is to look up precise values or compare several detailed fields at the row level. In this scenario, account managers need customer-specific contract value, actual revenue, variance, and renewal date so they can take action and quote exact numbers. A sortable table also supports the stated need to filter and prioritize follow-up. Charts are better for showing trends, proportions, or high-level patterns, but they often make exact values harder to read when many categories or multiple measures are involved. The key takeaway is to match the format to the decision task, not just to visual appeal.
Topic: Data Analysis
A data analyst is preparing a monthly report for executives. The dataset contains online survey responses from 420 customers who chose to respond after a support interaction. The sample was not randomly selected, and no test of statistical significance was performed. Which wording should the analyst use to communicate the finding without overstating what the data supports?
Options:
A. Long wait times caused lower customer satisfaction.
B. Long wait times significantly reduce satisfaction across the customer base.
C. Respondents reported lower satisfaction after long wait times.
D. All customers are less satisfied after long wait times.
Best answer: C
Explanation: Communication wording should match the strength and limitations of the evidence. In this case, the data came from a self-selected survey sample, so it can describe what respondents reported but cannot safely represent all customers. Also, no significance test or controlled analysis was performed, so the report should avoid terms that imply proof, causation, or population-wide inference. A careful statement uses qualifiers such as “respondents reported” or “in this survey” and avoids unsupported claims like “caused,” “all customers,” or “significantly reduces.” The key is to present the observed pattern while preserving uncertainty and scope.
Topic: Data Analysis
An analyst is updating a weekly marketing dashboard for marketing managers. The report must show total sales, month-over-month revenue change, and conversion rate by channel. The source includes order-level revenue, order counts, and web sessions by channel. Validation found some session values are null, and new channels can have 0 sessions. Which approach is the best professional decision?
Options:
A. Build a predictive model to estimate sales and conversion for each channel.
B. Sum revenue, subtract prior-month revenue, and divide order count by valid sessions.
C. Count orders, compare them to target, and replace null sessions with 0 before division.
D. Average revenue, subtract sessions from orders, and format revenue per order as a percentage.
Best answer: B
Explanation: The required measures call for basic mathematical functions on numeric data. Total sales should use a sum of revenue. Month-over-month change should use subtraction between comparable current and prior-month revenue values. Conversion rate is a ratio: order count divided by sessions, then formatted as a percentage. Because sessions can be null or 0, the calculation should check for valid denominators before dividing; otherwise, the dashboard may show misleading or undefined percentages. A simple calculated-field approach satisfies the reporting need without adding unnecessary modeling.
Topic: Data Analysis
A sales analyst is troubleshooting a March revenue variance before publishing a monthly dashboard. The finance general ledger (GL) is the approved control source for booked net sales.
Exhibit: Report validation finding
| Evidence | March result |
|---|---|
| Dashboard net sales | $1,248,900 |
| GL control total | $1,312,400 |
| BI calculation review | SUM(net_amount), unchanged |
| ETL load status | Completed, no errors |
| Extract row count | 18,420, lower than prior run |
Which next action is best supported by the exhibit?
Options:
A. Clear the dashboard cache and republish
B. Change the dashboard formula to match the GL total
C. Remove March outliers from the analysis
D. Validate the extract against upstream GL records
Best answer: D
Explanation: When analysis results disagree with a trusted upstream record or control total, source validation should come before report changes. The BI calculation was reviewed and unchanged, and the ETL job completed without errors, but the extract row count is lower than expected and the dashboard total does not reconcile to the approved GL total. The analyst should compare the extracted records, source transactions, and control totals to determine whether records were omitted, filtered incorrectly, or received from the wrong source version. Adjusting the dashboard to force a match would hide the root cause.
Topic: Data Analysis
A data analyst receives a customer feedback file with a single text field named comment. The reporting team needs a monthly dashboard that groups comments containing variations of “refund,” “refunded,” or “refunds” into a Refund category, regardless of capitalization or extra spaces. Which preparation approach best meets the requirement?
Options:
A. Aggregate comments by monthly count only
B. Delete comments that contain inconsistent capitalization
C. Normalize case and spaces, then use pattern matching
D. Convert the field to a numeric data type
Best answer: C
Explanation: String functions are the right fit when text needs cleanup, matching, extraction, or categorization. In this scenario, the analyst should standardize the text first, such as trimming extra spaces and converting to a consistent case, then use a matching function or pattern to identify refund-related terms. This preserves the original information while creating a reliable category for the dashboard.
The key takeaway is to clean and match text values rather than discard records or treat free-form comments as numeric data.
Refund category.Topic: Data Analysis
A data analyst is preparing a weekly dashboard for a marketing manager who wants to compare email campaign performance fairly across different audience sizes. Use the formula order rate = orders / emails sent * 100, and show one decimal place. Validation found one duplicate order in Campaign B that must be removed before reporting.
| Campaign | Emails sent | Orders before validation |
|---|---|---|
| A | 10,000 | 480 |
| B | 6,000 | 390 |
Which decision best satisfies the reporting need?
Options:
A. Report both campaigns using raw order counts only.
B. Report Campaign A as better because it has more total orders.
C. Report Campaign B at 6.5% without noting the adjustment.
D. Report Campaign B at 6.5% and document the duplicate removal.
Best answer: D
Explanation: The derived measure should match the business question and use validated data. Because the manager wants a fair comparison across different audience sizes, the order rate is more appropriate than raw orders. Campaign A is \(480 / 10{,}000 \times 100 = 4.8\%\). Campaign B must first remove the duplicate order, giving \(389 / 6{,}000 \times 100 = 6.483\%\), which rounds to 6.5%. The adjustment should be documented so the dashboard remains traceable and transparent.
Topic: Data Concepts and Environments
A retail analyst uses a scheduled job to collect competitor prices for a weekly pricing report. The report has become unreliable.
Exhibit: Collection log summary
| Date | Result | Note |
|---|---|---|
| Week 1 | 98% captured | CSS selector div.price found |
| Week 2 | 41% captured | Price moved to span.currentPrice |
| Week 3 | 0% captured | Site requires interactive consent |
| Week 4 | Blocked | robots.txt disallows automated price paths |
Which interpretation is best supported by the exhibit?
Options:
A. Web scraping is a fragile source method for this report.
B. The source should be converted from CSV to JSON.
C. The report needs a wider date filter.
D. The analyst should remove missing price rows.
Best answer: A
Explanation: Web scraping can be useful when no formal data feed exists, but it is fragile because it depends on website structure and allowed access. In the exhibit, the scraper first breaks when the price element changes from one CSS selector to another. Later, an interactive consent step and a robots.txt restriction prevent automated collection. These are source-method risks, not normal data-cleaning issues. A more reliable next step would be to look for an approved API, data-sharing agreement, licensed feed, or another permissioned source before relying on the scraped data for recurring reporting. The key takeaway is that operational source selection must consider both technical stability and permission to collect the data.
Topic: Data Acquisition and Preparation
A data analyst must collect employee feedback for a workforce planning report. HR needs results by department and tenure band, but the privacy policy prohibits collecting names, employee IDs, email addresses, or free-text comments that could reveal identities. Which collection approach best meets these requirements?
Options:
A. Interview managers about individual employee concerns
B. Export the HRIS employee table and remove names after analysis
C. Collect open-ended survey comments and redact sensitive words later
D. Use an anonymous survey with closed-ended questions and demographic bands
Best answer: D
Explanation: Sensitive data constraints should shape the collection design before data is captured. In this scenario, HR needs aggregate analysis by department and tenure band, but the policy explicitly rules out direct identifiers and free-text comments. An anonymous, closed-ended survey can collect only the minimum necessary fields for the report while reducing the chance of capturing personally identifiable information. Designing the form this way is safer than collecting sensitive details first and trying to remove them later. The key takeaway is to minimize sensitive collection at the source while still preserving the required analysis granularity.
Topic: Visualization and Reporting
A fulfillment center manager uses a dashboard to decide when to reassign pickers between zones during each shift. The current report is refreshed from the prior night’s batch load.
Exhibit: Dashboard validation note
| Finding | Detail |
|---|---|
| Decision window | Staffing changes every 15 minutes |
| Current refresh | Daily at 2:00 a.m. |
| Impact | Late alerts cause missed service-level targets |
| Source system | Warehouse events available within 2 minutes |
Which reporting approach best supports the manager’s operational decision?
Options:
A. Daily snapshot dashboard
B. Near-real-time operational dashboard
C. Static PDF report after each shift
D. Weekly executive summary
Best answer: B
Explanation: Operational decisions that must be made during a shift require data that is current enough for the decision window. The exhibit shows staffing changes occur every 15 minutes, while source events are available within 2 minutes. A near-real-time operational dashboard can refresh frequently enough to support timely reassignment decisions. A daily snapshot is useful for historical review, but it is too stale for in-shift action. The key is to match report refresh behavior to the business decision cadence, not just to the availability of a report format.
Topic: Visualization and Reporting
A sales director and a regional manager report different values for the same KPI on what appears to be the same dashboard. The dashboard owner checks the report metadata.
Exhibit: Report validation finding
| Viewer | Report title | Snapshot timestamp | Revenue KPI |
|---|---|---|---|
| Director | Q2 Sales Dashboard | July 1, 8:00 AM | $4,820,000 |
| Manager | Q2 Sales Dashboard | June 30, 6:00 PM | $4,760,000 |
What is the most likely interpretation of this issue?
Options:
A. The revenue calculation formula is incorrect.
B. The dashboard filter logic is excluding records.
C. The source database has duplicate sales records.
D. The users are viewing different data snapshots.
Best answer: D
Explanation: Dashboard versioning issues can occur when users access the same report name but not the same refreshed dataset or snapshot. In the exhibit, the title is identical, but the snapshot timestamps differ. That means the users are not comparing the same point-in-time version of the data, so the KPI values can legitimately differ even if the metric definition is unchanged.
The next validation step would be to align both users to the same snapshot or refresh cycle before investigating formulas, filters, or source data defects.
Topic: Data Acquisition and Preparation
A data analyst is preparing a weekly sales report from a transaction table. Leadership needs total revenue summarized by region and calendar week, duplicate transaction rows have already been removed, and the report must keep individual customer details out of the output. Which query approach is the best professional decision?
Options:
A. Join transactions to the customer profile table
B. Append weekly transaction files into one table
C. Sort records by region and transaction date
D. Group records by region and week, then sum revenue
Best answer: D
Explanation: Grouping is used when detailed records must be summarized by a category, time period, region, customer, or similar dimension. In this scenario, leadership wants total revenue by region and calendar week, not a row-by-row transaction listing. A grouped query with an aggregate function such as SUM(revenue) produces one summarized result per region-week combination and avoids exposing customer-level detail. Sorting may make records easier to read, but it does not summarize them. Joining or appending may be useful in other integration tasks, but they do not directly meet the reporting requirement after duplicates have already been handled.
Topic: Visualization and Reporting
A data analyst is preparing a monthly revenue dashboard for executives. The dashboard uses a new calculated field for recurring revenue, excludes refunded transactions, and includes a line chart that shows a sudden increase after a product launch. The source refresh completed successfully, but the analyst wants validation before publishing because the assumptions and visual interpretation may affect leadership decisions. What is the best next step?
Options:
A. Request an independent peer review
B. Run a stress test on the dashboard server
C. Create a new data dictionary
D. Replace the dashboard with a static PDF
Best answer: A
Explanation: Peer review is a report validation technique used when an independent analyst should examine whether assumptions, code, calculations, or visual interpretations are reasonable. In this scenario, the data refresh succeeded, but the risk is that the new recurring revenue logic, refund exclusion, and executive-facing trend interpretation could be wrong or misleading. A peer reviewer can verify the calculated field, check that exclusions match the business rule, and challenge whether the line chart supports the stated conclusion before leaders use it for decisions.
The key is matching the validation method to the risk: this is not mainly a performance, metadata, or delivery-format problem.
Use the CompTIA Data+ DA0-002 Practice Test page for the full IT Mastery practice bank, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.
Try CompTIA Data+ DA0-002 on Web View CompTIA Data+ DA0-002 Practice Test
Read the CompTIA Data+ DA0-002 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.