DA0-002 — CompTIA Data+ V2 Quick Reference

Last revised: June 18, 2026

Compact DA0-002 review reference for CompTIA Data+ V2 candidates: data concepts, SQL, statistics, quality, visualization, governance, and exam decision points.

Exam Identity and How to Use This Page

This independent Quick Reference supports preparation for CompTIA Data+ V2 (DA0-002) from CompTIA. Use it as a compact review of high-yield data analysis concepts, decision points, formulas, SQL patterns, visualization choices, governance terms, and common exam traps.

For best results:

Use the tables to test “when would I choose this?” rather than memorizing definitions only.
Practice reading scenario clues: business objective, data source, data type, quality problem, stakeholder, and reporting need.
Pair this page with timed DA0-002 practice questions to confirm that you can apply concepts under exam conditions.

High-Yield DA0-002 Decision Map

If the stem emphasizes…	Think first about…	Common correct direction	Common trap
Business question, KPI, audience	Requirements gathering	Define metric, grain, filters, stakeholder need	Building a chart before defining the question
Missing, invalid, duplicated data	Data quality	Profile, validate, clean, document assumptions	Deleting data without understanding impact
Combining data from multiple systems	Integration and joins	Keys, grain, schema, transformation rules	Many-to-many join causing inflated totals
Operational transactions	OLTP	Normalized, current, frequent writes	Using OLTP schema directly for heavy analytics
Historical reporting and dashboards	OLAP / warehouse	Star schema, facts, dimensions, aggregations	Over-normalizing analytic models
Raw, varied, high-volume files	Data lake	Store raw/semi-structured data, schema-on-read	Treating a lake as curated truth without governance
Trends over time	Time series	Date grain, seasonality, moving averages	Ignoring missing periods or calendar effects
Relationship between variables	Correlation/regression	Scatterplot, correlation, regression diagnostics	Claiming causation from correlation
Categories or proportions	Bar/stacked bar/pie with caution	Compare counts or percentages	Using pie charts with many categories
Sensitive personal data	Governance/security	Classify, minimize, mask, encrypt, restrict access	Sharing raw PII because the report is internal
Model performance	Metrics and validation	Choose metric based on error cost	Reporting accuracy only on imbalanced classes

Data Lifecycle Reference

Phase	Candidate should know	Exam-focused questions to ask
Plan	Objective, stakeholder, scope, KPI, success criteria	What business decision will this support?
Collect	Source systems, APIs, files, surveys, sensors, logs	Is the data relevant, permitted, and complete enough?
Ingest	Batch, streaming, CDC, manual upload	How often must data be refreshed?
Store	Database, warehouse, lake, mart, spreadsheet	Does the structure fit analytics, cost, and governance needs?
Prepare	Clean, transform, standardize, join, aggregate	What assumptions are being introduced?
Analyze	Descriptive, diagnostic, predictive, prescriptive	Which method matches the question and data type?
Visualize	Chart, dashboard, report, narrative	What is the simplest accurate way to communicate the insight?
Act	Recommendation, decision, automation	What action should the stakeholder take?
Govern	Metadata, lineage, quality, privacy, access	Can the result be trusted, reproduced, and audited?
Retire/archive	Retention, disposal, archival	Is the data still needed and allowed to be retained?

Data Types, Measurement, and Structure

Data Type Matrix

Type	Description	Examples	Analysis implications
Structured	Fixed schema, rows/columns	Relational tables, spreadsheets	SQL-friendly; constraints and joins matter
Semi-structured	Flexible tags/keys	JSON, XML, logs	Requires parsing; schema may vary by record
Unstructured	No predefined tabular model	Text, images, audio, PDFs	Needs extraction, NLP, classification, or metadata
Categorical	Labels or groups	Region, product, status	Counts, proportions, bar charts
Numerical	Measured or counted values	Revenue, age, quantity	Summary stats, distributions, trends
Discrete	Countable integers	Tickets, orders, defects	Counts, rates, histograms
Continuous	Measured on continuum	Temperature, duration, weight	Means, ranges, density, binning
Date/time	Time-based values	Timestamp, fiscal month	Trends, seasonality, intervals, time zones
Boolean	True/false	Active flag, subscribed	Filtering, binary classification
Geospatial	Location-based	Latitude/longitude, ZIP/postal area	Maps, clustering, regional aggregation

Measurement Scales

Scale	Ordered?	Equal intervals?	True zero?	Examples	Valid operations
Nominal	No	No	No	Color, country, department	Count, mode, percentage
Ordinal	Yes	Not guaranteed	No	Satisfaction rating, risk level	Median, rank, percentile
Interval	Yes	Yes	No	Celsius, calendar year	Difference, mean, standard deviation
Ratio	Yes	Yes	Yes	Revenue, age, distance	Ratios, growth rate, coefficient of variation

Exam trap: Do not average nominal labels. Be cautious averaging ordinal ratings; it is common in business reporting, but the scale distance may not be truly equal.

Data Storage and Architecture

Analytical Storage Selection

Option	Best for	Strengths	Limitations / traps
Spreadsheet	Small ad hoc analysis	Fast, familiar, flexible	Error-prone, weak version control, limited governance
Relational database	Structured operational data	ACID transactions, SQL, constraints	Not always optimized for large analytical scans
Data warehouse	Curated historical analytics	Consistent metrics, performance, governance	Requires modeling and ETL/ELT discipline
Data mart	Department-specific analytics	Focused, faster delivery	Can create inconsistent definitions if unmanaged
Data lake	Raw diverse data at scale	Stores structured/semi/unstructured data	Needs catalog, quality, security, and curation
Lakehouse	Lake storage with warehouse-like features	Supports broader analytics on open formats	Still requires strong governance and design
NoSQL document store	Flexible nested records	Handles changing JSON-like structures	Joins and complex analytics may be harder
Key-value store	Fast lookup by key	Low-latency retrieval	Poor for complex filtering or aggregation
Column-family store	Wide sparse high-volume data	Scalable reads/writes for certain patterns	Query patterns must be designed upfront
Graph database	Connected entities	Relationship traversal, networks	Not ideal for simple tabular reporting

OLTP vs OLAP

Feature	OLTP	OLAP
Primary purpose	Run business transactions	Analyze business performance
Data shape	Highly normalized	Star/snowflake, denormalized, aggregated
Workload	Many small reads/writes	Fewer large scans and aggregations
Data freshness	Current/near current	Historical snapshots or curated refreshes
Users	Applications, operations	Analysts, BI users, executives
Example	Order entry system	Sales performance dashboard
Exam clue	“Insert/update transactions”	“Trends, KPIs, historical reporting”

Data Modeling Essentials

Concept	Meaning	Exam note
Entity	Object being stored	Customer, order, product
Attribute	Field describing an entity	Customer name, order date
Primary key	Unique row identifier	Should be stable and unique
Foreign key	Links to primary key in another table	Supports referential integrity
Composite key	Key made from multiple columns	Common in bridge or fact tables
Surrogate key	Artificial system-generated key	Often used in warehouses
Natural key	Real-world identifier	May change or contain errors
Fact table	Measurements/events	Sales amount, units, clicks
Dimension table	Descriptive context	Date, product, customer, region
Grain	Level of detail in a table	“One row per order line” is different from “one row per order”
Star schema	Fact table connected to dimensions	Common BI model; simpler joins
Snowflake schema	Dimensions normalized into subdimensions	Less redundancy, more joins
Normalization	Reduces redundancy and update anomalies	Useful for OLTP
Denormalization	Adds redundancy for faster reads	Useful for analytics/performance

File and Data Exchange Formats

Format	Best use	Strengths	Watch for
CSV	Simple tabular exchange	Portable, human-readable	Delimiters, quoting, encoding, missing headers
TSV	Tabular data with tabs	Avoids comma conflicts	Still weak typing
JSON	APIs, nested semi-structured data	Flexible, widely used	Nested arrays, schema drift
XML	Tagged hierarchical exchange	Self-describing, supports schemas	Verbose, more complex parsing
Parquet	Columnar analytics	Efficient compression and queries	Not human-readable; schema matters
ORC	Columnar big data analytics	Efficient for large scans	Ecosystem-specific considerations
Avro	Row-oriented serialization	Good for streaming and schema evolution	Requires schema management
Excel workbook	Business user exchange	Multiple sheets, formulas, formatting	Hidden logic, manual edits, inconsistent types
PDF	Final-form documents	Preserves layout	Poor for structured extraction
Log files	Event/activity tracking	Rich operational detail	Timestamp parsing, volume, inconsistent formats

Data Integration and Transformation

Ingestion and Refresh Patterns

Pattern	Choose when…	Key benefit	Risk / exam trap
Full load	Dataset is small or baseline is needed	Simple and complete	Expensive for large data
Incremental load	Only changes need refresh	Efficient	Requires reliable change detection
Batch processing	Periodic reporting is acceptable	Efficient scheduling	Not real-time
Streaming	Low-latency event processing is required	Near real-time insight	More complex monitoring and ordering
Change data capture	Need database changes over time	Captures inserts/updates/deletes	Must handle late/out-of-order changes
API ingestion	Source exposes service endpoint	Controlled access and automation	Rate limits, pagination, authentication
Manual upload	Infrequent or early-stage process	Low setup effort	Error-prone and hard to govern

ETL vs ELT

Approach	Flow	Best fit	Exam clue
ETL	Extract → Transform → Load	Transform before warehouse load; strict target schema	“Clean and conform before loading”
ELT	Extract → Load → Transform	Cloud/lake/warehouse can transform after loading	“Load raw data first, transform in platform”

Common Transformation Tasks

Task	Purpose	Example
Filtering	Keep relevant records	Current fiscal year only
Projection	Keep relevant columns	Select customer_id, order_date, amount
Standardization	Make values consistent	Convert “USA,” “U.S.,” “United States”
Type conversion	Ensure correct data type	String date to date type
Parsing	Split/extract components	Extract domain from email
Deduplication	Remove duplicate records	Same customer loaded twice
Aggregation	Summarize to desired grain	Daily sales by region
Joining	Combine related tables	Orders with customer dimension
Pivot/unpivot	Reshape rows/columns	Months as rows instead of columns
Binning	Group numeric ranges	Age bands, revenue tiers
Imputation	Fill missing values	Median income by segment
Anonymization/masking	Reduce sensitive exposure	Hide full account number

SQL Quick Reference for Data+ Candidates

Query Order and Logical Processing

Clause	Purpose	Exam note
SELECT	Columns or expressions returned	Can include aliases and calculated fields
FROM	Source table/view	Start with correct grain
JOIN	Combine related data	Choose join type carefully
WHERE	Row-level filter before aggregation	Cannot filter aggregate results here
GROUP BY	Aggregate by category/grain	Every non-aggregated selected column must be grouped
HAVING	Filter groups after aggregation	Use for SUM/COUNT/AVG conditions
ORDER BY	Sort results	Usually last logical output step
LIMIT / TOP	Return subset	Syntax varies by platform

Core SQL Patterns

-- Aggregation with group filter
SELECT
    region,
    COUNT(*) AS order_count,
    SUM(order_amount) AS total_sales,
    AVG(order_amount) AS avg_order_value
FROM orders
WHERE order_date >= '2026-01-01'
GROUP BY region
HAVING SUM(order_amount) > 100000
ORDER BY total_sales DESC;

-- Left join to keep all customers, even those without orders
SELECT
    c.customer_id,
    c.customer_name,
    COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o
    ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name;

-- Window function: rank rows without collapsing detail
SELECT
    customer_id,
    order_id,
    order_amount,
    RANK() OVER (
        PARTITION BY customer_id
        ORDER BY order_amount DESC
    ) AS order_rank
FROM orders;

-- CASE expression for business categories
SELECT
    customer_id,
    total_spend,
    CASE
        WHEN total_spend >= 10000 THEN 'High'
        WHEN total_spend >= 1000 THEN 'Medium'
        ELSE 'Low'
    END AS spend_segment
FROM customer_summary;

Join Types

Join	Returns	Use when…	Trap
INNER JOIN	Matching rows only	Need records present in both tables	Can unintentionally drop unmatched records
LEFT JOIN	All left rows plus matches	Need full base population	WHERE filter on right table can turn it into inner-like behavior
RIGHT JOIN	All right rows plus matches	Same concept as left join, reversed	Often less readable than rewriting as LEFT JOIN
FULL OUTER JOIN	All rows from both sides	Need unmatched records from either source	Not supported in every SQL dialect
CROSS JOIN	All combinations	Need Cartesian product intentionally	Can explode row count
Self join	Table joined to itself	Hierarchies, comparisons, previous relationships	Requires clear aliases

SQL Exam Traps

Trap	Why it matters	Safer approach
COUNT(*) vs COUNT(column)	COUNT(column) ignores NULLs	Choose intentionally
NULL comparison	NULL is unknown, not equal to anything	Use IS NULL / IS NOT NULL
Many-to-many join	Inflates counts and sums	Check grain and bridge tables
Filtering after left join	WHERE right_table.column = value may remove unmatched rows	Put condition in JOIN or allow NULL logic
Date filtering	Time components can exclude expected records	Use half-open date ranges where appropriate
Duplicate dimension rows	Can multiply facts	Validate uniqueness of join keys
Aggregating at wrong grain	Produces misleading KPIs	Define grain before joining or summarizing
Alias availability	Some dialects do not allow SELECT alias in WHERE	Use subquery/CTE if needed

Data Quality and Profiling

Data Quality Dimensions

Dimension	Meaning	Example issue	Detection methods
Accuracy	Correctly represents reality	Wrong customer address	Source comparison, validation sample
Completeness	Required data is present	Missing email or date	Null counts, required field checks
Consistency	Same value across systems	Different customer status in CRM and billing	Reconciliation, cross-system checks
Validity	Matches allowed format/range	Negative age, invalid ZIP/postal code	Rules, regex, constraints
Timeliness	Available when needed	Data refreshed after report deadline	Refresh timestamp, SLA monitoring
Uniqueness	No unintended duplicates	Duplicate customer records	Duplicate key checks, fuzzy matching
Integrity	Relationships are valid	Order with nonexistent customer	Referential integrity checks
Conformity	Follows standard representation	Mixed date formats	Pattern and type profiling

Data Profiling Checklist

Count rows and compare to expected volume.
Review data types and unexpected type coercion.
Count NULLs by column and by key business segment.
Identify duplicate keys or suspicious near-duplicates.
Check minimum, maximum, mean, median, and outliers for numeric fields.
Validate categorical values against allowed domains.
Verify date ranges, future dates, and impossible timestamps.
Confirm referential integrity across joined tables.
Compare aggregates to trusted control totals.
Document assumptions, exclusions, and known limitations.

Cleaning and Remediation Choices

Problem	Possible action	When appropriate	Risk
Missing values	Leave as NULL	Missingness is meaningful or unknown	Downstream tools may handle poorly
Missing values	Impute mean/median/mode	Small gaps; analysis requires complete data	Can bias variability and relationships
Missing values	Drop rows	Few affected rows and low business impact	Can introduce selection bias
Invalid format	Standardize/parse	Pattern is recoverable	Incorrect parsing
Duplicate records	Exact/fuzzy dedupe	Same entity appears multiple times	False merges
Outliers	Investigate, cap, transform, or keep	Depends on whether error or real extreme	Hiding important events
Inconsistent categories	Map to standard codes	Known synonym list exists	Misclassification
Wrong data type	Convert type	Source imported as text	Failed conversions or truncation
Inconsistent grain	Aggregate or disaggregate	Data sources differ in detail	Loss of detail or double counting

Descriptive Statistics and Core Formulas

Formula Reference

Use these formulas conceptually; exam questions often test interpretation more than calculation.

\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]\[ s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n - 1} \]\[ s = \sqrt{s^2} \]\[ z = \frac{x - \mu}{\sigma} \]\[ \text{IQR} = Q_3 - Q_1 \]\[ r = \frac{\operatorname{cov}(X,Y)}{\sigma_X \sigma_Y} \]

Statistic Selection

Statistic	Plain formula / meaning	Use when…	Sensitive to outliers?
Count	Number of records	Volume/frequency matters	No, but duplicates matter
Sum	Total of values	Total revenue, units, cost	Yes
Mean	sum of values / count	Symmetric numeric data	Yes
Median	Middle ordered value	Skewed data or outliers	Less sensitive
Mode	Most frequent value	Categorical or common value	No
Range	max - min	Quick spread check	Yes
Variance	Average squared deviation	Variability calculation	Yes
Standard deviation	Square root of variance	Typical spread around mean	Yes
Percentile	Value below which a percentage falls	Distribution thresholds	Less sensitive than max
Quartiles	25%, 50%, 75% points	Boxplots, spread	Less sensitive
IQR	Q3 - Q1	Robust spread	Less sensitive
Z-score	Standard deviations from mean	Standardized outlier detection	Assumes meaningful mean/SD
Correlation	Strength/direction of linear relationship	Relationship between numeric variables	Can be distorted by outliers
Weighted mean	Sum(value × weight) / sum(weights)	Unequal importance or sample sizes	Depends on weights

Distribution and Shape

Shape / pattern	Interpretation	Good visual
Normal / bell-shaped	Symmetric around mean	Histogram, density plot
Right-skewed	Long tail to high values; mean often above median	Histogram, boxplot
Left-skewed	Long tail to low values; mean often below median	Histogram, boxplot
Uniform	Values evenly distributed	Histogram
Bimodal/multimodal	Multiple peaks; possible subgroups	Histogram split by segment
Outliers	Extreme values	Boxplot, scatterplot
Seasonality	Repeating time pattern	Line chart by time
Trend	Long-term increase/decrease	Line chart, moving average

Inferential Statistics and Hypothesis Testing

Key Terms

Term	Meaning	Exam note
Population	Entire group of interest	Often unavailable in full
Sample	Subset of population	Should represent population
Parameter	Population measure	Usually unknown
Statistic	Sample measure	Used to estimate parameter
Sampling error	Difference between sample statistic and population parameter	Reduced by better sampling design and larger samples
Confidence interval	Range of plausible values for a parameter	Wider intervals imply more uncertainty
Null hypothesis	Default/no-effect claim	Tested against alternative
Alternative hypothesis	Claim of effect/difference	May be one-tailed or two-tailed
p-value	Probability of results as extreme if null is true	Small p-value suggests evidence against null
Significance level	Threshold for rejecting null	Chosen before test
Type I error	Rejecting a true null	False positive
Type II error	Failing to reject a false null	False negative
Power	Probability of detecting a real effect	Higher power lowers Type II risk

Common Test Selection

Scenario	Candidate method	Data type
Compare mean to known value	One-sample t-test	Numeric
Compare means of two independent groups	Two-sample t-test	Numeric + two groups
Compare means of paired observations	Paired t-test	Numeric paired data
Compare means across more than two groups	ANOVA	Numeric + multiple groups
Test relationship between categorical variables	Chi-square test	Categorical
Estimate linear relationship	Linear regression	Numeric outcome
Compare proportions	Proportion test	Categorical/binary outcome

Exam trap: Statistical significance does not prove practical significance, business value, or causation.

Sampling, Bias, and Experimental Design

Sampling Methods

Method	How it works	Strength	Risk
Simple random	Every member has equal chance	Easy to understand	Requires full sampling frame
Stratified	Sample within important subgroups	Ensures subgroup representation	Requires correct strata
Cluster	Randomly select groups/clusters	Cost-effective for dispersed populations	Higher sampling error if clusters vary
Systematic	Select every kth record	Simple	Hidden periodic patterns can bias results
Convenience	Use easily available records	Fast	Often biased
Snowball	Participants recruit others	Useful for hard-to-reach groups	Network bias
Census	Include all records/population	No sampling error for included population	May be expensive or infeasible

Bias and Validity Traps

Bias / issue	Description	Mitigation
Selection bias	Sample differs from population	Random/stratified sampling, clear inclusion rules
Survivorship bias	Only successful/remaining cases are analyzed	Include failures and removed records
Confirmation bias	Analyst favors expected result	Predefine method; peer review
Response bias	Answers influenced by wording/social pressure	Neutral survey design
Nonresponse bias	Missing respondents differ from respondents	Follow-up, weighting, assess differences
Measurement bias	Instrument/process systematically mismeasures	Calibrate, validate, standardize collection
Data leakage	Predictive model uses information unavailable at prediction time	Separate training features by time and availability
Confounding	Third variable affects relationship	Control variables, experimental design
Simpson’s paradox	Aggregate trend reverses within subgroups	Analyze by relevant segments

Analytics Methods and Model Concepts

Analytics Categories

Category	Question answered	Examples
Descriptive	What happened?	Monthly sales, defect count, dashboard KPI
Diagnostic	Why did it happen?	Drill-down, variance analysis, root cause analysis
Predictive	What is likely to happen?	Forecasting demand, churn prediction
Prescriptive	What should we do?	Optimization, recommendations, next-best action

Method Selection

Task	Common method	Output	Watch for
Forecast future values	Time series / regression	Predicted value by time	Seasonality, missing periods, external events
Predict numeric value	Regression	Continuous estimate	Outliers, multicollinearity, nonlinearity
Predict category/class	Classification	Class label/probability	Imbalanced classes, threshold choice
Find natural groups	Clustering	Segments/clusters	Need interpretation and scaling
Find co-occurring items	Association rules	Item relationships	Correlation, not causation
Reduce variables	Dimensionality reduction	Fewer features/components	Loss of interpretability
Analyze free text	Text mining/NLP	Sentiment, topics, entities	Ambiguity, language, context
Detect unusual events	Anomaly detection	Outlier score/flag	Rare legitimate events vs errors

Model Evaluation Metrics

Metric	Plain formula / meaning	Best for	Trap
Accuracy	Correct predictions / all predictions	Balanced classification	Misleading with class imbalance
Precision	TP / (TP + FP)	False positives are costly	May ignore missed positives
Recall / sensitivity	TP / (TP + FN)	False negatives are costly	May increase false positives
Specificity	TN / (TN + FP)	Correctly identifying negatives	Not enough alone
F1 score	Harmonic mean of precision and recall	Balance precision and recall	Hides business cost differences
MAE	Average absolute error	Regression; interpretable units	Treats all errors linearly
MSE	Average squared error	Regression; penalizes large errors	Units are squared
RMSE	Square root of MSE	Regression; original units	Sensitive to outliers
R-squared	Variance explained by model	Regression fit summary	Higher is not always better; overfitting possible

Visualization and Reporting

Chart Selection Matrix

Need	Best chart types	Avoid / watch for
Compare categories	Bar, column, dot plot	3D bars, unsorted clutter
Show trend over time	Line, area, sparkline	Pie chart for time trends
Show part-to-whole	Stacked bar, 100% stacked bar, treemap, pie for few categories	Too many pie slices
Show distribution	Histogram, boxplot, density plot	Mean-only summary for skewed data
Show relationship	Scatterplot, bubble chart	Inferring causation automatically
Show ranking	Sorted bar, lollipop chart	Alphabetical order when rank matters
Show geography	Map, choropleth, proportional symbol map	Using raw counts without population normalization
Show process flow	Flowchart, Sankey	Overly decorative visuals
Show KPI status	Scorecard, bullet chart, gauge with caution	Gauge overload
Show correlation matrix	Heatmap	Using rainbow color scales without meaning

Visualization Design Principles

Principle	Practical guidance
Match chart to question	Choose the simplest chart that answers the stakeholder’s question
Use correct scale	Avoid misleading truncated axes unless clearly justified
Label clearly	Title, axes, units, timeframe, filters, source
Reduce clutter	Remove unnecessary gridlines, effects, and redundant labels
Use color intentionally	Highlight meaning; do not rely on color alone
Preserve context	Include benchmarks, targets, prior period, or sample size when needed
Show uncertainty	Use confidence intervals/error bars where appropriate
Support accessibility	Sufficient contrast, readable fonts, colorblind-friendly palettes
Keep grain consistent	Do not mix daily, monthly, and yearly values without explanation
Document assumptions	Filters, exclusions, definitions, and refresh date should be discoverable

Dashboard and Report Types

Type	Audience	Purpose	Design focus
Operational dashboard	Front-line teams	Monitor current activity	Timeliness, alerts, drill-through
Tactical dashboard	Managers	Track departmental performance	Trends, exceptions, targets
Strategic dashboard	Executives	Monitor high-level goals	KPIs, concise summaries, business outcomes
Analytical report	Analysts/managers	Explore causes and patterns	Filters, segmentation, detail
Static report	Broad distribution	Fixed snapshot	Clear narrative and definitions
Self-service BI	Business users	Flexible exploration	Governed datasets and consistent metrics

KPI and Metric Review Checklist

Is the metric tied to a business objective?
Is the numerator and denominator clearly defined?
Is the grain clear?
Are filters and exclusions documented?
Is the time period consistent?
Is there a target, benchmark, or baseline?
Could the metric be gamed?
Are leading and lagging indicators balanced?
Are related metrics needed to avoid misinterpretation?

Business Analysis and Communication

Requirements Questions

Area	Questions to ask
Objective	What decision or action will this analysis support?
Stakeholder	Who will use the result, and what is their data literacy level?
Scope	Which products, regions, periods, or populations are included?
Metric definition	How exactly is success measured?
Data availability	Which sources contain the needed fields?
Refresh	How often does the output need to update?
Security	Who is allowed to see raw data and summarized results?
Delivery	Dashboard, report, file extract, presentation, API, alert?
Acceptance	How will the stakeholder validate the result?

Communicating Findings

Element	Include
Executive summary	Key finding, impact, recommendation
Method	Data sources, timeframe, filters, transformations
Evidence	Relevant visuals, statistical support, sample size
Limitations	Missing data, assumptions, uncertainty, known bias
Recommendation	Clear action tied to business objective
Next steps	Further analysis, monitoring, or decision owner

Exam trap: A technically correct analysis can still fail if it does not answer the stakeholder’s actual question.

Governance, Privacy, and Security

Governance Roles

Role	Typical responsibility
Data owner	Accountable for data domain and access decisions
Data steward	Maintains definitions, quality rules, metadata, and usage guidance
Data custodian	Operates technical storage, backups, and security mechanisms
Data analyst	Prepares, analyzes, visualizes, and communicates data
Data engineer	Builds pipelines, ingestion, transformation, and data platforms
Database administrator	Manages database performance, availability, and access controls
Security/privacy teams	Define controls for sensitive data and risk management

Governance Artifacts

Artifact	Purpose
Data dictionary	Field names, definitions, types, allowed values
Business glossary	Business-friendly definition of terms and metrics
Data catalog	Searchable inventory of datasets and metadata
Lineage documentation	Shows where data came from and transformations applied
Quality rules	Defines expected validity, completeness, and consistency checks
Access policy	Defines who can access what and why
Retention policy	Defines how long data is kept and when it is disposed
Data classification	Labels sensitivity and handling requirements
Master data	Authoritative shared entities such as customer/product
Reference data	Standard codes and lookup values

Sensitive Data Handling

Technique	What it does	Use when…
Data minimization	Collect/use only needed data	Reducing risk and exposure
Masking	Hides part of a value	Users need partial visibility
Tokenization	Replaces sensitive value with token	Systems need reference without exposing original
Encryption	Protects data using cryptography	Data at rest or in transit must be protected
Hashing	One-way transformation	Need comparison without revealing original value
Anonymization	Removes ability to identify individuals	Analysis should not identify subjects
Pseudonymization	Replaces identifiers but may be reversible with separate key	Analysis needs linkage with reduced exposure
Aggregation	Reports grouped results	Individual-level detail is unnecessary
Redaction	Removes sensitive fields/content	Sharing documents or extracts
Access control	Limits who can view/use data	Least privilege and role separation

Access Control Distinctions

Control	Description	Exam clue
Least privilege	Users get only required access	Reduce unnecessary exposure
RBAC	Access based on role	“Analysts can read curated sales tables”
ABAC	Access based on attributes/context	Department, location, sensitivity, purpose
MFA	Additional authentication factor	Strengthen identity verification
Audit logging	Records access and actions	Investigation, accountability
Segregation of duties	Splits conflicting responsibilities	Prevent misuse or unchecked changes
Row-level security	Restricts rows by user/context	Regional managers see only their region
Column-level security	Restricts sensitive fields	Hide salary, SSN/national ID, or account number

Metadata, Lineage, and Documentation

Concept	Why it matters for DA0-002 scenarios
Metadata	Helps users understand source, owner, refresh, type, and meaning
Technical metadata	Data types, schema, table size, refresh job, constraints
Business metadata	Business definitions, KPI rules, owner, approved usage
Operational metadata	Load time, job status, error count, processing duration
Lineage	Supports trust, auditability, troubleshooting, and impact analysis
Versioning	Tracks changes to queries, reports, definitions, and datasets
Data provenance	Establishes origin and authenticity
Reproducibility	Allows another analyst to recreate the result

Exam trap: A dashboard without definitions, refresh timestamp, source, or owner may look polished but still be weakly governed.

Troubleshooting Data Problems

Symptom-to-Cause Reference

Symptom	Likely causes	First checks
Dashboard totals suddenly changed	Source refresh, filter change, join issue, duplicate load	Refresh logs, row counts, query version, source control totals
Counts too high	Duplicate rows, many-to-many join, wrong grain	Distinct counts, key uniqueness, join path
Counts too low	Inner join dropped records, filter too restrictive, missing source file	Unmatched records, filter logic, ingestion logs
NULLs increased	Source system change, parsing failure, new optional field behavior	Null profiling by load date/source
Date trend has gaps	Missing batch, time zone issue, holiday/weekend handling	Calendar table, refresh logs, timestamp conversion
Categories split unexpectedly	Inconsistent spelling/casing, new codes	Value frequency list, reference data mapping
Query is slow	Large scans, missing filters, inefficient joins, no aggregation	Explain plan if available, filter early, reduce columns
Model performance dropped	Data drift, changed population, feature pipeline issue	Compare training vs current distributions
Report users disagree with metric	Different definitions, filters, or source systems	Business glossary, requirements, reconciliation

Validation Before Publishing

Reconcile totals against trusted source reports.
Confirm joins do not change expected row counts unexpectedly.
Test filters, date ranges, and parameter defaults.
Review outliers and decide whether they are errors or real events.
Validate metric definitions with stakeholders.
Check access permissions and sensitive fields.
Include source, refresh date, owner, and definitions.
Save query/report version and document assumptions.

Common DA0-002 Exam Traps

Trap	Better thinking
Correlation equals causation	Correlation can suggest a relationship but does not prove cause
Highest accuracy is always best	Choose metrics based on business cost of false positives/false negatives
Mean always represents “typical”	Median is often better for skewed data
Remove all outliers	Investigate first; outliers may be valid and important
More data is always better	Relevant, high-quality, governed data is better than uncontrolled volume
Pie charts are always good for percentages	Use only for a few categories; bars are often clearer
Raw data lake equals trusted data	Raw storage requires cataloging, quality, lineage, and access controls
Dashboard first, requirements later	Define audience, decision, metric, and refresh needs first
Inner joins are harmless	They can drop unmatched records and bias results
Cleaning data is just formatting	Cleaning can change meaning; document assumptions
Statistical significance means business significance	Effect size, cost, and actionability still matter
Aggregates are always safe	Aggregation can hide subgroup patterns and bias

Final Review Checklist

Before sitting for CompTIA Data+ V2 (DA0-002), confirm you can:

Select the right data storage pattern for operational vs analytical scenarios.
Explain structured, semi-structured, and unstructured data implications.
Identify the correct chart for comparison, trend, distribution, relationship, or geography.
Read SQL joins, GROUP BY, HAVING, CASE, and window-function patterns.
Diagnose duplicate, missing, invalid, inconsistent, and untimely data.
Choose mean vs median, standard deviation vs IQR, and correlation vs regression.
Interpret p-values, confidence intervals, Type I/Type II errors, and sampling bias.
Distinguish descriptive, diagnostic, predictive, and prescriptive analytics.
Match model metrics to business error costs.
Apply privacy, masking, encryption, access control, lineage, and metadata concepts.
Communicate findings with assumptions, limitations, and actionable recommendations.

Practical Next Step

Use this Quick Reference as a checklist while answering timed DA0-002 practice questions. After each missed question, tag the miss by category: data quality, SQL, statistics, visualization, architecture, governance, or communication. Then revisit the matching section above until you can explain both the correct answer and the trap answer.

Scenario Guide

Data Concepts and Environments