Exam Identity and How to Use This Page
This independent Quick Reference supports preparation for CompTIA Data+ V2 (DA0-002) from CompTIA. Use it as a compact review of high-yield data analysis concepts, decision points, formulas, SQL patterns, visualization choices, governance terms, and common exam traps.
For best results:
- Use the tables to test “when would I choose this?” rather than memorizing definitions only.
- Practice reading scenario clues: business objective, data source, data type, quality problem, stakeholder, and reporting need.
- Pair this page with timed DA0-002 practice questions to confirm that you can apply concepts under exam conditions.
High-Yield DA0-002 Decision Map
| If the stem emphasizes… | Think first about… | Common correct direction | Common trap |
|---|
| Business question, KPI, audience | Requirements gathering | Define metric, grain, filters, stakeholder need | Building a chart before defining the question |
| Missing, invalid, duplicated data | Data quality | Profile, validate, clean, document assumptions | Deleting data without understanding impact |
| Combining data from multiple systems | Integration and joins | Keys, grain, schema, transformation rules | Many-to-many join causing inflated totals |
| Operational transactions | OLTP | Normalized, current, frequent writes | Using OLTP schema directly for heavy analytics |
| Historical reporting and dashboards | OLAP / warehouse | Star schema, facts, dimensions, aggregations | Over-normalizing analytic models |
| Raw, varied, high-volume files | Data lake | Store raw/semi-structured data, schema-on-read | Treating a lake as curated truth without governance |
| Trends over time | Time series | Date grain, seasonality, moving averages | Ignoring missing periods or calendar effects |
| Relationship between variables | Correlation/regression | Scatterplot, correlation, regression diagnostics | Claiming causation from correlation |
| Categories or proportions | Bar/stacked bar/pie with caution | Compare counts or percentages | Using pie charts with many categories |
| Sensitive personal data | Governance/security | Classify, minimize, mask, encrypt, restrict access | Sharing raw PII because the report is internal |
| Model performance | Metrics and validation | Choose metric based on error cost | Reporting accuracy only on imbalanced classes |
Data Lifecycle Reference
| Phase | Candidate should know | Exam-focused questions to ask |
|---|
| Plan | Objective, stakeholder, scope, KPI, success criteria | What business decision will this support? |
| Collect | Source systems, APIs, files, surveys, sensors, logs | Is the data relevant, permitted, and complete enough? |
| Ingest | Batch, streaming, CDC, manual upload | How often must data be refreshed? |
| Store | Database, warehouse, lake, mart, spreadsheet | Does the structure fit analytics, cost, and governance needs? |
| Prepare | Clean, transform, standardize, join, aggregate | What assumptions are being introduced? |
| Analyze | Descriptive, diagnostic, predictive, prescriptive | Which method matches the question and data type? |
| Visualize | Chart, dashboard, report, narrative | What is the simplest accurate way to communicate the insight? |
| Act | Recommendation, decision, automation | What action should the stakeholder take? |
| Govern | Metadata, lineage, quality, privacy, access | Can the result be trusted, reproduced, and audited? |
| Retire/archive | Retention, disposal, archival | Is the data still needed and allowed to be retained? |
Data Types, Measurement, and Structure
Data Type Matrix
| Type | Description | Examples | Analysis implications |
|---|
| Structured | Fixed schema, rows/columns | Relational tables, spreadsheets | SQL-friendly; constraints and joins matter |
| Semi-structured | Flexible tags/keys | JSON, XML, logs | Requires parsing; schema may vary by record |
| Unstructured | No predefined tabular model | Text, images, audio, PDFs | Needs extraction, NLP, classification, or metadata |
| Categorical | Labels or groups | Region, product, status | Counts, proportions, bar charts |
| Numerical | Measured or counted values | Revenue, age, quantity | Summary stats, distributions, trends |
| Discrete | Countable integers | Tickets, orders, defects | Counts, rates, histograms |
| Continuous | Measured on continuum | Temperature, duration, weight | Means, ranges, density, binning |
| Date/time | Time-based values | Timestamp, fiscal month | Trends, seasonality, intervals, time zones |
| Boolean | True/false | Active flag, subscribed | Filtering, binary classification |
| Geospatial | Location-based | Latitude/longitude, ZIP/postal area | Maps, clustering, regional aggregation |
Measurement Scales
| Scale | Ordered? | Equal intervals? | True zero? | Examples | Valid operations |
|---|
| Nominal | No | No | No | Color, country, department | Count, mode, percentage |
| Ordinal | Yes | Not guaranteed | No | Satisfaction rating, risk level | Median, rank, percentile |
| Interval | Yes | Yes | No | Celsius, calendar year | Difference, mean, standard deviation |
| Ratio | Yes | Yes | Yes | Revenue, age, distance | Ratios, growth rate, coefficient of variation |
Exam trap: Do not average nominal labels. Be cautious averaging ordinal ratings; it is common in business reporting, but the scale distance may not be truly equal.
Data Storage and Architecture
Analytical Storage Selection
| Option | Best for | Strengths | Limitations / traps |
|---|
| Spreadsheet | Small ad hoc analysis | Fast, familiar, flexible | Error-prone, weak version control, limited governance |
| Relational database | Structured operational data | ACID transactions, SQL, constraints | Not always optimized for large analytical scans |
| Data warehouse | Curated historical analytics | Consistent metrics, performance, governance | Requires modeling and ETL/ELT discipline |
| Data mart | Department-specific analytics | Focused, faster delivery | Can create inconsistent definitions if unmanaged |
| Data lake | Raw diverse data at scale | Stores structured/semi/unstructured data | Needs catalog, quality, security, and curation |
| Lakehouse | Lake storage with warehouse-like features | Supports broader analytics on open formats | Still requires strong governance and design |
| NoSQL document store | Flexible nested records | Handles changing JSON-like structures | Joins and complex analytics may be harder |
| Key-value store | Fast lookup by key | Low-latency retrieval | Poor for complex filtering or aggregation |
| Column-family store | Wide sparse high-volume data | Scalable reads/writes for certain patterns | Query patterns must be designed upfront |
| Graph database | Connected entities | Relationship traversal, networks | Not ideal for simple tabular reporting |
OLTP vs OLAP
| Feature | OLTP | OLAP |
|---|
| Primary purpose | Run business transactions | Analyze business performance |
| Data shape | Highly normalized | Star/snowflake, denormalized, aggregated |
| Workload | Many small reads/writes | Fewer large scans and aggregations |
| Data freshness | Current/near current | Historical snapshots or curated refreshes |
| Users | Applications, operations | Analysts, BI users, executives |
| Example | Order entry system | Sales performance dashboard |
| Exam clue | “Insert/update transactions” | “Trends, KPIs, historical reporting” |
Data Modeling Essentials
| Concept | Meaning | Exam note |
|---|
| Entity | Object being stored | Customer, order, product |
| Attribute | Field describing an entity | Customer name, order date |
| Primary key | Unique row identifier | Should be stable and unique |
| Foreign key | Links to primary key in another table | Supports referential integrity |
| Composite key | Key made from multiple columns | Common in bridge or fact tables |
| Surrogate key | Artificial system-generated key | Often used in warehouses |
| Natural key | Real-world identifier | May change or contain errors |
| Fact table | Measurements/events | Sales amount, units, clicks |
| Dimension table | Descriptive context | Date, product, customer, region |
| Grain | Level of detail in a table | “One row per order line” is different from “one row per order” |
| Star schema | Fact table connected to dimensions | Common BI model; simpler joins |
| Snowflake schema | Dimensions normalized into subdimensions | Less redundancy, more joins |
| Normalization | Reduces redundancy and update anomalies | Useful for OLTP |
| Denormalization | Adds redundancy for faster reads | Useful for analytics/performance |
| Format | Best use | Strengths | Watch for |
|---|
| CSV | Simple tabular exchange | Portable, human-readable | Delimiters, quoting, encoding, missing headers |
| TSV | Tabular data with tabs | Avoids comma conflicts | Still weak typing |
| JSON | APIs, nested semi-structured data | Flexible, widely used | Nested arrays, schema drift |
| XML | Tagged hierarchical exchange | Self-describing, supports schemas | Verbose, more complex parsing |
| Parquet | Columnar analytics | Efficient compression and queries | Not human-readable; schema matters |
| ORC | Columnar big data analytics | Efficient for large scans | Ecosystem-specific considerations |
| Avro | Row-oriented serialization | Good for streaming and schema evolution | Requires schema management |
| Excel workbook | Business user exchange | Multiple sheets, formulas, formatting | Hidden logic, manual edits, inconsistent types |
| PDF | Final-form documents | Preserves layout | Poor for structured extraction |
| Log files | Event/activity tracking | Rich operational detail | Timestamp parsing, volume, inconsistent formats |
Ingestion and Refresh Patterns
| Pattern | Choose when… | Key benefit | Risk / exam trap |
|---|
| Full load | Dataset is small or baseline is needed | Simple and complete | Expensive for large data |
| Incremental load | Only changes need refresh | Efficient | Requires reliable change detection |
| Batch processing | Periodic reporting is acceptable | Efficient scheduling | Not real-time |
| Streaming | Low-latency event processing is required | Near real-time insight | More complex monitoring and ordering |
| Change data capture | Need database changes over time | Captures inserts/updates/deletes | Must handle late/out-of-order changes |
| API ingestion | Source exposes service endpoint | Controlled access and automation | Rate limits, pagination, authentication |
| Manual upload | Infrequent or early-stage process | Low setup effort | Error-prone and hard to govern |
ETL vs ELT
| Approach | Flow | Best fit | Exam clue |
|---|
| ETL | Extract → Transform → Load | Transform before warehouse load; strict target schema | “Clean and conform before loading” |
| ELT | Extract → Load → Transform | Cloud/lake/warehouse can transform after loading | “Load raw data first, transform in platform” |
| Task | Purpose | Example |
|---|
| Filtering | Keep relevant records | Current fiscal year only |
| Projection | Keep relevant columns | Select customer_id, order_date, amount |
| Standardization | Make values consistent | Convert “USA,” “U.S.,” “United States” |
| Type conversion | Ensure correct data type | String date to date type |
| Parsing | Split/extract components | Extract domain from email |
| Deduplication | Remove duplicate records | Same customer loaded twice |
| Aggregation | Summarize to desired grain | Daily sales by region |
| Joining | Combine related tables | Orders with customer dimension |
| Pivot/unpivot | Reshape rows/columns | Months as rows instead of columns |
| Binning | Group numeric ranges | Age bands, revenue tiers |
| Imputation | Fill missing values | Median income by segment |
| Anonymization/masking | Reduce sensitive exposure | Hide full account number |
SQL Quick Reference for Data+ Candidates
Query Order and Logical Processing
| Clause | Purpose | Exam note |
|---|
| SELECT | Columns or expressions returned | Can include aliases and calculated fields |
| FROM | Source table/view | Start with correct grain |
| JOIN | Combine related data | Choose join type carefully |
| WHERE | Row-level filter before aggregation | Cannot filter aggregate results here |
| GROUP BY | Aggregate by category/grain | Every non-aggregated selected column must be grouped |
| HAVING | Filter groups after aggregation | Use for SUM/COUNT/AVG conditions |
| ORDER BY | Sort results | Usually last logical output step |
| LIMIT / TOP | Return subset | Syntax varies by platform |
Core SQL Patterns
-- Aggregation with group filter
SELECT
region,
COUNT(*) AS order_count,
SUM(order_amount) AS total_sales,
AVG(order_amount) AS avg_order_value
FROM orders
WHERE order_date >= '2026-01-01'
GROUP BY region
HAVING SUM(order_amount) > 100000
ORDER BY total_sales DESC;
-- Left join to keep all customers, even those without orders
SELECT
c.customer_id,
c.customer_name,
COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name;
-- Window function: rank rows without collapsing detail
SELECT
customer_id,
order_id,
order_amount,
RANK() OVER (
PARTITION BY customer_id
ORDER BY order_amount DESC
) AS order_rank
FROM orders;
-- CASE expression for business categories
SELECT
customer_id,
total_spend,
CASE
WHEN total_spend >= 10000 THEN 'High'
WHEN total_spend >= 1000 THEN 'Medium'
ELSE 'Low'
END AS spend_segment
FROM customer_summary;
Join Types
| Join | Returns | Use when… | Trap |
|---|
| INNER JOIN | Matching rows only | Need records present in both tables | Can unintentionally drop unmatched records |
| LEFT JOIN | All left rows plus matches | Need full base population | WHERE filter on right table can turn it into inner-like behavior |
| RIGHT JOIN | All right rows plus matches | Same concept as left join, reversed | Often less readable than rewriting as LEFT JOIN |
| FULL OUTER JOIN | All rows from both sides | Need unmatched records from either source | Not supported in every SQL dialect |
| CROSS JOIN | All combinations | Need Cartesian product intentionally | Can explode row count |
| Self join | Table joined to itself | Hierarchies, comparisons, previous relationships | Requires clear aliases |
SQL Exam Traps
| Trap | Why it matters | Safer approach |
|---|
| COUNT(*) vs COUNT(column) | COUNT(column) ignores NULLs | Choose intentionally |
| NULL comparison | NULL is unknown, not equal to anything | Use IS NULL / IS NOT NULL |
| Many-to-many join | Inflates counts and sums | Check grain and bridge tables |
| Filtering after left join | WHERE right_table.column = value may remove unmatched rows | Put condition in JOIN or allow NULL logic |
| Date filtering | Time components can exclude expected records | Use half-open date ranges where appropriate |
| Duplicate dimension rows | Can multiply facts | Validate uniqueness of join keys |
| Aggregating at wrong grain | Produces misleading KPIs | Define grain before joining or summarizing |
| Alias availability | Some dialects do not allow SELECT alias in WHERE | Use subquery/CTE if needed |
Data Quality and Profiling
Data Quality Dimensions
| Dimension | Meaning | Example issue | Detection methods |
|---|
| Accuracy | Correctly represents reality | Wrong customer address | Source comparison, validation sample |
| Completeness | Required data is present | Missing email or date | Null counts, required field checks |
| Consistency | Same value across systems | Different customer status in CRM and billing | Reconciliation, cross-system checks |
| Validity | Matches allowed format/range | Negative age, invalid ZIP/postal code | Rules, regex, constraints |
| Timeliness | Available when needed | Data refreshed after report deadline | Refresh timestamp, SLA monitoring |
| Uniqueness | No unintended duplicates | Duplicate customer records | Duplicate key checks, fuzzy matching |
| Integrity | Relationships are valid | Order with nonexistent customer | Referential integrity checks |
| Conformity | Follows standard representation | Mixed date formats | Pattern and type profiling |
Data Profiling Checklist
- Count rows and compare to expected volume.
- Review data types and unexpected type coercion.
- Count NULLs by column and by key business segment.
- Identify duplicate keys or suspicious near-duplicates.
- Check minimum, maximum, mean, median, and outliers for numeric fields.
- Validate categorical values against allowed domains.
- Verify date ranges, future dates, and impossible timestamps.
- Confirm referential integrity across joined tables.
- Compare aggregates to trusted control totals.
- Document assumptions, exclusions, and known limitations.
| Problem | Possible action | When appropriate | Risk |
|---|
| Missing values | Leave as NULL | Missingness is meaningful or unknown | Downstream tools may handle poorly |
| Missing values | Impute mean/median/mode | Small gaps; analysis requires complete data | Can bias variability and relationships |
| Missing values | Drop rows | Few affected rows and low business impact | Can introduce selection bias |
| Invalid format | Standardize/parse | Pattern is recoverable | Incorrect parsing |
| Duplicate records | Exact/fuzzy dedupe | Same entity appears multiple times | False merges |
| Outliers | Investigate, cap, transform, or keep | Depends on whether error or real extreme | Hiding important events |
| Inconsistent categories | Map to standard codes | Known synonym list exists | Misclassification |
| Wrong data type | Convert type | Source imported as text | Failed conversions or truncation |
| Inconsistent grain | Aggregate or disaggregate | Data sources differ in detail | Loss of detail or double counting |
Use these formulas conceptually; exam questions often test interpretation more than calculation.
\[
\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}
\]\[
s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n - 1}
\]\[
s = \sqrt{s^2}
\]\[
z = \frac{x - \mu}{\sigma}
\]\[
\text{IQR} = Q_3 - Q_1
\]\[
r = \frac{\operatorname{cov}(X,Y)}{\sigma_X \sigma_Y}
\]
Statistic Selection
| Statistic | Plain formula / meaning | Use when… | Sensitive to outliers? |
|---|
| Count | Number of records | Volume/frequency matters | No, but duplicates matter |
| Sum | Total of values | Total revenue, units, cost | Yes |
| Mean | sum of values / count | Symmetric numeric data | Yes |
| Median | Middle ordered value | Skewed data or outliers | Less sensitive |
| Mode | Most frequent value | Categorical or common value | No |
| Range | max - min | Quick spread check | Yes |
| Variance | Average squared deviation | Variability calculation | Yes |
| Standard deviation | Square root of variance | Typical spread around mean | Yes |
| Percentile | Value below which a percentage falls | Distribution thresholds | Less sensitive than max |
| Quartiles | 25%, 50%, 75% points | Boxplots, spread | Less sensitive |
| IQR | Q3 - Q1 | Robust spread | Less sensitive |
| Z-score | Standard deviations from mean | Standardized outlier detection | Assumes meaningful mean/SD |
| Correlation | Strength/direction of linear relationship | Relationship between numeric variables | Can be distorted by outliers |
| Weighted mean | Sum(value × weight) / sum(weights) | Unequal importance or sample sizes | Depends on weights |
Distribution and Shape
| Shape / pattern | Interpretation | Good visual |
|---|
| Normal / bell-shaped | Symmetric around mean | Histogram, density plot |
| Right-skewed | Long tail to high values; mean often above median | Histogram, boxplot |
| Left-skewed | Long tail to low values; mean often below median | Histogram, boxplot |
| Uniform | Values evenly distributed | Histogram |
| Bimodal/multimodal | Multiple peaks; possible subgroups | Histogram split by segment |
| Outliers | Extreme values | Boxplot, scatterplot |
| Seasonality | Repeating time pattern | Line chart by time |
| Trend | Long-term increase/decrease | Line chart, moving average |
Inferential Statistics and Hypothesis Testing
Key Terms
| Term | Meaning | Exam note |
|---|
| Population | Entire group of interest | Often unavailable in full |
| Sample | Subset of population | Should represent population |
| Parameter | Population measure | Usually unknown |
| Statistic | Sample measure | Used to estimate parameter |
| Sampling error | Difference between sample statistic and population parameter | Reduced by better sampling design and larger samples |
| Confidence interval | Range of plausible values for a parameter | Wider intervals imply more uncertainty |
| Null hypothesis | Default/no-effect claim | Tested against alternative |
| Alternative hypothesis | Claim of effect/difference | May be one-tailed or two-tailed |
| p-value | Probability of results as extreme if null is true | Small p-value suggests evidence against null |
| Significance level | Threshold for rejecting null | Chosen before test |
| Type I error | Rejecting a true null | False positive |
| Type II error | Failing to reject a false null | False negative |
| Power | Probability of detecting a real effect | Higher power lowers Type II risk |
Common Test Selection
| Scenario | Candidate method | Data type |
|---|
| Compare mean to known value | One-sample t-test | Numeric |
| Compare means of two independent groups | Two-sample t-test | Numeric + two groups |
| Compare means of paired observations | Paired t-test | Numeric paired data |
| Compare means across more than two groups | ANOVA | Numeric + multiple groups |
| Test relationship between categorical variables | Chi-square test | Categorical |
| Estimate linear relationship | Linear regression | Numeric outcome |
| Compare proportions | Proportion test | Categorical/binary outcome |
Exam trap: Statistical significance does not prove practical significance, business value, or causation.
Sampling, Bias, and Experimental Design
Sampling Methods
| Method | How it works | Strength | Risk |
|---|
| Simple random | Every member has equal chance | Easy to understand | Requires full sampling frame |
| Stratified | Sample within important subgroups | Ensures subgroup representation | Requires correct strata |
| Cluster | Randomly select groups/clusters | Cost-effective for dispersed populations | Higher sampling error if clusters vary |
| Systematic | Select every kth record | Simple | Hidden periodic patterns can bias results |
| Convenience | Use easily available records | Fast | Often biased |
| Snowball | Participants recruit others | Useful for hard-to-reach groups | Network bias |
| Census | Include all records/population | No sampling error for included population | May be expensive or infeasible |
Bias and Validity Traps
| Bias / issue | Description | Mitigation |
|---|
| Selection bias | Sample differs from population | Random/stratified sampling, clear inclusion rules |
| Survivorship bias | Only successful/remaining cases are analyzed | Include failures and removed records |
| Confirmation bias | Analyst favors expected result | Predefine method; peer review |
| Response bias | Answers influenced by wording/social pressure | Neutral survey design |
| Nonresponse bias | Missing respondents differ from respondents | Follow-up, weighting, assess differences |
| Measurement bias | Instrument/process systematically mismeasures | Calibrate, validate, standardize collection |
| Data leakage | Predictive model uses information unavailable at prediction time | Separate training features by time and availability |
| Confounding | Third variable affects relationship | Control variables, experimental design |
| Simpson’s paradox | Aggregate trend reverses within subgroups | Analyze by relevant segments |
Analytics Methods and Model Concepts
Analytics Categories
| Category | Question answered | Examples |
|---|
| Descriptive | What happened? | Monthly sales, defect count, dashboard KPI |
| Diagnostic | Why did it happen? | Drill-down, variance analysis, root cause analysis |
| Predictive | What is likely to happen? | Forecasting demand, churn prediction |
| Prescriptive | What should we do? | Optimization, recommendations, next-best action |
Method Selection
| Task | Common method | Output | Watch for |
|---|
| Forecast future values | Time series / regression | Predicted value by time | Seasonality, missing periods, external events |
| Predict numeric value | Regression | Continuous estimate | Outliers, multicollinearity, nonlinearity |
| Predict category/class | Classification | Class label/probability | Imbalanced classes, threshold choice |
| Find natural groups | Clustering | Segments/clusters | Need interpretation and scaling |
| Find co-occurring items | Association rules | Item relationships | Correlation, not causation |
| Reduce variables | Dimensionality reduction | Fewer features/components | Loss of interpretability |
| Analyze free text | Text mining/NLP | Sentiment, topics, entities | Ambiguity, language, context |
| Detect unusual events | Anomaly detection | Outlier score/flag | Rare legitimate events vs errors |
Model Evaluation Metrics
| Metric | Plain formula / meaning | Best for | Trap |
|---|
| Accuracy | Correct predictions / all predictions | Balanced classification | Misleading with class imbalance |
| Precision | TP / (TP + FP) | False positives are costly | May ignore missed positives |
| Recall / sensitivity | TP / (TP + FN) | False negatives are costly | May increase false positives |
| Specificity | TN / (TN + FP) | Correctly identifying negatives | Not enough alone |
| F1 score | Harmonic mean of precision and recall | Balance precision and recall | Hides business cost differences |
| MAE | Average absolute error | Regression; interpretable units | Treats all errors linearly |
| MSE | Average squared error | Regression; penalizes large errors | Units are squared |
| RMSE | Square root of MSE | Regression; original units | Sensitive to outliers |
| R-squared | Variance explained by model | Regression fit summary | Higher is not always better; overfitting possible |
Visualization and Reporting
Chart Selection Matrix
| Need | Best chart types | Avoid / watch for |
|---|
| Compare categories | Bar, column, dot plot | 3D bars, unsorted clutter |
| Show trend over time | Line, area, sparkline | Pie chart for time trends |
| Show part-to-whole | Stacked bar, 100% stacked bar, treemap, pie for few categories | Too many pie slices |
| Show distribution | Histogram, boxplot, density plot | Mean-only summary for skewed data |
| Show relationship | Scatterplot, bubble chart | Inferring causation automatically |
| Show ranking | Sorted bar, lollipop chart | Alphabetical order when rank matters |
| Show geography | Map, choropleth, proportional symbol map | Using raw counts without population normalization |
| Show process flow | Flowchart, Sankey | Overly decorative visuals |
| Show KPI status | Scorecard, bullet chart, gauge with caution | Gauge overload |
| Show correlation matrix | Heatmap | Using rainbow color scales without meaning |
Visualization Design Principles
| Principle | Practical guidance |
|---|
| Match chart to question | Choose the simplest chart that answers the stakeholder’s question |
| Use correct scale | Avoid misleading truncated axes unless clearly justified |
| Label clearly | Title, axes, units, timeframe, filters, source |
| Reduce clutter | Remove unnecessary gridlines, effects, and redundant labels |
| Use color intentionally | Highlight meaning; do not rely on color alone |
| Preserve context | Include benchmarks, targets, prior period, or sample size when needed |
| Show uncertainty | Use confidence intervals/error bars where appropriate |
| Support accessibility | Sufficient contrast, readable fonts, colorblind-friendly palettes |
| Keep grain consistent | Do not mix daily, monthly, and yearly values without explanation |
| Document assumptions | Filters, exclusions, definitions, and refresh date should be discoverable |
Dashboard and Report Types
| Type | Audience | Purpose | Design focus |
|---|
| Operational dashboard | Front-line teams | Monitor current activity | Timeliness, alerts, drill-through |
| Tactical dashboard | Managers | Track departmental performance | Trends, exceptions, targets |
| Strategic dashboard | Executives | Monitor high-level goals | KPIs, concise summaries, business outcomes |
| Analytical report | Analysts/managers | Explore causes and patterns | Filters, segmentation, detail |
| Static report | Broad distribution | Fixed snapshot | Clear narrative and definitions |
| Self-service BI | Business users | Flexible exploration | Governed datasets and consistent metrics |
KPI and Metric Review Checklist
- Is the metric tied to a business objective?
- Is the numerator and denominator clearly defined?
- Is the grain clear?
- Are filters and exclusions documented?
- Is the time period consistent?
- Is there a target, benchmark, or baseline?
- Could the metric be gamed?
- Are leading and lagging indicators balanced?
- Are related metrics needed to avoid misinterpretation?
Business Analysis and Communication
Requirements Questions
| Area | Questions to ask |
|---|
| Objective | What decision or action will this analysis support? |
| Stakeholder | Who will use the result, and what is their data literacy level? |
| Scope | Which products, regions, periods, or populations are included? |
| Metric definition | How exactly is success measured? |
| Data availability | Which sources contain the needed fields? |
| Refresh | How often does the output need to update? |
| Security | Who is allowed to see raw data and summarized results? |
| Delivery | Dashboard, report, file extract, presentation, API, alert? |
| Acceptance | How will the stakeholder validate the result? |
Communicating Findings
| Element | Include |
|---|
| Executive summary | Key finding, impact, recommendation |
| Method | Data sources, timeframe, filters, transformations |
| Evidence | Relevant visuals, statistical support, sample size |
| Limitations | Missing data, assumptions, uncertainty, known bias |
| Recommendation | Clear action tied to business objective |
| Next steps | Further analysis, monitoring, or decision owner |
Exam trap: A technically correct analysis can still fail if it does not answer the stakeholder’s actual question.
Governance, Privacy, and Security
Governance Roles
| Role | Typical responsibility |
|---|
| Data owner | Accountable for data domain and access decisions |
| Data steward | Maintains definitions, quality rules, metadata, and usage guidance |
| Data custodian | Operates technical storage, backups, and security mechanisms |
| Data analyst | Prepares, analyzes, visualizes, and communicates data |
| Data engineer | Builds pipelines, ingestion, transformation, and data platforms |
| Database administrator | Manages database performance, availability, and access controls |
| Security/privacy teams | Define controls for sensitive data and risk management |
Governance Artifacts
| Artifact | Purpose |
|---|
| Data dictionary | Field names, definitions, types, allowed values |
| Business glossary | Business-friendly definition of terms and metrics |
| Data catalog | Searchable inventory of datasets and metadata |
| Lineage documentation | Shows where data came from and transformations applied |
| Quality rules | Defines expected validity, completeness, and consistency checks |
| Access policy | Defines who can access what and why |
| Retention policy | Defines how long data is kept and when it is disposed |
| Data classification | Labels sensitivity and handling requirements |
| Master data | Authoritative shared entities such as customer/product |
| Reference data | Standard codes and lookup values |
Sensitive Data Handling
| Technique | What it does | Use when… |
|---|
| Data minimization | Collect/use only needed data | Reducing risk and exposure |
| Masking | Hides part of a value | Users need partial visibility |
| Tokenization | Replaces sensitive value with token | Systems need reference without exposing original |
| Encryption | Protects data using cryptography | Data at rest or in transit must be protected |
| Hashing | One-way transformation | Need comparison without revealing original value |
| Anonymization | Removes ability to identify individuals | Analysis should not identify subjects |
| Pseudonymization | Replaces identifiers but may be reversible with separate key | Analysis needs linkage with reduced exposure |
| Aggregation | Reports grouped results | Individual-level detail is unnecessary |
| Redaction | Removes sensitive fields/content | Sharing documents or extracts |
| Access control | Limits who can view/use data | Least privilege and role separation |
Access Control Distinctions
| Control | Description | Exam clue |
|---|
| Least privilege | Users get only required access | Reduce unnecessary exposure |
| RBAC | Access based on role | “Analysts can read curated sales tables” |
| ABAC | Access based on attributes/context | Department, location, sensitivity, purpose |
| MFA | Additional authentication factor | Strengthen identity verification |
| Audit logging | Records access and actions | Investigation, accountability |
| Segregation of duties | Splits conflicting responsibilities | Prevent misuse or unchecked changes |
| Row-level security | Restricts rows by user/context | Regional managers see only their region |
| Column-level security | Restricts sensitive fields | Hide salary, SSN/national ID, or account number |
| Concept | Why it matters for DA0-002 scenarios |
|---|
| Metadata | Helps users understand source, owner, refresh, type, and meaning |
| Technical metadata | Data types, schema, table size, refresh job, constraints |
| Business metadata | Business definitions, KPI rules, owner, approved usage |
| Operational metadata | Load time, job status, error count, processing duration |
| Lineage | Supports trust, auditability, troubleshooting, and impact analysis |
| Versioning | Tracks changes to queries, reports, definitions, and datasets |
| Data provenance | Establishes origin and authenticity |
| Reproducibility | Allows another analyst to recreate the result |
Exam trap: A dashboard without definitions, refresh timestamp, source, or owner may look polished but still be weakly governed.
Troubleshooting Data Problems
Symptom-to-Cause Reference
| Symptom | Likely causes | First checks |
|---|
| Dashboard totals suddenly changed | Source refresh, filter change, join issue, duplicate load | Refresh logs, row counts, query version, source control totals |
| Counts too high | Duplicate rows, many-to-many join, wrong grain | Distinct counts, key uniqueness, join path |
| Counts too low | Inner join dropped records, filter too restrictive, missing source file | Unmatched records, filter logic, ingestion logs |
| NULLs increased | Source system change, parsing failure, new optional field behavior | Null profiling by load date/source |
| Date trend has gaps | Missing batch, time zone issue, holiday/weekend handling | Calendar table, refresh logs, timestamp conversion |
| Categories split unexpectedly | Inconsistent spelling/casing, new codes | Value frequency list, reference data mapping |
| Query is slow | Large scans, missing filters, inefficient joins, no aggregation | Explain plan if available, filter early, reduce columns |
| Model performance dropped | Data drift, changed population, feature pipeline issue | Compare training vs current distributions |
| Report users disagree with metric | Different definitions, filters, or source systems | Business glossary, requirements, reconciliation |
Validation Before Publishing
- Reconcile totals against trusted source reports.
- Confirm joins do not change expected row counts unexpectedly.
- Test filters, date ranges, and parameter defaults.
- Review outliers and decide whether they are errors or real events.
- Validate metric definitions with stakeholders.
- Check access permissions and sensitive fields.
- Include source, refresh date, owner, and definitions.
- Save query/report version and document assumptions.
Common DA0-002 Exam Traps
| Trap | Better thinking |
|---|
| Correlation equals causation | Correlation can suggest a relationship but does not prove cause |
| Highest accuracy is always best | Choose metrics based on business cost of false positives/false negatives |
| Mean always represents “typical” | Median is often better for skewed data |
| Remove all outliers | Investigate first; outliers may be valid and important |
| More data is always better | Relevant, high-quality, governed data is better than uncontrolled volume |
| Pie charts are always good for percentages | Use only for a few categories; bars are often clearer |
| Raw data lake equals trusted data | Raw storage requires cataloging, quality, lineage, and access controls |
| Dashboard first, requirements later | Define audience, decision, metric, and refresh needs first |
| Inner joins are harmless | They can drop unmatched records and bias results |
| Cleaning data is just formatting | Cleaning can change meaning; document assumptions |
| Statistical significance means business significance | Effect size, cost, and actionability still matter |
| Aggregates are always safe | Aggregation can hide subgroup patterns and bias |
Final Review Checklist
Before sitting for CompTIA Data+ V2 (DA0-002), confirm you can:
- Select the right data storage pattern for operational vs analytical scenarios.
- Explain structured, semi-structured, and unstructured data implications.
- Identify the correct chart for comparison, trend, distribution, relationship, or geography.
- Read SQL joins, GROUP BY, HAVING, CASE, and window-function patterns.
- Diagnose duplicate, missing, invalid, inconsistent, and untimely data.
- Choose mean vs median, standard deviation vs IQR, and correlation vs regression.
- Interpret p-values, confidence intervals, Type I/Type II errors, and sampling bias.
- Distinguish descriptive, diagnostic, predictive, and prescriptive analytics.
- Match model metrics to business error costs.
- Apply privacy, masking, encryption, access control, lineage, and metadata concepts.
- Communicate findings with assumptions, limitations, and actionable recommendations.
Practical Next Step
Use this Quick Reference as a checklist while answering timed DA0-002 practice questions. After each missed question, tag the miss by category: data quality, SQL, statistics, visualization, architecture, governance, or communication. Then revisit the matching section above until you can explain both the correct answer and the trap answer.