DA0-002 — CompTIA Data+ V2 Quick Reference

Compact DA0-002 review reference for CompTIA Data+ V2 candidates: data concepts, SQL, statistics, quality, visualization, governance, and exam decision points.

Exam Identity and How to Use This Page

This independent Quick Reference supports preparation for CompTIA Data+ V2 (DA0-002) from CompTIA. Use it as a compact review of high-yield data analysis concepts, decision points, formulas, SQL patterns, visualization choices, governance terms, and common exam traps.

For best results:

  • Use the tables to test “when would I choose this?” rather than memorizing definitions only.
  • Practice reading scenario clues: business objective, data source, data type, quality problem, stakeholder, and reporting need.
  • Pair this page with timed DA0-002 practice questions to confirm that you can apply concepts under exam conditions.

High-Yield DA0-002 Decision Map

If the stem emphasizes…Think first about…Common correct directionCommon trap
Business question, KPI, audienceRequirements gatheringDefine metric, grain, filters, stakeholder needBuilding a chart before defining the question
Missing, invalid, duplicated dataData qualityProfile, validate, clean, document assumptionsDeleting data without understanding impact
Combining data from multiple systemsIntegration and joinsKeys, grain, schema, transformation rulesMany-to-many join causing inflated totals
Operational transactionsOLTPNormalized, current, frequent writesUsing OLTP schema directly for heavy analytics
Historical reporting and dashboardsOLAP / warehouseStar schema, facts, dimensions, aggregationsOver-normalizing analytic models
Raw, varied, high-volume filesData lakeStore raw/semi-structured data, schema-on-readTreating a lake as curated truth without governance
Trends over timeTime seriesDate grain, seasonality, moving averagesIgnoring missing periods or calendar effects
Relationship between variablesCorrelation/regressionScatterplot, correlation, regression diagnosticsClaiming causation from correlation
Categories or proportionsBar/stacked bar/pie with cautionCompare counts or percentagesUsing pie charts with many categories
Sensitive personal dataGovernance/securityClassify, minimize, mask, encrypt, restrict accessSharing raw PII because the report is internal
Model performanceMetrics and validationChoose metric based on error costReporting accuracy only on imbalanced classes

Data Lifecycle Reference

PhaseCandidate should knowExam-focused questions to ask
PlanObjective, stakeholder, scope, KPI, success criteriaWhat business decision will this support?
CollectSource systems, APIs, files, surveys, sensors, logsIs the data relevant, permitted, and complete enough?
IngestBatch, streaming, CDC, manual uploadHow often must data be refreshed?
StoreDatabase, warehouse, lake, mart, spreadsheetDoes the structure fit analytics, cost, and governance needs?
PrepareClean, transform, standardize, join, aggregateWhat assumptions are being introduced?
AnalyzeDescriptive, diagnostic, predictive, prescriptiveWhich method matches the question and data type?
VisualizeChart, dashboard, report, narrativeWhat is the simplest accurate way to communicate the insight?
ActRecommendation, decision, automationWhat action should the stakeholder take?
GovernMetadata, lineage, quality, privacy, accessCan the result be trusted, reproduced, and audited?
Retire/archiveRetention, disposal, archivalIs the data still needed and allowed to be retained?

Data Types, Measurement, and Structure

Data Type Matrix

TypeDescriptionExamplesAnalysis implications
StructuredFixed schema, rows/columnsRelational tables, spreadsheetsSQL-friendly; constraints and joins matter
Semi-structuredFlexible tags/keysJSON, XML, logsRequires parsing; schema may vary by record
UnstructuredNo predefined tabular modelText, images, audio, PDFsNeeds extraction, NLP, classification, or metadata
CategoricalLabels or groupsRegion, product, statusCounts, proportions, bar charts
NumericalMeasured or counted valuesRevenue, age, quantitySummary stats, distributions, trends
DiscreteCountable integersTickets, orders, defectsCounts, rates, histograms
ContinuousMeasured on continuumTemperature, duration, weightMeans, ranges, density, binning
Date/timeTime-based valuesTimestamp, fiscal monthTrends, seasonality, intervals, time zones
BooleanTrue/falseActive flag, subscribedFiltering, binary classification
GeospatialLocation-basedLatitude/longitude, ZIP/postal areaMaps, clustering, regional aggregation

Measurement Scales

ScaleOrdered?Equal intervals?True zero?ExamplesValid operations
NominalNoNoNoColor, country, departmentCount, mode, percentage
OrdinalYesNot guaranteedNoSatisfaction rating, risk levelMedian, rank, percentile
IntervalYesYesNoCelsius, calendar yearDifference, mean, standard deviation
RatioYesYesYesRevenue, age, distanceRatios, growth rate, coefficient of variation

Exam trap: Do not average nominal labels. Be cautious averaging ordinal ratings; it is common in business reporting, but the scale distance may not be truly equal.

Data Storage and Architecture

Analytical Storage Selection

OptionBest forStrengthsLimitations / traps
SpreadsheetSmall ad hoc analysisFast, familiar, flexibleError-prone, weak version control, limited governance
Relational databaseStructured operational dataACID transactions, SQL, constraintsNot always optimized for large analytical scans
Data warehouseCurated historical analyticsConsistent metrics, performance, governanceRequires modeling and ETL/ELT discipline
Data martDepartment-specific analyticsFocused, faster deliveryCan create inconsistent definitions if unmanaged
Data lakeRaw diverse data at scaleStores structured/semi/unstructured dataNeeds catalog, quality, security, and curation
LakehouseLake storage with warehouse-like featuresSupports broader analytics on open formatsStill requires strong governance and design
NoSQL document storeFlexible nested recordsHandles changing JSON-like structuresJoins and complex analytics may be harder
Key-value storeFast lookup by keyLow-latency retrievalPoor for complex filtering or aggregation
Column-family storeWide sparse high-volume dataScalable reads/writes for certain patternsQuery patterns must be designed upfront
Graph databaseConnected entitiesRelationship traversal, networksNot ideal for simple tabular reporting

OLTP vs OLAP

FeatureOLTPOLAP
Primary purposeRun business transactionsAnalyze business performance
Data shapeHighly normalizedStar/snowflake, denormalized, aggregated
WorkloadMany small reads/writesFewer large scans and aggregations
Data freshnessCurrent/near currentHistorical snapshots or curated refreshes
UsersApplications, operationsAnalysts, BI users, executives
ExampleOrder entry systemSales performance dashboard
Exam clue“Insert/update transactions”“Trends, KPIs, historical reporting”

Data Modeling Essentials

ConceptMeaningExam note
EntityObject being storedCustomer, order, product
AttributeField describing an entityCustomer name, order date
Primary keyUnique row identifierShould be stable and unique
Foreign keyLinks to primary key in another tableSupports referential integrity
Composite keyKey made from multiple columnsCommon in bridge or fact tables
Surrogate keyArtificial system-generated keyOften used in warehouses
Natural keyReal-world identifierMay change or contain errors
Fact tableMeasurements/eventsSales amount, units, clicks
Dimension tableDescriptive contextDate, product, customer, region
GrainLevel of detail in a table“One row per order line” is different from “one row per order”
Star schemaFact table connected to dimensionsCommon BI model; simpler joins
Snowflake schemaDimensions normalized into subdimensionsLess redundancy, more joins
NormalizationReduces redundancy and update anomaliesUseful for OLTP
DenormalizationAdds redundancy for faster readsUseful for analytics/performance

File and Data Exchange Formats

FormatBest useStrengthsWatch for
CSVSimple tabular exchangePortable, human-readableDelimiters, quoting, encoding, missing headers
TSVTabular data with tabsAvoids comma conflictsStill weak typing
JSONAPIs, nested semi-structured dataFlexible, widely usedNested arrays, schema drift
XMLTagged hierarchical exchangeSelf-describing, supports schemasVerbose, more complex parsing
ParquetColumnar analyticsEfficient compression and queriesNot human-readable; schema matters
ORCColumnar big data analyticsEfficient for large scansEcosystem-specific considerations
AvroRow-oriented serializationGood for streaming and schema evolutionRequires schema management
Excel workbookBusiness user exchangeMultiple sheets, formulas, formattingHidden logic, manual edits, inconsistent types
PDFFinal-form documentsPreserves layoutPoor for structured extraction
Log filesEvent/activity trackingRich operational detailTimestamp parsing, volume, inconsistent formats

Data Integration and Transformation

Ingestion and Refresh Patterns

PatternChoose when…Key benefitRisk / exam trap
Full loadDataset is small or baseline is neededSimple and completeExpensive for large data
Incremental loadOnly changes need refreshEfficientRequires reliable change detection
Batch processingPeriodic reporting is acceptableEfficient schedulingNot real-time
StreamingLow-latency event processing is requiredNear real-time insightMore complex monitoring and ordering
Change data captureNeed database changes over timeCaptures inserts/updates/deletesMust handle late/out-of-order changes
API ingestionSource exposes service endpointControlled access and automationRate limits, pagination, authentication
Manual uploadInfrequent or early-stage processLow setup effortError-prone and hard to govern

ETL vs ELT

ApproachFlowBest fitExam clue
ETLExtract → Transform → LoadTransform before warehouse load; strict target schema“Clean and conform before loading”
ELTExtract → Load → TransformCloud/lake/warehouse can transform after loading“Load raw data first, transform in platform”

Common Transformation Tasks

TaskPurposeExample
FilteringKeep relevant recordsCurrent fiscal year only
ProjectionKeep relevant columnsSelect customer_id, order_date, amount
StandardizationMake values consistentConvert “USA,” “U.S.,” “United States”
Type conversionEnsure correct data typeString date to date type
ParsingSplit/extract componentsExtract domain from email
DeduplicationRemove duplicate recordsSame customer loaded twice
AggregationSummarize to desired grainDaily sales by region
JoiningCombine related tablesOrders with customer dimension
Pivot/unpivotReshape rows/columnsMonths as rows instead of columns
BinningGroup numeric rangesAge bands, revenue tiers
ImputationFill missing valuesMedian income by segment
Anonymization/maskingReduce sensitive exposureHide full account number

SQL Quick Reference for Data+ Candidates

Query Order and Logical Processing

ClausePurposeExam note
SELECTColumns or expressions returnedCan include aliases and calculated fields
FROMSource table/viewStart with correct grain
JOINCombine related dataChoose join type carefully
WHERERow-level filter before aggregationCannot filter aggregate results here
GROUP BYAggregate by category/grainEvery non-aggregated selected column must be grouped
HAVINGFilter groups after aggregationUse for SUM/COUNT/AVG conditions
ORDER BYSort resultsUsually last logical output step
LIMIT / TOPReturn subsetSyntax varies by platform

Core SQL Patterns

-- Aggregation with group filter
SELECT
    region,
    COUNT(*) AS order_count,
    SUM(order_amount) AS total_sales,
    AVG(order_amount) AS avg_order_value
FROM orders
WHERE order_date >= '2026-01-01'
GROUP BY region
HAVING SUM(order_amount) > 100000
ORDER BY total_sales DESC;
-- Left join to keep all customers, even those without orders
SELECT
    c.customer_id,
    c.customer_name,
    COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o
    ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name;
-- Window function: rank rows without collapsing detail
SELECT
    customer_id,
    order_id,
    order_amount,
    RANK() OVER (
        PARTITION BY customer_id
        ORDER BY order_amount DESC
    ) AS order_rank
FROM orders;
-- CASE expression for business categories
SELECT
    customer_id,
    total_spend,
    CASE
        WHEN total_spend >= 10000 THEN 'High'
        WHEN total_spend >= 1000 THEN 'Medium'
        ELSE 'Low'
    END AS spend_segment
FROM customer_summary;

Join Types

JoinReturnsUse when…Trap
INNER JOINMatching rows onlyNeed records present in both tablesCan unintentionally drop unmatched records
LEFT JOINAll left rows plus matchesNeed full base populationWHERE filter on right table can turn it into inner-like behavior
RIGHT JOINAll right rows plus matchesSame concept as left join, reversedOften less readable than rewriting as LEFT JOIN
FULL OUTER JOINAll rows from both sidesNeed unmatched records from either sourceNot supported in every SQL dialect
CROSS JOINAll combinationsNeed Cartesian product intentionallyCan explode row count
Self joinTable joined to itselfHierarchies, comparisons, previous relationshipsRequires clear aliases

SQL Exam Traps

TrapWhy it mattersSafer approach
COUNT(*) vs COUNT(column)COUNT(column) ignores NULLsChoose intentionally
NULL comparisonNULL is unknown, not equal to anythingUse IS NULL / IS NOT NULL
Many-to-many joinInflates counts and sumsCheck grain and bridge tables
Filtering after left joinWHERE right_table.column = value may remove unmatched rowsPut condition in JOIN or allow NULL logic
Date filteringTime components can exclude expected recordsUse half-open date ranges where appropriate
Duplicate dimension rowsCan multiply factsValidate uniqueness of join keys
Aggregating at wrong grainProduces misleading KPIsDefine grain before joining or summarizing
Alias availabilitySome dialects do not allow SELECT alias in WHEREUse subquery/CTE if needed

Data Quality and Profiling

Data Quality Dimensions

DimensionMeaningExample issueDetection methods
AccuracyCorrectly represents realityWrong customer addressSource comparison, validation sample
CompletenessRequired data is presentMissing email or dateNull counts, required field checks
ConsistencySame value across systemsDifferent customer status in CRM and billingReconciliation, cross-system checks
ValidityMatches allowed format/rangeNegative age, invalid ZIP/postal codeRules, regex, constraints
TimelinessAvailable when neededData refreshed after report deadlineRefresh timestamp, SLA monitoring
UniquenessNo unintended duplicatesDuplicate customer recordsDuplicate key checks, fuzzy matching
IntegrityRelationships are validOrder with nonexistent customerReferential integrity checks
ConformityFollows standard representationMixed date formatsPattern and type profiling

Data Profiling Checklist

  • Count rows and compare to expected volume.
  • Review data types and unexpected type coercion.
  • Count NULLs by column and by key business segment.
  • Identify duplicate keys or suspicious near-duplicates.
  • Check minimum, maximum, mean, median, and outliers for numeric fields.
  • Validate categorical values against allowed domains.
  • Verify date ranges, future dates, and impossible timestamps.
  • Confirm referential integrity across joined tables.
  • Compare aggregates to trusted control totals.
  • Document assumptions, exclusions, and known limitations.

Cleaning and Remediation Choices

ProblemPossible actionWhen appropriateRisk
Missing valuesLeave as NULLMissingness is meaningful or unknownDownstream tools may handle poorly
Missing valuesImpute mean/median/modeSmall gaps; analysis requires complete dataCan bias variability and relationships
Missing valuesDrop rowsFew affected rows and low business impactCan introduce selection bias
Invalid formatStandardize/parsePattern is recoverableIncorrect parsing
Duplicate recordsExact/fuzzy dedupeSame entity appears multiple timesFalse merges
OutliersInvestigate, cap, transform, or keepDepends on whether error or real extremeHiding important events
Inconsistent categoriesMap to standard codesKnown synonym list existsMisclassification
Wrong data typeConvert typeSource imported as textFailed conversions or truncation
Inconsistent grainAggregate or disaggregateData sources differ in detailLoss of detail or double counting

Descriptive Statistics and Core Formulas

Formula Reference

Use these formulas conceptually; exam questions often test interpretation more than calculation.

\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]\[ s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n - 1} \]\[ s = \sqrt{s^2} \]\[ z = \frac{x - \mu}{\sigma} \]\[ \text{IQR} = Q_3 - Q_1 \]\[ r = \frac{\operatorname{cov}(X,Y)}{\sigma_X \sigma_Y} \]

Statistic Selection

StatisticPlain formula / meaningUse when…Sensitive to outliers?
CountNumber of recordsVolume/frequency mattersNo, but duplicates matter
SumTotal of valuesTotal revenue, units, costYes
Meansum of values / countSymmetric numeric dataYes
MedianMiddle ordered valueSkewed data or outliersLess sensitive
ModeMost frequent valueCategorical or common valueNo
Rangemax - minQuick spread checkYes
VarianceAverage squared deviationVariability calculationYes
Standard deviationSquare root of varianceTypical spread around meanYes
PercentileValue below which a percentage fallsDistribution thresholdsLess sensitive than max
Quartiles25%, 50%, 75% pointsBoxplots, spreadLess sensitive
IQRQ3 - Q1Robust spreadLess sensitive
Z-scoreStandard deviations from meanStandardized outlier detectionAssumes meaningful mean/SD
CorrelationStrength/direction of linear relationshipRelationship between numeric variablesCan be distorted by outliers
Weighted meanSum(value × weight) / sum(weights)Unequal importance or sample sizesDepends on weights

Distribution and Shape

Shape / patternInterpretationGood visual
Normal / bell-shapedSymmetric around meanHistogram, density plot
Right-skewedLong tail to high values; mean often above medianHistogram, boxplot
Left-skewedLong tail to low values; mean often below medianHistogram, boxplot
UniformValues evenly distributedHistogram
Bimodal/multimodalMultiple peaks; possible subgroupsHistogram split by segment
OutliersExtreme valuesBoxplot, scatterplot
SeasonalityRepeating time patternLine chart by time
TrendLong-term increase/decreaseLine chart, moving average

Inferential Statistics and Hypothesis Testing

Key Terms

TermMeaningExam note
PopulationEntire group of interestOften unavailable in full
SampleSubset of populationShould represent population
ParameterPopulation measureUsually unknown
StatisticSample measureUsed to estimate parameter
Sampling errorDifference between sample statistic and population parameterReduced by better sampling design and larger samples
Confidence intervalRange of plausible values for a parameterWider intervals imply more uncertainty
Null hypothesisDefault/no-effect claimTested against alternative
Alternative hypothesisClaim of effect/differenceMay be one-tailed or two-tailed
p-valueProbability of results as extreme if null is trueSmall p-value suggests evidence against null
Significance levelThreshold for rejecting nullChosen before test
Type I errorRejecting a true nullFalse positive
Type II errorFailing to reject a false nullFalse negative
PowerProbability of detecting a real effectHigher power lowers Type II risk

Common Test Selection

ScenarioCandidate methodData type
Compare mean to known valueOne-sample t-testNumeric
Compare means of two independent groupsTwo-sample t-testNumeric + two groups
Compare means of paired observationsPaired t-testNumeric paired data
Compare means across more than two groupsANOVANumeric + multiple groups
Test relationship between categorical variablesChi-square testCategorical
Estimate linear relationshipLinear regressionNumeric outcome
Compare proportionsProportion testCategorical/binary outcome

Exam trap: Statistical significance does not prove practical significance, business value, or causation.

Sampling, Bias, and Experimental Design

Sampling Methods

MethodHow it worksStrengthRisk
Simple randomEvery member has equal chanceEasy to understandRequires full sampling frame
StratifiedSample within important subgroupsEnsures subgroup representationRequires correct strata
ClusterRandomly select groups/clustersCost-effective for dispersed populationsHigher sampling error if clusters vary
SystematicSelect every kth recordSimpleHidden periodic patterns can bias results
ConvenienceUse easily available recordsFastOften biased
SnowballParticipants recruit othersUseful for hard-to-reach groupsNetwork bias
CensusInclude all records/populationNo sampling error for included populationMay be expensive or infeasible

Bias and Validity Traps

Bias / issueDescriptionMitigation
Selection biasSample differs from populationRandom/stratified sampling, clear inclusion rules
Survivorship biasOnly successful/remaining cases are analyzedInclude failures and removed records
Confirmation biasAnalyst favors expected resultPredefine method; peer review
Response biasAnswers influenced by wording/social pressureNeutral survey design
Nonresponse biasMissing respondents differ from respondentsFollow-up, weighting, assess differences
Measurement biasInstrument/process systematically mismeasuresCalibrate, validate, standardize collection
Data leakagePredictive model uses information unavailable at prediction timeSeparate training features by time and availability
ConfoundingThird variable affects relationshipControl variables, experimental design
Simpson’s paradoxAggregate trend reverses within subgroupsAnalyze by relevant segments

Analytics Methods and Model Concepts

Analytics Categories

CategoryQuestion answeredExamples
DescriptiveWhat happened?Monthly sales, defect count, dashboard KPI
DiagnosticWhy did it happen?Drill-down, variance analysis, root cause analysis
PredictiveWhat is likely to happen?Forecasting demand, churn prediction
PrescriptiveWhat should we do?Optimization, recommendations, next-best action

Method Selection

TaskCommon methodOutputWatch for
Forecast future valuesTime series / regressionPredicted value by timeSeasonality, missing periods, external events
Predict numeric valueRegressionContinuous estimateOutliers, multicollinearity, nonlinearity
Predict category/classClassificationClass label/probabilityImbalanced classes, threshold choice
Find natural groupsClusteringSegments/clustersNeed interpretation and scaling
Find co-occurring itemsAssociation rulesItem relationshipsCorrelation, not causation
Reduce variablesDimensionality reductionFewer features/componentsLoss of interpretability
Analyze free textText mining/NLPSentiment, topics, entitiesAmbiguity, language, context
Detect unusual eventsAnomaly detectionOutlier score/flagRare legitimate events vs errors

Model Evaluation Metrics

MetricPlain formula / meaningBest forTrap
AccuracyCorrect predictions / all predictionsBalanced classificationMisleading with class imbalance
PrecisionTP / (TP + FP)False positives are costlyMay ignore missed positives
Recall / sensitivityTP / (TP + FN)False negatives are costlyMay increase false positives
SpecificityTN / (TN + FP)Correctly identifying negativesNot enough alone
F1 scoreHarmonic mean of precision and recallBalance precision and recallHides business cost differences
MAEAverage absolute errorRegression; interpretable unitsTreats all errors linearly
MSEAverage squared errorRegression; penalizes large errorsUnits are squared
RMSESquare root of MSERegression; original unitsSensitive to outliers
R-squaredVariance explained by modelRegression fit summaryHigher is not always better; overfitting possible

Visualization and Reporting

Chart Selection Matrix

NeedBest chart typesAvoid / watch for
Compare categoriesBar, column, dot plot3D bars, unsorted clutter
Show trend over timeLine, area, sparklinePie chart for time trends
Show part-to-wholeStacked bar, 100% stacked bar, treemap, pie for few categoriesToo many pie slices
Show distributionHistogram, boxplot, density plotMean-only summary for skewed data
Show relationshipScatterplot, bubble chartInferring causation automatically
Show rankingSorted bar, lollipop chartAlphabetical order when rank matters
Show geographyMap, choropleth, proportional symbol mapUsing raw counts without population normalization
Show process flowFlowchart, SankeyOverly decorative visuals
Show KPI statusScorecard, bullet chart, gauge with cautionGauge overload
Show correlation matrixHeatmapUsing rainbow color scales without meaning

Visualization Design Principles

PrinciplePractical guidance
Match chart to questionChoose the simplest chart that answers the stakeholder’s question
Use correct scaleAvoid misleading truncated axes unless clearly justified
Label clearlyTitle, axes, units, timeframe, filters, source
Reduce clutterRemove unnecessary gridlines, effects, and redundant labels
Use color intentionallyHighlight meaning; do not rely on color alone
Preserve contextInclude benchmarks, targets, prior period, or sample size when needed
Show uncertaintyUse confidence intervals/error bars where appropriate
Support accessibilitySufficient contrast, readable fonts, colorblind-friendly palettes
Keep grain consistentDo not mix daily, monthly, and yearly values without explanation
Document assumptionsFilters, exclusions, definitions, and refresh date should be discoverable

Dashboard and Report Types

TypeAudiencePurposeDesign focus
Operational dashboardFront-line teamsMonitor current activityTimeliness, alerts, drill-through
Tactical dashboardManagersTrack departmental performanceTrends, exceptions, targets
Strategic dashboardExecutivesMonitor high-level goalsKPIs, concise summaries, business outcomes
Analytical reportAnalysts/managersExplore causes and patternsFilters, segmentation, detail
Static reportBroad distributionFixed snapshotClear narrative and definitions
Self-service BIBusiness usersFlexible explorationGoverned datasets and consistent metrics

KPI and Metric Review Checklist

  • Is the metric tied to a business objective?
  • Is the numerator and denominator clearly defined?
  • Is the grain clear?
  • Are filters and exclusions documented?
  • Is the time period consistent?
  • Is there a target, benchmark, or baseline?
  • Could the metric be gamed?
  • Are leading and lagging indicators balanced?
  • Are related metrics needed to avoid misinterpretation?

Business Analysis and Communication

Requirements Questions

AreaQuestions to ask
ObjectiveWhat decision or action will this analysis support?
StakeholderWho will use the result, and what is their data literacy level?
ScopeWhich products, regions, periods, or populations are included?
Metric definitionHow exactly is success measured?
Data availabilityWhich sources contain the needed fields?
RefreshHow often does the output need to update?
SecurityWho is allowed to see raw data and summarized results?
DeliveryDashboard, report, file extract, presentation, API, alert?
AcceptanceHow will the stakeholder validate the result?

Communicating Findings

ElementInclude
Executive summaryKey finding, impact, recommendation
MethodData sources, timeframe, filters, transformations
EvidenceRelevant visuals, statistical support, sample size
LimitationsMissing data, assumptions, uncertainty, known bias
RecommendationClear action tied to business objective
Next stepsFurther analysis, monitoring, or decision owner

Exam trap: A technically correct analysis can still fail if it does not answer the stakeholder’s actual question.

Governance, Privacy, and Security

Governance Roles

RoleTypical responsibility
Data ownerAccountable for data domain and access decisions
Data stewardMaintains definitions, quality rules, metadata, and usage guidance
Data custodianOperates technical storage, backups, and security mechanisms
Data analystPrepares, analyzes, visualizes, and communicates data
Data engineerBuilds pipelines, ingestion, transformation, and data platforms
Database administratorManages database performance, availability, and access controls
Security/privacy teamsDefine controls for sensitive data and risk management

Governance Artifacts

ArtifactPurpose
Data dictionaryField names, definitions, types, allowed values
Business glossaryBusiness-friendly definition of terms and metrics
Data catalogSearchable inventory of datasets and metadata
Lineage documentationShows where data came from and transformations applied
Quality rulesDefines expected validity, completeness, and consistency checks
Access policyDefines who can access what and why
Retention policyDefines how long data is kept and when it is disposed
Data classificationLabels sensitivity and handling requirements
Master dataAuthoritative shared entities such as customer/product
Reference dataStandard codes and lookup values

Sensitive Data Handling

TechniqueWhat it doesUse when…
Data minimizationCollect/use only needed dataReducing risk and exposure
MaskingHides part of a valueUsers need partial visibility
TokenizationReplaces sensitive value with tokenSystems need reference without exposing original
EncryptionProtects data using cryptographyData at rest or in transit must be protected
HashingOne-way transformationNeed comparison without revealing original value
AnonymizationRemoves ability to identify individualsAnalysis should not identify subjects
PseudonymizationReplaces identifiers but may be reversible with separate keyAnalysis needs linkage with reduced exposure
AggregationReports grouped resultsIndividual-level detail is unnecessary
RedactionRemoves sensitive fields/contentSharing documents or extracts
Access controlLimits who can view/use dataLeast privilege and role separation

Access Control Distinctions

ControlDescriptionExam clue
Least privilegeUsers get only required accessReduce unnecessary exposure
RBACAccess based on role“Analysts can read curated sales tables”
ABACAccess based on attributes/contextDepartment, location, sensitivity, purpose
MFAAdditional authentication factorStrengthen identity verification
Audit loggingRecords access and actionsInvestigation, accountability
Segregation of dutiesSplits conflicting responsibilitiesPrevent misuse or unchecked changes
Row-level securityRestricts rows by user/contextRegional managers see only their region
Column-level securityRestricts sensitive fieldsHide salary, SSN/national ID, or account number

Metadata, Lineage, and Documentation

ConceptWhy it matters for DA0-002 scenarios
MetadataHelps users understand source, owner, refresh, type, and meaning
Technical metadataData types, schema, table size, refresh job, constraints
Business metadataBusiness definitions, KPI rules, owner, approved usage
Operational metadataLoad time, job status, error count, processing duration
LineageSupports trust, auditability, troubleshooting, and impact analysis
VersioningTracks changes to queries, reports, definitions, and datasets
Data provenanceEstablishes origin and authenticity
ReproducibilityAllows another analyst to recreate the result

Exam trap: A dashboard without definitions, refresh timestamp, source, or owner may look polished but still be weakly governed.

Troubleshooting Data Problems

Symptom-to-Cause Reference

SymptomLikely causesFirst checks
Dashboard totals suddenly changedSource refresh, filter change, join issue, duplicate loadRefresh logs, row counts, query version, source control totals
Counts too highDuplicate rows, many-to-many join, wrong grainDistinct counts, key uniqueness, join path
Counts too lowInner join dropped records, filter too restrictive, missing source fileUnmatched records, filter logic, ingestion logs
NULLs increasedSource system change, parsing failure, new optional field behaviorNull profiling by load date/source
Date trend has gapsMissing batch, time zone issue, holiday/weekend handlingCalendar table, refresh logs, timestamp conversion
Categories split unexpectedlyInconsistent spelling/casing, new codesValue frequency list, reference data mapping
Query is slowLarge scans, missing filters, inefficient joins, no aggregationExplain plan if available, filter early, reduce columns
Model performance droppedData drift, changed population, feature pipeline issueCompare training vs current distributions
Report users disagree with metricDifferent definitions, filters, or source systemsBusiness glossary, requirements, reconciliation

Validation Before Publishing

  • Reconcile totals against trusted source reports.
  • Confirm joins do not change expected row counts unexpectedly.
  • Test filters, date ranges, and parameter defaults.
  • Review outliers and decide whether they are errors or real events.
  • Validate metric definitions with stakeholders.
  • Check access permissions and sensitive fields.
  • Include source, refresh date, owner, and definitions.
  • Save query/report version and document assumptions.

Common DA0-002 Exam Traps

TrapBetter thinking
Correlation equals causationCorrelation can suggest a relationship but does not prove cause
Highest accuracy is always bestChoose metrics based on business cost of false positives/false negatives
Mean always represents “typical”Median is often better for skewed data
Remove all outliersInvestigate first; outliers may be valid and important
More data is always betterRelevant, high-quality, governed data is better than uncontrolled volume
Pie charts are always good for percentagesUse only for a few categories; bars are often clearer
Raw data lake equals trusted dataRaw storage requires cataloging, quality, lineage, and access controls
Dashboard first, requirements laterDefine audience, decision, metric, and refresh needs first
Inner joins are harmlessThey can drop unmatched records and bias results
Cleaning data is just formattingCleaning can change meaning; document assumptions
Statistical significance means business significanceEffect size, cost, and actionability still matter
Aggregates are always safeAggregation can hide subgroup patterns and bias

Final Review Checklist

Before sitting for CompTIA Data+ V2 (DA0-002), confirm you can:

  • Select the right data storage pattern for operational vs analytical scenarios.
  • Explain structured, semi-structured, and unstructured data implications.
  • Identify the correct chart for comparison, trend, distribution, relationship, or geography.
  • Read SQL joins, GROUP BY, HAVING, CASE, and window-function patterns.
  • Diagnose duplicate, missing, invalid, inconsistent, and untimely data.
  • Choose mean vs median, standard deviation vs IQR, and correlation vs regression.
  • Interpret p-values, confidence intervals, Type I/Type II errors, and sampling bias.
  • Distinguish descriptive, diagnostic, predictive, and prescriptive analytics.
  • Match model metrics to business error costs.
  • Apply privacy, masking, encryption, access control, lineage, and metadata concepts.
  • Communicate findings with assumptions, limitations, and actionable recommendations.

Practical Next Step

Use this Quick Reference as a checklist while answering timed DA0-002 practice questions. After each missed question, tag the miss by category: data quality, SQL, statistics, visualization, architecture, governance, or communication. Then revisit the matching section above until you can explain both the correct answer and the trap answer.

Browse Certification Practice Tests by Exam Family