AWS MLA-C01: ML Data Prep

Try 10 focused AWS MLA-C01 questions on ML Data Prep, with explanations, then continue with IT Mastery.

On this page

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try AWS MLA-C01 on Web View full AWS MLA-C01 practice page

Topic snapshot

FieldDetail
Exam routeAWS MLA-C01
Topic areaData Preparation for Machine Learning (ML)
Blueprint weight28%
Page purposeFocused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Data Preparation for Machine Learning (ML) for AWS MLA-C01. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

PassWhat to doWhat to record
First attemptAnswer without checking the explanation first.The fact, rule, calculation, or judgment point that controlled your answer.
ReviewRead the explanation even when you were correct.Why the best answer is stronger than the closest distractor.
RepairRepeat only missed or uncertain items after a short break.The pattern behind misses, not the answer letter.
TransferReturn to mixed practice once the topic feels stable.Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 28% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.

Question 1

Topic: Data Preparation for Machine Learning (ML)

A team stores ML training data in Amazon S3 and uses an Amazon SageMaker Processing job to read the data for feature engineering. The team requires encryption at rest using customer-managed keys and encryption in transit. They also want clear ownership of key permissions.

Which approach best meets these requirements?

Options:

  • A. Use S3 SSE-S3 and HTTPS; AWS manages all key permissions.

  • B. Client-side encrypt the data and upload it to S3 over HTTP.

  • C. Use S3 SSE-C and pass the encryption key to each job run.

  • D. Use S3 SSE-KMS with a customer managed KMS key and HTTPS; grant the job role kms:Decrypt.

Best answer: D

Explanation: S3 SSE-KMS with a customer managed KMS key satisfies encryption at rest with customer-controlled keys. Using HTTPS covers encryption in transit. Because the key is customer managed, the customer is responsible for granting the SageMaker job role permissions such as kms:Decrypt.

For ML data at rest in S3 with customer ownership of keys, use server-side encryption with AWS KMS (SSE-KMS) and a customer managed KMS key. Data in transit between SageMaker and S3 should use TLS (HTTPS). With SSE-KMS, AWS operates the KMS service, but the customer controls the KMS key policy (or grants) and must allow the SageMaker execution role to use the key (commonly kms:Decrypt and kms:GenerateDataKey) so the job can read encrypted objects. The key takeaway is that customer-managed encryption implies customer-managed access to that key.

  • SSE-S3 mismatch uses AWS-managed keys, not customer-managed keys.
  • SSE-C operational risk requires providing and handling the key on every request/job.
  • No TLS in transit uploading over HTTP fails the in-transit encryption requirement.

Question 2

Topic: Data Preparation for Machine Learning (ML)

A data science team is building a binary classifier in Amazon SageMaker. The training data in Amazon S3 contains (1) many records with missing labels and (2) some labels marked as uncertain because human reviewers disagreed. The team must choose a strategy that preserves a reliable estimate of model quality.

Which approach should the team NOT use?

Options:

  • A. Label with Ground Truth; keep a separate untouched test set

  • B. Down-weight uncertain labels and report sensitivity metrics

  • C. Drop unlabeled rows from training; score them after deployment

  • D. Auto-label train and test using a model trained on all data

Best answer: D

Explanation: Using a model to generate labels for the same data (especially the test set) breaks the separation between training and evaluation. This causes leakage and produces overly optimistic metrics that do not reflect real-world performance. Proper handling of missing or uncertain labels must preserve an untouched evaluation set to assess true model quality.

When labels are missing or uncertain, the core principle is to keep ground truth and evaluation independent so that offline metrics reflect real production behavior. Auto-labeling the dataset with a model trained on the same population (and then evaluating on those auto-labeled records) turns the model into its own grader, which contaminates labels and invalidates measured precision/recall/AUC.

Acceptable strategies include:

  • Acquire additional labels (for missing/uncertain cases) and keep a truly untouched test set.
  • Train only on confidently labeled examples while tracking coverage and selection bias.
  • Model label uncertainty explicitly (soft labels or sample weights) and measure robustness with sensitivity analyses.

Key takeaway: label generation must not “peek” at the evaluation split or be derived from the model being evaluated.

  • Human labeling with holdout maintains label integrity and preserves an unbiased test set.
  • Exclude unlabeled from supervised training is valid when you track representativeness and evaluate on labeled holdout data.
  • Weight uncertain labels is a standard way to reduce the impact of noisy labels while assessing metric sensitivity.

Question 3

Topic: Data Preparation for Machine Learning (ML)

A company is building a SageMaker pipeline that extracts features from these sources: Amazon S3 (raw files), Amazon RDS for PostgreSQL (OLTP), Amazon DynamoDB (online features), and Amazon EFS (shared logs). The sources are in private subnets.

Requirements:

  • Do not disrupt production databases/tables with heavy read operations.
  • Do not use long-lived or hardcoded database credentials.

Which TWO approaches should you AVOID? (Select TWO.)

Options:

  • A. Use DynamoDB point-in-time export to S3 for training snapshots

  • B. Mount EFS in a SageMaker Processing job and read files

  • C. Use AWS DMS CDC from RDS to S3 for downstream processing

  • D. Use AWS Glue JDBC in VPC to query an RDS read replica

  • E. Run pg_dump on the primary RDS from EC2 with hardcoded credentials

  • F. Run hourly DynamoDB Scan from a notebook against the production table

Correct answers: E and F

Explanation: Avoid extraction methods that either create heavy read pressure on production stores or rely on long-lived credentials. Full scans and direct dumps against primary OLTP systems can cause throttling, increased latency, and operational risk. Prefer exports, CDC/replication, and VPC-connected managed jobs with secure secret handling.

For ML ingestion on AWS, choose mechanisms that minimize impact on operational data stores and use secure, managed authentication. For RDS, common patterns include querying a read replica with controlled queries, or using AWS DMS to replicate changes (CDC) to S3 for analytics/ML. For DynamoDB, prefer managed exports to S3 (including point-in-time export) or other patterns that avoid repeated full-table reads against a live table. For file systems, reading directly via EFS mounts from VPC-attached SageMaker jobs is appropriate when the data already resides on EFS.

The main anti-patterns in this scenario are (1) repeated full-table scans against production DynamoDB tables and (2) running ad-hoc dumps against the primary RDS instance with hardcoded credentials.

  • RDS read replica access is acceptable because it offloads read traffic from the primary and can run in a VPC.
  • DMS to S3 is acceptable because it externalizes data (and optionally CDC) without ad-hoc heavy queries.
  • DynamoDB export to S3 is acceptable because it avoids repeatedly reading the live table.
  • EFS mount in Processing is acceptable because it reads existing files without stressing OLTP stores.

Question 4

Topic: Data Preparation for Machine Learning (ML)

A healthcare company stores 15 TB of daily partitioned Parquet files in Amazon S3 for model training in Amazon SageMaker. Before any data scientists can access the dataset, the company must (1) identify PII/PHI columns automatically, (2) mask identifiers with deterministic pseudonyms so records remain joinable across days, (3) produce the masked dataset within 2 hours nightly, and (4) keep an auditable, least-privilege boundary so raw PII/PHI is not accessible to the ML account. Which solution BEST meets these requirements with AWS-native services?

Options:

  • A. Use Macie findings to drive a Glue ETL pseudonymization job

  • B. Use SageMaker Data Wrangler to drop PII columns and export

  • C. Use Amazon Comprehend Medical to redact PHI from the Parquet files

  • D. Encrypt the dataset with SSE-KMS and restrict S3 bucket access

Best answer: A

Explanation: Use Amazon Macie to automatically discover and classify PII/PHI in the S3 data lake, then run an AWS Glue ETL job to apply deterministic pseudonymization (for example, salted hashing or tokenization) and write a masked S3 dataset for SageMaker. This preserves join keys across days and scales to multi-terabyte processing on a nightly schedule. Access controls and logs provide auditability while preventing raw data access.

The core need is to protect sensitive attributes while keeping the dataset analytically useful. Amazon Macie is designed to discover and classify sensitive data in S3 (PII/PHI signals), which satisfies the automatic identification requirement. For masking that preserves joinability, the transformation must be deterministic (the same input value maps to the same pseudonym across partitions/days), which is a good fit for an AWS Glue Spark ETL job over large Parquet datasets.

A best-fit pattern is:

  • Run Macie to identify sensitive fields/locations in S3.
  • Use a scheduled Glue ETL job to deterministically tokenize/hash sensitive columns and write a masked dataset to a separate S3 location.
  • Enforce least privilege (separate buckets/accounts/roles) and audit via CloudTrail/S3 access logs, with SSE-KMS on the masked outputs.

Encryption and access control are necessary but do not replace masking/anonymization, and text-redaction services are not the right tool for structured Parquet at this scale.

  • Interactive-only tooling dropping columns in Data Wrangler doesn’t meet deterministic pseudonymization and is a poor fit for a 15 TB nightly SLA.
  • Encryption is not masking SSE-KMS and bucket policies protect storage/access but do not remove or anonymize PII/PHI for downstream use.
  • Wrong modality/service Comprehend Medical targets unstructured clinical text, not scalable column-level masking of Parquet datasets.

Question 5

Topic: Data Preparation for Machine Learning (ML)

A team trains a fraud model in Amazon SageMaker and serves it on a real-time endpoint in private subnets (no internet egress). Auditors require that the exact same feature definitions and transformation logic be reused between training and inference and be versionable.

Which TWO approaches should the team AVOID because they are unsafe or violate these requirements? (Select TWO.)

Options:

  • A. Have the endpoint pull the latest preprocessing code from GitHub

  • B. Record preprocessor and feature schema versions with each model package

  • C. Version preprocessing code as a pinned internal Python package

  • D. Use Feature Store offline/online stores with one FeatureGroup schema

  • E. Load the same serialized sklearn preprocessor in training and endpoint

  • F. Reimplement inference preprocessing in Lambda using a wiki spec

Correct answers: A and F

Explanation: Feature parity requires a single, reusable, versioned transformation definition that both training and inference use. Approaches that duplicate or dynamically change preprocessing logic introduce training-serving skew and fail auditability. The private-subnet constraint also forbids designs that depend on downloading code from the public internet at inference time.

The core risk is training-serving skew: if training and inference compute features differently (different encodings, missing-value rules, scaling, or category handling), the model will see a different feature distribution at inference and quality will degrade in hard-to-debug ways. To ensure parity, centralize transformations as an artifact or shared component and version it alongside the model.

Common safe patterns on AWS include:

  • Persisting a fitted preprocessor (for example, an sklearn pipeline) and loading the exact same artifact in the inference container.
  • Using consistent feature definitions via SageMaker Feature Store (same FeatureGroup schema for online/offline).
  • Pinning transformation code versions (package/container) and recording the preprocessor/schema version with the registered model.

Any approach that re-implements transforms separately or pulls mutable “latest” code undermines reproducibility and audit requirements.

  • Feature Store schema parity works because the same FeatureGroup definition backs offline training and online retrieval.
  • Shared serialized preprocessor preserves identical fitted parameters (encoders/scalers) across training and inference.
  • Pinned internal package supports reuse and repeatability by controlling the exact transformation code version.
  • Model package version metadata improves auditability by tying a model to specific feature/preprocessor versions.

Question 6

Topic: Data Preparation for Machine Learning (ML)

A team uses Amazon SageMaker Pipelines to train and tune a churn model and deploy it to a real-time endpoint.

Current workflow:

  • Data in Amazon S3 (mix of numeric features with very different ranges, plus one-hot encoded categories)
  • Train/tune a logistic regression model using an SGD-based optimizer
  • Deploy the best model to production and monitor latency and quality

Problem: Automatic Model Tuning runs are expensive and often unstable (slow convergence and occasional divergence). The team must keep a linear model for auditability and must not change the feature set.

Which change is the best way to improve training stability and reduce tuning cost without breaking these constraints?

Options:

  • A. Skip feature scaling and instead increase the learning rate and number of epochs for faster convergence

  • B. Apply Min-Max scaling using per-mini-batch statistics during training to reduce preprocessing overhead

  • C. Standardize continuous features using training-set mean and variance, and reuse the same transform at inference

  • D. Switch the model to a tree-based algorithm to avoid feature scaling and simplify the pipeline

Best answer: C

Explanation: Standardizing continuous features (zero mean, unit variance) directly addresses slow or unstable convergence for SGD-trained linear models. Fewer divergent runs and faster convergence typically reduces the number and duration of tuning jobs, lowering cost while preserving the required linear, auditable model. The tradeoff is added preprocessing plus the need to version and apply the exact same scaler at inference.

Feature scaling matters most for algorithm families that rely on distances or gradient-based optimization (for example, k-NN, SVMs, neural networks, and linear/logistic regression trained with SGD). When features have very different magnitudes, the optimizer can take inefficient steps, require more iterations, or become numerically unstable.

In this pipeline, adding a preprocessing step to fit a StandardScaler (mean/standard deviation) on the training split and applying it consistently to training/validation and the deployed endpoint improves stability and usually reduces tuning time/cost. Operationally, you must persist the scaler parameters with the model artifact (or as part of the inference container) so production uses the exact same transformation.

By contrast, tree-based models are generally insensitive to monotonic feature scaling, but changing to them violates the linear-model constraint.

  • Per-mini-batch scaling makes the feature transformation inconsistent across batches and between training and inference, hurting model reliability.
  • Tuning learning rate/epochs only does not address ill-conditioned features and can increase divergence risk.
  • Switching to tree-based models may work without scaling, but it breaks the stated requirement to keep a linear model for auditability.

Question 7

Topic: Data Preparation for Machine Learning (ML)

A company is building a computer vision model in Amazon SageMaker to detect safety gear in warehouse photos. The team must label 900,000 images stored in Amazon S3. Requirements:

  • Only company employees can label data (no third parties) because images may contain customer PII.
  • Labeling actions must be auditable.
  • Label output must be encrypted at rest with a customer managed KMS key.
  • The project must finish labeling within 4 weeks without building a custom labeling application.

Which AWS approach is the BEST fit for these requirements?

Options:

  • A. SageMaker Ground Truth with a private workforce

  • B. Amazon Mechanical Turk as the labeling workforce

  • C. Custom labeling UI on Amazon EC2 for employees

  • D. Amazon Rekognition to auto-label all images

Best answer: A

Explanation: SageMaker Ground Truth can run a managed labeling job at large scale while restricting labeling to a private workforce made up of the company’s employees. It integrates with S3 for output storage, supports KMS encryption for the output, and provides traceability for labeling activity to meet audit requirements. This satisfies the governance and timeline constraints without building a custom tool.

The core requirement is governed, high-throughput human labeling without using external annotators or building a bespoke system. Amazon SageMaker Ground Truth is designed for creating labeled datasets at scale and includes built-in workflow management and quality controls (for example, annotation consolidation) while allowing you to choose the workforce type.

For this scenario, use Ground Truth with a private workforce so only employees can access the images, store labels back to S3 with SSE-KMS using a customer managed key, and rely on AWS logging (for example, CloudTrail) for auditability of job and data access activity. This delivers a managed path to completing 900,000 labels within the required timeframe, whereas alternatives either violate the workforce governance constraint or require custom application development.

Key takeaway: pick Ground Truth private workforce when you need scalable labeling with strong governance controls.

  • Third-party annotators fails because Mechanical Turk does not meet the “employees only” governance requirement.
  • No human labeling fails because Rekognition auto-labeling is not a controlled human labeling workflow for producing ground-truth training labels.
  • Build it yourself fails because a custom EC2 labeling UI violates the requirement to avoid building a custom labeling application and adds operational burden to meet the 4-week SLA.

Question 8

Topic: Data Preparation for Machine Learning (ML)

A team uses Amazon SageMaker Data Wrangler to engineer features from daily CSV files in Amazon S3 for a churn model. The team must (1) make the transformations repeatable for daily runs in an automated workflow and (2) validate that the transformations are producing expected feature distributions before training.

Which TWO actions meet these requirements? (Select TWO.)

Options:

  • A. Export the Data Wrangler flow to a SageMaker Pipelines processing step

  • B. Export to a notebook and rerun cells manually each day

  • C. Add a Data Wrangler data quality/insights analysis to validate features

  • D. Rebuild the transformations as an AWS Glue Studio job instead

  • E. Run SageMaker Clarify to validate feature transformations

  • F. Write to Feature Store without saving or versioning the Data Wrangler flow

Correct answers: A and C

Explanation: SageMaker Data Wrangler supports a repeatable feature engineering workflow by saving transformations in a versionable .flow and exporting them to operational compute (for example, a SageMaker Pipelines processing step). It also provides built-in analyses (data quality/insights) to validate the engineered features before downstream training uses them.

The core mechanism is a Data Wrangler flow: you build transformations in the UI, validate the resulting feature set, and then export the flow so it can run repeatedly outside the interactive session.

A practical pattern is:

  • Use Data Wrangler analyses (data quality/insights) to check missing values, outliers, and distribution shifts on the engineered features.
  • Export the flow to SageMaker Pipelines (Data Wrangler processing step) so the same transformations run automatically each day against new S3 data.

This keeps feature engineering consistent across runs and adds an explicit validation point before training.

  • OK (validate in Data Wrangler) using data quality/insights; NO (manual notebook reruns) because it is not an automated, repeatable workflow.
  • OK (export to SageMaker Pipelines) for repeatable scheduled execution; NO (Clarify for transformations) because Clarify targets bias/explainability rather than validating feature engineering outputs.
  • NO (rewrite as Glue Studio) because it bypasses the requirement to use Data Wrangler for the workflow; NO (no saved/versioned flow) because repeatability depends on a persisted, reusable flow artifact.

Question 9

Topic: Data Preparation for Machine Learning (ML)

A team runs an end-to-end ML workflow on AWS (ingest raw data preprocess train/tune deploy monitor). Raw daily training data lands in an S3 bucket as thousands of partition files. Delivery time varies, and the current solution uses an hourly cron on an EC2 instance to start a preprocessing job, which sometimes runs before all files arrive and often runs when no new data exists.

The upstream system can write a single completion marker file named manifest.json to the same S3 prefix after all partitions for the day are uploaded. The team needs the preprocessing to start automatically within minutes of the marker, with minimal operational overhead and fewer unnecessary runs.

Which change best improves reliability and cost without breaking these constraints?

Options:

  • A. Invoke a Lambda function for every S3 object created

  • B. Run a nightly AWS Glue crawler to detect new partitions

  • C. Trigger on manifest.json creation using EventBridge to start workflow

  • D. Use an hourly EventBridge schedule to run preprocessing

Best answer: C

Explanation: Use an event-driven trigger that fires only when the dataset is known to be complete. An EventBridge rule that matches the S3 Object Created event for manifest.json starts preprocessing only when the upstream marker arrives, avoiding early/empty runs and reducing cost. The main tradeoff is designing the preprocessing step to be idempotent because S3/EventBridge delivery is at-least-once.

The core optimization is replacing time-based polling with an event trigger that precisely represents “data is ready.” Because the producer writes manifest.json only after all partitions upload, configuring EventBridge to match S3 object creation for that single key provides a reliable, low-ops trigger and avoids starting compute when no new data exists.

A practical setup is:

  • Configure S3 to send object events to EventBridge.
  • Create an EventBridge rule that filters on the bucket, prefix, and the exact key manifest.json.
  • Target your ingestion/preprocessing workflow (for example, Step Functions or a SageMaker Pipeline).

Key takeaway: trigger on a completion marker (not each partition) to improve both correctness and cost; ensure idempotency to handle duplicate event delivery.

  • Time-based scheduling can still run before data completion and wastes runs on empty hours.
  • Per-object Lambda triggers can fan out into thousands of invocations and accidentally start duplicate preprocessing runs.
  • Glue crawler scheduling detects schema/partitions but is a coarse, delayed mechanism for triggering preprocessing.

Question 10

Topic: Data Preparation for Machine Learning (ML)

A team builds a daily ML pipeline in us-east-1: raw data lands in Amazon S3, an AWS Glue Spark job creates a training dataset, SageMaker trains and deploys a model, and Model Monitor runs.

The Glue job left-joins 1 billion clickstream events (key: customer_id) with a CRM export (key: customer_id). The CRM export contains multiple rows per customer due to history, but only one row has is_current=true. After the join, the training dataset grows ~6x and contains duplicate event rows, increasing Glue and training costs and degrading model quality.

Which change best improves cost and reliability while preserving correct joins (one output row per event) and keeping all events even when CRM data is missing?

Options:

  • A. Filter the CRM dataset to one row per customer_id (for example, is_current=true or a Spark window to pick the latest) before the left join

  • B. Keep the join as-is, then run distinct() on the joined dataset to remove duplicates

  • C. Increase the Glue job’s DPUs to finish faster despite the larger joined dataset

  • D. Change the left join to an inner join to avoid duplicates from unmatched keys

Best answer: A

Explanation: The join is duplicating events because the CRM side is not unique on customer_id, creating a many-to-many join and multiplying rows. Reducing the CRM dataset to a single current record per customer restores a many-to-one join, preserves the required left-join semantics, and cuts compute and training cost by avoiding the 6x data expansion.

Join correctness depends on key cardinality. If you need one output row per event, the event table should be the “many” side and the CRM table must be “one row per customer_id” for the attributes you want. When the CRM export contains multiple rows per key, a left join on only customer_id multiplies event rows, inflating storage/compute and corrupting labels/features.

In Glue/Spark, fix this by constraining the CRM table to a single record per customer_id before joining (for example, filter is_current=true or use a window to select the latest record). This restores a many-to-one join, keeps events with missing CRM values (left join), and reduces Glue/shuffle and SageMaker training time.

Removing duplicates after the join is usually more expensive and can discard legitimately different rows.

  • Post-join distinct() is a costly shuffle and can drop valid rows while still hiding a bad join key.
  • Inner join changes required semantics by dropping events that lack a CRM match.
  • More DPUs reduces wall-clock time but does not fix correctness and increases cost for the same inflated dataset.

Continue with full practice

Use the AWS MLA-C01 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try AWS MLA-C01 on Web View AWS MLA-C01 Practice Test

Free review resource

Read the AWS MLA-C01 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.

Revised on Thursday, May 14, 2026