AWS MLA-C01: ML Model Development

May 1, 2026

Try 10 focused AWS MLA-C01 questions on ML Model Development, with explanations, then continue with IT Mastery.

On this page

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try AWS MLA-C01 on Web View full AWS MLA-C01 practice page

Topic snapshot

Field	Detail
Exam route	AWS MLA-C01
Topic area	ML Model Development
Blueprint weight	26%
Page purpose	Focused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate ML Model Development for AWS MLA-C01. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

Pass	What to do	What to record
First attempt	Answer without checking the explanation first.	The fact, rule, calculation, or judgment point that controlled your answer.
Review	Read the explanation even when you were correct.	Why the best answer is stronger than the closest distractor.
Repair	Repeat only missed or uncertain items after a short break.	The pattern behind misses, not the answer letter.
Transfer	Return to mixed practice once the topic feels stable.	Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 26% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.

Question 1

Topic: ML Model Development

A team is using Amazon SageMaker Automatic Model Tuning with the built-in XGBoost algorithm to predict ad click-through rate (binary classification). The goal is to improve validation AUC, but the tuning job is not finding better models.

Current results (typical trial): train AUC = 0.63, validation AUC = 0.62 (values are consistently close across all 30 trials).

Current tuning ranges:

max_depth: 1–2
eta: 0.05–0.1
min_child_weight: 50–100
lambda: 50–100
subsample: 0.5–0.7

The dataset and feature engineering steps are locked for this release. Which change is most likely to address the root cause with the least change?

Options:

A. Increase lambda and min_child_weight to reduce overfitting
B. Add polynomial feature interactions in the preprocessing pipeline
C. Enable early stopping to prevent the model from over-training
D. Widen the search to allow more complex trees and weaker regularization

Best answer: D

Explanation: Train and validation AUC are both low and nearly identical across trials, which is a classic underfitting symptom (high bias). The current tuning ranges heavily restrict model capacity (very shallow trees) and strongly regularize splits, so the model cannot fit meaningful patterns. Expanding the tuning search space to allow greater complexity and less regularization directly targets the root cause without changing the dataset or feature pipeline.

The key symptom is that training and validation performance are both poor and very close to each other across many trials. That pattern points to underfitting (high bias): the model is too constrained to learn signal from the features.

In this setup, the tuner’s ranges strongly bias XGBoost toward overly simple models:

max_depth capped at 2 limits interactions the trees can represent.
Very large min_child_weight and lambda discourage splits and shrink leaf weights.

A minimal corrective action (while keeping data/features fixed) is to widen the hyperparameter search space to permit higher-capacity models, such as increasing allowable max_depth and lowering min_child_weight/lambda (and optionally exploring a broader subsample). This targets underfitting directly; early stopping is mainly helpful when validation degrades while training continues to improve.

More regularization would further restrict learning and typically worsens underfitting.
Early stopping helps when additional iterations cause validation to drop (overfitting), not when both metrics are already low.
New feature engineering could help, but it violates the stated constraint and is not the least-change fix.

Question 2

Topic: ML Model Development

A team wants to detect fraudulent transactions. The historical dataset contains millions of transactions but has no reliable fraud labels (chargebacks are delayed and often disputed), so supervised labels cannot be trusted. Which modeling approach is most feasible given these data constraints?

Options:

A. Train a supervised binary classifier using the disputed chargeback field as the ground-truth label
B. Train an image segmentation model to localize fraud signals within each transaction
C. Train an unsupervised anomaly detection model (for example, SageMaker Random Cut Forest) on mostly normal transactions
D. Use a reinforcement learning agent to choose which transactions to approve or decline

Best answer: C

Explanation: With no reliable ground-truth labels, a standard supervised classification approach is not feasible because the model would learn from incorrect targets. Unsupervised anomaly detection is designed for this situation by modeling normal patterns and identifying outliers that may indicate fraud.

Feasibility depends heavily on whether you have labels that are timely, accurate, and representative of the outcome you want to predict. When labels are missing or unreliable (for example, delayed/disputed chargebacks), supervised learning is likely to produce a model that optimizes the wrong objective and performs poorly in production.

In this situation, a feasible approach is to use unsupervised (or semi-supervised) anomaly detection:

Train on predominantly “normal” historical transactions.
Score new transactions by how anomalous they are relative to learned structure.
Route high-anomaly scores to manual review and collect higher-quality labels over time.

The key takeaway is that label quality can be the deciding constraint for whether a supervised model is viable.

Noisy labels Using disputed chargebacks as ground truth bakes label errors into training and evaluation.
Wrong problem framing Reinforcement learning requires a well-defined reward signal and feedback loop, which the dataset does not provide.
Wrong data modality Image segmentation is for pixel-level labels in images, not tabular transaction records.

Question 3

Topic: ML Model Development

A team trains a churn model in Amazon SageMaker and sees the following metrics:

Training AUC:    0.99
Validation AUC:  0.73
Training logloss:   0.05
Validation logloss: 0.62

The team has a separate holdout test set that must be used only once for final evaluation. This sprint’s compute budget allows at most 50 training jobs on ml.m5.xlarge.

Which TWO actions should the team AVOID as next steps? (Select TWO.)

Options:

A. Collect more labeled data or add data augmentation
B. Use k-fold cross-validation on training data only
C. Tune and early-stop using the holdout test set
D. Enable early stopping based on validation performance
E. Run 2,000 tuning jobs on ml.p5.48xlarge
F. Increase regularization and simplify the model

Correct answers: C and E

Explanation: The large gap between training and validation performance indicates overfitting. Appropriate next steps focus on improving generalization (for example, regularization, simplifying the model, and validation-based early stopping) while preserving an untouched test set for a final, unbiased check. Actions that leak the test set into iteration or exceed the stated compute budget should be avoided.

Comparing training vs. validation metrics is a primary way to diagnose fit: very strong training results with much worse validation results indicate overfitting (poor generalization). To address overfitting, you typically constrain the model or make it less able to memorize noise, and you ensure evaluation hygiene by keeping the test set untouched until the end.

Practical corrective actions include:

Add/strengthen regularization or reduce model complexity.
Use early stopping driven by validation metrics.
Increase effective training data (more labels, augmentation) and validate with cross-validation.

Using the holdout test set during tuning creates optimistic estimates, and an oversized tuning run that exceeds the explicit job/instance budget violates the stated constraint.

Test-set leakage iterating on the holdout test set breaks the “final evaluation only once” requirement.
Budget violation a 2,000-job sweep on large GPU instances conflicts with the stated compute limit.
Generalization controls regularization/simplification and validation-based early stopping are standard overfitting mitigations.
Better validation use k-fold cross-validation (without touching the test set) can improve robustness of validation estimates.

Question 4

Topic: ML Model Development

A team trained a binary classification model in Amazon SageMaker to flag fraudulent transactions. Fraud occurs in less than 1% of transactions, and the team wants a metric that balances precision and recall into a single value when selecting the best model.

Which evaluation metric should the team use?

Options:

A. RMSE
B. R-squared
C. Mean absolute percentage error (MAPE)
D. F1 score

Best answer: D

Explanation: Because the task is imbalanced binary classification, accuracy alone can be misleading and the team needs to account for both false positives and false negatives. F1 score is the standard single-number metric that combines precision and recall, making it appropriate for comparing classifiers when both error types matter.

Evaluation metrics should match the problem type and what errors matter. For imbalanced binary classification (like fraud detection), a model can achieve high apparent performance by predicting the majority class, so metrics that incorporate false positives and false negatives are preferred. Precision measures how many predicted frauds are truly fraud, and recall measures how many true frauds are found. F1 score combines these into one value via the harmonic mean, so it is commonly used to select a model when you need a single metric that balances precision and recall. Regression metrics such as RMSE, R-squared, and MAPE apply to continuous-valued targets, not classification labels.

Key takeaway: use classification metrics (precision/recall/F1/AUC) for classification and regression metrics (RMSE/MAE/ R 2) for regression.

RMSE is regression-only and measures error magnitude on continuous targets.
R-squared is regression-only and summarizes variance explained for continuous outcomes.
MAPE is regression-only and requires meaningful percentage error on numeric targets.

Question 5

Topic: ML Model Development

A team is building a daily demand forecasting model in Amazon SageMaker to predict the next 30 days for each product (forecast horizon = 30 days). They need an offline evaluation approach that reflects production behavior and avoids temporal leakage.

Which approach is NOT appropriate?

Options:

A. Randomly shuffle days before splitting into train and test
B. Create lag features using only values available at prediction time
C. Use rolling-origin backtests with a fixed 30-day horizon
D. Hold out the most recent 60 days as the test set

Best answer: A

Explanation: Time series evaluation must preserve temporal order so the model is trained only on the past and tested on the future for the required forecast horizon. Randomly shuffling observations mixes future and past data, creating leakage and invalidating metrics. A proper split/backtest should mirror the 30-day-ahead production forecasting workflow.

For time series forecasting, the core principle is to respect time: training data must come strictly before evaluation data, and the evaluation setup must match the forecast horizon used in production (here, 30 days ahead). Acceptable approaches include chronological holdout splits (latest period held out) and rolling-origin backtesting (multiple train/test folds that each forecast the next 30 days).

Randomly shuffling timestamps before splitting is an anti-pattern because it mixes future observations into the training set relative to test points. This creates temporal leakage and yields unrealistically good error metrics that will not reproduce when the model is deployed to predict unseen future days.

Recent holdout is a standard way to test on “future” data relative to training.
Rolling-origin backtesting validates stability across multiple time windows for the same horizon.
Leakage-safe features ensures lags/rolling stats don’t use information from after the prediction timestamp.

Question 6

Topic: ML Model Development

A company is building a daily demand forecast for 2,000 products using Amazon SageMaker. The business needs a 14-day ahead forecast (

\(t+14\)

) and will retrain monthly.

Two constraints apply:

Forecast accuracy must be evaluated realistically for the 14-day horizon.
At prediction time, only data known up to the prediction day is available (the promotions team finalizes promo schedules only 7 days ahead).

Which TWO approaches should the ML engineer use to avoid data leakage and to evaluate the 14-day horizon correctly? (Select TWO.)

Options:

A. Fit all scalers/imputers on the full dataset before splitting
B. Add next-14-day promo flags even though only 7 days are known
C. Use rolling-origin backtesting with 14-day-ahead forecasts per fold
D. Shift labels by 14 days and use chronological train/val/test
E. Tune on the last 14 days and also report them as test
F. Randomly shuffle rows and run standard k-fold cross-validation

Correct answers: C and D

Explanation: Time series forecasting requires respecting time order and matching evaluation to the forecast horizon. Creating labels for a 14-day ahead target and using chronological splits avoids training on information from the future. Rolling-origin backtesting further validates performance across multiple historical cutoffs while scoring true 14-day-ahead predictions.

For time series, the core requirement is to ensure that training features and preprocessing are computed using only information available up to the prediction time, and that validation/testing occur strictly after the training period. For a 14-day horizon, labels must be aligned to that horizon (predicting \(y_{t+14}\) from data at or before \(t\)), and evaluation should score forecasts made 14 days ahead rather than one-step-ahead proxies.

Practical high-level approaches:

Use time-based (chronological) splits so the model is always evaluated on later time periods.
Use rolling/expanding window backtesting so multiple train→forecast→score cycles reflect production behavior for the 14-day horizon.

The key takeaway is that randomization, reuse of the same period for tuning and final testing, and features not available at inference all create optimistic, invalid results.

OK — Shift labels + chronological split: Matches the 14-day horizon and keeps validation/test strictly in the future.
OK — Rolling-origin backtesting: Repeatedly evaluates true 14-day-ahead forecasts without peeking forward.
NO — Random shuffle + k-fold: Breaks temporal order and leaks future patterns into training folds.
NO — Preprocess on full dataset: Learns statistics from future periods (leakage) before the split.
NO — Tune and test on same 14 days: Uses test information during model selection and inflates results.
NO — Future promo flags not known: Uses covariates unavailable at prediction time beyond 7 days.

Question 7

Topic: ML Model Development

A team uses an Amazon SageMaker Pipelines evaluation step and sees 0.99 F1 offline but 0.62 F1 during a canary deployment. The dataset includes PII and time-ordered events. Requirements: the holdout test set must be reproducible for audits, preprocessing must not learn from validation/test data, and data must remain private (no public access) and encrypted with KMS.

Which TWO actions should you AVOID when fixing the evaluation pipeline?

Options:

A. Validate metric code uses correct labels/thresholding and log confusion matrix
B. Enable lineage tracking of dataset versions, split seed, and metric definition
C. Upload the evaluation dataset (with PII) to a public S3 bucket for review
D. Persist a deterministic split file in S3 with versioning and reuse it
E. Switch to a time-based split and prevent future-derived features
F. Fit scalers/imputers on the full dataset before splitting

Correct answers: C and F

Explanation: To troubleshoot evaluation issues, fixes must preserve an unbiased holdout set and comply with stated security controls. Any approach that lets preprocessing learn from validation/test data can cause data leakage and unrealistically high offline scores. Any approach that exposes PII outside required private, encrypted boundaries violates the explicit security requirements.

The core goal is to make offline evaluation trustworthy and repeatable while meeting security constraints. In SageMaker Pipelines, leakage often happens when a preprocessing transformer (imputer/scaler/encoder) is fit using rows that later end up in validation/test, or when feature engineering accidentally uses future/label-derived information. Auditability also requires that the exact same holdout set can be reconstructed (or reused) across pipeline runs.

Acceptable fixes typically include:

Create and store deterministic splits (hash-based or fixed seed) and version them.
Use time-based splits for time-ordered events and ensure features only use past data.
Confirm metric implementation and inputs (labels vs probabilities, class mapping) and record artifacts/lineage.

The key takeaway is to prevent information from crossing from holdout data into training/preprocessing and to keep PII protected per requirements.

Preprocessing on all rows is a classic leakage path because transformer parameters encode holdout distribution information.
Deterministic, versioned splits support reproducibility and audit requirements.
Time-based split is appropriate for temporal data and helps avoid look-ahead bias.
Metric and lineage validation is a safe way to detect misconfiguration and make results explainable.
Public data sharing breaks the explicit “no public access” requirement for PII.

Question 8

Topic: ML Model Development

A team is starting a new binary classification project in Amazon SageMaker and wants rapid iteration. Which approach best follows the recommended baseline model strategy before increasing model complexity?

Options:

A. Define success using training accuracy only, then add features
B. Train a simple baseline and set a target metric first
C. Skip baseline models and optimize feature engineering first
D. Start with a deep neural network and tune for best AUC

Best answer: B

Explanation: Start by training a simple, easy-to-reproduce baseline model and explicitly define what “good” looks like using an agreed evaluation metric (and target). This establishes a reference point for later experiments so added complexity can be justified by measurable improvement and faster iteration.

The baseline model strategy is to begin with the simplest model that can be trained and evaluated quickly, then define success criteria using an appropriate evaluation metric on held-out/validation data. This provides a stable reference for comparing subsequent approaches (more features, tuning, or more complex algorithms) and prevents premature complexity that is hard to debug and may not improve real-world performance. After a baseline and target are established, you can iterate by improving data quality, adding features, and performing hyperparameter tuning while measuring gains relative to the baseline. The key takeaway is that success criteria and a baseline must come before increasing complexity.

Deep model first slows iteration and makes it harder to diagnose whether complexity is warranted.
No baseline removes the reference point needed to quantify improvements from later changes.
Training accuracy only is not a reliable success definition because it can hide overfitting.

Question 9

Topic: ML Model Development

A company is fine-tuning a Hugging Face LLM in Amazon SageMaker to improve performance on internal IT support tickets. The model must retain strong general instruction-following and open-domain Q&A ability, and the training data must not contain customer PII. The team has (1) a labeled ticket dataset and (2) a de-identified general instruction dataset.

Which TWO actions should the ML engineer AVOID to reduce the risk of catastrophic forgetting and meet the requirements? (Select TWO.)

Options:

A. Use low learning rate and early stopping on dual eval sets
B. Full fine-tune all weights only on ticket data
C. Freeze most layers and train LoRA/adapters
D. Add raw production chats with names/emails to training mix
E. Interleave ticket and de-identified general data during training
F. Use instruction templates consistent with the base model format

Correct answers: B and D

Explanation: To prevent catastrophic forgetting, avoid fine-tuning strategies that aggressively overwrite the base model using only narrow, domain-specific data. Also, meeting the explicit data constraint is mandatory: adding any PII-containing logs to broaden coverage is prohibited, regardless of potential training benefits.

Catastrophic forgetting happens when fine-tuning shifts a model’s parameters to fit a narrow target distribution, degrading previously learned general behaviors. High-risk patterns include full-weight fine-tuning on only the new domain data (especially with aggressive hyperparameters), because the model has no “anchor” to preserve general instruction-following.

Safer high-level strategies are to:

Rehearse by mixing in representative de-identified general instruction data.
Use parameter-efficient fine-tuning (e.g., LoRA/adapters) or freezing to limit parameter drift.
Tune conservatively (lower learning rate, early stopping) while validating on both domain and general evaluation sets.

Separately, data governance constraints override modeling convenience: prohibited data (PII) must not be used even if it could reduce forgetting.

Replay/mixing data helps retain general skills by anchoring training to broader behaviors.
LoRA/adapters reduce forgetting by limiting how much base weights change.
Dual evaluation (domain + general) helps detect regression early and stop before overfitting.
Instruction templates improve alignment without inherently increasing forgetting risk.

Question 10

Topic: ML Model Development

A company fine-tunes a pretrained text generation model in Amazon SageMaker on 200,000 internal customer-support tickets stored in Amazon S3. After fine-tuning, the model performs well on support tasks but noticeably degrades on general writing and instruction-following tasks that it previously handled well.

Which THREE actions are most likely to reduce catastrophic forgetting during fine-tuning? (Select THREE.)

Options:

A. Use parameter-efficient fine-tuning (for example, LoRA/adapters) or freeze most base layers
B. Fine-tune only on the support-ticket dataset to avoid “diluting” the domain signal
C. Reduce the learning rate and use early stopping based on a general-task evaluation set
D. Increase the learning rate to speed adaptation to the support-ticket domain
E. Reinitialize several top transformer blocks before fine-tuning to encourage domain specialization
F. Mix a representative set of general-domain examples into the fine-tuning dataset

Correct answers: A, C and F

Explanation: Catastrophic forgetting happens when fine-tuning updates overwrite capabilities learned during pretraining. The most effective high-level mitigations are to rehearse general behavior with mixed data, constrain how much the base weights can change (for example, PEFT/freezing), and use conservative optimization with evaluation-driven stopping to prevent over-adaptation.

To reduce catastrophic forgetting, the goal is to adapt to the new support-ticket distribution without letting gradient updates erase broadly useful behavior from pretraining. In practice this is done by (1) keeping some training signal that represents the original/general tasks, and (2) limiting or carefully controlling weight updates so the model does not “move too far” from the pretrained solution.

Common, high-level strategies include:

Rehearsal: fine-tune on a mixture of domain data and representative general data.
Constrained updates: freeze layers or use PEFT methods (such as LoRA/adapters) to localize changes.
Conservative optimization: lower learning rate and stop based on general-task eval to avoid overfitting to the new domain.

By contrast, aggressive optimization or training only on the new domain typically increases forgetting.

OK Mix in general-domain examples: rehearsal helps retain general capabilities.
OK Use PEFT or freeze layers: reduces destructive updates to base knowledge.
OK Lower learning rate with early stopping on general eval: limits drift from pretrained behavior.
NO Increase learning rate: typically amplifies overwriting and instability.
NO Train only on domain data: removes the retention signal and increases forgetting.
NO Reinitialize blocks: discards pretrained representations and worsens retention.

Continue with full practice

Use the AWS MLA-C01 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try AWS MLA-C01 on Web View AWS MLA-C01 Practice Test

Free review resource

Read the AWS MLA-C01 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.

Revised on Thursday, May 14, 2026

ML Data Prep

ML Deployment

Browse Certification Practice Tests by Exam Family

AWS MLA-C01: ML Model Development

Topic snapshot

How to use this topic drill

Sample questions

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Continue with full practice

Related focused pages

Free review resource