Try 10 focused AWS MLA-C01 questions on ML Model Development, with explanations, then continue with IT Mastery.
Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.
| Field | Detail |
|---|---|
| Exam route | AWS MLA-C01 |
| Topic area | ML Model Development |
| Blueprint weight | 26% |
| Page purpose | Focused sample questions before returning to mixed practice |
Use this page to isolate ML Model Development for AWS MLA-C01. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.
| Pass | What to do | What to record |
|---|---|---|
| First attempt | Answer without checking the explanation first. | The fact, rule, calculation, or judgment point that controlled your answer. |
| Review | Read the explanation even when you were correct. | Why the best answer is stronger than the closest distractor. |
| Repair | Repeat only missed or uncertain items after a short break. | The pattern behind misses, not the answer letter. |
| Transfer | Return to mixed practice once the topic feels stable. | Whether the same skill holds up when the topic is no longer obvious. |
Blueprint context: 26% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.
These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.
Topic: ML Model Development
A team is using Amazon SageMaker Automatic Model Tuning with the built-in XGBoost algorithm to predict ad click-through rate (binary classification). The goal is to improve validation AUC, but the tuning job is not finding better models.
Current results (typical trial): train AUC = 0.63, validation AUC = 0.62 (values are consistently close across all 30 trials).
Current tuning ranges:
max_depth: 1–2eta: 0.05–0.1min_child_weight: 50–100lambda: 50–100subsample: 0.5–0.7The dataset and feature engineering steps are locked for this release. Which change is most likely to address the root cause with the least change?
Options:
A. Increase lambda and min_child_weight to reduce overfitting
B. Add polynomial feature interactions in the preprocessing pipeline
C. Enable early stopping to prevent the model from over-training
D. Widen the search to allow more complex trees and weaker regularization
Best answer: D
Explanation: Train and validation AUC are both low and nearly identical across trials, which is a classic underfitting symptom (high bias). The current tuning ranges heavily restrict model capacity (very shallow trees) and strongly regularize splits, so the model cannot fit meaningful patterns. Expanding the tuning search space to allow greater complexity and less regularization directly targets the root cause without changing the dataset or feature pipeline.
The key symptom is that training and validation performance are both poor and very close to each other across many trials. That pattern points to underfitting (high bias): the model is too constrained to learn signal from the features.
In this setup, the tuner’s ranges strongly bias XGBoost toward overly simple models:
max_depth capped at 2 limits interactions the trees can represent.min_child_weight and lambda discourage splits and shrink leaf weights.A minimal corrective action (while keeping data/features fixed) is to widen the hyperparameter search space to permit higher-capacity models, such as increasing allowable max_depth and lowering min_child_weight/lambda (and optionally exploring a broader subsample). This targets underfitting directly; early stopping is mainly helpful when validation degrades while training continues to improve.
Topic: ML Model Development
A team wants to detect fraudulent transactions. The historical dataset contains millions of transactions but has no reliable fraud labels (chargebacks are delayed and often disputed), so supervised labels cannot be trusted. Which modeling approach is most feasible given these data constraints?
Options:
A. Train a supervised binary classifier using the disputed chargeback field as the ground-truth label
B. Train an image segmentation model to localize fraud signals within each transaction
C. Train an unsupervised anomaly detection model (for example, SageMaker Random Cut Forest) on mostly normal transactions
D. Use a reinforcement learning agent to choose which transactions to approve or decline
Best answer: C
Explanation: With no reliable ground-truth labels, a standard supervised classification approach is not feasible because the model would learn from incorrect targets. Unsupervised anomaly detection is designed for this situation by modeling normal patterns and identifying outliers that may indicate fraud.
Feasibility depends heavily on whether you have labels that are timely, accurate, and representative of the outcome you want to predict. When labels are missing or unreliable (for example, delayed/disputed chargebacks), supervised learning is likely to produce a model that optimizes the wrong objective and performs poorly in production.
In this situation, a feasible approach is to use unsupervised (or semi-supervised) anomaly detection:
The key takeaway is that label quality can be the deciding constraint for whether a supervised model is viable.
Topic: ML Model Development
A team trains a churn model in Amazon SageMaker and sees the following metrics:
Training AUC: 0.99
Validation AUC: 0.73
Training logloss: 0.05
Validation logloss: 0.62
The team has a separate holdout test set that must be used only once for final evaluation. This sprint’s compute budget allows at most 50 training jobs on ml.m5.xlarge.
Which TWO actions should the team AVOID as next steps? (Select TWO.)
Options:
A. Collect more labeled data or add data augmentation
B. Use k-fold cross-validation on training data only
C. Tune and early-stop using the holdout test set
D. Enable early stopping based on validation performance
E. Run 2,000 tuning jobs on ml.p5.48xlarge
F. Increase regularization and simplify the model
Correct answers: C and E
Explanation: The large gap between training and validation performance indicates overfitting. Appropriate next steps focus on improving generalization (for example, regularization, simplifying the model, and validation-based early stopping) while preserving an untouched test set for a final, unbiased check. Actions that leak the test set into iteration or exceed the stated compute budget should be avoided.
Comparing training vs. validation metrics is a primary way to diagnose fit: very strong training results with much worse validation results indicate overfitting (poor generalization). To address overfitting, you typically constrain the model or make it less able to memorize noise, and you ensure evaluation hygiene by keeping the test set untouched until the end.
Practical corrective actions include:
Using the holdout test set during tuning creates optimistic estimates, and an oversized tuning run that exceeds the explicit job/instance budget violates the stated constraint.
Topic: ML Model Development
A team trained a binary classification model in Amazon SageMaker to flag fraudulent transactions. Fraud occurs in less than 1% of transactions, and the team wants a metric that balances precision and recall into a single value when selecting the best model.
Which evaluation metric should the team use?
Options:
A. RMSE
B. R-squared
C. Mean absolute percentage error (MAPE)
D. F1 score
Best answer: D
Explanation: Because the task is imbalanced binary classification, accuracy alone can be misleading and the team needs to account for both false positives and false negatives. F1 score is the standard single-number metric that combines precision and recall, making it appropriate for comparing classifiers when both error types matter.
Evaluation metrics should match the problem type and what errors matter. For imbalanced binary classification (like fraud detection), a model can achieve high apparent performance by predicting the majority class, so metrics that incorporate false positives and false negatives are preferred. Precision measures how many predicted frauds are truly fraud, and recall measures how many true frauds are found. F1 score combines these into one value via the harmonic mean, so it is commonly used to select a model when you need a single metric that balances precision and recall. Regression metrics such as RMSE, R-squared, and MAPE apply to continuous-valued targets, not classification labels.
Key takeaway: use classification metrics (precision/recall/F1/AUC) for classification and regression metrics (RMSE/MAE/ R 2) for regression.
Topic: ML Model Development
A team is building a daily demand forecasting model in Amazon SageMaker to predict the next 30 days for each product (forecast horizon = 30 days). They need an offline evaluation approach that reflects production behavior and avoids temporal leakage.
Which approach is NOT appropriate?
Options:
A. Randomly shuffle days before splitting into train and test
B. Create lag features using only values available at prediction time
C. Use rolling-origin backtests with a fixed 30-day horizon
D. Hold out the most recent 60 days as the test set
Best answer: A
Explanation: Time series evaluation must preserve temporal order so the model is trained only on the past and tested on the future for the required forecast horizon. Randomly shuffling observations mixes future and past data, creating leakage and invalidating metrics. A proper split/backtest should mirror the 30-day-ahead production forecasting workflow.
For time series forecasting, the core principle is to respect time: training data must come strictly before evaluation data, and the evaluation setup must match the forecast horizon used in production (here, 30 days ahead). Acceptable approaches include chronological holdout splits (latest period held out) and rolling-origin backtesting (multiple train/test folds that each forecast the next 30 days).
Randomly shuffling timestamps before splitting is an anti-pattern because it mixes future observations into the training set relative to test points. This creates temporal leakage and yields unrealistically good error metrics that will not reproduce when the model is deployed to predict unseen future days.
Topic: ML Model Development
A company is building a daily demand forecast for 2,000 products using Amazon SageMaker. The business needs a 14-day ahead forecast (
\(t+14\)
) and will retrain monthly.
Two constraints apply:
Which TWO approaches should the ML engineer use to avoid data leakage and to evaluate the 14-day horizon correctly? (Select TWO.)
Options:
A. Fit all scalers/imputers on the full dataset before splitting
B. Add next-14-day promo flags even though only 7 days are known
C. Use rolling-origin backtesting with 14-day-ahead forecasts per fold
D. Shift labels by 14 days and use chronological train/val/test
E. Tune on the last 14 days and also report them as test
F. Randomly shuffle rows and run standard k-fold cross-validation
Correct answers: C and D
Explanation: Time series forecasting requires respecting time order and matching evaluation to the forecast horizon. Creating labels for a 14-day ahead target and using chronological splits avoids training on information from the future. Rolling-origin backtesting further validates performance across multiple historical cutoffs while scoring true 14-day-ahead predictions.
For time series, the core requirement is to ensure that training features and preprocessing are computed using only information available up to the prediction time, and that validation/testing occur strictly after the training period. For a 14-day horizon, labels must be aligned to that horizon (predicting \(y_{t+14}\) from data at or before \(t\)), and evaluation should score forecasts made 14 days ahead rather than one-step-ahead proxies.
Practical high-level approaches:
The key takeaway is that randomization, reuse of the same period for tuning and final testing, and features not available at inference all create optimistic, invalid results.
Topic: ML Model Development
A team uses an Amazon SageMaker Pipelines evaluation step and sees 0.99 F1 offline but 0.62 F1 during a canary deployment. The dataset includes PII and time-ordered events. Requirements: the holdout test set must be reproducible for audits, preprocessing must not learn from validation/test data, and data must remain private (no public access) and encrypted with KMS.
Which TWO actions should you AVOID when fixing the evaluation pipeline?
Options:
A. Validate metric code uses correct labels/thresholding and log confusion matrix
B. Enable lineage tracking of dataset versions, split seed, and metric definition
C. Upload the evaluation dataset (with PII) to a public S3 bucket for review
D. Persist a deterministic split file in S3 with versioning and reuse it
E. Switch to a time-based split and prevent future-derived features
F. Fit scalers/imputers on the full dataset before splitting
Correct answers: C and F
Explanation: To troubleshoot evaluation issues, fixes must preserve an unbiased holdout set and comply with stated security controls. Any approach that lets preprocessing learn from validation/test data can cause data leakage and unrealistically high offline scores. Any approach that exposes PII outside required private, encrypted boundaries violates the explicit security requirements.
The core goal is to make offline evaluation trustworthy and repeatable while meeting security constraints. In SageMaker Pipelines, leakage often happens when a preprocessing transformer (imputer/scaler/encoder) is fit using rows that later end up in validation/test, or when feature engineering accidentally uses future/label-derived information. Auditability also requires that the exact same holdout set can be reconstructed (or reused) across pipeline runs.
Acceptable fixes typically include:
The key takeaway is to prevent information from crossing from holdout data into training/preprocessing and to keep PII protected per requirements.
Topic: ML Model Development
A team is starting a new binary classification project in Amazon SageMaker and wants rapid iteration. Which approach best follows the recommended baseline model strategy before increasing model complexity?
Options:
A. Define success using training accuracy only, then add features
B. Train a simple baseline and set a target metric first
C. Skip baseline models and optimize feature engineering first
D. Start with a deep neural network and tune for best AUC
Best answer: B
Explanation: Start by training a simple, easy-to-reproduce baseline model and explicitly define what “good” looks like using an agreed evaluation metric (and target). This establishes a reference point for later experiments so added complexity can be justified by measurable improvement and faster iteration.
The baseline model strategy is to begin with the simplest model that can be trained and evaluated quickly, then define success criteria using an appropriate evaluation metric on held-out/validation data. This provides a stable reference for comparing subsequent approaches (more features, tuning, or more complex algorithms) and prevents premature complexity that is hard to debug and may not improve real-world performance. After a baseline and target are established, you can iterate by improving data quality, adding features, and performing hyperparameter tuning while measuring gains relative to the baseline. The key takeaway is that success criteria and a baseline must come before increasing complexity.
Topic: ML Model Development
A company is fine-tuning a Hugging Face LLM in Amazon SageMaker to improve performance on internal IT support tickets. The model must retain strong general instruction-following and open-domain Q&A ability, and the training data must not contain customer PII. The team has (1) a labeled ticket dataset and (2) a de-identified general instruction dataset.
Which TWO actions should the ML engineer AVOID to reduce the risk of catastrophic forgetting and meet the requirements? (Select TWO.)
Options:
A. Use low learning rate and early stopping on dual eval sets
B. Full fine-tune all weights only on ticket data
C. Freeze most layers and train LoRA/adapters
D. Add raw production chats with names/emails to training mix
E. Interleave ticket and de-identified general data during training
F. Use instruction templates consistent with the base model format
Correct answers: B and D
Explanation: To prevent catastrophic forgetting, avoid fine-tuning strategies that aggressively overwrite the base model using only narrow, domain-specific data. Also, meeting the explicit data constraint is mandatory: adding any PII-containing logs to broaden coverage is prohibited, regardless of potential training benefits.
Catastrophic forgetting happens when fine-tuning shifts a model’s parameters to fit a narrow target distribution, degrading previously learned general behaviors. High-risk patterns include full-weight fine-tuning on only the new domain data (especially with aggressive hyperparameters), because the model has no “anchor” to preserve general instruction-following.
Safer high-level strategies are to:
Separately, data governance constraints override modeling convenience: prohibited data (PII) must not be used even if it could reduce forgetting.
Topic: ML Model Development
A company fine-tunes a pretrained text generation model in Amazon SageMaker on 200,000 internal customer-support tickets stored in Amazon S3. After fine-tuning, the model performs well on support tasks but noticeably degrades on general writing and instruction-following tasks that it previously handled well.
Which THREE actions are most likely to reduce catastrophic forgetting during fine-tuning? (Select THREE.)
Options:
A. Use parameter-efficient fine-tuning (for example, LoRA/adapters) or freeze most base layers
B. Fine-tune only on the support-ticket dataset to avoid “diluting” the domain signal
C. Reduce the learning rate and use early stopping based on a general-task evaluation set
D. Increase the learning rate to speed adaptation to the support-ticket domain
E. Reinitialize several top transformer blocks before fine-tuning to encourage domain specialization
F. Mix a representative set of general-domain examples into the fine-tuning dataset
Correct answers: A, C and F
Explanation: Catastrophic forgetting happens when fine-tuning updates overwrite capabilities learned during pretraining. The most effective high-level mitigations are to rehearse general behavior with mixed data, constrain how much the base weights can change (for example, PEFT/freezing), and use conservative optimization with evaluation-driven stopping to prevent over-adaptation.
To reduce catastrophic forgetting, the goal is to adapt to the new support-ticket distribution without letting gradient updates erase broadly useful behavior from pretraining. In practice this is done by (1) keeping some training signal that represents the original/general tasks, and (2) limiting or carefully controlling weight updates so the model does not “move too far” from the pretrained solution.
Common, high-level strategies include:
By contrast, aggressive optimization or training only on the new domain typically increases forgetting.
Use the AWS MLA-C01 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.
Try AWS MLA-C01 on Web View AWS MLA-C01 Practice Test
Read the AWS MLA-C01 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.