MLA-C01 — AWS Certified Machine Learning Engineer – Associate Quick Review
Quick Review for AWS Certified Machine Learning Engineer – Associate (MLA-C01): high-yield ML engineering concepts, AWS service choices, deployment patterns, monitoring, security, and practice guidance.
Quick Review purpose
This Quick Review is for candidates preparing for the real AWS Certified Machine Learning Engineer – Associate (MLA-C01) exam from AWS. Use it to refresh the main decision points before moving into topic drills, mock exams, and detailed explanations.
This page supports IT Mastery practice with original practice questions. It is not affiliated with AWS.
What to know before drilling questions
The MLA-C01 exam is scenario-driven. Many questions are not asking, “What does this service do?” They are asking, “Given these constraints, which AWS machine learning design is the best fit?”
Read each question for:
- Workflow stage: data preparation, training, deployment, orchestration, monitoring, governance, or security.
- Constraint: lowest latency, lowest cost, real-time inference, batch inference, private networking, explainability, drift detection, automation, or operational control.
- Managed-service preference: AWS exam scenarios often reward using managed capabilities when they directly satisfy the requirement.
- Failure mode: data leakage, incorrect metric, overfitting, missing permissions, no network path, no monitoring baseline, or manual steps where automation is required.
High-yield AWS ML engineering service map
| Need | High-yield AWS services or features | Watch for |
|---|---|---|
| Store raw and processed ML data | Amazon S3, S3 versioning, S3 lifecycle, S3 encryption | Bucket policies, KMS permissions, data partitioning |
| Catalog and transform data | AWS Glue, AWS Glue Data Catalog, Amazon Athena, Amazon EMR, Amazon SageMaker Data Wrangler | Glue for ETL/catalog, Athena for SQL on S3, EMR for big data frameworks |
| Stream data | Amazon Kinesis Data Streams, Kinesis Data Firehose, Amazon MSK | Real-time ingestion vs delivery to S3/OpenSearch/Redshift |
| Build and train models | Amazon SageMaker training jobs, notebooks, Studio, built-in algorithms, custom containers | IAM execution role, ECR image access, S3 input/output paths |
| Tune models | SageMaker automatic model tuning | Objective metric, search ranges, early stopping |
| Process data at scale | SageMaker Processing jobs | Repeatable preprocessing/evaluation outside notebooks |
| Track features | SageMaker Feature Store | Online store for low-latency lookup, offline store for training/history |
| Register and approve models | SageMaker Model Registry | Model package groups, approval status, lineage |
| Deploy inference | SageMaker real-time endpoints, serverless inference, asynchronous inference, batch transform | Match latency, traffic pattern, payload size, and cost |
| Orchestrate workflows | SageMaker Pipelines, AWS Step Functions, Amazon EventBridge | ML-native pipeline vs broader service orchestration |
| Monitor models | SageMaker Model Monitor, SageMaker Clarify, Amazon CloudWatch | Baselines, schedules, captured data, labels for model quality |
| Secure workloads | IAM, AWS KMS, VPC, security groups, VPC endpoints, AWS Secrets Manager, AWS CloudTrail | Least privilege, encryption, private connectivity, auditability |
| Build CI/CD | AWS CodePipeline, CodeBuild, CodeDeploy, SageMaker Projects | Reproducible promotion from dev to test to production |
The core ML lifecycle on AWS
flowchart LR
A[Collect data] --> B[Store in S3]
B --> C[Catalog and prepare]
C --> D[Train and tune]
D --> E[Evaluate]
E --> F{Meets criteria?}
F -- No --> C
F -- Yes --> G[Register model]
G --> H[Deploy]
H --> I[Monitor]
I --> J{Drift or degradation?}
J -- Yes --> C
J -- No --> I
For MLA-C01 review, focus on how each stage is automated, secured, monitored, and connected to the next stage.
Data preparation and feature engineering
Data storage and formats
| Decision point | Prefer this | Why |
|---|---|---|
| Large analytical datasets in S3 | Parquet or ORC | Columnar, compressed, efficient for Athena/Glue/Spark |
| Simple interchange or small datasets | CSV or JSON | Easy but often less efficient |
| Repeated ML training reads | Partitioned S3 data | Reduces scan and processing cost |
| Versioned reproducible training data | S3 versioning, manifest files, pipeline parameters | Helps reproduce a model |
| Shared POSIX file access during training | Amazon EFS or FSx options, depending on workload | S3 is object storage, not a mounted file system by default |
Common trap: choosing a training algorithm or deployment service before fixing the data issue. If the scenario says the model performs well in validation but poorly in production, suspect leakage, skew, drift, nonrepresentative validation data, or feature mismatch.
Data splitting and leakage
Know the difference between random splitting and time-aware splitting.
| Scenario | Better split strategy | Trap |
|---|---|---|
| Independent records with no time dependency | Random train/validation/test split | Accidentally duplicating near-identical rows across splits |
| Forecasting, clickstream, transactions over time | Time-based split | Training on future information |
| Users/customers appear multiple times | Group-based split | Same user in train and test |
| Rare positive class | Stratified split | Test set has too few positive cases |
Data leakage examples:
- Using a feature that is only known after the prediction time.
- Fitting scalers, imputers, encoders, or feature selectors on the full dataset before the split.
- Including target-derived columns.
- Using test data during hyperparameter tuning.
- Training on records that overlap with the evaluation set.
Feature engineering decision rules
| Requirement | Useful approach |
|---|---|
| Handle missing numeric values | Imputation, missingness indicators, domain-specific defaults |
| Handle high-cardinality categorical values | Target encoding with care, hashing, embeddings, or grouping rare categories |
| Handle skewed numeric values | Log transform, winsorization, robust scaling |
| Handle class imbalance | Class weights, resampling, threshold tuning, metric selection |
| Use features for both training and low-latency inference | SageMaker Feature Store online/offline stores |
| Avoid training-serving skew | Use the same transformation code or pipeline for training and inference |
Data preparation services: quick choices
| If the question says… | Think… |
|---|---|
| “Run SQL queries directly on S3 data” | Amazon Athena with AWS Glue Data Catalog |
| “Serverless ETL and data catalog” | AWS Glue |
| “Spark/Hadoop ecosystem and more cluster control” | Amazon EMR |
| “Visual feature preparation for SageMaker workflow” | SageMaker Data Wrangler |
| “Repeatable preprocessing step in ML pipeline” | SageMaker Processing |
| “Streaming records need real-time ingestion” | Kinesis Data Streams or Amazon MSK |
| “Deliver streaming data into S3 with minimal management” | Kinesis Data Firehose |
Model development essentials
Algorithm and problem type recognition
| Problem type | Output | Common metrics |
|---|---|---|
| Binary classification | One of two classes or probability | Accuracy, precision, recall, F1, ROC-AUC, PR-AUC |
| Multiclass classification | One of several classes | Accuracy, macro/micro F1, confusion matrix |
| Regression | Numeric value | RMSE, MAE, R-squared |
| Forecasting | Future numeric values over time | RMSE, MAPE, backtesting metrics |
| Clustering | Group assignment without labels | Silhouette score, domain validation |
| Anomaly detection | Unusual event score or label | Precision/recall, false positive rate |
| Ranking/recommendation | Ordered list or item score | NDCG, MAP, click-through metrics |
Metric traps:
- Accuracy can be misleading with imbalanced data.
- Precision matters when false positives are expensive.
- Recall matters when false negatives are expensive.
- F1 balances precision and recall.
- ROC-AUC may look strong even when rare-positive performance is weak; PR-AUC may be more informative for severe imbalance.
- RMSE penalizes large errors more than MAE.
Classification metrics refresher
| Metric | Plain-language meaning | Use when |
|---|---|---|
| Precision | Of predicted positives, how many were actually positive | False positives are costly |
| Recall | Of actual positives, how many were found | False negatives are costly |
| F1 score | Harmonic balance of precision and recall | Need a single balance metric |
| Specificity | Of actual negatives, how many were correctly rejected | False alarms matter |
| Confusion matrix | Counts TP, FP, TN, FN | Diagnose error type |
Bias, variance, and overfitting
| Symptom | Likely issue | Response |
|---|---|---|
| Low training score and low validation score | High bias / underfitting | More expressive model, better features, train longer |
| High training score and low validation score | High variance / overfitting | Regularization, more data, early stopping, simpler model |
| Validation good, production poor | Drift, leakage, skew, bad split, changed data source | Monitor, compare distributions, retrain |
| Training unstable | Learning rate too high, poor scaling, noisy data | Tune learning rate, normalize, review data quality |
Hyperparameter tuning
SageMaker automatic model tuning is high-yield for scenarios where the model type is chosen but performance needs improvement.
Remember:
- Define an objective metric that matches business and exam constraints.
- Set realistic hyperparameter ranges.
- Use validation data, not test data, for tuning.
- Use early stopping when supported to reduce cost.
- Keep a final untouched test set for unbiased evaluation.
Common trap: optimizing the wrong metric. If the scenario emphasizes missed fraud, missed disease, or missed safety issues, recall-oriented metrics often matter more than accuracy.
SageMaker training jobs
Training job anatomy
A SageMaker training job usually needs:
- Training container image, either built-in or custom.
- Input data location, often S3.
- Output model artifact location, often S3.
- IAM execution role.
- Instance type and count.
- Hyperparameters.
- Optional VPC configuration.
- Optional checkpointing.
- Optional debugger/profiler/metrics.
Built-in algorithms vs custom containers
| Choose | When |
|---|---|
| SageMaker built-in algorithm | Standard problem type, faster setup, less container maintenance |
| SageMaker framework estimator | TensorFlow, PyTorch, XGBoost, scikit-learn with managed training support |
| Custom container | Custom dependencies, custom runtime, unsupported framework, specialized training logic |
| Bring your own script | You need flexibility but can use managed framework containers |
Custom container traps:
- Image must be in Amazon ECR or otherwise accessible as required.
- SageMaker role needs permission to pull the image and read/write S3.
- Training code must read from expected input channels and write model artifacts correctly.
- Private VPC training needs network access to S3/ECR/CloudWatch, often through VPC endpoints or controlled egress.
Distributed training and acceleration
| Scenario clue | Consider |
|---|---|
| Large deep learning model, long training time | GPU instances, distributed training, managed distributed libraries |
| Large tabular or tree model | CPU or memory-optimized instances may be enough |
| Need lower training cost and can tolerate interruption | Managed Spot Training with checkpointing |
| Training job must resume after interruption | Checkpoints saved to S3 |
| Large dataset bottleneck | Data format, sharding, pipe mode where applicable, FSx/EFS patterns |
Do not assume “bigger instance” is always the best answer. The exam may prefer the option that addresses the actual bottleneck: data loading, algorithm configuration, storage format, networking, or metric choice.
Deployment and inference
Pick the right inference pattern
| Requirement | Better fit | Key reason |
|---|---|---|
| Low-latency, always-on API | SageMaker real-time endpoint | Persistent endpoint for synchronous predictions |
| Intermittent traffic, simpler scaling | SageMaker serverless inference | No instance management for variable demand |
| Large payloads or long processing time | SageMaker asynchronous inference | Queues requests and processes asynchronously |
| Offline predictions for a dataset | SageMaker batch transform | No persistent endpoint needed |
| Many similar models with low traffic each | Multi-model endpoint | Reduces cost by sharing infrastructure |
| Test new model against production traffic | Shadow testing or production variants | Compare safely before full cutover |
| Gradual rollout | Canary or blue/green deployment | Reduce release risk |
Deployment traps
| Trap | Correct thinking |
|---|---|
| Choosing batch transform for real-time low-latency use | Batch transform is for offline batch scoring |
| Keeping a real-time endpoint for infrequent jobs | Consider batch transform or serverless inference |
| Ignoring payload size and timeout | Async inference may be a better fit for large/long requests |
| Deploying without data capture | Model Monitor needs captured inference data for many monitoring workflows |
| Confusing endpoint variants with model registry versions | Variants split traffic; registry tracks model packages and approval status |
| Assuming auto scaling fixes model quality | Scaling fixes capacity, not drift or bad predictions |
Real-time endpoint concepts
For SageMaker real-time inference, know:
- Model: points to model artifacts and inference image.
- Endpoint configuration: defines production variants and instance choices.
- Endpoint: live HTTPS inference target.
- Production variant: model/instance group with traffic weight.
- Auto scaling: adjusts capacity based on metrics such as invocation load.
- Data capture: stores requests and responses for monitoring.
Orchestration, CI/CD, and MLOps
Workflow service selection
| Need | Prefer |
|---|---|
| ML-native pipeline with training, tuning, evaluation, model registration | SageMaker Pipelines |
| Coordinate AWS services beyond ML, with branching and retries | AWS Step Functions |
| Event-driven trigger after file upload or schedule | Amazon EventBridge |
| Source-to-build-to-deploy software pipeline | AWS CodePipeline with CodeBuild/CodeDeploy |
| Package and approve model versions | SageMaker Model Registry |
| Track experiments, parameters, metrics, and artifacts | SageMaker Experiments or equivalent tracking setup |
MLOps review checklist
A production-ready ML workflow should answer:
- Where did the training data come from?
- Which code version created the model?
- Which hyperparameters were used?
- Which metrics approved the model?
- Who or what approved deployment?
- How is the model deployed and rolled back?
- What monitoring detects drift or degradation?
- What triggers retraining?
- How are secrets, keys, and network paths secured?
- How are logs and audit events retained?
Model Registry decision points
Use SageMaker Model Registry when the scenario requires:
- Tracking model versions.
- Model package approval before deployment.
- Promotion from development to staging to production.
- Lineage and governance around model artifacts.
- CI/CD integration for model deployment.
Common trap: storing a model artifact in S3 is not the same as managing the model lifecycle. S3 can store artifacts, but Model Registry provides versioning, approval, and lifecycle metadata.
Monitoring, maintenance, and drift
Types of monitoring
| Monitoring type | What it detects | Needs |
|---|---|---|
| Infrastructure monitoring | CPU, memory, latency, errors, invocations | CloudWatch metrics/logs |
| Data quality monitoring | Feature distribution changes, missing values, schema issues | Baseline and captured inference data |
| Model quality monitoring | Prediction quality degradation | Ground truth labels |
| Bias monitoring | Bias metric changes over time | SageMaker Clarify configuration and data |
| Explainability monitoring | Feature attribution changes | Clarify/explainability setup |
| Security/audit monitoring | API calls, access changes, unusual activity | CloudTrail, logs, IAM review |
Drift concepts
| Drift type | Meaning | Example |
|---|---|---|
| Data drift | Input feature distribution changes | New customer population behaves differently |
| Concept drift | Relationship between features and target changes | Fraud patterns change |
| Label drift | Target distribution changes | Positive class rate rises sharply |
| Training-serving skew | Training preprocessing differs from inference preprocessing | One-hot encoding differs between environments |
High-yield rule: if a question mentions production performance decline but infrastructure is healthy, look for drift, skew, missing monitoring baseline, or retraining workflow.
Retraining triggers
Retraining may be triggered by:
- Scheduled interval.
- Data drift threshold.
- Model quality threshold.
- New labeled data availability.
- Business event or seasonal change.
- Manual approval after monitoring alert.
Do not retrain blindly if the problem is bad input data, broken preprocessing, missing features, or a deployment bug. Fix the cause first.
Security and governance
IAM fundamentals for MLA-C01
| Concept | Review point |
|---|---|
| IAM role | Preferred for AWS service permissions; avoid hard-coded credentials |
| SageMaker execution role | Grants training/processing/notebook jobs access to S3, ECR, CloudWatch, KMS, etc. |
| Least privilege | Grant only required actions and resources |
| Resource policy | S3 bucket policies, KMS key policies, ECR repository policies may also control access |
| Temporary credentials | Prefer roles and federation over long-term access keys |
| Cross-account access | Requires permissions on both caller and resource sides |
Common trap: giving an IAM role S3 permission but forgetting the KMS key policy or KMS permissions for encrypted data.
Encryption and private networking
| Requirement | Consider |
|---|---|
| Encrypt data at rest in S3 | SSE-S3 or SSE-KMS, depending on control requirements |
| Encrypt training artifacts | S3 encryption and SageMaker volume/output encryption settings |
| Encrypt data in transit | HTTPS/TLS endpoints |
| Keep traffic off public internet | VPC configuration, private subnets, VPC endpoints |
| Access S3 privately from VPC | Gateway endpoint for S3 |
| Access AWS APIs privately | Interface VPC endpoints where applicable |
| Store database passwords/API tokens | AWS Secrets Manager or AWS Systems Manager Parameter Store |
| Audit API calls | AWS CloudTrail |
Private VPC trap: putting SageMaker training in a private subnet can break access to S3, ECR, and CloudWatch unless network paths are configured. The secure answer must still allow required service access.
Data protection and responsible ML
Expect scenarios involving:
- Sensitive data in training datasets.
- Encryption requirements.
- Access control for notebooks, S3, model artifacts, and endpoints.
- Audit trails for model deployment.
- Bias or explainability checks with SageMaker Clarify.
- Minimizing exposure of secrets and credentials.
Do not choose an option that solves model accuracy while ignoring stated security constraints.
Cost and performance optimization
Training cost controls
| Requirement | Option |
|---|---|
| Reduce cost for interruption-tolerant training | Managed Spot Training |
| Resume interrupted training | Checkpointing to S3 |
| Avoid unnecessary data scans | Partitioned columnar data |
| Reduce repeated preprocessing cost | Persist processed features or use Feature Store/offline store |
| Reduce tuning cost | Narrow search ranges, early stopping, sensible max jobs |
| Avoid idle notebooks | Stop notebook instances or use managed environments appropriately |
Inference cost controls
| Traffic pattern | Cost-aware choice |
|---|---|
| Continuous predictable traffic | Right-sized real-time endpoint with auto scaling |
| Bursty or intermittent traffic | Serverless inference |
| Offline scoring | Batch transform |
| Many low-traffic models | Multi-model endpoint |
| Large/slow requests | Async inference rather than overprovisioned synchronous endpoint |
| Need lower latency at scale | Tune model, choose appropriate instance, autoscale, consider optimized runtimes |
Performance trap: adding instances may not help if the bottleneck is model size, serialization, preprocessing, cold starts, or downstream dependencies.
Common MLA-C01 scenario traps
| Candidate mistake | Better exam approach |
|---|---|
| Memorizing services without constraints | Identify latency, cost, governance, and automation requirements |
| Picking the newest ML service automatically | Choose the service that directly satisfies the scenario |
| Treating notebooks as production workflows | Use pipelines, jobs, registries, and CI/CD for repeatability |
| Ignoring train/test contamination | Check split strategy and preprocessing order |
| Using accuracy for imbalanced classification | Match metric to business cost |
| Deploying before approval/governance | Use Model Registry and approval gates when required |
| Monitoring only CPU and latency | Add data/model quality monitoring for ML risk |
| Forgetting ground truth labels | Model quality monitoring needs labels |
| Assuming IAM permission alone is enough | Check bucket policy, KMS key policy, VPC access, and ECR access |
| Choosing real-time endpoint for batch workload | Use batch transform for offline scoring |
| Choosing batch transform for API prediction | Use real-time, serverless, or async inference |
| Missing retraining automation | Use EventBridge, Pipelines, Step Functions, and monitoring triggers |
| Hard-coding credentials | Use IAM roles and Secrets Manager/Parameter Store |
Fast decision rules
Data and processing
- If the data is in S3 and the question says ad hoc SQL, think Athena.
- If the question says serverless ETL/catalog, think Glue.
- If the question says Spark with more control, think EMR.
- If the question says repeatable ML preprocessing job, think SageMaker Processing.
- If the question says same features for training and low-latency inference, think SageMaker Feature Store.
- If the question says streaming ingestion, compare Kinesis Data Streams, Firehose, and MSK.
Training
- If performance is poor on both train and validation, address underfitting.
- If training is strong and validation is weak, address overfitting.
- If validation is strong and production is weak, investigate drift, skew, leakage, or bad split.
- If training may be interrupted for cost savings, use Managed Spot Training with checkpoints.
- If custom dependencies are required, consider custom containers, but check ECR/IAM/networking.
Deployment
- Need real-time synchronous predictions: SageMaker real-time endpoint.
- Need intermittent traffic without managing instances: serverless inference.
- Need large payload or long-running inference: asynchronous inference.
- Need offline scoring: batch transform.
- Need gradual rollout: production variants, canary, blue/green.
- Need compare new model without affecting responses: shadow testing.
Monitoring and operations
- Need input distribution checks: data quality monitoring.
- Need prediction performance checks: model quality monitoring with ground truth.
- Need bias/explainability: SageMaker Clarify.
- Need API/infrastructure metrics: CloudWatch.
- Need audit of AWS API activity: CloudTrail.
- Need automatic retraining: monitoring trigger plus pipeline orchestration.
Mini review tables for question practice
Inference selection table
| Latency | Workload | Best starting answer |
|---|---|---|
| Milliseconds/low latency | Continuous API traffic | Real-time endpoint |
| Low latency | Spiky or intermittent API traffic | Serverless inference |
| Minutes acceptable | Large files or long processing | Async inference |
| Hours acceptable | Large offline dataset | Batch transform |
Monitoring selection table
| Question clue | Best monitoring angle |
|---|---|
| “Input features differ from training baseline” | Data drift/data quality |
| “Accuracy decreased after deployment” | Model quality, ground truth labels |
| “Bias must be measured before and after deployment” | SageMaker Clarify |
| “Endpoint latency increased” | CloudWatch endpoint metrics |
| “Who changed the endpoint configuration?” | CloudTrail |
| “Need captured requests and responses” | SageMaker endpoint data capture |
Security selection table
| Question clue | Likely answer component |
|---|---|
| “No public internet access” | VPC endpoints/private networking |
| “Encrypted S3 objects cannot be read” | KMS key permissions or key policy |
| “Notebook has access keys in code” | IAM role/temporary credentials |
| “Need audit record of API calls” | CloudTrail |
| “Need secure database password retrieval” | Secrets Manager |
| “Training container cannot be pulled” | ECR permissions/network path |
How to use this with the question bank
Use this page first, then move into IT Mastery practice:
- Do topic drills for data preparation, model development, deployment, monitoring, and security.
- Review every detailed explanation, including questions you answered correctly.
- Tag missed questions by decision error, not just by service name.
- Re-drill weak areas until you can explain why the wrong options are wrong.
- Use mock exams only after you can consistently handle scenario tradeoffs.
Good review notes after a missed question should look like:
- “I chose real-time endpoint, but payload was large and processing was long; async inference was better.”
- “I optimized accuracy, but class imbalance made recall/F1 more appropriate.”
- “I selected IAM permissions, but the real issue was KMS key access.”
- “I chose retraining, but the immediate issue was training-serving skew.”
Final quick checklist before practice
Before starting a mock exam for AWS Certified Machine Learning Engineer – Associate (MLA-C01), confirm you can quickly answer:
- Which AWS service prepares, trains, deploys, monitors, and orchestrates each ML step?
- Which inference option matches each latency and traffic pattern?
- Which metric matches each business risk?
- How do you detect data drift, model quality degradation, bias, and infrastructure issues?
- How do IAM, KMS, VPC endpoints, CloudWatch, and CloudTrail fit into ML workloads?
- How do SageMaker Pipelines, Model Registry, and CI/CD support repeatable MLOps?
- What are the common causes of production model failure beyond endpoint availability?
Next step: start with MLA-C01 topic drills in the question bank, then use the detailed explanations to turn each missed scenario into a clear AWS service-selection rule.
Continue in IT Mastery
Use this Quick Review as a final concept map, then move into IT Mastery for focused topic drills, mixed practice sets, timed mock exams, and detailed explanations. The practice questions are original IT Mastery practice items; they are not official AWS questions, copied live-exam content, or exam dumps.