MLA-C01 — AWS Certified Machine Learning Engineer – Associate Quick Reference
Compact AWS MLA-C01 reference for machine learning engineering: data prep, SageMaker training, deployment, MLOps, monitoring, and security decisions.
Exam identity and quick-use map
This independent Quick Reference supports preparation for AWS Certified Machine Learning Engineer – Associate (MLA-C01). It focuses on the AWS service choices, ML engineering workflows, security controls, deployment patterns, and troubleshooting distinctions that commonly drive scenario questions.
Use this page to answer: What AWS service or pattern should I choose, and why?
flowchart LR
A[Data sources] --> B[Ingest and store]
B --> C[Clean, label, transform]
C --> D[Feature engineering]
D --> E[Train or tune model]
E --> F[Evaluate]
F --> G{Meets criteria?}
G -- No --> C
G -- Yes --> H[Register and approve]
H --> I[Deploy: real-time, async, batch, serverless]
I --> J[Monitor: data, quality, bias, latency, errors]
J --> K{Drift or degradation?}
K -- Yes --> L[Retrain pipeline]
L --> E
K -- No --> I
High-yield AWS service selection
| Need in scenario | Prefer | Why it fits | Common trap |
|---|---|---|---|
| Durable landing zone for training data, artifacts, model outputs | Amazon S3 | Native integration with SageMaker, Glue, Athena, EMR, Redshift Spectrum | Do not store large training datasets only on notebook instance storage |
| Data catalog for files in S3 | AWS Glue Data Catalog | Central schema/catalog for Athena, Glue, EMR, Redshift Spectrum | Athena queries data; Glue Data Catalog stores metadata |
| Serverless SQL over S3 | Amazon Athena | Ad hoc queries without managing clusters | Not ideal for heavy ETL pipelines that need complex transforms |
| Serverless ETL, crawlers, Spark jobs | AWS Glue | Managed ETL and schema discovery | Use EMR when cluster-level control/custom big data stack is required |
| Custom big data processing frameworks | Amazon EMR | Managed Hadoop/Spark/Hive ecosystem with more configuration control | More operational responsibility than Glue |
| Data warehouse analytics | Amazon Redshift | Columnar analytics, BI, warehouse workloads | S3 + Athena is often enough for ad hoc lake queries |
| Streaming ingestion with custom consumers | Amazon Kinesis Data Streams | Low-latency streams and multiple consuming apps | Not the same as Firehose delivery |
| Managed streaming delivery to S3/Redshift/OpenSearch | Amazon Data Firehose | Minimal administration for delivery and buffering | Less control than Kinesis Data Streams |
| Kafka-compatible streaming | Amazon MSK | Managed Apache Kafka compatibility | Choose only when Kafka ecosystem compatibility matters |
| Human data labeling | Amazon SageMaker Ground Truth | Managed labeling workflows and workforces | For sensitive data, prefer private workforce controls |
| Reusable online/offline ML features | Amazon SageMaker Feature Store | Helps reduce training-serving skew | Do not duplicate feature logic in separate train and inference code |
| No-code/low-code ML exploration | Amazon SageMaker Canvas | Business-user model building and predictions | Production-grade MLOps still needs controlled pipelines and deployment |
| Managed notebook and ML IDE | Amazon SageMaker Studio | Development, experiments, pipelines, model registry integration | Notebook success does not equal reproducible pipeline |
| Managed training jobs | Amazon SageMaker Training | Scalable, repeatable training with containers, S3 inputs, IAM roles | Avoid training on notebook instances for production workflows |
| Hyperparameter search | SageMaker automatic model tuning | Runs multiple training jobs against objective metric | Do not tune against final test set |
| ML workflow orchestration | Amazon SageMaker Pipelines | ML-native steps, lineage, parameters, model registry integration | Use Step Functions for broader cross-service workflow orchestration |
| General workflow orchestration across AWS services | AWS Step Functions | Serverless state machines, retries, approvals, integrations | Less ML-specific lineage than SageMaker Pipelines |
| Model package approval and versioning | SageMaker Model Registry | Tracks model versions, metadata, approval state | S3 artifact alone is not governed deployment |
| Real-time hosted inference | SageMaker real-time endpoint | Persistent low-latency API endpoint | Idle endpoints can create unnecessary cost |
| Offline scoring of large datasets | SageMaker Batch Transform | No persistent endpoint; reads/writes S3 | Not for interactive request/response inference |
| Large payloads or longer inference times | SageMaker Asynchronous Inference | Queued requests, S3 outputs, scales endpoint capacity | Not true synchronous low-latency API behavior |
| Intermittent inference traffic | SageMaker Serverless Inference | No instance management for spiky/idle workloads | Consider cold starts and workload suitability |
| Many similar models behind one endpoint | SageMaker multi-model endpoint | Consolidates model hosting | Model load/cache behavior can affect latency |
| Foundation model API without managing model infrastructure | Amazon Bedrock | Managed access to foundation models, agents, guardrails, knowledge bases | Do not choose custom SageMaker training when managed FM API is enough |
| Custom ML containers | Amazon ECR + SageMaker | Bring your own algorithm or inference container | Container must satisfy SageMaker training/inference contracts |
| Logs, metrics, alarms | Amazon CloudWatch | Operational visibility for endpoints, training, pipelines | Model quality drift requires ML-specific monitoring too |
| API auditing | AWS CloudTrail | Who called what AWS API and when | CloudWatch logs are not a substitute for API audit trails |
| Sensitive data discovery in S3 | Amazon Macie | Finds and reports sensitive data | Macie does not replace IAM, KMS, or data access design |
Data engineering and preparation reference
Storage, catalog, and query decisions
| Pattern | Best fit | Exam cues |
|---|---|---|
| Raw/bronze data lake | S3 buckets with prefixes, encryption, lifecycle policies | “Store raw source data durably and cheaply” |
| Curated training dataset | S3 curated prefix, Parquet/CSV/RecordIO as appropriate | “Reusable prepared dataset for training jobs” |
| Schema discovery | Glue crawler + Glue Data Catalog | “Infer schema from files in S3” |
| SQL exploration | Athena | “Run SQL directly on S3 data” |
| Repeatable ETL | Glue ETL job or SageMaker Processing | Glue for general ETL; SageMaker Processing when tightly coupled to ML workflow |
| Distributed feature engineering | Glue, EMR, or SageMaker Processing with Spark | Choose based on required control and integration |
| Warehouse-to-ML source | Redshift unload/query integration, Data Wrangler, or direct connector | “Training from warehouse data” |
| Streaming features/events | Kinesis Data Streams, MSK, Data Firehose | Distinguish custom stream processing from managed delivery |
Data split and leakage traps
| Situation | Split strategy | Watch for |
|---|---|---|
| Independent and identically distributed tabular data | Random train/validation/test split | Fit preprocessing only on training data |
| Imbalanced classes | Stratified split | Accuracy may be misleading |
| Time series forecasting | Chronological split | Random split leaks future information |
| Same user/device/account appears many times | Group-aware split | Avoid same entity in train and test |
| Small dataset | Cross-validation if feasible | Keep final holdout untouched |
| Hyperparameter tuning | Train/validation or cross-validation | Test set is for final estimate only |
| Feature engineering before split | Usually unsafe | Scaling, imputation, encoding, and feature selection can leak test statistics |
Data quality checklist
| Check | Why it matters | AWS-oriented action |
|---|---|---|
| Missing values | Many algorithms cannot use nulls directly | Impute, add missing indicator, or filter |
| Outliers | Can dominate loss and scaling | Cap, transform, robust scaling, investigate source |
| Class imbalance | Optimizer may favor majority class | Resampling, class weights, threshold tuning, PR AUC/F1 |
| Label noise | Limits achievable accuracy | Ground Truth review, consensus labeling, quality audits |
| Duplicate rows | Can leak across splits | Deduplicate before splitting or group split |
| Skewed distributions | Affects linear models and distance methods | Log transform, normalization, robust scaling |
| High cardinality categoricals | Sparse and overfit-prone | Hashing, target encoding with care, embeddings |
| PII/sensitive data | Security and governance risk | Macie, IAM least privilege, KMS, tokenization/redaction |
Feature Store concepts
| Concept | Meaning | Exam relevance |
|---|---|---|
| Feature group | Named collection of feature definitions and records | Organizes reusable features |
| Offline store | Historical features, typically in S3 | Training, batch analytics, backfills |
| Online store | Low-latency feature lookup | Real-time inference |
| Event time | Timestamp associated with feature record | Correct point-in-time training data |
| Training-serving skew | Different feature logic or freshness between training and production | Feature Store and shared transformation code reduce risk |
SageMaker development and training
Development environment choices
| Need | Choose | Notes |
|---|---|---|
| Full ML IDE and managed notebooks | SageMaker Studio | Useful for experiments, pipelines, registry, monitoring |
| Notebook-only experimentation | SageMaker notebook instances or Studio notebooks | Stop idle resources; not a production pipeline by itself |
| Business-user model building | SageMaker Canvas | Low-code predictions and exploration |
| Visual data prep | SageMaker Data Wrangler where available in the workflow | Useful for profiling, transforms, export to jobs/pipelines |
| Scripted reproducible processing | SageMaker Processing | Run preprocessing/evaluation containers at scale |
| Production training | SageMaker Training job | Isolated, repeatable, containerized, logged |
Training job anatomy
Recognize these knobs in scenario and configuration questions:
TrainingJob:
AlgorithmSpecification:
TrainingImage: <ECR image or built-in algorithm>
TrainingInputMode: File | FastFile | Pipe
RoleArn: <SageMaker execution role>
InputDataConfig:
- ChannelName: train
DataSource: s3://bucket/prefix/train/
- ChannelName: validation
DataSource: s3://bucket/prefix/validation/
OutputDataConfig:
S3OutputPath: s3://bucket/prefix/model-artifacts/
KmsKeyId: <optional KMS key>
ResourceConfig:
InstanceType: <training instance type>
InstanceCount: <count>
VolumeKmsKeyId: <optional KMS key>
HyperParameters:
objective: binary:logistic
VpcConfig:
Subnets: [private-subnet]
SecurityGroupIds: [sg-id]
StoppingCondition:
MaxRuntimeInSeconds: <limit>
Training input modes and data access
| Mode/source | Best fit | Trap |
|---|---|---|
| File mode | Common default; data copied from S3 to training volume | Startup can be slower for very large data |
| FastFile mode | S3 data exposed with file-like access where supported | Confirm algorithm/framework support |
| Pipe mode | Streams data to algorithm where supported | Container/algorithm must support streaming |
| Amazon FSx for Lustre | High-performance distributed file access | More setup than simple S3 inputs |
| Amazon EFS | Shared file system across instances | Consider throughput and access pattern |
| Checkpoints to S3 | Long or interruptible training jobs | Needed to resume rather than restart from scratch |
Container and algorithm choices
| Choice | Use when | Notes |
|---|---|---|
| Built-in SageMaker algorithm | Standard algorithm fits problem | Less container work, optimized integration |
| Framework estimator/script mode | TensorFlow, PyTorch, scikit-learn, XGBoost scripts | Bring training script; SageMaker manages job |
| Custom Docker container | Need custom runtime, dependencies, algorithm, or inference stack | Must follow SageMaker container conventions |
| Bring your own model artifact | Model already trained elsewhere | Package with compatible inference container |
| Amazon ECR image | Custom training/inference image | Execution role needs pull permissions |
Built-in algorithm selection cues
| Problem cue | Likely algorithm family | Notes |
|---|---|---|
| Tabular classification/regression with nonlinear patterns | XGBoost | Strong default for structured data |
| Large-scale linear classification/regression | Linear Learner | Works well for sparse/high-dimensional linear problems |
| Recommendation or sparse feature interactions | Factorization Machines | Common for user-item sparse matrices |
| Clustering without labels | K-Means | Unsupervised segmentation |
| Anomaly detection in numeric/time-series-like data | Random Cut Forest | Detects unusual observations |
| Text classification or word embeddings | BlazingText | Text-focused built-in option |
| Forecasting multiple related time series | DeepAR | Uses historical time series patterns |
| Image classification/detection | Image Classification, Object Detection, or framework model | Often use transfer learning or pretrained models |
| Custom deep learning architecture | PyTorch/TensorFlow on SageMaker | Use framework estimator or custom container |
Hyperparameter tuning
| Element | What to know |
|---|---|
| Objective metric | Metric to maximize or minimize; must be emitted by training job |
| Search space | Ranges or categorical values for hyperparameters |
| Early stopping | Stops weak jobs when supported/appropriate |
| Validation set | Used to compare tuning jobs |
| Final test set | Held out until final evaluation |
| Overfitting risk | More tuning can overfit validation data |
Model evaluation metrics
Confusion matrix terms
| Term | Meaning |
|---|---|
| TP | Predicted positive and actually positive |
| FP | Predicted positive but actually negative |
| TN | Predicted negative and actually negative |
| FN | Predicted negative but actually positive |
Metric selection table
| Task or risk | Prefer | Avoid over-relying on |
|---|---|---|
| Balanced classification | Accuracy, ROC AUC, F1 | Accuracy alone if costs differ |
| Rare positive class | Precision, recall, F1, PR AUC | Accuracy and sometimes ROC AUC |
| False negatives are costly | Recall/sensitivity | Precision alone |
| False positives are costly | Precision | Recall alone |
| Probabilistic classification | Log loss, calibration | Only thresholded accuracy |
| Regression with large-error penalty | RMSE | MAE if large errors must be emphasized |
| Regression with robust typical error | MAE | RMSE if outliers dominate unfairly |
| Forecasting | MAE, RMSE, MAPE/sMAPE where valid | MAPE when actual values can be zero |
| Ranking/recommendation | NDCG, MAP, precision@k/recall@k | Generic classification accuracy |
| Clustering | Silhouette score, within-cluster sum of squares | Supervised metrics without labels |
Evaluation traps
| Trap | Correct reasoning |
|---|---|
| “High accuracy” on imbalanced data | Check confusion matrix, recall, precision, F1, PR AUC |
| Tuning threshold on test set | Tune threshold on validation set; reserve test set |
| Random split for time series | Use chronological split |
| Preprocessing entire dataset before split | Fit transforms on train only, apply to validation/test |
| Comparing models with different test data | Use the same holdout or controlled cross-validation |
| Better offline metric but worse production | Investigate data drift, training-serving skew, latency/timeouts, feature freshness |
Deployment and inference patterns
Inference mode decision matrix
| Requirement | Choose | Why | Watch for |
|---|---|---|---|
| Low-latency request/response | SageMaker real-time endpoint | Persistent HTTPS endpoint | Scale and monitor latency/errors |
| Spiky or intermittent traffic | SageMaker Serverless Inference | No instance management | Cold start and workload suitability |
| Large payloads or long processing | SageMaker Asynchronous Inference | Queued async invocation, S3 output | Client does not wait synchronously |
| Offline batch scoring | SageMaker Batch Transform | Reads S3 input, writes S3 output | No always-on endpoint |
| Many tenant- or segment-specific models | Multi-model endpoint | Hosts multiple models behind one endpoint | Initial model load can add latency |
| Multiple containers in one endpoint | Multi-container endpoint | Direct or serial container invocation patterns | Not the same as multi-model hosting |
| Edge or disconnected inference | AWS IoT Greengrass or device runtime pattern | Local inference near data source | Model update and device security matter |
| Lightweight model behind app API | AWS Lambda plus API Gateway, if suitable | Simple serverless app integration | Not ideal for large models/heavy inference |
Deployment controls
| Control | Use for | Notes |
|---|---|---|
| Production variants | Traffic splitting across model variants | Supports A/B style testing |
| Shadow variant | Test new model on production traffic without serving its response | Useful before promotion |
| Canary/linear rollout pattern | Gradual production traffic shift | Pair with CloudWatch alarms and rollback |
| Auto scaling | Adjust endpoint capacity based on demand | Monitor latency, invocation volume, errors |
| Data capture | Store inference inputs/outputs in S3 | Required for many monitoring workflows |
| Model Registry approval | Gate promotion to staging/prod | Supports governance and reproducibility |
| Inference Recommender | Evaluate hosting instance/config options | Use when unsure about performance/cost tradeoff |
SageMaker inference container contract
| Endpoint | Purpose |
|---|---|
/ping | Health check |
/invocations | Inference requests |
Common container issues: wrong content type, missing dependencies, slow model load, model artifact path mismatch, container not listening correctly, memory exhaustion, or IAM denial when pulling ECR image or reading S3 artifact.
Generative AI and foundation model choices
| Scenario | Prefer | Reasoning |
|---|---|---|
| Use managed foundation models through API | Amazon Bedrock | Avoids managing model infrastructure |
| Need guardrails for FM application behavior | Guardrails for Amazon Bedrock | Central control for safety and policy behavior |
| Need RAG over enterprise documents | Knowledge Bases for Amazon Bedrock or custom RAG stack | Retrieves current private context instead of retraining model for facts |
| Need agents that call tools/APIs | Agents for Amazon Bedrock | Orchestrates tasks with FM reasoning and actions |
| Need deploy/tune open or pretrained model in SageMaker environment | SageMaker JumpStart or SageMaker hosting | More control over model/container/VPC/MLOps |
| Need custom model architecture/training loop | SageMaker custom training | Full control, more engineering responsibility |
| Need semantic search | Embeddings + vector store such as Amazon OpenSearch Service/OpenSearch Serverless, Aurora PostgreSQL with vector support, or managed Bedrock knowledge base | Match text by meaning, not exact keywords |
Prompt, RAG, fine-tuning, or training?
| Need | Usually choose | Why |
|---|---|---|
| Change output format, tone, instructions | Prompt engineering | Fastest and lowest operational complexity |
| Use private or frequently changing facts | RAG | Keeps knowledge external and updateable |
| Improve behavior on repeated task pattern | Fine-tuning/customization where supported | Teaches task style or domain pattern |
| Add brand-new domain facts only | RAG first | Fine-tuning is not a reliable database |
| Build specialized model from scratch | Custom training | Highest cost/complexity; use only when necessary |
MLOps and automation
Pipeline stages to recognize
| Stage | SageMaker/AWS service fit | Key artifacts |
|---|---|---|
| Ingest | S3, Kinesis, Data Firehose, DMS | Raw data |
| Validate/profile | Glue Data Quality, SageMaker Processing, Data Wrangler | Data reports, constraints |
| Transform | Glue, EMR, SageMaker Processing | Curated dataset, features |
| Train | SageMaker Training | Model artifact, metrics |
| Tune | SageMaker automatic model tuning | Best training job, hyperparameters |
| Evaluate | SageMaker Processing or pipeline evaluation step | Evaluation report |
| Conditional gate | SageMaker Pipelines condition step | Pass/fail metric rule |
| Register | SageMaker Model Registry | Model package/version |
| Approve | Manual or automated approval workflow | Approval state |
| Deploy | SageMaker endpoint, Batch Transform, CI/CD pipeline | Endpoint or batch job |
| Monitor | Model Monitor, Clarify, CloudWatch | Drift reports, alarms |
| Retrain | EventBridge, Pipelines, Step Functions | New model version |
SageMaker Pipelines pattern
Process raw data
-> Train model
-> Evaluate metrics
-> If metric passes threshold:
Register model package
Optionally deploy to staging
Else:
Stop and record failure
CI/CD and governance distinctions
| Need | Prefer | Notes |
|---|---|---|
| Version infrastructure | AWS CloudFormation or AWS CDK | Reproducible environments |
| Build/test custom containers | AWS CodeBuild + Amazon ECR | Scan and control images |
| Orchestrate release stages | AWS CodePipeline or equivalent CI/CD | Separate dev/test/prod |
| Trigger pipeline on data or approval event | Amazon EventBridge | Event-driven retraining/deployment |
| Human approval | CodePipeline approval, Step Functions, or registry approval process | Useful before production changes |
| Track experiments | SageMaker Experiments | Parameters, metrics, artifacts, lineage |
| Reproduce training | Pin code, image, dependencies, data version, hyperparameters, random seeds | Not just “rerun notebook” |
Security, privacy, and governance
IAM and access patterns
| Control | Exam-ready meaning |
|---|---|
| SageMaker execution role | Role assumed by SageMaker jobs/endpoints to access S3, ECR, CloudWatch, KMS, VPC resources |
| Least privilege | Restrict actions and resource ARNs, especially S3 prefixes and KMS keys |
| IAM user/role separation | Human identity starts jobs; execution role is used by managed service |
| Resource policies | S3 bucket policies, KMS key policies, ECR repository policies may also be required |
| Temporary credentials | Prefer IAM roles over long-lived static keys |
| Secrets Manager | Store database/API credentials; do not hardcode in notebooks or containers |
Network and encryption controls
| Requirement | Use | Notes |
|---|---|---|
| Encrypt S3 training data/artifacts | S3 server-side encryption with AWS KMS where required | Execution role needs KMS permissions |
| Encrypt training/inference volumes | KMS key options where supported | Include key policy permissions |
| Private training/inference network path | VPC configuration with private subnets/security groups | Ensure access to S3/ECR/CloudWatch through endpoints or controlled egress |
| No internet access from training container | Network isolation where appropriate | Container cannot fetch packages from internet |
| Private AWS service access | VPC endpoints/AWS PrivateLink where supported | Avoid public internet routes |
| Audit API calls | CloudTrail | Who changed endpoint, role, pipeline, bucket, key |
| Monitor logs/metrics | CloudWatch | Operational visibility |
| Detect sensitive data in S3 | Macie | Complements, not replaces, access controls |
| Govern data lake permissions | AWS Lake Formation | Centralized lake permissions over cataloged data |
Security traps
| Trap | Correct answer direction |
|---|---|
| AccessDenied from training job despite user access | Check SageMaker execution role, bucket policy, KMS key policy |
| Private subnet job cannot pull image or read S3 | Add required VPC endpoints or controlled NAT path |
| KMS-encrypted S3 object unreadable | Execution role needs both S3 and KMS decrypt permissions |
| Secret passed as plain environment variable | Use Secrets Manager or secure parameter retrieval |
| Public notebook or endpoint exposure | Use IAM, VPC, security groups, private access, and least privilege |
| Sensitive labeling data | Use private workforce and secure data access controls |
Monitoring, observability, and troubleshooting
What to monitor
| Layer | Tool/service | Signals |
|---|---|---|
| Endpoint operations | CloudWatch metrics/logs | Invocations, latency, errors, resource utilization |
| Training jobs | CloudWatch logs, SageMaker job status | Script errors, metric output, resource failures |
| API activity | CloudTrail | Create/update/delete endpoint, IAM, S3, KMS API calls |
| Input/output drift | SageMaker Model Monitor data quality | Feature distribution changes |
| Model performance | SageMaker Model Monitor model quality | Requires ground truth labels |
| Bias drift | SageMaker Clarify / Model Monitor integration | Bias metric changes over time |
| Explainability drift | Clarify feature attribution monitoring | Feature importance changes |
| Data capture | SageMaker endpoint data capture to S3 | Inputs/outputs for monitoring and analysis |
Troubleshooting decision table
| Symptom | Likely checks |
|---|---|
| Training job cannot access data | Execution role, S3 URI, bucket policy, KMS key policy, VPC endpoint |
| Training job starts but algorithm fails | Input format, content type, channel names, hyperparameters, script error |
| Metrics not visible for tuning | Training script must emit metric matching tuning regex/definition |
| Endpoint creation fails | Model artifact path, container image, IAM/ECR access, model load errors |
| Endpoint returns 4xx | Request format, content type, authentication, payload schema |
| Endpoint returns 5xx | Container logs, model exception, memory/timeout, dependency error |
| Latency increases | Instance sizing, concurrency, autoscaling, payload size, cold starts, model size |
| Production accuracy drops | Data drift, label drift, feature skew, upstream schema change, stale features |
| Model Monitor has no quality report | Ground truth labels may be missing or delayed |
| Costs unexpectedly high | Idle endpoints/notebooks, overprovisioned instances, unnecessary always-on hosting |
| Batch job slow | Input sharding, data format, instance choice, transform strategy |
| Pipeline did not trigger | EventBridge rule, permissions, source event pattern, pipeline parameters |
Cost-aware engineering choices
| Cost pressure | Practical pattern |
|---|---|
| Idle development environments | Stop notebooks/Studio apps when unused; use lifecycle controls where appropriate |
| Always-on endpoint with rare traffic | Consider Serverless Inference, Asynchronous Inference, or Batch Transform |
| Many small models | Consider multi-model endpoints |
| Large recurring batch scoring | Use Batch Transform and right-size compute |
| Long training jobs | Use checkpoints; consider managed spot training where suitable |
| Overtraining | Use early stopping and sensible tuning search spaces |
| Duplicate feature computation | Reuse Feature Store and shared processing jobs |
| Unused artifacts/logs | Apply S3 lifecycle policies and retention controls |
| Inefficient data format | Prefer columnar/compressed formats such as Parquet for analytics workloads |
Scenario shortcuts
| If the stem says… | Likely answer | Why |
|---|---|---|
| “Run SQL on files in S3 without managing servers” | Athena + Glue Data Catalog | Serverless query over data lake |
| “Infer schema from new S3 data” | Glue crawler | Populates catalog metadata |
| “Large-scale ETL with serverless Spark” | AWS Glue | Managed ETL |
| “Need full Spark cluster configuration control” | EMR | More control than Glue |
| “Label images with human reviewers” | SageMaker Ground Truth | Managed labeling |
| “Avoid different feature code in training and inference” | SageMaker Feature Store | Reduces training-serving skew |
| “Train model reproducibly at scale” | SageMaker Training job | Managed, containerized, repeatable |
| “Find best hyperparameters automatically” | SageMaker automatic model tuning | Searches parameter space |
| “Track parameters, metrics, and artifacts” | SageMaker Experiments | Experiment lineage |
| “Approve model before production” | SageMaker Model Registry | Model package governance |
| “Deploy for millisecond-style request/response” | Real-time endpoint | Persistent inference |
| “Score millions of records nightly” | Batch Transform | Offline batch predictions |
| “Requests can take longer and response can be stored in S3” | Asynchronous Inference | Queued async processing |
| “Traffic is unpredictable and often idle” | Serverless Inference | No instance management |
| “Compare new model silently on production traffic” | Shadow variant | Does not affect user response |
| “Detect input feature distribution drift” | Model Monitor data quality | Baseline vs captured data |
| “Detect accuracy degradation after labels arrive” | Model Monitor model quality | Needs ground truth |
| “Who changed the endpoint configuration?” | CloudTrail | API audit |
| “Endpoint has high 5xx errors” | CloudWatch logs + container diagnostics | Operational troubleshooting |
| “Use foundation model without hosting it” | Amazon Bedrock | Managed FM API |
| “Add current company documents to FM answers” | RAG / Knowledge Bases for Amazon Bedrock | Retrieves external knowledge |
| “Sensitive S3 training data may contain PII” | Macie + IAM/KMS controls | Discovery plus protection |
| “Private training with no internet” | VPC config, endpoints, network isolation | Controlled network path |
Final review checklist
- Map every scenario to the lifecycle step: data, features, training, evaluation, deployment, monitoring, or governance.
- Distinguish Athena vs Glue vs EMR, Pipelines vs Step Functions, and real-time vs async vs batch vs serverless inference.
- For security questions, check execution role, S3 policy, KMS policy, VPC path, and CloudTrail.
- For model quality questions, identify whether the issue is data quality, drift, bias, feature skew, evaluation metric choice, or deployment configuration.
- For MLOps questions, prefer repeatable jobs, tracked artifacts, model registry approval, automated deployment, and monitoring-triggered retraining over manual notebook workflows.
Next step: use this Quick Reference as a drill sheet, then practice scenario questions that force you to choose the correct AWS service, deployment mode, monitoring control, or security fix for MLA-C01.