Try 65 free AWS MLA-C01 questions across the exam domains, with explanations, then continue with full IT Mastery practice.
This free full-length AWS MLA-C01 practice exam includes 65 original IT Mastery questions across the exam domains.
These questions are for self-assessment. They are not official exam questions and do not imply affiliation with the exam sponsor.
Count note: this page uses the full-length practice count maintained in the Mastery exam catalog. Some certification vendors publish total questions, scored questions, duration, or unscored/pretest-item rules differently; always confirm exam-day rules with the sponsor.
Need concept review first? Read the AWS MLA-C01 Cheat Sheet on Tech Exam Lexicon, then return here for timed mocks and full IT Mastery practice.
Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.
| Domain | Weight |
|---|---|
| Data Preparation for Machine Learning (ML) | 28% |
| ML Model Development | 26% |
| Deployment and Orchestration of ML Workflows | 22% |
| ML Solution Monitoring, Maintenance, and Security | 24% |
Use this as one diagnostic run. IT Mastery gives you timed mocks, topic drills, analytics, code-reading practice where relevant, and full practice.
Topic: Data Preparation for Machine Learning (ML)
In an Amazon SageMaker training workflow, you apply feature transformations such as normalization, log transforms, and binning. Which statement is INCORRECT (unsafe) about applying these transformations to improve model performance and stability?
Options:
A. Use a log transform to reduce extreme right-skew in a feature.
B. Fit the scaler on training data and reuse it for inference.
C. Bin a continuous feature to reduce outlier sensitivity.
D. Compute scaling statistics using all splits to ensure consistency.
Best answer: D
Explanation: Scaling parameters (for example, mean and standard deviation) must be learned only from the training set and then applied unchanged to validation/test and inference. Computing them using all splits introduces data leakage, which can make offline evaluation look better than it will in production and can destabilize retraining comparisons. The other statements describe standard, generally safe transformations and their typical benefits.
The core issue is data leakage: any transformation that learns parameters from data (normalization, standardization, target encoding, learned embeddings, etc.) must be fit on the training split only. If you compute scaling statistics using validation/test data, the model indirectly “sees” distribution information it would not have at training time, which can inflate evaluation results and make performance look less stable across retrains.
A safe SageMaker pattern is:
Log transforms often stabilize skewed features, and binning can reduce sensitivity to outliers (with a potential loss of granularity).
Topic: Deployment and Orchestration of ML Workflows
A company trained a PyTorch transformer for real-time text classification in Amazon SageMaker. The inference code requires a proprietary, compiled tokenizer library and OS packages that cannot be installed at startup.
Deployment requirements:
Which deployment approach BEST meets these requirements?
Options:
A. Run the model on Amazon EKS using a Kubernetes Deployment
B. Build a custom container, push to ECR, deploy to SageMaker endpoint
C. Use SageMaker Batch Transform with the built-in PyTorch container
D. Use the SageMaker Hugging Face container with pip installs at startup
Best answer: B
Explanation: Because the endpoint has no outbound internet and needs proprietary compiled dependencies, the inference environment must be fully self-contained. Packaging the required OS libraries and tokenizer into a custom Docker image in Amazon ECR satisfies the private, auditable image requirement. That image can then be deployed to an autoscaling SageMaker real-time endpoint inside the VPC to meet the latency SLA.
The core decision is whether a SageMaker provided container can satisfy the runtime requirements without modification. When inference needs proprietary binaries, custom system packages, or nonstandard runtime components that cannot be safely installed at startup (especially with no outbound internet), you should use a bring-your-own container (BYOC).
In this scenario, a custom image built with the compiled tokenizer and required OS packages can be stored in private Amazon ECR for auditability and deployed as a SageMaker real-time endpoint with VPC configuration and autoscaling to meet the p95 latency and throughput requirements. SageMaker provided deep learning containers are best when your dependencies can be expressed through supported mechanisms (and network access/policy allows installation).
Topic: ML Model Development
A team runs a PyTorch training job on Amazon SageMaker using instance type ml.g4dn.xlarge (the instance type cannot change due to budget controls). The job fails with different errors across retries.
Exhibit: CloudWatch log excerpts
RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB
pandas.errors.ParserError: Error tokenizing data. C error: Expected 12 fields
WARNING: loss is NaN at step 430
Which THREE actions are the most appropriate mitigations? (Select THREE.)
Options:
A. Switch to an instance type with more GPU memory
B. Enable SageMaker Managed Spot Training
C. Add a preprocessing validation/cleaning step before training
D. Increase the training job EBS volume size
E. Reduce batch size or use gradient accumulation
F. Lower the learning rate and add gradient clipping
Correct answers: C, E and F
Explanation: The errors indicate three distinct failure modes: GPU out-of-memory, input parsing errors, and unstable optimization (NaN loss). The best mitigations are to reduce GPU memory pressure in the training loop, validate/clean input data before it reaches the training container, and adjust optimization settings to improve numerical stability and convergence.
Troubleshooting training failures in SageMaker starts by mapping the error to the resource or pipeline stage that causes it. CUDA OOM is almost always driven by model size and per-step memory (batch size/sequence length/activations), so you mitigate by reducing memory use in the training code rather than changing unrelated storage settings. Parser errors indicate malformed or inconsistent input (schema/field count/quoting), which is best handled by adding an explicit validation/cleaning step (for example, a SageMaker Processing job or AWS Glue job) that produces a known-good dataset for training. NaN loss typically indicates divergence or numeric instability, so mitigation is to adjust optimizer hyperparameters (lower learning rate) and add stabilizers such as gradient clipping.
Key takeaway: pick mitigations that directly address the failure signal (GPU memory, data validity, optimization stability) and fit stated constraints.
Topic: Data Preparation for Machine Learning (ML)
A company must ensure that training datasets containing PII are stored and processed only in eu-west-1. Amazon S3 is used for raw storage, and Amazon SageMaker processing jobs are used for feature engineering. Which action best reflects a core principle to keep data confined to the approved Region and approved storage locations?
Options:
A. Create CloudTrail alarms for S3 and SageMaker in other Regions.
B. Remove PII columns before storing data for feature engineering.
C. Encrypt S3 data with KMS keys created in eu-west-1.
D. Use SCP region-deny and allow-list approved S3 buckets.
Best answer: D
Explanation: The best fit is the principle of least privilege: explicitly allow only the approved Region and the specific S3 locations that may be used. An AWS Organizations SCP can prevent using services in disallowed Regions, and an allow-list policy limits where identities can read/write data, reducing the chance of accidental noncompliant storage or processing.
The core principle is least privilege: identities should be permitted to operate only within the minimum set of Regions and storage locations required for the workflow. To enforce data residency, use preventive guardrails that block actions rather than only detecting them after the fact.
Practically, this means:
aws:RequestedRegion is not eu-west-1 (for Regional services such as SageMaker).This combination reduces the permitted “blast radius” so data cannot be easily processed or written outside approved boundaries, which is stronger than monitoring-only or encryption-only approaches.
Topic: Deployment and Orchestration of ML Workflows
A team deploys the same Amazon SageMaker inference container image to dev, test, and prod. They avoid baking environment-specific values into the image.
At deployment time, they pass non-sensitive settings (for example, log level and downstream endpoint URL) as environment variables and retrieve sensitive values (for example, an API key) from AWS Secrets Manager using the endpoint’s IAM role.
Which core principle does this action most directly reflect?
Options:
A. Least privilege
B. Reproducibility
C. Separation of duties
D. Defense in depth
Best answer: B
Explanation: This approach externalizes configuration from the built artifact so the same container image can be promoted across environments without modification. Environment-specific values are supplied at deploy time through environment variables and managed secrets, which reduces configuration drift between dev, test, and prod. That directly supports reproducible, consistent deployments.
The core concept is reproducibility: build once, deploy many. When you keep the container image identical across dev/test/prod and inject environment-specific settings at deployment time, you can reliably reproduce the same workload behavior in each environment and promote releases without rebuilding.
In AWS ML deployments, this is typically achieved by:
This pattern also improves security, but the primary principle demonstrated is that deployments remain consistent and repeatable because configuration is separated from the artifact.
Topic: Data Preparation for Machine Learning (ML)
Which statement is INCORRECT about using Amazon SageMaker Data Wrangler to create, validate, and export a repeatable feature engineering workflow?
Options:
A. After exporting to a Processing job, you can delete the .flow file because the exported job contains the full transformation graph.
B. Export the flow to a SageMaker Processing job or SageMaker Pipelines step and parameterize the input and output S3 URIs for repeatable runs.
C. Store the Data Wrangler .flow file in Amazon S3 with versioning to track transformation changes over time.
D. Use Data Wrangler’s data quality/insights to validate the dataset and save the report artifacts to Amazon S3 for auditability.
Best answer: A
Explanation: A repeatable Data Wrangler workflow depends on persisting the flow definition (the .flow file) as the source of truth for the transformation graph. Exports (Processing job/Pipelines) use that flow to apply the same steps on new data, and validation artifacts can be saved for traceability. Deleting the flow undermines reruns and auditability.
In SageMaker Data Wrangler, the .flow file is the durable, declarative definition of your feature engineering steps (imports, transforms, joins, encodings, etc.). When you export a Data Wrangler workflow (for example, to a SageMaker Processing job or a SageMaker Pipelines step), the runtime uses the flow definition to reproduce the same transformations on future datasets.
To make the workflow repeatable and auditable:
.flow file (for example, in S3 with versioning).Keeping the flow artifact is a core requirement for reproducibility; export alone is not a substitute for the flow definition.
.flow captures the exact transform graph..flow after export is unsafe because the export relies on the flow definition to reproduce transforms.Topic: ML Solution Monitoring, Maintenance, and Security
A team uses AWS CodePipeline and CodeBuild to train a model and then promote it in Amazon SageMaker Model Registry before deploying to production. A security review captured this IAM policy fragment attached to the CodeBuild project role.
Exhibit: CodeBuild role policy (partial)
1 Effect: Allow
2 Action: secretsmanager:GetSecretValue
3 Resource: arn:aws:secretsmanager:us-east-1:111122223333:secret:prod/db*
4 Action: sagemaker:UpdateModelPackage
5 Resource: *
6 Action: sts:AssumeRole
7 Resource: arn:aws:iam::111122223333:role/ProdDeployRole
Which action is the best next step to secure the ML CI/CD pipeline?
Options:
A. Move the production secret into S3 with SSE-KMS
B. Split build and promotion roles; least-privilege both
C. Encrypt the Secrets Manager secret with a new KMS key
D. Enable CloudTrail data events for Secrets Manager
Best answer: B
Explanation: The exhibit shows the CodeBuild role can fetch a production database secret and can directly promote a model package and assume a production deploy role. The best security improvement is to separate duties: keep the build role restricted to building artifacts and use a tightly scoped, gated promotion/deploy role for production changes.
Securing ML CI/CD pipelines relies on least privilege, separation of duties, and controlled promotion of artifacts. In the exhibit, the same CodeBuild role can retrieve a production secret (line 2 with a prod secret ARN on line 3) and can also promote artifacts and reach production permissions (it can call sagemaker:UpdateModelPackage on * in lines 4–5 and can sts:AssumeRole into ProdDeployRole in lines 6–7).
A safer approach is to:
Key takeaway: the risky capability is explicitly visible in lines 2–7, so the fix is to restrict roles and promotion permissions, not just add encryption or logging.
Topic: ML Solution Monitoring, Maintenance, and Security
A company in a regulated industry must run Amazon SageMaker training jobs and a real-time endpoint in a VPC with no internet access (no direct or indirect egress). Training data is in Amazon S3, and custom containers are stored in Amazon ECR. Logs must go to Amazon CloudWatch.
Which approach is NOT appropriate to meet the network isolation requirement?
Options:
A. Run training and the endpoint in private subnets with no route to an internet gateway
B. Use VPC endpoints for S3, ECR, and CloudWatch so jobs can access required services privately
C. Place jobs in a public subnet and use a NAT gateway so the container can download packages from the internet at startup
D. Enable SageMaker network isolation and ensure all dependencies are packaged in the container or retrieved from S3 via VPC endpoints
Best answer: C
Explanation: When policies require no internet access, SageMaker training and inference must run in private subnets without internet routes and rely on private connectivity (VPC endpoints) for AWS service access. Allowing internet egress (even via NAT) breaks network isolation because the container can reach external networks.
The core principle is enforcing network isolation by preventing any internet egress path for training and inference while still allowing access to required AWS services over private networking. In SageMaker, this typically means placing jobs/endpoints in private subnets, removing routes to an internet gateway/NAT gateway, and using VPC endpoints to reach services such as S3 (Gateway endpoint), ECR (Interface endpoints), and CloudWatch Logs (Interface endpoint). If additional packages or artifacts are needed, they must be pre-packaged into the container image or stored in S3 and accessed through those endpoints.
Allowing a NAT gateway so containers can fetch dependencies at runtime is an anti-pattern because it reintroduces outbound internet connectivity.
Topic: ML Model Development
A data scientist trains two binary classifiers in Amazon SageMaker and evaluates them using ROC AUC on the same held-out test set. Model A has an AUC of 0.91, and Model B has an AUC of 0.87. Which statement is the most accurate interpretation of this result?
Options:
A. Model A has better overall class-separation across thresholds
B. Model A will have higher precision for the positive class
C. Model A will have higher accuracy at any chosen threshold
D. Model A has a lower false positive rate at every recall level
Best answer: A
Explanation: ROC AUC summarizes a model’s ability to discriminate between classes across all possible classification thresholds. Because Model A’s AUC is higher than Model B’s on the same test set, Model A is the better overall discriminator in a threshold-independent sense. This does not guarantee it will be best at a specific operating threshold or metric like precision.
ROC curves plot true positive rate against false positive rate as the decision threshold varies, and AUC compresses that curve into a single, threshold-independent number. Interpreting AUC at a high level: it reflects how well the model separates (ranks) positive examples above negative examples across all thresholds. Therefore, when two models are evaluated on the same dataset, the model with the higher AUC (0.91 vs. 0.87) has better overall discriminative performance.
AUC alone does not guarantee better performance at any particular threshold, nor does it directly determine threshold-dependent metrics such as accuracy, precision, or recall; those depend on the chosen operating point and class prevalence. The key takeaway is that higher AUC means better overall ranking/separation, not universal superiority for every threshold or metric.
Topic: ML Model Development
A team needs to extract printed text from scanned loan applications (OCR). The team does not need custom fields beyond standard form and table extraction. Which action best applies the core principle of data minimization while meeting the requirement on AWS?
Options:
A. Use Amazon Textract and store only extracted text outputs
B. Use Amazon Rekognition Custom Labels for text extraction
C. Run open-source OCR on EC2 and archive all raw images
D. Train a custom OCR model in SageMaker and keep all scans
Best answer: A
Explanation: Amazon Textract is the AWS-managed service purpose-built for OCR on scanned documents, forms, and tables, so custom modeling is unnecessary here. Data minimization means collecting and retaining only what is needed for the stated purpose. Storing only the extracted text (and not the source scans) best matches that principle.
The key decision is matching the computer vision need to the appropriate managed service and then applying the stated principle. For OCR of scanned documents (including forms and tables) where specialized customization is not required, Amazon Textract is the high-level, fit-for-purpose choice.
Data minimization means limiting both the data you collect/process and the data you retain. In this scenario, the business output is the extracted text, so minimizing retained data favors persisting the Textract results while avoiding unnecessary long-term storage of the original scanned images.
By contrast, custom modeling (for example, training an OCR model in SageMaker) is warranted when managed OCR does not meet accuracy, language/layout, or domain-specific extraction requirements.
Topic: Data Preparation for Machine Learning (ML)
A team has existing producers and consumers that use the Apache Kafka protocol and client libraries. The team wants a fully managed AWS service to ingest a high-throughput stream with minimal application changes while retaining Kafka concepts such as topics and partitions. Which AWS service should the team use?
Options:
A. Amazon Managed Streaming for Apache Kafka (Amazon MSK)
B. Amazon Kinesis Data Streams
C. Amazon Kinesis Data Firehose
D. Amazon SageMaker Feature Store
Best answer: A
Explanation: Amazon MSK is AWS’s fully managed service for running Apache Kafka, allowing teams to keep Kafka APIs and client libraries while ingesting streaming data at scale. This directly matches the requirement to preserve Kafka-specific constructs like topics and partitions with minimal application changes.
Use Amazon Managed Streaming for Apache Kafka (Amazon MSK) when you need a managed Kafka control plane and data plane but want to keep the Kafka protocol, client libraries, and operational model (topics, partitions, consumer groups). This is a common choice for streaming ingestion when teams already have Kafka-based applications and want to reduce operational burden (broker management, patching, scaling) without rewriting producers/consumers.
A close alternative is Amazon Kinesis Data Streams, which is also for streaming ingestion at scale, but it uses Kinesis APIs and shard-based scaling rather than Kafka APIs and partitions, so it typically requires application changes.
Topic: ML Model Development
A team runs multiple Amazon SageMaker training jobs while tuning hyperparameters and wants each run to be reproducible and comparable later. They need a managed way to record randomness-related settings, track parameters and metrics, and version the run’s inputs and produced artifacts (for example, S3 URIs for datasets, training code, and model artifacts).
Which SageMaker capability best meets this requirement?
Options:
A. Amazon CloudWatch Logs
B. Amazon SageMaker Model Registry
C. Amazon SageMaker Experiments
D. AWS CloudTrail
Best answer: C
Explanation: Amazon SageMaker Experiments is designed to organize and track ML runs so they can be reproduced and compared. It captures run metadata such as parameters (including seeds), metrics, and references to inputs and outputs (like dataset and model artifact locations). This makes it the most direct managed feature for reproducible experimentation in SageMaker.
Reproducible ML experiments require consistent capture of what was run and what data/artifacts were used: training parameters (including any random seeds), evaluation metrics, and the specific input and output artifacts for each run. Amazon SageMaker Experiments provides this experiment tracking by grouping runs into trials and recording each run’s parameters, metrics, and artifact references (for example, S3 locations for datasets, code bundles, and model artifacts). This lets you reliably compare results across runs and re-run the same configuration later using the recorded metadata.
Operational logs and API audit trails can support debugging and governance, but they are not purpose-built to structure experiment metadata for repeatable model development.
Topic: Data Preparation for Machine Learning (ML)
A team is preparing a 5 TB image dataset in Amazon S3 for distributed training on Amazon SageMaker. The team wants to maximize training throughput and minimize data loading bottlenecks.
Which statement is INCORRECT about preparing the training data so it can be loaded efficiently by the training infrastructure?
Options:
A. Compress files when the training job is not CPU-bound on decompression
B. Upload a single large file per epoch to reduce S3 GET overhead
C. Store many small objects to maximize S3 request parallelism
D. Shard the dataset into multiple medium-size files per channel
Best answer: B
Explanation: Efficient SageMaker training input typically comes from balanced sharding: enough files to allow parallel reads without creating pathological tiny-object overhead. A single very large file forces serial or limited parallel reads and makes retries and partial re-reads expensive, which can reduce overall training throughput.
The core concept is to shape S3 training data for high-throughput, parallel ingestion by the training cluster. For distributed training, you generally want multiple reasonably sized shards (often tens to hundreds of MB up to a few GB depending on the pipeline) so each worker can read different files concurrently; one monolithic file tends to cap parallelism and can become a hotspot.
Compression can improve effective I/O and reduce network transfer, but only helps when decompression CPU cost does not become the bottleneck. Conversely, extremely small files can overwhelm the input pipeline with per-object overhead (listing/opens/metadata and request management), even if S3 can scale requests.
Key takeaway: optimize for parallel reads with sensible shard sizes, not single giant objects.
Topic: ML Solution Monitoring, Maintenance, and Security
An ML team stores training datasets in one S3 bucket using prefixes raw/ (contains PII) and curated/ (PII removed). Data scientists must be able to list and read objects only under curated/ and must not access raw/. Which action best reflects the core security principle of least privilege?
Options:
A. Use separate IAM roles for ETL and model training jobs
B. Scope S3 permissions to only the curated/ prefix
C. Automatically delete datasets after every training run
D. Enable S3 Versioning and log access to CloudTrail
Best answer: B
Explanation: Least privilege means granting only the minimum permissions needed for a specific job. In S3, that commonly requires restricting access by prefix so principals can only list and read the specific dataset paths they are authorized to use. This directly prevents access to the raw/ PII data while still enabling work on curated/.
The core principle being tested is least privilege: each identity should receive only the permissions (actions) and scope (resources) required to perform its tasks. For S3-hosted ML datasets, the practical way to enforce least privilege is to limit s3:GetObject (and s3:ListBucket with an s3:prefix condition) to the exact prefixes that the persona needs, such as curated/*, and avoid granting broad bucket-wide access. This reduces the blast radius of accidental misuse or credential compromise and helps ensure PII-containing raw/* objects remain inaccessible to data scientists.
A common pitfall is using controls that improve auditing or role separation without actually constraining which objects can be accessed; those are valuable, but they do not satisfy the “only this prefix” requirement by themselves.
raw/.Topic: ML Model Development
A company uses separate AWS accounts for each ML team. A team in account 210987654321 wants to reuse a standard training container that is built and versioned by a central platform team in account 123456789012.
Exhibit: SageMaker training job log excerpt
[INFO] TrainingImage: 123456789012.dkr.ecr.us-east-1.amazonaws.com/ml-train:1.4.0
[INFO] SageMakerExecutionRole: arn:aws:iam::210987654321:role/sm-training-role
[ERROR] ClientError: AccessDeniedException
[ERROR] not authorized to perform: ecr:BatchGetImage
[ERROR] on resource: arn:aws:ecr:us-east-1:123456789012:repository/ml-train
Which action best enables sharing this reusable training container across accounts while keeping a single centrally managed image?
Options:
A. Add a cross-account ECR repository policy to allow image pull, and grant the training role ECR read permissions
B. Create VPC interface endpoints for Amazon ECR in the team account to resolve the image pull failure
C. Switch the training job to use a SageMaker built-in algorithm container instead of a custom container
D. Export the image to Amazon S3 and have each account load it into its own ECR repository before training
Best answer: A
Explanation: The log indicates an authorization failure when the team account tries to pull a Docker image from an ECR repository in a different account. The right sharing strategy is to keep the image centralized in one ECR repository and explicitly grant cross-account pull access through an ECR repository policy, along with the necessary ECR read actions on the SageMaker execution role.
The core issue is cross-account authorization to a centrally managed ECR image. The exhibit explicitly shows an AccessDeniedException and the denied action ecr:BatchGetImage against the repository ARN in account 123456789012 (see the lines stating not authorized to perform: ecr:BatchGetImage and the arn:aws:ecr:...:123456789012:repository/ml-train).
To share reusable training containers across accounts while keeping one source of truth:
Network endpoints can help connectivity, but they do not fix an IAM/ECR authorization error.
AccessDeniedException), not a connectivity failure.Topic: ML Solution Monitoring, Maintenance, and Security
A company uses Amazon SageMaker to train models on sensitive customer PII stored in Amazon S3 and serves real-time inference through SageMaker endpoints. The security team must prevent unauthorized access to both training data and inference payloads while keeping deployments and operations scalable.
Which approach should the company AVOID?
Options:
A. Encrypt S3 training data with SSE-KMS and restrict the bucket policy to approved SageMaker roles and specific S3 access paths
B. Enable auditing with AWS CloudTrail for data access and SageMaker API activity and send logs to a protected central account
C. Use one shared SageMaker execution role with broad s3:* and kms:* permissions on * for all teams and environments
D. Place training jobs and endpoints in a VPC and use VPC endpoints to keep S3 and inference traffic on the AWS network
Best answer: C
Explanation: To protect sensitive training data and inference payloads, access should be limited to the minimum set of identities, resources, and network paths required. Overbroad IAM permissions create unnecessary blast radius and make it easier for unauthorized principals to read training data or misuse KMS keys. Using encryption, private connectivity, and auditing supports secure operations without blocking legitimate workflows.
The core control for preventing unauthorized access at scale is least-privilege authorization combined with encrypted storage/transport and auditable access. In SageMaker workflows, that typically means scoping IAM roles to specific S3 prefixes and required KMS keys, limiting network paths (for example, VPC endpoints) so traffic stays on the AWS network, and logging access for investigation and compliance.
A single shared execution role with s3:* and kms:* on * is an anti-pattern because it:
The key takeaway is to minimize permissions and scope access, then layer encryption, private connectivity, and auditing around that foundation.
Topic: ML Model Development
A bank trained an Amazon SageMaker XGBoost model to predict loan approval. The team must validate that performance is consistent across subpopulations (for example, gender and age_band) and identify potential fairness concerns before registering the model.
Which THREE analyses should the ML engineer perform to meet this requirement?
Options:
A. Tune the global classification threshold for best overall F1
B. Run a SageMaker Clarify post-training bias report
C. Compare FPR/TPR and AUC across gender and age bands
D. Report only overall ROC-AUC on the full test set
E. Check calibration curves separately for each subgroup
F. Remove gender from the dataset and assume fairness
Correct answers: B, C and E
Explanation: To validate consistency across subpopulations, evaluate model quality and errors separately by group, not only in aggregate. A high-level fairness check also requires computing group-based bias metrics on predictions and verifying that scores are similarly calibrated across groups. These analyses can surface disparate error rates, selection rates, or miscalibration that may create unfair outcomes.
Fairness and consistency checks start by slicing evaluation results by protected or sensitive subpopulations (or their proxies) and comparing metrics that drive decisions. Aggregate metrics can hide large gaps when one group dominates the test set.
Useful high-level checks include:
The key takeaway is to evaluate by subgroup using both predictive performance and bias metrics, rather than relying on overall scores or “fairness by omission.”
Topic: Data Preparation for Machine Learning (ML)
A retail company is building a binary classifier in Amazon SageMaker to predict whether a customer will churn in the next 30 days. The dataset contains 1 year of daily customer snapshots with features such as days_since_last_purchase, support_tickets_last_30d, and a unique customer_id. Churn occurs in ~3% of snapshots.
The team needs train/validation/test splits that avoid leakage and keep class balance comparable across splits.
Which approach should the team AVOID?
Options:
A. Randomly split snapshots across all dates, then stratify by churn label
B. Split by customer_id so each customer appears in only one split, and stratify by churn at the customer level where feasible
C. Split by time: oldest 80% train, next 10% validation, newest 10% test
D. Split by time, and within each time window use stratified sampling to maintain the churn rate in train and validation
Best answer: A
Explanation: For time-evolving customer snapshots, a random stratified split across all dates is a leakage anti-pattern because it can place later snapshots for the same customer in training while evaluating earlier snapshots in test (or vice versa). That breaks the intended “predict the future from the past” setup and inflates metrics. A leakage-safe split should respect time and/or keep entities isolated across splits while maintaining label distribution as appropriate.
The core principle is to design splits that match the real inference setting and prevent information from the evaluation set from “bleeding” into training. With customer snapshots over time, randomly splitting rows across the full year can put the same customer_id into multiple splits and mix future and past observations, which introduces temporal/entity leakage (the model indirectly learns customer-specific patterns that won’t generalize).
Leakage-safe approaches typically:
customer_id) so each entity appears in only one split.The key takeaway is that stratification is useful for class imbalance, but it must not override boundaries that prevent leakage.
Topic: ML Model Development
A team uses Amazon SageMaker Pipelines to train and evaluate a binary classifier. After a recent pipeline change, the evaluation AUC jumped from 0.76 to 0.99, but the model performs poorly in production. The team suspects evaluation issues such as data leakage, metric misconfiguration, or inconsistent train/test splits across runs.
Which action should the team take EXCEPT?
Options:
A. Fit preprocessing on train and apply artifacts to test
B. Generate features on the full dataset before splitting
C. Validate the evaluation script’s metric mapping and thresholds
D. Use a deterministic split and persist a split manifest
Best answer: B
Explanation: To troubleshoot an evaluation pipeline, the team should eliminate leakage and ensure the evaluation procedure is repeatable and correctly measured. Actions that fit transformations only on training data, lock down split determinism, and verify metric configuration improve evaluation fidelity. Computing features across the full dataset before splitting is a classic leakage anti-pattern that can produce unrealistically high offline metrics.
The core principle is to keep the test set isolated so it represents unseen data. Any operation that learns from data (feature scaling, encoding, aggregations that depend on label/time, imputation statistics) must be fit using only the training split, then applied to validation/test using the saved artifacts. In SageMaker Pipelines, this typically means producing and passing preprocessing artifacts between steps and making the split deterministic (seeded random split or time-based split) so evaluation is consistent across runs. Separately, the evaluation step must compute the intended metric (and any thresholds) correctly; a miswired metric can also create misleading “great” results. The key takeaway: improve reproducibility and metric correctness without introducing train-test contamination.
Topic: ML Solution Monitoring, Maintenance, and Security
A team uses a standard workflow (data in S3 train/tune in SageMaker deploy to a SageMaker real-time endpoint monitor). The endpoint occasionally has elevated latency and intermittent request throttling, but the team currently checks the console manually after incidents.
The team must detect and alert on endpoint health issues (errors, latency, throttling) within 5 minutes using AWS-native monitoring, with minimal ongoing cost and no changes to the model container or client integration.
Exhibit: Available CloudWatch metrics for the endpoint
ModelLatencyInvocation4XXErrorsInvocation5XXErrorsInvocationThrottlesWhich change best improves operability and reliability while meeting the constraints?
Options:
A. Convert the deployment to an asynchronous endpoint to reduce throttling
B. Use CloudTrail to alert on InvokeEndpoint API calls that return errors
C. Create CloudWatch alarms on latency, errors, and throttles; enable logs to CloudWatch Logs with alerts
D. Enable SageMaker Model Monitor with hourly drift and data quality reports to S3
Best answer: C
Explanation: CloudWatch metrics and logs are the primary tools to monitor inference endpoint health at runtime. Alarming on latency percentiles, 4XX/5XX errors, and throttling provides fast detection, and shipping container logs to CloudWatch Logs provides the context needed to troubleshoot. This meets the 5-minute alerting goal with low overhead and no model or client changes.
For SageMaker endpoints, health monitoring is best implemented with CloudWatch metrics for fast, quantitative signals (latency, error rates, throttling) and CloudWatch Logs for qualitative diagnostics (stack traces, timeouts, upstream dependency errors). In this scenario, creating CloudWatch alarms on ModelLatency, Invocation4XXErrors, Invocation5XXErrors, and InvocationThrottles (and routing notifications through SNS/on-call tooling) reduces time-to-detect while keeping costs low.
Enabling the endpoint/container logs in CloudWatch Logs lets the team correlate alarm spikes with specific error messages and request patterns. The key tradeoff is some additional log ingestion/storage cost, but it is typically modest compared with the operational benefit of rapid detection and troubleshooting.
Options focused on drift detection, audit logging, or changing the endpoint type do not directly meet the stated 5-minute endpoint health alerting requirement under the constraints.
Topic: ML Solution Monitoring, Maintenance, and Security
A team trains a fraud model in Amazon SageMaker, registers it, and deploys it to a real-time endpoint. CloudWatch alarms show p95 latency breaches only during a daily 10-minute traffic spike. The application requires real-time responses (<250 ms p95) and the model cannot be changed.
Exhibit: Endpoint metrics during spike
SageMakerVariantInvocationsPerInstance: ~260Which change is the best way to reduce latency during spikes while minimizing cost outside the spike window?
Options:
A. Increase max capacity to 10 and lower the target invocations per instance (with faster scale-out)
B. Change the instance type to ml.m5.4xlarge and keep max capacity at 2
C. Replace the real-time endpoint with a Batch Transform job that runs every minute
D. Disable autoscaling and set a fixed desired capacity of 8 instances
Best answer: A
Explanation: The metrics indicate the endpoint is CPU-bound and constrained by a very low autoscaling max capacity, so it cannot add enough replicas during the spike. Increasing the maximum instance count and scaling out sooner reduces per-instance CPU pressure and request queueing, which lowers p95 latency and timeouts. Because capacity scales back down after the spike, cost outside the spike window stays low.
For SageMaker real-time endpoints, spike latency commonly comes from insufficient concurrency (too few instances) and autoscaling that reacts too late or is capped. Here, CPU is near saturation while memory is low, and the current max capacity of 2 prevents the endpoint from adding enough replicas to handle the spike.
A high-level fix is to tune endpoint autoscaling so the variant scales out earlier and is allowed to scale higher:
maxCapacity to cover the spike.SageMakerVariantInvocationsPerInstance (or similar target-tracking metric) so scale-out starts before CPU saturates.This typically improves p95 latency and reliability during spikes while keeping costs low the rest of the day because you only pay for extra instances when they are needed.
Topic: ML Model Development
A team trains an XGBoost fraud model in Amazon SageMaker. Data is prepared in S3, then a SageMaker hyperparameter tuning job runs, the best model is deployed to a real-time endpoint, and CloudWatch alarms monitor latency and errors.
Each training job takes ~2 hours on ml.m5.4xlarge. The team has a fixed budget that allows at most 40 total training jobs for tuning, and only 4 can run in parallel. The search space is 6 hyperparameters that are mostly continuous ranges (for example, eta, min_child_weight, subsample) and model quality changes smoothly with small hyperparameter changes.
Which change to the tuning step is the best way to improve expected model performance without exceeding the budget?
Options:
A. Switch to exhaustive grid search
B. Keep random search and double max jobs
C. Tune only one hyperparameter at a time
D. Switch tuning to Bayesian optimization
Best answer: D
Explanation: With only 40 total trials and 2-hour trainings, the priority is finding a strong configuration with as few evaluations as possible. For a low-dimensional, mostly continuous, smooth search space, Bayesian optimization can use results from prior trials to choose more promising next trials than random sampling. The main tradeoff is less effective scaling to massive parallelism and potential sensitivity to noisy objectives.
This is a “limited budget, expensive trial” tuning problem: each evaluation costs hours and the team can only afford 40 total trials. When the hyperparameter space is relatively small in dimensionality and largely continuous, and the objective behaves reasonably smoothly, Bayesian optimization is usually preferred because it is sample-efficient (it learns from earlier trials to choose better next candidates).
Random search is often better when you can afford many trials, want easy parallelism at large scale, or have very high-dimensional spaces with many irrelevant parameters or lots of discrete/categorical choices. Here, parallelism is already limited (4 at a time), so Bayesian’s more sequential, guided exploration is not a major drawback. Key takeaway: choose Bayesian optimization to maximize performance under a tight trial budget; choose random search when you need broad, parallel exploration or the space is awkward for surrogate modeling.
Topic: ML Model Development
A retail company uses Amazon SageMaker to forecast daily demand per store for the next 7 days (forecast horizon = 7). A nightly SageMaker Pipeline builds lag features and trains an XGBoost regressor. Offline evaluation shows very low RMSE, but after deployment, Amazon SageMaker Model Monitor raises a constraint violation because prediction error is much higher than the training baseline.
Current training split (in the preprocessing step): train_test_split(data, test_size=0.2, shuffle=True)
Which change will most likely fix the root cause with the least disruption?
Options:
A. Enable more aggressive hyperparameter tuning for XGBoost
B. Randomly shuffle data within each store to reduce seasonality bias
C. Use a chronological split and leave a 7-day gap before validation
D. Increase the training instance size to avoid underfitting
Best answer: C
Explanation: The symptom (great offline RMSE but poor post-deployment accuracy) is typical of time-series leakage caused by random shuffling. For forecasting, validation must simulate the real use case: predicting future values from past data with the required horizon. Using a chronological holdout (often with a horizon-sized gap) produces a realistic estimate and trains without peeking into the future.
Time series forecasting requires splits that preserve time order and match the prediction horizon. With shuffle=True, examples from the future can end up in the training set while earlier timestamps land in validation, especially when labels are shifted to predict 7 days ahead. This “future peek” inflates offline RMSE while the deployed model performs poorly on truly future periods, triggering Model Monitor accuracy violations.
A minimal, high-impact fix is to change to a temporal split that mirrors production:
This preserves the existing model approach while aligning evaluation and training with the 7-day horizon.
Topic: Data Preparation for Machine Learning (ML)
A financial services company trains a binary loan-approval model in Amazon SageMaker each night (S3 data -> train/tune -> register -> deploy). Auditors now require an automated, repeatable check for potential bias related to a sensitive attribute (for example, gender) and the model label, and the team must keep costs low by avoiding always-on infrastructure. The check must run as part of the workflow before a model can be approved for deployment.
Which change best meets these requirements?
Options:
A. Run fairness SQL in Athena instead of Clarify
B. Add a Clarify processing step in SageMaker Pipelines
C. Enable endpoint auto scaling to reduce bias risk
D. Use Data Wrangler to balance classes automatically
Best answer: B
Explanation: Use Amazon SageMaker Clarify as an automated processing step in the pipeline to detect bias in the training data and to compute high-level bias metrics before model approval. This improves operability and auditability because the report is produced consistently every run and stored centrally. It also reduces cost compared with running manual analyses on always-on notebook instances.
Amazon SageMaker Clarify is designed to identify potential bias and compute bias-related metrics at a high level, including pre-training (dataset) bias and post-training (model) bias. Embedding Clarify as a SageMaker Processing job inside SageMaker Pipelines makes the check repeatable and gateable (for example, fail or stop the pipeline if metrics exceed agreed thresholds) and produces artifacts (reports/metrics) that can be stored in S3 and attached to model approval workflows.
This is typically implemented as:
The main tradeoff is added pipeline runtime and per-run processing cost, but it removes manual steps and avoids always-on cost while improving governance.
Topic: ML Model Development
A team runs a nightly Amazon SageMaker training job (data in S3, training in a private VPC) to refresh a PyTorch model before 7:00 AM. The current job uses On-Demand ml.p3.2xlarge, costs are high, and the job typically finishes in 4.5 hours. The training script already writes periodic checkpoints.
Which change will reduce training cost while preserving reliability if capacity is interrupted?
Options:
A. Switch the training instance to a smaller CPU instance
B. Deploy the model to a multi-model endpoint to cut costs
C. Enable SageMaker managed Spot training with S3 checkpointing
D. Use EC2 Spot Instances for training without checkpointing
Best answer: C
Explanation: Use SageMaker managed Spot training and configure checkpointing to S3. This captures progress so the job can continue after Spot interruptions, preserving reliability while benefiting from Spot discounts. The main tradeoff is potential longer wall-clock time due to interruptions and checkpoint overhead.
The core concept is using discounted Spot capacity for training while maintaining reliability through checkpointing. With SageMaker managed Spot training, you enable Spot for the training job and configure an S3 checkpoint location so SageMaker can restore the latest checkpoint if the instance is reclaimed.
To keep the workflow reliable within the nightly window:
CheckpointConfig to an S3 path and ensure the script saves checkpoints regularly.This improves cost without changing the model or pipeline stages; the main tradeoff is that interruptions can increase end-to-end training time compared to stable On-Demand capacity.
Topic: Data Preparation for Machine Learning (ML)
A company trains a fraud classifier in Amazon SageMaker using a pipeline (S3 data AWS Glue SageMaker training/AMT real-time endpoint Model Monitor). The target label is highly imbalanced: 0.4% fraud.
Constraints:
Which change is the best optimization to address the class imbalance while meeting these constraints?
Options:
A. Generate synthetic fraud with SMOTE in Data Wrangler
B. Downsample non-fraud to a small subset
C. Oversample fraud examples during preprocessing
D. Use class weights (for example, XGBoost scale_pos_weight)
Best answer: D
Explanation: Applying class weights during training addresses imbalance by penalizing fraud misclassification more heavily, typically improving fraud recall with little to no increase in training cost. It also preserves auditability because it does not fabricate new customer records. The main tradeoff is a higher false-positive rate and the need to re-tune thresholds and calibration.
When the positive class is rare, many algorithms optimize overall loss by prioritizing the majority class. A cost-effective, auditable way to counter this is to use class weights (or per-example weights) so fraud errors contribute more to the training objective. In SageMaker (for example, the built-in XGBoost), setting scale_pos_weight (often near ,\(\frac{\#\text{negative}}{\#\text{positive}}\),) can improve fraud recall without increasing the number of training rows, so it typically stays within tight cost constraints and avoids synthetic-record governance issues.
The tradeoff is that emphasizing the minority class can increase false positives and affect probability calibration, so you usually need to re-evaluate precision/recall tradeoffs and choose an operating threshold that meets business requirements.
Topic: ML Model Development
When evaluating whether a modeling approach will fit a strict budget for training and time-to-train in Amazon SageMaker, which term describes the process of running many separate training jobs to search for the best hyperparameter values?
Options:
A. Hyperparameter tuning
B. Model drift
C. Regularization
D. Data leakage
Best answer: A
Explanation: Hyperparameter tuning is a search procedure that evaluates many hyperparameter configurations by running multiple training jobs. Because it multiplies the number of training runs, it can significantly increase training cost and extend time-to-train compared with training a single model once. In SageMaker, this is commonly performed with Automatic Model Tuning.
Hyperparameter tuning is the process of selecting the best hyperparameter values (for example, learning rate, tree depth, or regularization strength) by training and evaluating multiple candidate models. In Amazon SageMaker, Automatic Model Tuning orchestrates this by creating a tuning job that runs many underlying training jobs, each with a different hyperparameter configuration. This is why tuning is a key cost-and-schedule consideration: more trials typically mean more compute time and higher training spend.
Regularization is often confused with tuning, but it is a modeling technique applied within a single training run to reduce overfitting by adding a penalty to the objective; it does not inherently require multiple training jobs. The key takeaway is that tuning scales training cost roughly with the number of trials you run.
Topic: Data Preparation for Machine Learning (ML)
A healthcare company is preparing an Amazon SageMaker training dataset in Amazon S3 that includes structured claims (patient name, SSN, date of birth, ZIP code, diagnosis codes) and free-text clinician notes containing PHI. Data scientists must be able to join records across tables, but they do not need to see direct identifiers. Which TWO actions best protect sensitive attributes in the ML dataset before training? (Select TWO.)
Options:
A. Detect and redact PHI in notes using Amazon Comprehend Medical
B. Enable Amazon Macie on the S3 buckets and review findings
C. Tokenize identifiers with deterministic keyed hashing; drop raw fields
D. Run training jobs in private subnets and use VPC endpoints for S3
E. Encrypt the S3 buckets and SageMaker volumes with SSE-KMS
F. Use Lake Formation to deny access to columns containing PII
Correct answers: A and C
Explanation: Use anonymization and masking that change the data content before it reaches training. Deterministic tokenization (with a secret key) removes direct identifiers while keeping a consistent surrogate key for joins. PHI redaction on clinician notes masks sensitive entities in unstructured text so the model does not learn or expose them.
To protect PII/PHI in ML datasets, apply data classification to locate sensitive fields and then apply anonymization/masking so sensitive attributes are not present in the training data. For structured identifiers, replace values like name/SSN/patient ID with a deterministic, keyed token (for example, HMAC) so the same person maps to the same surrogate key across tables, while the original identifier is removed. For unstructured notes, run an NLP de-identification step (such as Amazon Comprehend Medical PHI detection) and redact or replace detected entities before writing the curated training dataset to S3.
Controls like encryption, network isolation, and access permissions reduce exposure but do not remove sensitive attributes from the dataset itself.
Topic: Deployment and Orchestration of ML Workflows
A company deploys an NLP model for real-time inference on Amazon SageMaker. The model is retrained weekly and must be promoted from dev to staging to prod across separate AWS accounts.
Requirements:
Which solution best meets these requirements?
Options:
A. CDK + CodePipeline/CodeDeploy canary to SageMaker endpoint variants
B. Deploy the model container to ECS with a rolling update
C. Manually update a single SageMaker endpoint in the console
D. Run SageMaker Batch Transform weekly and store outputs in S3
Best answer: A
Explanation: Define the deployment as infrastructure as code and use a managed progressive delivery mechanism for SageMaker endpoints. Pair canary/linear traffic shifting with CloudWatch alarms to automatically roll back when latency or error rates regress. Use SageMaker model/version artifacts with encryption and AWS audit logs to satisfy governance requirements.
The core need is maintainable, reliable ML infrastructure: repeatable configuration, safe rollouts, and observability hooks that can trigger rollback. An AWS-native pattern is to use infrastructure as code (AWS CDK or CloudFormation) to define SageMaker models/endpoints and supporting IAM/KMS/VPC settings, then run CI/CD with CodePipeline. For safe rollout, use CodeDeploy’s blue/green (canary or linear) deployment for SageMaker by shifting traffic between endpoint production variants while watching CloudWatch alarms (for p95 latency and 5xx), and automatically rolling back on alarm. Governance is met by using encrypted artifacts (S3/KMS, ECR/KMS as applicable) and auditable change history (CodePipeline/CodeDeploy events and AWS CloudTrail).
The key takeaway is to combine IaC + progressive delivery + monitoring-driven rollback for SageMaker endpoints.
Topic: ML Model Development
A team is fine-tuning a pretrained model in Amazon SageMaker for a new classification task. They must reduce the risk of catastrophic forgetting of the model’s original capabilities.
Which statement is NOT correct (unsafe) as a high-level fine-tuning approach?
Options:
A. Freeze most layers or use LoRA/adapters
B. Validate on old-task and new-task evaluation sets
C. Use a large learning rate with only new-task data
D. Mix a replay set of base-domain data during fine-tuning
Best answer: C
Explanation: To prevent catastrophic forgetting, fine-tuning should preserve performance on the original distribution while adapting to the new task. Strategies such as including representative replay data, constraining weight updates (freezing layers or PEFT), and monitoring performance on both old and new evaluation sets help maintain prior capabilities. Fine-tuning only on new-task data with a large learning rate is unsafe because it accelerates overwriting previously learned representations.
Catastrophic forgetting happens when fine-tuning shifts the model parameters too far toward the new task, reducing performance on previously learned behaviors. High-level mitigation is to (1) keep the fine-tuning data representative of both old and new distributions and (2) limit how much the pretrained weights can change.
Common, practical approaches include:
In contrast, training only on new-task data with an aggressive learning rate tends to overwrite prior representations and increases forgetting risk.
Topic: Deployment and Orchestration of ML Workflows
A team has a fraud model in Amazon SageMaker. They must score 40 million transactions once per day from Amazon S3 and write results back to S3 within 3 hours. There is no per-transaction real-time requirement, and the team wants to minimize cost and operational overhead.
Which deployment approach should the team AVOID for this workload?
Options:
A. Run a SageMaker Batch Transform job on a daily schedule
B. Run an Amazon ECS task on AWS Fargate to perform batch scoring
C. Run a SageMaker Processing job that performs batch inference
D. Keep a SageMaker real-time endpoint running 24/7 for daily scoring
Best answer: D
Explanation: Because scoring happens only once per day and can run for hours, the team should use batch-oriented compute that is provisioned only for the job duration. Keeping a real-time endpoint running continuously adds unnecessary always-on instance cost without improving required latency. The violated principle is selecting always-on infrastructure when the workload is offline and periodic.
The core tradeoff is paying for reserved/continuous inference capacity versus provisioning compute only when you need it. For periodic, offline scoring from S3 to S3 with a multi-hour SLA, batch patterns (Batch Transform, Processing, or a containerized batch job on ECS/Fargate) are cost-efficient because they spin up resources for the run and then terminate.
A real-time SageMaker endpoint is designed for low-latency, on-demand requests and charges for provisioned instances while the endpoint is up. Using it for a once-per-day batch job typically wastes money on idle capacity between runs and adds endpoint operations you do not need. The key takeaway is to match the deployment pattern to the access pattern: batch for scheduled bulk inference, endpoints for interactive latency-sensitive inference.
Topic: Data Preparation for Machine Learning (ML)
A data engineer is building an Amazon SageMaker training dataset in an AWS Glue Spark job by joining a fact table (transactions) with a dimension table (customer_attributes) on customer_id. The output row count is unexpectedly higher than the input fact table.
Exhibit: Glue job log (row counts)
transactions: rows=10,000 distinct(customer_id)=9,500
customer_attributes: rows=12,300 distinct(customer_id)=9,500
joined_output (left join on customer_id): rows=12,940
Based on the exhibit, what is the best next step to preserve join correctness and avoid unintended duplication?
Options:
A. Change the left join to an inner join
B. Deduplicate customer_attributes to one row per customer_id before joining
C. Normalize customer_id strings (trim/lowercase) before joining
D. Increase Spark shuffle partitions for the join stage
Best answer: B
Explanation: The join is multiplying fact rows because the dimension side is not unique on the join key. The exhibit shows customer_attributes has more rows than distinct customer_id, and the joined output row count is greater than the fact table row count. Ensuring a single record per customer_id in the dimension (or otherwise enforcing the intended join cardinality) prevents unintended duplication.
Join correctness depends on the expected key cardinality. Here, transactions should not grow after a left join unless the right-side table has multiple matches per key.
The exhibit indicates this exact problem:
customer_attributes: rows=12,300 distinct(customer_id)=9,500 shows multiple rows per customer_id.joined_output ... rows=12,940 being greater than transactions: rows=10,000 confirms row multiplication.To preserve correctness, make the dimension deterministic at one row per customer_id before the join (for example, pick the latest effective record with a window function, or aggregate to a single row). This keeps the join one-to-one (or the intended one-to-many) instead of unintentionally many-to-many.
Topic: Data Preparation for Machine Learning (ML)
You are troubleshooting scalability issues in an ML data ingestion and storage pipeline on AWS. Which statement is NOT correct about diagnosing throughput bottlenecks and choosing mitigations?
Options:
A. AWS Glue jobs can scale out by increasing workers and parallelism.
B. S3 request throughput can improve by spreading objects across prefixes.
C. Kinesis Data Streams shard hot spots can occur from skewed partition keys.
D. S3 throughput is unaffected by prefix design, so prefix spread never helps.
Best answer: D
Explanation: A common scalability bottleneck in S3-backed pipelines is uneven request distribution (a hot prefix), which can limit effective parallel throughput for reads/writes. Spreading objects and requests across multiple prefixes increases parallelism and reduces hot-spotting risk. The other statements correctly describe common throughput ceilings and mitigations for Kinesis and AWS Glue.
Scalability bottlenecks in ML ingestion/storage pipelines often come from constrained parallelism (too few partitions/shards) or skew (hot partitions). For S3-backed datasets, request patterns that concentrate activity on a small keyspace (for example, a single heavily used prefix) can reduce effective throughput; distributing objects across multiple prefixes and enabling readers to parallelize across those prefixes is a standard mitigation.
Similarly, Kinesis Data Streams ingest/read throughput scales with shard count, but a poor partition-key strategy can cause one shard to become a hot spot even when overall shard capacity looks sufficient. For transformation/ETL, AWS Glue can reduce runtime by scaling out workers/DPUs and increasing parallelism, assuming the job is not bottlenecked by a single hot partition or external system.
The key takeaway is to look for throughput ceilings and hot partitions, then increase parallelism and improve key/partition distribution.
Topic: Data Preparation for Machine Learning (ML)
A team trains an XGBoost model in Amazon SageMaker using tabular data in Amazon S3. The team uses multiple transformations (imputation, one-hot encoding, and numeric scaling) and must ensure feature parity between training and inference across future model versions.
Which approach should the team NOT use?
Options:
A. Use the same preprocessing script in both a SageMaker Processing step and the inference container
B. Store transformation artifacts (for example, encoders/scalers) from training and load them at inference
C. Recreate the transformations in the inference code by hand to match training
D. Register the feature definitions and transformations and reuse them via SageMaker Feature Store
Best answer: C
Explanation: To ensure feature parity, transformation logic and feature definitions should be reusable and applied consistently in both training and inference. Manually re-implementing preprocessing separately for inference creates two sources of truth, making it easy for the online path to diverge from the training pipeline as code and schemas evolve.
The core principle is to keep training and inference on a single, versioned source of truth for feature definitions and transformation logic. If the team hand-codes preprocessing separately for inference, even small differences (category ordering, missing-value handling, scaling parameters, new columns) can change the feature vector and degrade predictions or break the endpoint.
Acceptable patterns include:
The key takeaway is to avoid duplicated, independently maintained transformation implementations across training and inference.
Topic: Deployment and Orchestration of ML Workflows
A team uses an AWS ML workflow: data in Amazon S3 - train/tune in Amazon SageMaker - deploy - monitor. After training, they deployed a SageMaker real-time endpoint and run a nightly scoring job by invoking the endpoint ~80,000 times in parallel from AWS Lambda, then writing predictions to S3.
The endpoint is idle 23 hours/day, costs are high, and the nightly run often throttles and misses its 2-hour completion SLA. There are no interactive users; results only need to be available in S3 by morning.
Which change is the best way to improve cost and reliability without breaking requirements?
Options:
A. Use a SageMaker multi-model endpoint for the trained model
B. Enable autoscaling and raise the endpoint max capacity
C. Switch the deployment to a SageMaker serverless endpoint
D. Run SageMaker Batch Transform nightly and remove the endpoint
Best answer: D
Explanation: This workload is pure batch scoring with an S3 output requirement and no need for low-latency online inference. SageMaker Batch Transform provisions compute only for the job duration and can use multiple instances for throughput, eliminating always-on endpoint cost while reducing throttling risk during the nightly peak.
The core issue is a deployment selection mismatch: a real-time endpoint is designed for interactive, low-latency requests and is billed continuously, but the team’s pattern is a scheduled, high-throughput batch job. Moving inference to SageMaker Batch Transform improves both cost and reliability because it uses ephemeral compute sized for the batch window and avoids per-request endpoint throttling.
A high-level corrective approach is:
The main tradeoff is that inference becomes job-based (not interactive), but that aligns with the stated requirements.
Topic: ML Model Development
A team is comparing two Amazon SageMaker training jobs and wants the results to be reproducible.
Exhibit: Training summary
Job: churn-xgb-2026-02-20-01
Input: s3://ml-data/churn/train.csv
HyperParams: {eta=0.1, max_depth=5}
Metric: validation:auc=0.912
Job: churn-xgb-2026-02-20-02
Input: s3://ml-data/churn/train.csv
HyperParams: {eta=0.1, max_depth=5}
Metric: validation:auc=0.935
Which action best enables reproducible model development for these runs?
Options:
A. Deploy both models to a multi-model endpoint to compare inference performance
B. Enable SageMaker Debugger to capture tensors and gradients during training
C. Version the dataset and log the exact dataset version, parameters, and metrics in SageMaker Experiments
D. Run SageMaker Automatic Model Tuning to explore additional hyperparameters
Best answer: C
Explanation: The exhibit shows identical input location and hyperparameters across runs but different validation AUC, which is a reproducibility red flag. To reproduce and audit results, each experiment run should record immutable dataset identifiers (for example, versioned S3 objects or timestamped prefixes) together with the run’s parameters and resulting metrics using an experiment tracker such as SageMaker Experiments.
Reproducible training requires being able to re-run the same code against the same dataset version with the same configuration and then compare resulting metrics. In the exhibit, the two jobs have the same hyperparameters and point to the same S3 object key (Input: s3://ml-data/churn/train.csv) but yield different AUC values, which commonly happens when the data at that key changes over time (or when run configuration isn’t fully captured).
A practical AWS approach is to:
This creates an auditable lineage for each metric value back to the exact dataset and settings that produced it.
s3://.../train.csv.HyperParams are already the same.Topic: ML Solution Monitoring, Maintenance, and Security
A company runs Amazon SageMaker training pipelines and real-time endpoints across multiple AWS accounts in an AWS Organization. Compliance requires audit evidence for 7 years, centrally stored in a dedicated logging account, encrypted with KMS, and protected against tampering/deletion. Security operations also needs near real-time visibility into API activity and endpoint application logs for investigations.
Which TWO actions should the ML engineer AVOID because they do not meet these requirements? (Select TWO.)
Options:
A. Configure SageMaker endpoint container logs to stream to CloudWatch Logs and set the log group retention to 7 years
B. Create an organization trail for all Regions that delivers to an S3 bucket in the logging account with SSE-KMS and log file validation enabled
C. Rely on CloudTrail Event history and set CloudWatch Logs retention to 30 days
D. Create separate account-level trails that deliver to S3 buckets in each application account managed by engineers
E. Enable S3 Object Lock (compliance or governance mode) on the centralized CloudTrail S3 bucket and restrict delete actions to a break-glass role
F. Send CloudTrail events to CloudWatch Logs for alerting and investigation, in addition to storing the long-term copy in S3
Correct answers: C and D
Explanation: Meeting compliance and investigation needs requires durable, centralized, tamper-resistant audit logs and searchable operational logs. CloudTrail should provide organization-wide API auditing with long-term retention in a protected S3 bucket, while CloudWatch Logs supports near real-time visibility and retained endpoint logs. Any approach that relies on short-lived history or decentralized, engineer-managed storage fails the stated evidence requirements.
For security investigations and compliance evidence, treat audit and operational logs as immutable records with controlled access. Use CloudTrail (preferably an organization trail across Regions) to capture API activity and deliver logs to a centralized S3 bucket in a dedicated logging account with KMS encryption, integrity controls (log file validation), and strong anti-tamper protections (for example, S3 Object Lock and restricted delete permissions).
CloudWatch Logs complements this by providing near real-time log search, metric filters/alarms, and centralized application logging (such as SageMaker endpoint container logs) with retention set to match policy. The key failure patterns are depending on short-retention sources (like CloudTrail Event history) or storing logs in non-central, weakly protected locations where administrators can alter or delete evidence.
Topic: Deployment and Orchestration of ML Workflows
A machine learning engineer is reviewing an Amazon SageMaker batch inference configuration.
Exhibit: Transform job request (partial)
TransformInput:
S3Uri: s3://ml-prod/inference-input/2026-02-25/
TransformOutput:
S3OutputPath: s3://ml-prod/inference-output/2026-02-25/
TransformResources:
InstanceType: ml.m5.xlarge
InstanceCount: 4
Based on the exhibit, which interpretation best describes the batch workflow (input location, output location, and parallelism)?
Options:
A. Creates a real-time endpoint with 4 instances for inference
B. Reads from S3OutputPath and writes results to the input prefix
C. Runs on one instance; parallelism comes only from S3 prefix sharding
D. Reads input S3Uri, writes to S3OutputPath, uses 4 instances
Best answer: D
Explanation: This is a SageMaker batch transform-style configuration where S3 is used for both the input data location and the output results location. The job’s compute parallelism is determined by how many instances are provisioned for the transform resources.
In a batch inference workflow on SageMaker, you typically provide an S3 prefix for where the input objects are read from, an S3 prefix for where inference outputs are written, and a resource configuration that determines how much compute is allocated.
Here, the exhibit explicitly identifies:
S3Uri: s3://ml-prod/inference-input/2026-02-25/S3OutputPath: s3://ml-prod/inference-output/2026-02-25/InstanceCount: 4 under TransformResourcesA real-time endpoint configuration would not be expressed with TransformInput/TransformOutput fields like these.
S3Uri (input) and S3OutputPath (output) lines in the exhibit.TransformInput and TransformOutput in the request.TransformResources showing InstanceCount: 4.Topic: Deployment and Orchestration of ML Workflows
A company is building a regulated ML CI/CD pipeline using AWS CodePipeline and Amazon SageMaker Pipelines. Auditors require that for any deployed model the team can prove, for at least 2 years, exactly which training data, code commit, and container image produced it.
Which TWO actions should the team AVOID because they make the pipeline non-auditable or non-reproducible?
Options:
A. Register models in SageMaker Model Registry with metadata
B. Overwrite S3 dataset objects with versioning disabled
C. Enable S3 versioning for data and model artifact buckets
D. Enable CloudTrail for CodePipeline, SageMaker, and S3 events
E. Pull ECR latest image tag without recording digest
F. Tag training jobs with git SHA and dataset version ID
Correct answers: B and E
Explanation: Auditability requires immutable, linkable evidence for data, code, and runtime artifacts. Using mutable references (like an ECR tag) or losing historical versions of datasets prevents the team from reconstructing what produced a specific model. The safer pattern is to keep versioned/immutable artifacts and record their identifiers as metadata through the pipeline.
To meet compliance traceability, an ML CI/CD pipeline should produce an end-to-end lineage record that ties each model to immutable identifiers for (1) the training data snapshot, (2) the exact code revision, and (3) the exact container image/build. Mutable pointers (for example, an ECR tag that can be moved) and non-versioned data locations break reproducibility because the same identifier can refer to different content over time.
A robust approach is:
The key takeaway is to record immutable artifact IDs, not mutable names.
latest without a digest prevents proving which image actually ran.Topic: ML Solution Monitoring, Maintenance, and Security
A team hosts a fraud model on a SageMaker real-time endpoint with Data Capture enabled. A SageMaker Model Monitor schedule runs hourly and publishes a CloudWatch alarm (Fraud-DataDrift-Alarm) when a constraint violation is detected. The team expects an AWS Step Functions state machine to start a SageMaker Pipeline retraining run whenever the alarm goes to ALARM, but no executions are starting.
Exhibit: Current EventBridge rule event pattern
{
"source": ["aws.sagemaker"],
"detail-type": ["SageMaker Model Monitor Schedule Execution"],
"detail": {
"monitoringScheduleName": ["fraud-mm"]
}
}
Which change will fix the issue with the least operational change while preserving the existing drift trigger?
Options:
A. Disable Data Capture and read inference logs from CloudWatch Logs to detect drift
B. Recreate the Model Monitor baseline with a larger reference dataset so the alarm stops flapping
C. Increase the Model Monitor schedule frequency from hourly to every 5 minutes
D. Update the EventBridge rule to match CloudWatch Alarm State Change for Fraud-DataDrift-Alarm and target the Step Functions StartExecution API
Best answer: D
Explanation: The monitoring signal already exists as a CloudWatch alarm, but the EventBridge rule is listening for a SageMaker event type that won’t fire for alarm transitions. Use the alarm’s state change event as the retraining trigger. Then EventBridge can invoke Step Functions to orchestrate the SageMaker Pipeline run.
Symptom: the drift alarm is entering ALARM, but no Step Functions executions start.
Root cause: the EventBridge rule is filtering on a SageMaker Model Monitor “schedule execution” event, while the team’s chosen retraining trigger is a CloudWatch alarm transition. EventBridge will never match the alarm event with the current pattern.
Fix: connect the monitoring output to orchestration by triggering on the CloudWatch alarm state change event and using it to start the workflow:
source = aws.cloudwatchdetail-type = CloudWatch Alarm State Change and the specific alarm nameStartExecution (optionally passing alarm details)This keeps the existing drift threshold as the retraining trigger and only corrects the event wiring.
Topic: Deployment and Orchestration of ML Workflows
In Amazon SageMaker, which deployment option is intended for inference requests with large payload sizes and long processing times, where the client should not keep a synchronous connection open while the result is computed?
Options:
A. Serverless inference endpoint
B. Asynchronous inference endpoint
C. Batch transform job
D. Real-time inference endpoint
Best answer: B
Explanation: SageMaker asynchronous inference is built for workloads where requests can take longer to process and/or carry larger payloads than a synchronous request-response pattern is meant to handle. It decouples the client from the model’s processing time by queueing requests and delivering outputs asynchronously. This makes it a better fit than low-latency real-time patterns.
The key requirement is that inference processing time and payload size may exceed what is practical for a synchronous API call. A SageMaker asynchronous inference endpoint addresses this by accepting requests, placing them in a queue, and writing outputs to a destination (commonly Amazon S3) when processing completes. This design helps handle variable processing times and supports higher concurrency without forcing the caller to keep a connection open.
Batch transform is the closest alternative but is intended for offline, job-based processing of an entire dataset rather than request-driven endpoint inference. If you need interactive, request/response behavior with low latency, a real-time endpoint (including serverless real-time for spiky traffic) is typically more appropriate.
Topic: ML Model Development
A data science team trained a regression model in Amazon SageMaker to predict home sale price (USD). The team captured the following evaluation summary on the validation set.
Exhibit: Validation metrics and residual summary
Train RMSE (USD): 18,900
Validation RMSE (USD): 19,300
Residual mean (val): +$120
Residual std by true price bucket (val):
<$200k: $8,100
$200k–$500k: $17,600
>$500k: $42,300
Based on the exhibit, which next step is the BEST way to address the issue revealed by the residual pattern?
Options:
A. Model log(price) (or similar) and re-evaluate on original scale
B. Increase model complexity to fix underfitting
C. Switch to classification metrics such as AUC to assess performance
D. Add early stopping to reduce overfitting
Best answer: A
Explanation: The model’s overall RMSE is similar for train and validation, but the residual standard deviation grows substantially for higher-priced homes. This indicates heteroscedastic errors (non-constant variance) rather than classic overfitting/underfitting. A common next step is to transform the target (for example, predict log(price)) or otherwise weight the loss to better fit relative error across price ranges.
For regression, RMSE summarizes average error magnitude, but residual patterns can reveal systematic issues that a single metric can hide. In the exhibit, train and validation RMSE are close (18,900 vs 19,300), suggesting generalization is not the primary problem.
The key signal is non-constant residual variance across the target range: residual standard deviation rises from $8,100 for homes under $200k to $42,300 for homes over $500k (see the bucketed residual std lines). This is heteroscedasticity, which often improves by modeling the target on a transformed scale (for example, log) so errors become more proportional across ranges, then converting predictions back to USD.
This is a better fit than simply tuning training duration or model size when the dominant issue is variance that grows with the target magnitude.
Topic: Data Preparation for Machine Learning (ML)
A healthcare company needs to create human-labeled data for an NLP model. Use the exhibit to choose the most appropriate AWS labeling approach.
Exhibit: Labeling request (excerpt)
1) Dataset: 250,000 call transcripts
2) Contains PHI
3) Data must not leave the AWS account
4) Annotators must be internal employees (HIPAA BAA)
5) Quality: 2 annotators per item + adjudication
Which labeling approach best satisfies these requirements?
Options:
A. Use SageMaker Ground Truth with the public workforce and rely on NDAs
B. Use Amazon Mechanical Turk to run public HITs with worker qualifications
C. Export transcripts to an external labeling vendor and import labels back to S3
D. Use SageMaker Ground Truth with a private workforce and annotation consolidation
Best answer: D
Explanation: The exhibit’s governance constraints require that PHI stays within the AWS account and that only internal employees label the data. SageMaker Ground Truth with a private workforce satisfies these constraints while also supporting the required two-annotator workflow with adjudication through annotation consolidation.
The core decision is governance and quality control for labeling. The exhibit states the data contains PHI (line 2), must not leave the AWS account (line 3), and annotators must be internal employees under a HIPAA BAA (line 4). That rules out public crowdsourcing and external vendor workflows where data access expands beyond the company’s controlled workforce.
SageMaker Ground Truth with a private workforce lets you:
Key takeaway: when sensitive data and strict workforce governance are required, choose a private workforce labeling solution rather than public or external options.
Topic: Data Preparation for Machine Learning (ML)
A financial services company is preparing a training dataset in Amazon S3 for a loan approval model in Amazon SageMaker. The dataset includes a label column approved (1 = approved, 0 = denied), a sensitive attribute column gender, and PII columns (ssn, email).
The ML engineer must use SageMaker Clarify to assess potential bias in the training data and produce an auditable bias report. Company policy requires that PII is not exported to publicly accessible locations and that all Clarify outputs in S3 are encrypted with SSE-KMS.
Which TWO actions should the engineer AVOID? (Select TWO.)
Options:
A. Point Clarify at the raw S3 prefix containing ssn and write the report to an S3 bucket with public read access
B. Run Clarify without specifying a facet_name (sensitive attribute) and infer bias from overall label imbalance
C. Use Clarify post-training bias with a SageMaker model (or model package) and store outputs in an SSE-KMS encrypted S3 location
D. Add a Clarify processing step in SageMaker Pipelines to generate and store bias reports
E. Run separate Clarify bias jobs for gender (and other protected attributes) and compare the resulting metrics
F. Run a Clarify pre-training bias processing job on a curated dataset that excludes PII
Correct answers: A and B
Explanation: SageMaker Clarify bias metrics require defining the sensitive attribute (facet) so metrics can be computed across groups. The workflow must also meet data handling requirements by preventing public exposure of PII and ensuring the generated reports are stored securely with SSE-KMS encryption.
SageMaker Clarify evaluates bias by comparing outcomes across groups defined by a sensitive attribute (for example, gender) for either pre-training (dataset) or post-training (model predictions) analysis. To generate meaningful bias metrics, you must configure Clarify with the label and the facet so it can compute group-based statistics (such as disparate impact–style measures) rather than only global class distributions.
Because Clarify produces artifacts (reports, constraints, and intermediate outputs) in S3, the storage location must follow your organization’s security requirements. For regulated data, this typically means restricting bucket access (no public policies/ACLs) and enabling SSE-KMS so outputs are encrypted at rest. The key takeaway is that Clarify needs an explicitly defined sensitive facet for correct bias evaluation, and its output handling must meet security controls.
Topic: Data Preparation for Machine Learning (ML)
A team stores real-time fraud features in an Amazon SageMaker Feature Store feature group that was created with only the online store enabled. A new SageMaker training pipeline step tries to build a training dataset by running an Athena query against the feature group’s offline store.
Exhibit: Pipeline error
ValidationException: FeatureGroup FraudFeatures has no OfflineStoreConfig
The team must keep low-latency feature lookups for the existing real-time endpoint and make the smallest change that fixes the training failure. What should the ML engineer do?
Options:
A. Add IAM permissions for athena:StartQueryExecution to the role
B. Read training data from the online store using GetRecord calls
C. Create a new feature group with an offline store and backfill
D. Increase training instance size to avoid the Athena query failure
Best answer: C
Explanation: The training step fails because the feature group was created without an offline store configuration, so there is no S3-backed offline table for Athena to query. SageMaker Feature Store does not support enabling the offline store on an existing feature group after creation. Creating a new feature group with offline storage (and backfilling features) fixes training while preserving online lookups for low-latency inference.
SageMaker Feature Store uses two distinct storage paths: the online store for low-latency key-based inference lookups, and the offline store (S3 + Glue Data Catalog) for analytics and training-time batch access (for example, via Athena). The error indicates the feature group has no OfflineStoreConfig, which happens when it was created with only the online store enabled.
To fix this with minimal disruption:
The key takeaway is that offline storage must be planned at feature group creation time for training use cases.
OfflineStoreConfig.Topic: ML Model Development
A company runs a SageMaker pipeline for a fraud model. The ML engineer wants to add an automated check that flags unusual shifts in incoming feature distributions (no labels available) so the team can investigate potential data drift before model performance degrades. Which Amazon SageMaker built-in algorithm best matches this problem type?
Options:
A. XGBoost
B. k-means
C. Linear Learner
D. Random Cut Forest
Best answer: D
Explanation: This is an unlabeled, unsupervised need to identify unusual incoming data patterns that may indicate drift. Random Cut Forest is a SageMaker built-in algorithm designed for anomaly detection on numeric features, making it appropriate for automated drift signals. This aligns with the core principle of monitoring for drift to maintain ML system performance over time.
The core principle here is monitoring for drift: you want an automated signal when new data looks meaningfully different from what the model was trained on. Because the stem specifies no labels and the goal is to flag unusual patterns in feature distributions, the matching built-in approach is unsupervised anomaly detection. Amazon SageMaker Random Cut Forest (RCF) is designed to detect outliers by assigning anomaly scores to observations, which can be used to trigger investigation workflows (for example, alarms and retraining evaluation) when feature behavior changes.
The key takeaway is to choose an algorithm that matches the learning setup (unlabeled anomaly detection) rather than a supervised predictor.
Topic: ML Model Development
A company uses an Amazon SageMaker pipeline to forecast daily demand for 10,000 SKUs. Data is stored in Amazon S3, a processing step builds lag features, SageMaker Automatic Model Tuning trains an XGBoost model, the model is deployed, and Amazon CloudWatch monitors MAPE.
The business requires a 30-day forecast horizon. Offline validation MAPE is 8%, but after deployment, MAPE is 18%. The current pipeline creates the training and validation sets by randomly splitting rows across the full 2-year history.
Which change is the best way to improve forecast performance and reliability without increasing cost significantly?
Options:
A. Enable SageMaker managed Spot Training for the tuning jobs
B. Use a time-ordered split that holds out the most recent 30 days per SKU for validation
C. Replace the split with k-fold cross-validation using shuffled folds
D. Increase real-time endpoint capacity and configure automatic scaling
Best answer: B
Explanation: Time series data must be split by time to avoid training on information from the future relative to the validation period. Holding out the most recent 30 days (the forecast horizon) creates an evaluation that matches how the model will be used in production. This usually improves model selection and reduces the offline-to-online performance gap with minimal added cost.
The core issue is temporal leakage and a mismatch between how the model is validated and how it is used. Random row splits let the model see patterns from dates that occur after (or interleaved with) the validation dates, which can make offline metrics unrealistically optimistic for forecasting.
A better approach is to validate using a chronological holdout window that matches the forecast horizon:
This change improves reliability and forecast performance assessment with little to no additional infrastructure cost; tuning and deployment choices become consistent with the 30-day-ahead requirement.
Topic: ML Solution Monitoring, Maintenance, and Security
A SageMaker training job writes model artifacts to an S3 bucket that enforces SSE-KMS with a customer managed key. The job fails with:
AccessDeniedException: User: arn:aws:sts::111122223333:assumed-role/SageMakerTrainRole/... is not authorized to perform: kms:GenerateDataKey on key arn:aws:kms:us-east-1:111122223333:key/...
The role already has s3:PutObject to the bucket. Which action best applies the core principle of least privilege while fixing the failure?
Options:
A. Attach AmazonS3FullAccess to SageMakerTrainRole
B. Disable SSE-KMS on the S3 bucket for the model artifacts prefix
C. Update the KMS key policy to allow SageMakerTrainRole and grant only required KMS actions scoped to that key
D. Allow kms:* on * for SageMakerTrainRole
Best answer: C
Explanation: The failure is in the KMS authorization path (not S3), because the role cannot call kms:GenerateDataKey on the specific CMK required by the bucket’s encryption setting. The least-privilege fix is to allow that role to use only the necessary KMS actions on only that key, and ensure the key policy also permits the role.
This error indicates the failing control plane is AWS KMS authorization, not S3: when an S3 bucket enforces SSE-KMS, the caller must be allowed to use the specified CMK for data key generation and encryption. With least privilege, you grant only the KMS permissions required for the workload (commonly kms:GenerateDataKey, plus encryption/decryption as needed) and scope them to the specific CMK ARN. Because KMS evaluates both IAM policy and the key policy, the CMK’s key policy must also allow the SageMaker execution role (directly or via an allowed principal such as the account with appropriate conditions).
The key takeaway is to fix the specific missing permission on the specific resource, rather than broadening access.
Topic: Deployment and Orchestration of ML Workflows
A team deploys a real-time fraud model to an Amazon SageMaker endpoint using a CI/CD pipeline (AWS CodePipeline) that promotes an approved model package from SageMaker Model Registry.
Release requirements:
Which TWO rollout approaches should the team AVOID for this release?
(Select TWO.)
Options:
A. Shift 5% of traffic to a new production variant with CloudWatch alarms and automatic rollback to the previous variant
B. Keep the previous approved model package and endpoint configuration in Model Registry to enable one-click rollback in the pipeline
C. Run a short-lived blue/green cutover using identical VPC/KMS settings, then swap the endpoint configuration and keep the prior config for rollback
D. Use a linear rollout that increases traffic in small steps with alarms, then revert traffic weights if thresholds are breached
E. Temporarily deploy the candidate model to a public endpoint and store request/response payloads unencrypted for faster debugging
F. Replace the production model and route 100% of traffic immediately, planning to retrain and redeploy if issues are found
Correct answers: E and F
Explanation: Approaches to avoid are the ones that break explicit rollout constraints. Exposing inference publicly or writing unencrypted payloads violates security requirements, and cutting over 100% at once without an immediate rollback mechanism violates the 5-minute rollback requirement. Canary, linear, and controlled blue/green patterns can meet the stated rollback and cost-cap constraints when paired with alarms and preserved prior configurations.
High-level model release strategies (canary, linear, blue/green) are acceptable when they (1) shift traffic gradually or with a controlled cutover, (2) include health/KPI alarms, and (3) keep a ready rollback target (previous model package + endpoint config) that can be restored quickly.
In this scenario, the unsafe approaches are the ones that directly violate stated requirements:
By contrast, canary and linear deployments use traffic weighting and monitoring to limit blast radius, and a time-boxed blue/green swap can satisfy the temporary capacity constraint if the overlap window is kept short and rollback is an endpoint config revert.
Topic: ML Solution Monitoring, Maintenance, and Security
A team runs a nightly Amazon SageMaker Processing job that reads the full raw clickstream dataset from Amazon S3 (including many columns that are never used by the model) and writes a large intermediate dataset back to S3 before training. To reduce costs without changing model behavior, the team updates the preprocessing to select only the required feature columns, aggregates records to the granularity needed for training, and deletes the intermediate artifacts after the training job finishes.
Which core principle does this action most directly reflect?
Options:
A. Data minimization
B. Least privilege
C. Defense in depth
D. Reproducibility
Best answer: A
Explanation: This change applies the data minimization principle: only collect, retain, and process what is necessary for the ML objective. By reducing unused columns and unnecessary intermediate outputs, the workflow lowers S3 storage and processing compute costs while preserving training inputs that matter to the model.
The core principle is data minimization: design ML pipelines to use (and keep) only the minimum data needed to meet the purpose. In the scenario, dropping unused columns and aggregating to the required granularity reduces I/O, storage footprint, and processing time, which directly lowers infrastructure costs without changing the model’s intended inputs. Deleting intermediate artifacts after use further reduces ongoing storage cost and reduces the amount of data that could be exposed in the event of an incident.
Key takeaway: cost optimization can be achieved by minimizing unnecessary data movement, storage, and processing—not just by changing instance sizes.
Topic: ML Solution Monitoring, Maintenance, and Security
A company has deployed a fraud detection model to a SageMaker real-time endpoint. The team wants to detect feature distribution drift in production by using SageMaker Model Monitor.
They have a representative training dataset in S3 and have enabled endpoint data capture to an S3 prefix. After a scheduled monitoring run, the team sees Model Monitor artifacts written to an S3 output prefix.
Which TWO statements are correct about configuring this monitoring and interpreting the high-level outputs of the monitoring job? (Select TWO.)
Options:
A. Model Monitor automatically derives the baseline from the first schedule run.
B. A non-empty constraint_violations.json means baseline constraints were breached.
C. Review statistics.json to identify which constraints failed.
D. Use SageMaker Debugger rules to detect inference data drift.
E. Create a baseline that outputs constraints.json and statistics.json to S3.
F. Model Monitor reads CloudWatch Logs as the monitoring input by default.
Correct answers: B and E
Explanation: To monitor data drift, SageMaker Model Monitor compares captured inference data to a baseline made from representative data and saved as baseline statistics and constraints in S3. The monitoring run then writes a violations artifact when the captured data no longer conforms to those baseline constraints.
SageMaker Model Monitor data quality monitoring works by establishing a baseline and then comparing production inference data to it on a schedule.
Typical setup and outputs:
statistics.json and constraints.json in S3.constraint_violations.json file that lists any violated constraints (for example, missing values, type mismatches, or distribution constraints you baselined).Key takeaway: baseline artifacts define “normal,” and constraint_violations.json is the high-level signal that drift/quality constraints were violated.
constraints.json and statistics.json to S3 — these baseline artifacts are required for comparison.constraint_violations.json means baseline constraints were breached — this is the primary drift/quality indicator from a run.statistics.json to identify which constraints failed — the violations are surfaced in constraint_violations.json.Topic: ML Solution Monitoring, Maintenance, and Security
A company runs a SageMaker real-time endpoint for credit decisions. The model’s features include PII. The security team requires that inference data remains in AWS, encrypted with a KMS key, with no internet egress. The application also has a strict p99 latency SLO of 200 ms.
The ML engineer must detect changes in inference data distributions over time and choose high-level analysis approaches (including SageMaker Clarify) to support drift investigation.
Which TWO actions should the engineer AVOID? (Select TWO.)
Options:
A. Enable endpoint data capture to KMS-encrypted S3 for analysis
B. Schedule Clarify bias drift runs on captured inference samples
C. Export full inference payloads to an external SaaS for drift reporting
D. Run Clarify SHAP explainability synchronously on every inference request
E. Compute PSI/KS drift metrics in a scheduled SageMaker processing job
F. Create a Model Monitor data quality baseline from training data
Correct answers: C and D
Explanation: Detecting distribution shifts typically requires capturing inference inputs, establishing a baseline from training (or a known-good period), and running offline or scheduled comparisons. SageMaker Model Monitor and SageMaker Clarify support these analyses using batch processing and stored reports. Approaches that move PII outside AWS or that add expensive analysis to the real-time inference path violate the stated security and latency requirements.
Data distribution drift detection is usually done by comparing summary statistics or distributions of recent inference data to a baseline (often from training data). In SageMaker, a common pattern is to enable endpoint data capture to encrypted S3, then run scheduled monitoring jobs: Model Monitor for data quality drift (feature statistics/constraints) and Clarify for bias drift analysis on captured samples. You can also compute custom drift metrics (for example PSI or KS tests) in a scheduled SageMaker processing job and publish results to CloudWatch for alarming.
Actions to avoid are those that violate explicit constraints: exporting PII outside AWS breaks the no-egress requirement, and running explainability synchronously per request risks latency and unnecessary cost. Keep heavy analysis asynchronous/offline and keep captured data protected with encryption and controlled access.
Topic: ML Solution Monitoring, Maintenance, and Security
A company runs an Amazon SageMaker real-time endpoint in multiple AWS accounts. They capture 1% of requests/responses for troubleshooting and need to retain all monitoring artifacts (logs, metrics, captured payloads, and Model Monitor reports) for 7 years for audits. Requirements:
Which approach BEST meets these requirements?
Options:
A. Keep payloads, logs, and metrics only in CloudWatch with 7-year retention
B. Centralize artifacts in SSE-KMS S3 Object Lock with lifecycle; export CloudWatch logs/metrics and write Data Capture/Model Monitor there
C. Store payloads and metrics in DynamoDB with a 7-year TTL
D. Write captured payloads to EFS on endpoint instances; snapshot weekly to S3
Best answer: B
Explanation: Use an immutable, low-cost archival store for audit retention and keep recent data quickly accessible. An SSE-KMS encrypted S3 bucket with Object Lock satisfies WORM and 7-year retention, while S3 lifecycle policies control cost over time. Exporting CloudWatch logs/metrics to S3 and writing SageMaker Data Capture and Model Monitor outputs to the same bucket preserves all required monitoring artifacts.
The core requirement is audit-ready retention of monitoring artifacts (including captured inference payloads and time-series telemetry) with immutability and encryption, without adding meaningful inference latency. Amazon S3 is the audit-friendly system of record here: S3 Object Lock (WORM) plus SSE-KMS meets governance requirements, and S3 lifecycle transitions meet the “hot for 30 days, cheap afterward” requirement.
To cover all artifact types end-to-end:
Keeping operational views in CloudWatch for recent troubleshooting is compatible, but S3 is the durable audit store.
Topic: ML Solution Monitoring, Maintenance, and Security
Which AWS service provides an immutable-style audit trail of AWS API activity (for example, SageMaker CreateTrainingJob, CreateEndpoint, and IAM PassRole) that can be stored in Amazon S3 for compliance evidence and security investigations?
Options:
A. AWS CloudTrail
B. Amazon SageMaker Model Monitor
C. Amazon CloudWatch Logs
D. AWS Config
Best answer: A
Explanation: AWS CloudTrail is the AWS service used for auditing because it records management and data events for API calls across AWS services, including Amazon SageMaker. Those event records can be centrally stored (commonly in Amazon S3) and used as compliance evidence and for security investigations. This is distinct from operational logging and ML-specific monitoring features.
For security investigations and compliance, you typically need a reliable record of “who did what, when, and from where” in an AWS account. AWS CloudTrail provides this by capturing API activity (management events and, when enabled, selected data events) across services such as IAM, Amazon S3, and Amazon SageMaker. CloudTrail logs can be delivered to an Amazon S3 bucket (often with encryption and retention controls) and optionally sent to CloudWatch Logs for near-real-time searching and alerting. In contrast, CloudWatch Logs is primarily for ingesting and querying application/service log streams, and SageMaker monitoring features focus on ML behavior (like drift) rather than account-level API auditing.
Topic: Deployment and Orchestration of ML Workflows
A team is implementing CI/CD for an Amazon SageMaker-based ML workflow using CodePipeline and SageMaker Pipelines. The pipeline includes stages for training, evaluation, model registration, deployment, and monitoring.
Which statement is NOT correct (false/unsafe) about applying CI/CD principles to this MLOps workflow on AWS?
Options:
A. Use a model registry to version artifacts and track lineage from data and code to deployed model.
B. Register only models that pass evaluation gates, and promote them through environments using approvals.
C. Monitor deployed models for drift and performance, and trigger retraining or rollback workflows when thresholds are breached.
D. Deploy the model automatically to production first, then evaluate and roll back if needed.
Best answer: D
Explanation: In MLOps CI/CD, automation promotes only validated model artifacts. Training outputs must be evaluated against defined metrics before the model is registered and deployed, so production receives a controlled, approved version. Monitoring then closes the loop by detecting drift or degradation and triggering an automated response.
CI/CD principles for ML add explicit quality controls around model artifacts, not just application code. A typical automated flow is: train a candidate model, evaluate it against agreed metrics/baselines, register/version the model and its metadata (including lineage), deploy a specific approved version to the target environment, and continuously monitor the deployed model to detect drift or performance regression.
Deploying to production before evaluation is unsafe because it bypasses the key CI/CD gate that prevents unvalidated models from being promoted. The correct pattern is “train → evaluate → register → (approve) → deploy → monitor,” with automation and approvals appropriate to risk.
Topic: ML Solution Monitoring, Maintenance, and Security
A team is optimizing costs for an Amazon SageMaker workload after reviewing CloudWatch metrics.
ml.m5.2xlarge; p95 latency is 450 ms and CPU is ~85%.ml.m5.2xlarge reads ~120 GB from Amazon S3 to compute drift statistics; it fails with out-of-memory errors while CPU stays ~20%.Which TWO instance family changes best fit these workloads? (Select TWO.)
Options:
A. Move the Model Monitor processing job to a memory-optimized instance
B. Move the real-time endpoint to an inference-optimized instance (Inferentia-based)
C. Move the real-time endpoint to a memory-optimized instance
D. Move the real-time endpoint to a GPU training-optimized instance
E. Move the Model Monitor processing job to a compute-optimized instance
F. Move both workloads to newer general-purpose instances
Correct answers: A and B
Explanation: The endpoint is CPU-bound and serving a deep learning transformer, so an inference-optimized family is the best match to improve latency per dollar for supported frameworks. The Model Monitor processing job is failing due to memory exhaustion with low CPU use, so a memory-optimized family is the best fit to prevent OOM and stabilize the nightly run.
Instance family selection should follow the dominant resource bottleneck you observe in monitoring data. For real-time inference of deep learning models, inference-optimized instances (such as Inferentia-based choices) are purpose-built to deliver better throughput/latency per cost than general-purpose CPU instances when the model/framework is supported. For batch-style monitoring jobs that crash with out-of-memory while CPU remains low, the limiting factor is RAM, so memory-optimized instances are the appropriate fit.
A practical approach is:
Compute-optimized helps when CPU is the constraint; it does not fix RAM-driven failures.
Topic: Data Preparation for Machine Learning (ML)
A team is preparing a customer churn dataset for training in Amazon SageMaker. The data is stored in a private S3 bucket encrypted with SSE-KMS and contains PII. Company policy requires that PII must not leave the AWS account, and the team must preserve the integrity of an existing train/validation/test split for unbiased evaluation.
Which TWO data cleaning steps should the team AVOID?
Options:
A. Deduplicate by customer_id keeping the latest event_time
B. Add missing-value indicator features for sparse columns
C. Fit imputers on the training split and apply to others
D. Cap outliers using IQR thresholds computed on training data
E. Export raw PII to a third-party tool for deduplication
F. Impute using statistics computed from all dataset splits
Correct answers: E and F
Explanation: The steps to avoid are the ones that break explicit constraints: protecting PII and preventing data leakage across the existing train/validation/test split. Computing imputation statistics across all splits contaminates the holdout sets and biases evaluation. Sending raw PII outside the AWS account violates the security requirement.
Good data cleaning improves training quality only if it preserves security boundaries and the validity of model evaluation. In this scenario, two explicit rules apply: PII must stay inside the AWS account, and the existing train/validation/test split must remain a true holdout.
Practical guardrails:
A common failure mode is “peeking” at validation/test when choosing cleaning parameters, which silently turns the evaluation into a training artifact.
Topic: Deployment and Orchestration of ML Workflows
A team deploys a fraud model to a production Amazon SageMaker real-time endpoint. The team wants to reduce release risk by enforcing model versioning, artifact immutability, and controlled promotion. Requirements:
Which TWO deployment practices should the team AVOID?
Options:
A. Allow training role to update production endpoint automatically
B. Use Model Registry approvals; deploy specific model package version
C. Overwrite the S3 object at a fixed model artifact key
D. Create new EndpointConfig; shift traffic with blue/green policy
E. Enable S3 versioning; deploy using specific object VersionId
F. Record ECR image digest and immutable tag in Model Package
Correct answers: A and C
Explanation: Avoid practices that mutate release artifacts in place or bypass gated promotion. Immutable, versioned artifacts (S3 version IDs and image digests) plus an approval-controlled pipeline enable repeatable deployments and exact rollback. Direct, automated promotion from training to production violates controlled promotion requirements.
Safe model releases on SageMaker rely on deploying immutable, uniquely versioned artifacts and promoting them through a controlled process. If you overwrite a model artifact at a fixed S3 key, the endpoint may still “point” to the same location while the bytes change, making audits and rollbacks ambiguous. Similarly, letting the training role update the production endpoint directly removes the required approval gate and mixes build and release responsibilities.
A typical low-risk pattern is:
EndpointConfig, then shift traffic (canary/blue-green) with fast rollback.The key takeaway is that promotion should be gated, and references should be to immutable versions—not mutable “latest” locations.
Topic: Deployment and Orchestration of ML Workflows
A team uses Amazon SageMaker Model Registry to manage model versions for a production real-time endpoint. They want a manual approval gate so that a newly trained model is reviewed before any deployment occurs through their CI/CD pipeline.
Which statement is INCORRECT about implementing this workflow?
Options:
A. Marking a model package as Approved automatically updates any endpoints using the model package group to the new version.
B. Set new model packages to PendingManualApproval and require a reviewer to change them to Approved before promotion.
C. Restrict sagemaker:UpdateModelPackage to an approver role and rely on CloudTrail for auditability of approval actions.
D. Use an EventBridge rule for SageMaker model package state changes to trigger a deployment pipeline when a package becomes Approved.
Best answer: A
Explanation: In SageMaker Model Registry, approval status is a governance control that gates promotion, but it does not perform deployments by itself. You must wire approval-state changes to an explicit deployment action (for example, a pipeline stage) to update staging or production endpoints.
SageMaker Model Registry provides a centralized place to version models (model packages) and apply an approval gate using ModelApprovalStatus (for example, PendingManualApproval to Approved). The approval state is used to control when automation is allowed to promote a model, but it does not automatically modify running infrastructure.
A typical high-level promotion flow is:
PendingManualApproval.Approved.Key takeaway: approval is a gate; promotion/deployment still requires an orchestrated action.
PendingManualApproval/Approved is the standard way to block promotion until review.Topic: Data Preparation for Machine Learning (ML)
A dataset has a nominal categorical feature customer_id with tens of thousands of unique values. The feature must be used by a linear model, and the encoding should avoid creating an extremely wide one-hot matrix.
Which encoding technique is most appropriate?
Options:
A. Label encoding
B. No encoding (pass the string values)
C. Binary encoding
D. One-hot encoding
Best answer: C
Explanation: Binary encoding is designed for high-cardinality categorical features because it represents each category using multiple binary digits, producing far fewer columns than one-hot encoding. This avoids the extreme dimensionality and sparsity that one-hot encoding creates for tens of thousands of categories. It also avoids using a single integer code that can imply a misleading order.
For nominal categorical features, the main trade-off is between representational fidelity and feature dimensionality.
One-hot encoding is typically preferred for low-cardinality nominal features because it does not introduce any artificial ordering, but it scales poorly as cardinality grows (it creates one column per category). Label encoding maps categories to a single integer column, which can introduce an arbitrary numeric ordering that is not meaningful for nominal data and can mislead many model types.
Binary encoding is a practical compromise for high-cardinality features: it converts the category index into a binary representation spread across multiple columns, drastically reducing the number of generated features compared with one-hot while still representing distinct categories.
When cardinality is very large and the model is sensitive to input dimensionality, binary encoding is often the better fit than one-hot.
customer_id.Topic: ML Solution Monitoring, Maintenance, and Security
A team runs Amazon SageMaker Pipelines that (1) reads features from an Amazon RDS database and (2) calls a third-party labeling API during a processing step. The team must manage these credentials securely and support periodic rotation.
Which statement is INCORRECT?
Options:
A. Use IAM (and the secret’s KMS key policy) to restrict who can read secrets, and rely on CloudTrail to audit secret access.
B. Configure Secrets Manager rotation (for example, with a Lambda rotation function) so applications retrieve the current secret value without code changes.
C. Embed the third-party API token in the training container image or source repository to keep deployments simple.
D. Store the RDS credentials in AWS Secrets Manager and allow the pipeline’s execution role to call secretsmanager:GetSecretValue at runtime.
Best answer: C
Explanation: The unsafe statement is the one that recommends embedding an API token in a container image or source repository. Secrets should be stored in a managed secrets service (such as AWS Secrets Manager) and retrieved at runtime using least-privilege IAM permissions. This supports rotation, reduces accidental exposure, and improves auditability.
For ML workflows, credentials for data sources (for example, RDS) and third-party integrations (for example, labeling APIs) should be kept out of code, images, and configuration files that are broadly readable. AWS Secrets Manager (or an equivalent managed store) is designed for storing secrets encrypted with AWS KMS, controlling access with IAM, auditing access with AWS CloudTrail, and supporting automated rotation.
A secure pattern is:
The key takeaway is to avoid hardcoding credentials anywhere they could be copied, scanned, or baked into immutable artifacts.
Topic: ML Model Development
A binary classifier deployed on Amazon SageMaker returns a fraud probability score from 0 to 1. The business will automatically block orders when the score exceeds an operating threshold.
Blocking a legitimate order (false positive) costs USD 200 in lost margin. Allowing a fraudulent order (false negative) costs USD 20. Which threshold choice best fits these constraints?
Options:
A. Increase regularization strength to reduce overfitting
B. Increase the threshold to favor higher precision
C. Use SageMaker Model Monitor to detect drift and keep a 0.5 threshold
D. Decrease the threshold to favor higher recall
Best answer: B
Explanation: An operating (decision) threshold converts a model’s probability score into a class label. When false positives are much more expensive than false negatives, the business objective is to reduce false positives, which is achieved by raising the threshold. This increases precision while typically decreasing recall.
The operating (decision) threshold is the probability cutoff used to map a classifier’s score to a positive/negative prediction. Changing the threshold does not change the trained model; it changes the precision–recall tradeoff.
Given the stated costs, false positives are far more expensive than false negatives, so the deployment should be tuned to avoid predicting “fraud” unless the model is very confident.
Model-quality techniques (regularization, tuning) and operational tools (drift monitoring) are important, but they do not directly implement the business-driven precision vs. recall choice at inference time.
Topic: Data Preparation for Machine Learning (ML)
A data science team is preparing a binary classification dataset in Amazon SageMaker. They need an AWS-native way to identify potential bias in the training data and to produce high-level bias metrics (for example, comparing outcomes across sensitive groups) before training a model.
Which SageMaker capability is designed for this purpose?
Options:
A. SageMaker Debugger
B. SageMaker Data Wrangler
C. SageMaker Clarify
D. SageMaker Model Monitor
Best answer: C
Explanation: Amazon SageMaker Clarify provides bias detection for ML datasets and models. It can compute bias-related metrics by comparing outcomes across sensitive attributes, helping teams assess potential bias in training data before model development.
SageMaker Clarify is the SageMaker feature specifically intended to help identify potential bias and explainability concerns. For bias, it analyzes a dataset (such as a training set) by using a chosen sensitive attribute and label/outcome to compute bias metrics that summarize how outcomes differ across groups. This makes it suitable for a pre-training data preparation step to evaluate bias at a high level and decide whether additional data collection, rebalancing, or feature changes are needed.
Other SageMaker features focus on data transformation, monitoring production drift/quality, or debugging training behavior rather than quantifying dataset bias metrics.
Topic: Deployment and Orchestration of ML Workflows
A team must implement CI/CD for an Amazon SageMaker-based ML project. When code changes, the workflow must automatically build artifacts, run a repeatable training pipeline, register the resulting model, and deploy the approved model to an endpoint with minimal manual steps.
Which actions should the team take to meet these requirements? (Select THREE.)
Options:
A. Use AWS Glue interactive sessions to deploy models directly to endpoints
B. Use Amazon SageMaker Pipelines to orchestrate processing, training, and evaluation steps
C. Use SageMaker Model Registry approval to trigger deployment to a SageMaker endpoint
D. Send Amazon SNS notifications so engineers start each step manually
E. Run training from SageMaker Studio notebooks whenever code changes
F. Use AWS CodePipeline and CodeBuild to build/push images to Amazon ECR
Correct answers: B, C and F
Explanation: Use CI/CD orchestration to automatically build artifacts, execute a managed ML workflow, and promote models into deployment. CodePipeline/CodeBuild automate build and packaging, SageMaker Pipelines automates the train/evaluate workflow, and SageMaker Model Registry enables approval-based promotion and deployment. Together, these minimize manual steps while keeping deployments repeatable and auditable.
The goal is an automated, repeatable path from code commit to deployed model. In AWS, a common pattern is to use CI/CD tooling for build automation, SageMaker-native workflow orchestration for training, and a registry/approval mechanism for controlled promotion to deployment.
A high-level automated flow is:
This achieves end-to-end orchestration without relying on humans to kick off each stage.
Topic: Deployment and Orchestration of ML Workflows
A team uses Amazon SageMaker Pipelines to retrain a fraud model using new Parquet files (about 50 GB/day) in Amazon S3 and to register approved models in SageMaker Model Registry. Requirements:
Which approach BEST meets these requirements?
Options:
A. Run a continuously running EC2 cron job that starts the pipeline
B. Use an S3 lifecycle rule to invoke the pipeline nightly
C. Use a CodeCommit webhook as the only trigger for pipeline runs
D. Use EventBridge schedule and S3 events to start the pipeline
Best answer: D
Explanation: Use Amazon EventBridge rules to trigger SageMaker Pipelines for both scheduled and event-driven runs. An EventBridge schedule can start a nightly execution, and an EventBridge S3 object-created event can start execution when new labeled data arrives. This is serverless, auditable, and works with least-privilege IAM roles.
For high-level pipeline automation on AWS, Amazon EventBridge is the standard service to define both time-based schedules and event-driven triggers. In this scenario, you can create two EventBridge rules: a cron rule for 1:00 AM UTC and an event pattern rule that matches S3 Object Created events for the labeled-data prefix. Each rule targets either SageMaker directly (where supported) or a small Lambda/Step Functions target that calls sagemaker:StartPipelineExecution using a least-privilege IAM role.
This approach is fully managed (no always-on compute), produces audit trails via CloudTrail/EventBridge/Lambda logs, and cleanly supports both trigger types for the same SageMaker Pipeline.
Use the AWS MLA-C01 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.
Try AWS MLA-C01 on Web View AWS MLA-C01 Practice Test
Read the AWS MLA-C01 Cheat Sheet on Tech Exam Lexicon for concept review before another timed run.