Free AWS SOA-C03 Full-Length Practice Exam: 65 Questions

Try 65 free AWS SOA-C03 questions across the exam domains, with explanations, then continue with full IT Mastery practice.

This free full-length AWS SOA-C03 practice exam includes 65 original IT Mastery questions across the exam domains.

These questions are for self-assessment. They are not official exam questions and do not imply affiliation with the exam sponsor.

Count note: this page uses the full-length practice count maintained in the Mastery exam catalog. Some certification vendors publish total questions, scored questions, duration, or unscored/pretest-item rules differently; always confirm exam-day rules with the sponsor.

Need concept review first? Read the AWS SOA-C03 Cheat Sheet on Tech Exam Lexicon, then return here for timed mocks and full IT Mastery practice.

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try AWS SOA-C03 on Web View full AWS SOA-C03 practice page

Exam snapshot

  • Exam route: AWS SOA-C03
  • Practice-set question count: 65
  • Time limit: 130 minutes
  • Practice style: mixed-domain diagnostic run with answer explanations

Full-length exam mix

DomainWeight
Monitoring, Logging, Analysis, Remediation, and Performance Optimization22%
Reliability and Business Continuity22%
Deployment, Provisioning, and Automation22%
Security and Compliance16%
Networking and Content Delivery18%

Use this as one diagnostic run. IT Mastery gives you timed mocks, topic drills, analytics, code-reading practice where relevant, and full practice.

Practice questions

Questions 1-25

Question 1

Topic: Deployment, Provisioning, and Automation

An operations team must deploy and keep updated a standardized “baseline” CloudFormation template (IAM role, SSM documents, and CloudWatch alarms) across 60 AWS accounts in an AWS Organizations organization and in three AWS Regions. The team wants a repeatable process with minimal manual effort and low risk of configuration drift.

Which action should the team NOT take?

Options:

  • A. Use a CloudFormation StackSet with service-managed permissions targeting an OU in AWS Organizations

  • B. Create and update separate CloudFormation stacks manually in each account and Region

  • C. Use StackSet operation preferences (concurrency and failure tolerance) to control rollout blast radius

  • D. Configure StackSet automatic deployment so new accounts in the OU receive the baseline automatically

Best answer: B

Explanation: CloudFormation StackSets are designed to centrally deploy and update the same template across multiple accounts and Regions in a consistent, auditable way. Using AWS Organizations integration, automatic deployment, and operation preferences supports standardized rollout at scale. Manually creating individual stacks reintroduces inconsistency and operational risk.

The core benefit of CloudFormation StackSets is centralized, standardized deployment of the same CloudFormation template to many target accounts and Regions, with controlled rollouts and consistent updates. In an AWS Organizations environment, service-managed permissions and OU targeting reduce manual coordination and help ensure new accounts receive required baseline resources automatically.

Manually creating separate stacks in each account/Region is an operations anti-pattern because it:

  • Increases the chance of inconsistent parameters and missed updates
  • Creates configuration drift that is hard to detect and remediate
  • Requires repetitive, high-risk manual change execution

Key takeaway: use StackSets features (Organizations targeting, automatic deployment, operation preferences) to keep multi-account/multi-Region baselines consistent and maintainable.

  • Organizations targeting is appropriate for deploying the same baseline to many accounts under centralized control.
  • Automatic deployment is appropriate to ensure newly created accounts in the OU are brought into compliance without manual action.
  • Controlled rollout using operation preferences is appropriate to limit blast radius and handle failures predictably.

Question 2

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

An operations team wants to standardize automated remediation by routing CloudWatch alarm state changes through Amazon EventBridge to trigger a Lambda function.

Exhibit: EventBridge event received

1 {
2  "source": "aws.cloudwatch",
3  "detail-type": "CloudWatch Alarm State Change",
4  "detail": {
5    "alarmName": "HighCPU-WebASG",
6    "state": { "value": "ALARM" }
7  }
8 }

Which EventBridge rule configuration is the best next step to trigger the Lambda function only when this alarm enters the ALARM state?

Options:

  • A. Match aws.sns + SNS Topic Notification; filter topic and ALARM

  • B. Match aws.cloudwatch + CloudWatch Alarm State Change; filter alarmName and ALARM

  • C. Add an SNS action to the alarm and subscribe the Lambda function

  • D. Match CloudWatch Alarm Configuration Change; filter alarmName and ALARM

Best answer: B

Explanation: The exhibit shows a CloudWatch alarm state change event on EventBridge, including the event’s source, detail-type, and the alarm’s name and state. To route only this alarm’s ALARM transitions to automation, the rule must match source = aws.cloudwatch (line 2), detail-type = CloudWatch Alarm State Change (line 3), and filter alarmName (line 5) and state.value = ALARM (line 6).

CloudWatch alarm state changes are emitted to EventBridge as events, and EventBridge rules route those events to automation targets like Lambda or Systems Manager. From the exhibit, you build an EventBridge rule event pattern that matches the event envelope (source and detail-type) and then narrows to the specific alarm and state.

  • Use source from line 2: aws.cloudwatch
  • Use detail-type from line 3: CloudWatch Alarm State Change
  • Filter detail.alarmName from line 5: HighCPU-WebASG
  • Filter detail.state.value from line 6: ALARM

This ensures the Lambda target runs only on ALARM transitions for that specific alarm, not on OK or other alarms.

  • Wrong event source/type patterns like SNS won’t match because the exhibit source/detail-type are CloudWatch (lines 2–3).
  • Bypassing EventBridge with an alarm SNS action can work operationally, but it doesn’t follow the stated requirement to route through EventBridge.
  • Configuration-change events don’t match because the exhibit is a state change (CloudWatch Alarm State Change, line 3), not an alarm configuration update.

Question 3

Topic: Reliability and Business Continuity

A company hosts a public web application at app.example.com. The application runs behind an Application Load Balancer (ALB) in us-east-1 (primary) and an identical ALB in us-west-2 (standby). The requirement is DNS failover: Route 53 must return the standby endpoint only when the primary endpoint is unhealthy, and traffic must not be split across both endpoints during normal operation.

Which TWO actions should a CloudOps engineer AVOID when configuring Route 53 health checks and failover routing? (Select TWO.)

Options:

  • A. Use simple routing with two records and health checks.

  • B. Health check app.example.com and attach to the primary record.

  • C. Set the DNS TTL to 30 seconds.

  • D. Use a CloudWatch-alarm health check for the primary ALB.

  • E. Create failover alias records; enable Evaluate Target Health.

  • F. Health check the primary ALB DNS name /health.

Correct answers: A and B

Explanation: Failover routing requires Route 53 to evaluate the health of the primary endpoint and to only return the secondary record when the primary is unhealthy. A health check that targets the same DNS name being failed over can’t reliably distinguish primary from secondary. Using simple routing with multiple healthy records can also cause normal-operation traffic splitting instead of true active-passive failover.

The core concept is that Route 53 failover routing makes a DNS decision based on whether the primary record is considered healthy (via an attached Route 53 health check or alias target health evaluation). To meet an active-passive requirement, the health signal must unambiguously represent the primary endpoint only, and the routing policy must enforce primary-first behavior.

When configuring this:

  • Use a failover routing policy with distinct PRIMARY and SECONDARY records.
  • Point health evaluation at the primary endpoint (for example, the primary ALB DNS name and a specific path), or use an alarm-based health check that reflects primary availability.
  • Avoid checks that reference the same record name you are failing over, because the check can end up following DNS and testing the currently returned target rather than the intended primary.

The key takeaway is that both the routing policy and the health check target must align to “primary-only until unhealthy.”

  • Checking the same record name is unsafe because the health check can follow DNS and lose primary vs. secondary specificity.
  • Simple routing is acceptable for “return any healthy endpoints,” but it does not enforce active-passive behavior.
  • Endpoint-specific checks (such as checking the primary ALB DNS name/path) are appropriate because they directly measure primary health.
  • Low TTL and alarm-based checks are compatible with failover and help reduce caching and improve signal quality.

Question 4

Topic: Networking and Content Delivery

An ECS service runs in private subnets across two Availability Zones in one VPC. The tasks must call a third-party telemetry API that is offered either as a public HTTPS endpoint on the internet or as an AWS PrivateLink endpoint service in the same Region.

Security requires that the tasks remain in private subnets with no public IPs and that traffic to the telemetry API must not traverse the public internet.

The service sends 1,200 GB per month to the telemetry API. Assume 730 hours per month.

AWS costs (USD):

  • NAT gateway: USD 0.045 per hour per NAT gateway + USD 0.045 per GB processed
  • PrivateLink interface endpoint: USD 0.010 per hour per AZ endpoint + USD 0.010 per GB processed

Which option meets the security requirement and results in the lowest monthly AWS cost? Round to the nearest dollar.

Options:

  • A. Use two NAT gateways (one per AZ) to reach the vendor public HTTPS endpoint

  • B. Assign public IPs to the tasks and restrict access with security groups

  • C. Create a gateway VPC endpoint and route telemetry traffic through it

  • D. Create a PrivateLink interface VPC endpoint in each AZ to the vendor endpoint service

Best answer: D

Explanation: AWS PrivateLink uses interface VPC endpoints to reach a service privately without sending traffic over the public internet. With the provided rates, two interface endpoints plus 1,200 GB processed is significantly cheaper per month than running two NAT gateways to access a public endpoint. This meets the “no public internet” requirement and minimizes monthly connectivity cost.

To keep the ECS tasks private and avoid traversing the public internet, use AWS PrivateLink (an interface VPC endpoint) to connect to the vendor’s endpoint service.

Compute monthly AWS connectivity cost using the provided hourly and per-GB charges (730 hours/month, 1,200 GB/month):

  • NAT (2 AZs - 2 NAT gateways): hourly cost is 2 \(\times\) USD 0.045 \(\times\) 730 \(=\) USD 65.70, plus data cost 1,200 \(\times\) USD 0.045 \(=\) USD 54.00, total USD 119.70 \(\approx\) USD 120.
  • PrivateLink (2 AZ interface endpoints): hourly cost is 2 \(\times\) USD 0.010 \(\times\) 730 \(=\) USD 14.60, plus data cost 1,200 \(\times\) USD 0.010 \(=\) USD 12.00, total USD 26.60 \(\approx\) USD 27.

PrivateLink both satisfies the security constraint and is lower cost under these inputs.

  • NAT to public endpoint uses a public destination and is more expensive given the provided per-hour and per-GB rates.
  • Gateway endpoint applies to supported AWS services (for example, Amazon S3/DynamoDB), not a third-party vendor API.
  • Public IPs on tasks violates the requirement to keep workloads private and avoid public internet traversal.

Question 5

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

A company runs a production application on an Amazon EKS cluster (managed node groups) across multiple Availability Zones. Operations must meet these requirements:

  • Collect container CPU/memory metrics and pod/node metrics in Amazon CloudWatch.
  • Collect each container’s stdout/stderr logs in CloudWatch Logs.
  • Do not modify application code or rebuild container images.
  • Use least-privilege access from pods to AWS APIs and keep ongoing operational overhead low.

Which action is the BEST way to configure monitoring and logging?

Options:

  • A. Install the CloudWatch agent inside each application container and push logs directly to CloudWatch Logs

  • B. Enable VPC Flow Logs for the cluster subnets to capture container metrics and application logs

  • C. Configure AWS CloudTrail data events for EKS to send pod stdout/stderr to CloudWatch Logs

  • D. Enable EKS Container Insights and deploy the CloudWatch agent and Fluent Bit as DaemonSets using IRSA for permissions

Best answer: D

Explanation: For EKS, the standard operational approach is to use Container Insights with the CloudWatch agent for metrics and Fluent Bit for log forwarding. Running them as DaemonSets captures node/pod/container telemetry cluster-wide without changing application images. Using IAM Roles for Service Accounts (IRSA) meets the least-privilege requirement with low ongoing operational overhead.

In EKS, container metrics and container stdout/stderr logs are typically collected by deploying cluster-level agents rather than modifying application workloads. Enabling Container Insights and running the CloudWatch agent (metrics) plus Fluent Bit (logs) as DaemonSets provides coverage for all nodes and pods across Availability Zones. To meet security/compliance requirements, attach only the required CloudWatch/CloudWatch Logs permissions to a Kubernetes service account and use IRSA so pods assume that role without node-wide credentials.

This approach satisfies the requirements: no code/image changes, centralized CloudWatch metrics and logs, least-privilege access, and low operational overhead compared with per-application instrumentation.

  • Per-container install violates the “no rebuild image” and low-ops requirements because every workload must be modified and maintained.
  • CloudTrail misuse violates the monitoring requirement because CloudTrail records AWS API activity, not container stdout/stderr or resource metrics.
  • Flow Logs mismatch violates the logs/metrics requirements because VPC Flow Logs capture network flow metadata, not pod/container metrics or application logs.

Question 6

Topic: Deployment, Provisioning, and Automation

In AWS CloudFormation, select TWO true statements that help remediate common deployment failures caused by parameter misconfiguration, resource dependency ordering, or Region constraints.

Options:

  • A. Changing a parameter value never requires a stack update.

  • B. Enabling DisableRollback prevents failures from dependency misordering.

  • C. Using an Availability Zone name like us-east-1a is portable everywhere.

  • D. Ref/GetAtt usually create implicit dependencies; use DependsOn only if needed.

  • E. Hard-coded AMI IDs can fail cross-Region; use a Region-resolved mapping/parameter.

  • F. CreationPolicy is the standard way to fix resource ordering issues.

Correct answers: D and E

Explanation: CloudFormation failures commonly occur when templates assume global values that are actually Region-specific (such as AMI IDs), or when resources must be created in an order CloudFormation cannot infer. CloudFormation automatically orders many resources when one references another, and you should only add explicit dependencies when there is no reference-driven dependency.

OK: CloudFormation automatically infers dependency ordering when a resource uses Ref or Fn::GetAtt to reference another resource, so deployment issues from ordering are often solved by adding the missing reference or (only when no reference is possible) adding DependsOn.

OK: Many identifiers are Region-scoped, especially AMI IDs. A template that hard-codes an AMI ID might work in one Region and fail in another; remediate by resolving the AMI ID per Region (for example, AWS::SSM::Parameter::Value<AWS::EC2::Image::Id> or a Mappings lookup).

Key takeaway: fix ordering with inferred/explicit dependencies, and fix portability by removing Region-specific hard-coding.

  • NO: “No stack update needed” Parameter value changes take effect only through a stack update (change set/update).
  • NO: “AZ names are portable” AZ suffixes (like 1a) differ by account, so hard-coding can break.
  • NO: “CreationPolicy fixes ordering” CreationPolicy waits for signals; it doesn’t create dependencies by itself.
  • NO: “DisableRollback prevents ordering failures” It only preserves failed resources for troubleshooting; it doesn’t remediate the failure.

Question 7

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

An Amazon EventBridge rule sends events to a Lambda function target. During brief downstream outages, some invocations fail and operations needs a reliable way to capture the failed events for later reprocessing while EventBridge automatically retries transient failures.

Which action best aligns with the Well-Architected reliability/operational excellence principle for improving operational reliability?

Options:

  • A. Increase the Lambda function memory to reduce execution timeouts

  • B. Enable AWS X-Ray tracing on the Lambda function for faster root cause analysis

  • C. Create a CloudWatch alarm on Lambda errors and notify an on-call SNS topic

  • D. Configure the EventBridge target with a retry policy and an SQS dead-letter queue

Best answer: D

Explanation: Configure the EventBridge target with both a retry policy and a dead-letter queue to design for failure and avoid silent event loss. Retries handle transient issues automatically, and the DLQ preserves events that still fail so operators can replay them. This applies the Well-Architected reliability and operational excellence principle of building resilient, recoverable operations.

Event-driven operations should assume transient failures (throttling, timeouts, downstream outages) and use built-in mechanisms to prevent data loss. For EventBridge targets, a retry policy provides automated retries for delivery failures, and a dead-letter queue (DLQ) stores events that could not be delivered successfully after retries.

This supports reliability/operational excellence by:

  • Automating recovery for transient failures (retries)
  • Preserving failed events for audit, troubleshooting, and replay (DLQ)
  • Reducing manual intervention and avoiding “dropped” events

Monitoring and tracing help detect or diagnose issues, but they do not by themselves ensure failed events are retained and recoverable.

  • More compute may reduce some failures, but it does not provide guaranteed capture of undelivered events.
  • Alarm/notification only improves detection, not automatic retry or durable retention for later reprocessing.
  • Tracing only improves diagnostics, not the reliability mechanism for handling delivery failures.

Question 8

Topic: Reliability and Business Continuity

You are reviewing an Application Load Balancer target group configuration after an outage. The team wants health checks to detect unhealthy instances accurately and remove them from routing quickly without causing unnecessary failovers.

Which THREE statements about target group health checks are INCORRECT?

Options:

  • A. Configure the health check matcher to the expected success codes (for example 200–399).

  • B. Keep the health check endpoint lightweight to avoid failing due to noncritical downstream dependencies.

  • C. Targets do not need to allow inbound from the load balancer nodes for health checks; only client traffic must be allowed.

  • D. A health check should exercise the full user workflow and fail if any backend (DB/third-party API) is unavailable.

  • E. Set the unhealthy threshold to 1 and the interval to the minimum to ensure fastest failover in all cases.

  • F. Set the health check port to traffic-port so checks follow the target port.

Correct answers: C, D and E

Explanation: Load balancer health checks should accurately measure target readiness and remove unhealthy targets without inducing churn. Incorrect health checks commonly fail by being too aggressive (causing flapping), by not permitting health check traffic to reach targets, or by coupling checks to full dependency chains that create cascading failures. Correct configuration balances sensitivity with stability.

Target groups use health checks to decide when to stop routing traffic to a target and when to return it to service. To trigger reliable failover, the health check must (1) be able to reach the target, (2) validate the right signal of instance readiness, and (3) use thresholds/intervals that avoid rapid flapping.

Corrections to the incorrect statements:

  • Do not assume minimum interval and an unhealthy threshold of 1 is always best; tune values to reduce false positives during brief latency spikes, deploys, or GC pauses.
  • Ensure targets allow inbound health check traffic from the load balancer to the health check port/path; otherwise every target can appear unhealthy.
  • Avoid “full workflow” checks that fail due to transient or noncritical dependency issues; prefer a lightweight readiness endpoint that reflects whether the instance can serve requests safely.

The key takeaway is that health checks should be reachable, representative, and stable enough to avoid unnecessary failover.

  • Use traffic-port helps prevent checking the wrong port after a listener/target port change.
  • Set a matcher ensures the load balancer treats only the intended HTTP responses as healthy.
  • Lightweight readiness endpoint reduces the chance of removing healthy targets due to dependency noise.

Question 9

Topic: Networking and Content Delivery

A company uses AWS Global Accelerator with an Application Load Balancer (ALB) endpoint in us-east-1 and another ALB endpoint in us-west-2. The operations team wants Global Accelerator to stop sending traffic to a Region when its ALB becomes unhealthy.

Which statement correctly describes how Global Accelerator endpoints and health checks work?

Options:

  • A. Global Accelerator health checks run only from the client’s AWS Region.

  • B. Global Accelerator uses Route 53 health checks to detect unhealthy endpoints.

  • C. Health checks are set on an endpoint group and drive failover routing.

  • D. Global Accelerator health is determined only by ALB target group health.

Best answer: C

Explanation: Global Accelerator evaluates endpoint health by using health checks configured for each endpoint group. When an endpoint is unhealthy, Global Accelerator stops routing traffic to it and directs traffic to healthy endpoints (including in other Regions) to improve availability.

Global Accelerator is a global service that routes user traffic over the AWS global network to endpoints that you add in AWS Regions. Endpoints are organized into endpoint groups (typically one per Region), and the health check configuration is applied at the endpoint-group level (protocol/port/path for HTTP(S), or TCP).

Global Accelerator continuously checks endpoint health using that configuration and uses the results to decide where it can send traffic. If an endpoint becomes unhealthy, Global Accelerator removes it from routing and shifts traffic to other healthy endpoints (for example, the other Region’s ALB), improving availability without requiring DNS changes.

  • Route 53 coupling is incorrect because Global Accelerator has its own health checks and routing decisions.
  • Client-Region-only checks is incorrect because health is evaluated by the service, not by each client’s Region.
  • Inheriting target health is incorrect because Global Accelerator health checks are configured and evaluated independently of ALB target group health checks.

Question 10

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

Select THREE true statements about choosing Amazon EBS volume types based on performance and cost requirements.

Options:

  • A. io2 fits mission-critical, low-latency, high-IOPS workloads despite higher cost.

  • B. st1 is cost-effective for large, sequential throughput workloads.

  • C. st1 is recommended for boot volumes and small random I/O.

  • D. gp2 provides provisioned IOPS independently of volume size.

  • E. gp3 can provision IOPS and throughput independent of volume size.

  • F. sc1 is an SSD tier for low-latency transactional databases.

Correct answers: A, B and E

Explanation: EBS volume selection is primarily a tradeoff between SSD latency/IOPS and HDD throughput/cost. gp3 is the general-purpose SSD option with separately configurable performance, io2 targets the highest IOPS/lowest latency needs, and st1 is optimized for low-cost sequential throughput.

Pick the EBS volume type that matches the workload’s dominant I/O pattern and the budget. For most workloads needing balanced SSD performance at a predictable price, gp3 is a common default because it lets you set IOPS and throughput without increasing volume size. For latency-sensitive or IOPS-intensive, mission-critical workloads (often databases) where you can pay more for consistently high performance, io2 is the better fit. For large, sequential streaming I/O where throughput matters more than low latency—such as big data processing or log processing—st1 provides lower-cost HDD throughput, but it is not suited for small random I/O.

Key takeaway: SSD types (gp3/io2) are for low-latency random I/O; HDD types (st1/sc1) are for sequential throughput at lower cost.

  • OK: The statement about gp3 provisioning IOPS/throughput independent of size is accurate.
  • OK: The statement describing io2 for mission-critical, low-latency, high-IOPS needs is accurate.
  • OK: The statement describing st1 as cost-effective for large sequential throughput is accurate.
  • NO: Claims that gp2 offers provisioned IOPS independent of size, that st1 is for boot/small random I/O, or that sc1 is an SSD transactional tier are incorrect for these volume types.

Question 11

Topic: Reliability and Business Continuity

An online ordering API runs as an Amazon ECS service on AWS Fargate behind an Application Load Balancer (ALB). The service is set to a fixed desired count of 4 tasks. During lunch traffic spikes, ALB TargetResponseTime and 5xx errors increase, and the team manually scales tasks.

The operations goal is to keep performance stable during spikes while reducing off-peak cost and minimizing ongoing operational effort. Which change is the best fit?

Options:

  • A. Increase the task CPU and memory reservations and keep desired count at 4

  • B. Configure ECS Service Auto Scaling with a target tracking policy and min/max task counts

  • C. Move the service to an EC2 capacity provider that uses Spot Instances

  • D. Enable ALB slow start and increase deregistration delay on the target group

Best answer: B

Explanation: ECS Service Auto Scaling with target tracking is purpose-built to maintain application performance during variable load by automatically scaling task count based on a metric. It also reduces cost by scaling in when demand drops and removes the need for manual interventions. Setting sensible minimum and maximum task counts preserves baseline availability and controls spend.

To maintain performance during traffic spikes without overprovisioning, the ECS service should use Application Auto Scaling. A target tracking scaling policy can scale the service out when load increases and scale it back in when load decreases, using a metric such as CPU utilization, memory utilization, or an ALB metric like RequestCountPerTarget.

Operationally, you would:

  • Register the ECS service as a scalable target.
  • Set minCapacity to maintain baseline availability and maxCapacity to cap cost.
  • Attach a target tracking policy to keep the chosen metric near a target value.

This improves reliability (fewer overload errors) and cost efficiency (no fixed peak-sized task count). The main tradeoff is that scaling is reactive and can lag sudden spikes, so picking the right metric and min capacity matters.

  • Bigger tasks only improves per-task capacity but usually increases off-peak cost and still requires manual scaling when demand exceeds four tasks.
  • ALB slow start/deregistration tuning can smooth deployments and reduce connection churn, but it does not add capacity when the service is overloaded.
  • Spot Instances can reduce cost, but it adds interruption risk and operational complexity and does not directly address scaling behavior for performance under bursty load.

Question 12

Topic: Deployment, Provisioning, and Automation

You are troubleshooting an IaC deployment that fails with AccessDenied during resource provisioning. Which THREE statements are true about diagnosing the IAM permission issue and remediating it with least privilege?

Options:

  • A. Fix EC2 provisioning by allowing ec2:* on * resources.

  • B. VPC Flow Logs pinpoint missing IAM actions causing AccessDenied errors.

  • C. An explicit Deny (including SCPs) overrides any Allow.

  • D. CloudTrail AccessDenied events identify the denied API action and principal.

  • E. Temporarily attaching AdministratorAccess is a least-privilege remediation.

  • F. Least-privilege iam:PassRole can be scoped with iam:PassedToService.

Correct answers: C, D and F

Explanation: To diagnose provisioning failures caused by IAM, identify the exact denied API call and the principal that made it, then adjust permissions narrowly for that action and resource. CloudTrail provides the authoritative record of the failing API call, and least-privilege fixes commonly include tightly scoped iam:PassRole when a service must use a role. Also, any explicit deny in the evaluation chain must be removed or narrowed before an allow can take effect.

The core workflow for IAM provisioning failures is: find the denied API call, confirm which identity made it, and then remediate by granting only the required actions on the required resources (or by removing/narrowing an explicit deny).

CloudTrail is the best starting point because it shows the exact API (eventName) that returned AccessDenied and the calling identity (user/role session). If the failure involves a service using a role (for example, passing an instance profile role or a service role), the caller typically needs iam:PassRole to that specific role ARN; least privilege is to also restrict passing via conditions like iam:PassedToService. Finally, remember policy evaluation: any matching explicit Deny (including from SCPs or permissions boundaries) blocks the request regardless of other Allow statements.

  • OK CloudTrail AccessDenied records show the failing API and the calling principal for targeted fixes.
  • OK Scoping iam:PassRole to the role ARN and iam:PassedToService limits what can be passed and to whom.
  • OK An explicit deny in any applicable layer must be removed/narrowed; adding allows won’t override it.
  • NO Admin or broad wildcards (like ec2:* on *) are not least-privilege; VPC Flow Logs are for network traffic, not IAM authorization.

Question 13

Topic: Networking and Content Delivery

An operations engineer sets up AWS Global Accelerator for a public web application that is deployed behind an Application Load Balancer in us-east-1 and another ALB in eu-west-1. The engineer creates one accelerator with two endpoint groups (one per Region) and configures health checks so Global Accelerator automatically removes an unhealthy endpoint group from routing and sends traffic only to healthy endpoints.

Which operations principle is this change most directly applying?

Options:

  • A. Automation and standardization

  • B. Shared responsibility

  • C. Blast-radius reduction

  • D. Least privilege

Best answer: C

Explanation: This change applies blast-radius reduction as a reliability practice. By using Global Accelerator endpoint groups with health checks, traffic is automatically shifted away from an unhealthy Regional endpoint to healthy endpoints, containing failures and reducing user impact. The configuration also improves availability without requiring manual DNS changes during an incident.

Blast-radius reduction is a reliability principle focused on limiting how much of your workload and customer traffic is affected when something fails. With AWS Global Accelerator, you place multiple endpoints (often in different Regions) behind a single set of static anycast IP addresses and define endpoint groups with health checks. If a Regional endpoint (such as an ALB) becomes unhealthy, Global Accelerator stops routing new traffic to it and routes to the remaining healthy endpoint groups, keeping the failure isolated.

Key operational outcomes here are automatic failover based on health checks and reducing the customer impact of a single-Region issue. The key takeaway is that health-checked, multi-endpoint routing primarily serves to contain failures, not to manage permissions or ownership boundaries.

  • Least privilege is about minimizing IAM permissions, not traffic failover behavior.
  • Shared responsibility describes who secures what in AWS; it is not the main goal of multi-Region health-based routing.
  • Automation and standardization can be a benefit, but the most direct principle demonstrated is limiting outage impact by shifting traffic away from unhealthy endpoints.

Question 14

Topic: Reliability and Business Continuity

A company runs a public REST API on two Application Load Balancers (ALBs): one in us-east-1 and one in eu-west-1. The company also runs an internal admin portal reachable only from a VPC (users connect by VPN). Both applications must remain available during a Regional failure.

Requirements:

  • For api.example.com, route users to the lowest-latency Region and automatically stop returning an unhealthy Region.
  • For admin.corp.example.com (private DNS), keep us-east-1 as the preferred Region and fail over to eu-west-1 only if us-east-1 is unhealthy. CloudWatch alarms already exist for each ALB target group’s HealthyHostCount.

Which TWO Route 53 configurations should a CloudOps engineer implement? (Select TWO.)

Options:

  • A. Enable CloudTrail data events for Route 53 to audit DNS queries and rely on that for failover

  • B. Create a single alias record for api.example.com to the us-east-1 ALB and set TTL to 5 seconds

  • C. Create weighted routing records for api.example.com with 50/50 weights and no health checks

  • D. Create multi-value answer records for api.example.com that return both ALB IP addresses

  • E. Create latency routing records for api.example.com to both ALBs and associate Route 53 health checks with each endpoint

  • F. Create a private hosted zone failover record set for admin.corp.example.com with primary us-east-1 and secondary eu-west-1, using CloudWatch-alarm-based Route 53 health checks

Correct answers: E and F

Explanation: Latency-based routing with Route 53 health checks meets the requirement to send users to the lowest-latency healthy API endpoint. For the internal admin name, failover routing enforces a preferred primary Region, and a CloudWatch-alarm-based health check can drive DNS failover even when the endpoint is only reachable inside the VPC.

Use Route 53 routing policies to control how DNS answers are returned when you have multiple endpoints.

For global users, latency-based routing chooses the endpoint with the lowest observed latency from Route 53’s network, and adding health checks ensures Route 53 stops returning records for an unhealthy endpoint (improving availability during a Regional impairment).

For an internal name that must stay in one Region unless it fails, failover routing provides active-passive behavior:

  • Create primary and secondary records for the same name.
  • Attach health checks so Route 53 can determine when to return the secondary.
  • For private/internal endpoints, use a Route 53 health check that references a CloudWatch alarm (for example, HealthyHostCount).

Weighted routing can shift traffic, but it does not inherently select the lowest-latency endpoint.

  • OK: Latency routing with health checks satisfies “lowest latency” and removes unhealthy Regions from DNS answers.
  • OK: Failover routing in a private hosted zone enforces a preferred Region, and CloudWatch-alarm health checks can trigger failover for internal endpoints.
  • NO: Weighted 50/50 without health checks can still return an unhealthy endpoint.
  • NO: A single record (even with low TTL) cannot route around a Regional outage automatically.
  • NO: Multi-value answer is not a latency policy and ALB IPs are not stable targets to publish.
  • NO: CloudTrail/auditing does not provide DNS routing or automated failover behavior.

Question 15

Topic: Deployment, Provisioning, and Automation

An operations team runs an automation workflow whenever files in an S3 bucket change. Today, a scheduled Lambda function runs every minute, lists the prod-ingest bucket, and compares object keys to a DynamoDB table to detect new/updated objects and deletions. This causes high S3 ListObjectsV2 request costs and occasionally misses changes during traffic spikes.

The team needs near-real-time triggers for object create/overwrite and delete events, and the solution must handle bursts reliably with minimal operational effort. Which change best meets these requirements?

Options:

  • A. Configure S3 Event Notifications for ObjectCreated:* and ObjectRemoved:* to an SQS queue, and trigger the automation from a Lambda consumer

  • B. Enable CloudTrail S3 data events for the bucket and create an EventBridge rule to trigger the automation from CloudTrail events

  • C. Enable S3 Inventory reports and trigger the automation from the daily inventory manifest file

  • D. Reduce the schedule interval so the polling Lambda runs every 10 seconds instead of every minute

Best answer: A

Explanation: S3 Event Notifications provide near-real-time signals when objects are created/overwritten or removed, removing the need for continual bucket listings. Sending events to SQS adds a durable buffer so bursts don’t overwhelm the automation and events aren’t lost. The main tradeoff is at-least-once delivery, so the consumer should be idempotent.

The core optimization is to replace periodic polling (expensive and lossy during spikes) with event-driven automation. S3 Event Notifications can emit ObjectCreated:* for new objects and overwrites (an “update” in S3 is typically a new PUT of the same key) and ObjectRemoved:* for deletions. Delivering notifications to SQS improves reliability and operational stability by buffering bursts, enabling controlled concurrency and retries, and decoupling S3 from the processing runtime.

Tradeoff: S3 notifications are at-least-once and can be duplicated or arrive out of order, so the Lambda consumer should be idempotent (for example, key+version checks or conditional writes in DynamoDB). The key takeaway is that S3 event notifications plus a queue remove continuous LIST costs while improving timeliness and burst handling.

  • More frequent polling increases S3 LIST costs and still risks gaps during spikes.
  • S3 Inventory is not near-real-time and is meant for periodic reporting.
  • CloudTrail data events can work but typically add cost and complexity compared with native S3 event notifications for this use case.

Question 16

Topic: Security and Compliance

Which statement about using AWS KMS key policies and grants for least-privilege access to encrypted data is INCORRECT?

Options:

  • A. To reduce blast radius, a common practice is separating key administrators (who manage the key) from key users (who encrypt/decrypt) in the key policy.

  • B. For AWS service integrations (for example, EBS or S3), key policy conditions such as kms:ViaService and kms:CallerAccount can help restrict key usage.

  • C. An IAM policy that allows kms:Decrypt is sufficient even if the key policy does not allow that principal.

  • D. A KMS grant can be used to delegate a limited set of KMS operations to a principal without editing the key policy each time.

Best answer: C

Explanation: The incorrect statement is the one claiming IAM permission alone is enough to use a KMS key. For customer managed keys, KMS evaluates both the key policy and the caller’s IAM permissions (or an applicable grant). Least privilege is achieved by scoping key policy/grants to only the required principals, operations, and conditions.

The core concept is that AWS KMS uses a dual-authorization model for customer managed keys: the request must be allowed by the KMS key policy, and the principal must also have permission (typically via IAM), unless a key policy statement directly authorizes the principal without relying on IAM. Therefore, granting kms:Decrypt in IAM does not work by itself if the key policy does not allow that principal (or does not allow the principal to obtain/use a grant).

Operationally, you can use grants to provide narrowly scoped, auditable permissions (often temporary or programmatic) without repeatedly editing the key policy, and you can tighten service usage by adding conditions like kms:ViaService and kms:CallerAccount. A key policy can also separate key administrators from key users to maintain least privilege and reduce accidental over-permissioning.

  • IAM-only access is unsafe because KMS still requires the key policy (or a grant) to allow the principal.
  • Using grants is a valid way to delegate limited KMS permissions without frequent key policy edits.
  • Service-constraint conditions like kms:ViaService/kms:CallerAccount are commonly used to restrict how a key can be used.
  • Admin/user separation in the key policy helps enforce least privilege and reduce blast radius.

Question 17

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

A company uses AWS Organizations with 20 application accounts. Workloads run in us-east-1 and eu-west-1. Each application account has its own Amazon CloudWatch dashboards and alarms, so the on-call team must switch accounts and Regions during incidents.

Operations has a dedicated monitoring account and wants to create centralized CloudWatch dashboards that can view metrics and logs across all accounts and both Regions. The solution must minimize ongoing operational effort and avoid duplicating observability data into the monitoring account (to limit additional cost and meet a requirement to keep data in the source accounts).

Which change best meets these requirements?

Options:

  • A. Modify CloudWatch agents to publish all custom metrics only to us-east-1 and stop publishing metrics in eu-west-1 to simplify dashboards

  • B. Add CloudWatch Logs subscription filters in each account to send all log groups to a central CloudWatch Logs log group in the monitoring account for dashboarding

  • C. Use CloudWatch Observability Access Manager (OAM): create an OAM sink in the monitoring account with an org-wide sink policy, create OAM links from each source account/Region, and build dashboards in the monitoring account

  • D. Create CloudWatch metric streams in each account/Region to a Kinesis Data Firehose in the monitoring account and visualize in Amazon Managed Grafana

Best answer: C

Explanation: CloudWatch Observability Access Manager (OAM) is purpose-built for cross-account and cross-Region observability in CloudWatch. By linking source accounts/Regions to a sink in the monitoring account, operators can build centralized dashboards while keeping the underlying metrics and logs in the source accounts. This reduces operational toil without adding large duplication or transfer pipelines.

The core need is a single CloudWatch “pane of glass” across multiple accounts and Regions without building a parallel data lake or duplicating telemetry. CloudWatch Observability Access Manager (OAM) provides this by letting a monitoring account create an OAM sink and allowing source accounts (and each Region) to create OAM links to that sink.

With OAM in place, the monitoring account can query and visualize shared metrics and logs directly in CloudWatch dashboards across accounts/Regions, which reduces incident-time context switching. The main tradeoff is the one-time setup and ongoing permission governance of the OAM sink policy/links (often easiest via AWS Organizations), but it avoids the higher ingestion, transformation, and operational overhead of streaming/replicating data elsewhere.

  • Metric streaming to Grafana increases pipeline cost/ops and doesn’t create centralized CloudWatch dashboards.
  • Central log-group subscriptions duplicates logs into the monitoring account and only addresses logs, not metrics.
  • Publishing metrics to one Region doesn’t provide cross-Region observability and can reduce visibility during a Regional issue.

Question 18

Topic: Deployment, Provisioning, and Automation

You manage a golden Amazon Linux AMI that must be distributed to two AWS accounts and to us-west-2 for disaster recovery. You also need the ability to roll back an Auto Scaling group to a prior known-good image quickly.

Which statement is NOT correct?

Options:

  • A. Keeping prior AMI IDs in versioned launch templates enables fast rollback by updating the Auto Scaling group to an earlier template version.

  • B. To use the same AMI in another Region, you must copy the AMI into the target Region.

  • C. Sharing an AMI automatically shares its backing snapshots, so the recipient can copy the AMI to other Regions without additional permissions.

  • D. For cross-account AMI use, you typically share the AMI launch permission and the backing EBS snapshot permissions (and share the KMS key if encrypted).

Best answer: C

Explanation: Sharing an AMI only grants launch permission to the AMI metadata; it does not automatically grant access to the underlying snapshots (and KMS key for encrypted images). Without snapshot (and key) permissions, the recipient account cannot reliably use the image or copy it across Regions. Cross-Region distribution and rollback both rely on explicit copying/versioning practices.

AMI distribution has two distinct dimensions: cross-Region and cross-account. Cross-Region requires creating a separate AMI in the destination Region by copying the source AMI. Cross-account access requires granting the other account permissions to use the AMI and to read the AMI’s backing storage; for EBS-backed AMIs that means sharing the associated EBS snapshot(s), and if those snapshots are encrypted, the KMS key policy/grants must also allow the recipient to use the key.

For rollback, operational practice is to keep prior AMI IDs available and reference them through versioned mechanisms (for example, launch template versions) so an Auto Scaling group can be pointed back to a known-good version quickly. The key takeaway is that AMI “sharing” alone is not sufficient unless the backing snapshots (and encryption key) are also shared.

  • Cross-Region replication is accurate because copying creates a new AMI per Region.
  • Cross-account permissions is accurate because snapshot access (and KMS key access when encrypted) is required.
  • Rollback control is accurate because launch template versioning enables quick reversion to a prior AMI reference.

Question 19

Topic: Networking and Content Delivery

When troubleshooting VPC connectivity issues, select TWO statements that are true about security groups and network ACLs.

Options:

  • A. Security groups are stateful; return traffic is automatically allowed.

  • B. A single inbound NACL rule allows return traffic automatically.

  • C. A NAT gateway route makes a subnet public for internet access.

  • D. Network ACLs are stateless; allow both inbound and outbound.

  • E. Security groups use ordered rules and stop at first match.

  • F. An internet gateway route works without a public IPv4 address.

Correct answers: A and D

Explanation: Security groups are stateful, so they automatically allow return traffic for established connections. Network ACLs are stateless, so you must explicitly allow both inbound and outbound traffic (including ephemeral ports as needed) or connectivity will break.

A common cause of “it connects one way but not the other” in a VPC is confusing stateful and stateless filtering. Security groups are stateful and act as an allow list: if traffic is allowed in one direction for a connection, the return traffic is automatically allowed.

Network ACLs are stateless and are evaluated separately for inbound and outbound traffic at the subnet boundary, with an implicit deny at the end. If you allow inbound traffic but forget the corresponding outbound rule (or required ephemeral port range), the return path can be blocked and the connection will fail even if security groups look correct.

  • OK (stateful SGs): Return traffic is allowed automatically for established flows.
  • OK (stateless NACLs): You must allow both directions (and ports) explicitly.
  • NO (SG rule order): Security groups have no explicit deny and aren’t first-match.
  • NO (public access requirement): An IGW route still requires a public IPv4 (or IPv6) address.

Question 20

Topic: Networking and Content Delivery

A CloudOps engineer must immediately block all traffic between instances in a subnet and a known malicious IP address 203.0.113.10. The subnet’s network ACL currently has rule 100: ALLOW 0.0.0.0/0 for all protocols in both inbound and outbound directions. Which action best applies the ops principle of blast-radius reduction while correctly using the differences between security groups and network ACLs (statefulness and rule evaluation order)?

Options:

  • A. Add NACL inbound and outbound DENY rules for 203.0.113.10/32 with a rule number lower than 100

  • B. Add only an inbound NACL DENY rule because NACLs are stateful for return traffic

  • C. Add a security group DENY rule for 203.0.113.10/32

  • D. Add an inbound NACL DENY rule for 203.0.113.10/32 with a rule number higher than 100

Best answer: A

Explanation: Use the subnet network ACL to block the malicious IP for all resources in that subnet, limiting the incident’s impact scope (blast radius). Because NACLs are stateless, you typically block both inbound and outbound directions to stop traffic both ways. NACL rule order matters: a lower-numbered deny must be evaluated before the existing allow-all rule.

Blast-radius reduction means applying controls at the narrowest operational boundary that quickly contains impact for the affected scope. A network ACL is a subnet-level control that can explicitly ALLOW or DENY traffic and is evaluated in ascending rule number order; the first matching rule wins. Because NACLs are stateless, return traffic is not automatically allowed/blocked—you must consider rules in both inbound and outbound directions.

Security groups are instance/ENI-level controls that are stateful and effectively “allow lists” only: they evaluate all rules (no rule order) and cannot create explicit deny rules. Therefore, to immediately block a specific IP when an allow-all NACL exists, place a lower-numbered DENY rule for that IP (and typically mirror it outbound) so it matches before the allow-all rule.

  • Security group deny is not possible because security groups do not support explicit DENY rules.
  • Wrong NACL rule order fails because a higher-numbered deny would never match if an allow-all rule is evaluated first.
  • Assuming NACL statefulness is incorrect; stateless filtering requires considering both directions.

Question 21

Topic: Security and Compliance

A security team wants to ensure that no Amazon EC2 security group allows inbound SSH (port 22) from 0.0.0.0/0. The team uses AWS Config to detect noncompliant security groups and automatically trigger remediation workflows with an audit trail.

Which action should the team NOT take?

Options:

  • A. Auto-remediate by adding 0.0.0.0/0 ingress on port 22

  • B. Use a Config remediation action to run SSM Automation

  • C. Deploy the restricted-ssh AWS Config managed rule

  • D. Use EventBridge to trigger an Automation runbook on noncompliance

Best answer: A

Explanation: AWS Config rules should detect noncompliant configurations and trigger remediation that reduces risk. A remediation workflow for unrestricted SSH must remove or restrict the offending rule, using an auditable mechanism such as Systems Manager Automation. Adding an internet-open SSH rule is the opposite of remediation and increases exposure.

The core pattern is: AWS Config evaluates resources against a rule, and a remediation workflow responds to noncompliance to bring the resource back into the desired (secure) state. For security groups, compliant remediation typically revokes overly permissive ingress or restricts it to approved CIDRs. AWS Config can trigger remediation directly (remediation actions) or indirectly by emitting compliance change events that start an SSM Automation runbook, both of which are auditable through AWS API activity logs.

A workflow that intentionally opens SSH to 0.0.0.0/0 is an operations anti-pattern because it expands the attack surface and undermines the control Config is meant to enforce.

  • Managed rule usage is an appropriate way to continuously detect unrestricted SSH.
  • SSM Automation remediation is a standard, auditable way to enforce compliance automatically.
  • Event-driven remediation is acceptable when you want routing/approval logic before running a runbook.
  • Internet-open SSH is unsafe because it increases exposure rather than remediating it.

Question 22

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

Which TWO statements are true about choosing an automated remediation approach in AWS based on incident type and blast radius? (Select TWO.)

Options:

  • A. Use Lambda to replace unhealthy instances because Auto Scaling cannot do this automatically.

  • B. Use Auto Scaling to orchestrate in-place OS patching across running instances without reboots.

  • C. Use an Auto Scaling group with load balancer health checks to replace an unhealthy EC2 instance automatically.

  • D. Use Systems Manager Automation runbooks for controlled, auditable, multi-step remediation that can target many resources with rate controls.

  • E. Use Lambda for 30-minute host repairs because Lambda has no maximum runtime.

  • F. Use Systems Manager Automation only for manual execution; it cannot be event-triggered.

Correct answers: C and D

Explanation: Auto Scaling is the right fit for automatically replacing unhealthy instances, which keeps remediation focused at the instance/fleet level with minimal operator action. Systems Manager Automation is best when remediation requires multi-step, auditable runbooks and controlled rollout to many resources, which helps manage blast radius safely.

Pick the automation tool that matches what failed and how widely your action will apply. Auto Scaling is purpose-built for keeping a fleet healthy by automatically replacing failed instances using EC2 and/or load balancer health checks. Systems Manager Automation is better for operational runbooks where you need a defined sequence of actions (and logging/approvals/permissions) and you may need to limit concurrency when targeting many instances or resources.

As a rule of thumb:

  • Use Auto Scaling for instance replacement and fleet self-healing.
  • Use SSM Automation for repeatable, auditable, multi-step remediation with controlled rollout.

Lambda is event-driven and useful for short, idempotent actions, but it is not suited to long-running repairs.

  • OK: Auto Scaling with health checks is standard self-healing for unhealthy instances.
  • OK: SSM Automation fits multi-step, auditable remediation with rollout controls.
  • NO: Lambda has a maximum runtime (not suitable for 30-minute repairs).
  • NO: Auto Scaling does not provide in-place patch orchestration (use Systems Manager Patch Manager/Run Command).
  • NO: SSM Automation can be triggered automatically (for example, by EventBridge).
  • NO: Auto Scaling can replace unhealthy instances automatically; Lambda is not required for that core behavior.

Question 23

Topic: Deployment, Provisioning, and Automation

A fleet of Amazon EC2 Linux instances runs a critical web service. When the service hangs, a per-instance CloudWatch alarm (dimension InstanceId) enters ALARM and the on-call engineer manually connects to the instance to run systemctl restart websvc. The company has disabled inbound SSH and requires remediation actions to be repeatable and auditable. All instances already appear as “managed instances” in AWS Systems Manager.

Which change provides the best operational optimization while meeting these constraints?

Options:

  • A. Create an EventBridge rule that invokes an AWS Lambda function to restart the service over SSH using a bastion host

  • B. Configure the Auto Scaling group to terminate and replace an instance whenever the alarm enters ALARM

  • C. Create an AWS Systems Manager Automation runbook that restarts websvc, verifies health, and is triggered by an EventBridge rule on the alarm state-change event

  • D. Install a cron job on each instance to periodically restart websvc when local checks fail

Best answer: C

Explanation: Using a Systems Manager Automation runbook turns an ad-hoc, manual restart into a standardized, repeatable workflow with execution history. Triggering it from EventBridge on the CloudWatch alarm state change removes the need for inbound SSH while reducing MTTR through automatic remediation. The main tradeoff is the added effort to build and maintain the runbook and its IAM permissions.

Systems Manager Automation is designed for operational runbooks that perform one or more controlled steps (remediation plus validation) with centralized logging and audit history. In this scenario, an Automation document can use Systems Manager to restart the service on the specific instance that triggered the alarm and then confirm recovery (for example, by checking service status or running a health command) before ending successfully.

A practical workflow is:

  • EventBridge rule matches the CloudWatch alarm state change to ALARM.
  • Automation execution starts and targets the instance (by InstanceId from the event or by tag).
  • Steps run systemctl restart websvc, wait briefly, then verify the service.

This improves operational effort and reliability without violating the “no inbound SSH” constraint; the tradeoff is managing the runbook content and the Automation assume role permissions.

  • SSH-based remediation conflicts with the “no inbound SSH” constraint and adds more moving parts (bastion connectivity/keys).
  • Replace the instance can work but is slower and more disruptive than restarting a single service, and may increase cost.
  • Per-instance cron logic reduces central control and auditability and makes behavior drift likely across the fleet.

Question 24

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

A bursty API running on Amazon ECS connects to an Amazon Aurora MySQL cluster. Users report intermittent 500 errors during traffic spikes.

Exhibit: CloudWatch alarm history (Aurora-Prod-DBConnectionsHigh)

Metric: DatabaseConnections   Threshold: > 1,000 (1 datapoint)
12:01:10  ALARM  datapoint=1,245
12:04:10  OK     datapoint=220
12:16:10  ALARM  datapoint=1,310
12:19:10  OK     datapoint=240

Based on the exhibit, what is the best next step to improve database connection efficiency for this workload?

Options:

  • A. Increase the DB parameter group value for max_connections

  • B. Add an Aurora reader instance to offload read traffic

  • C. Scale the Aurora writer to a larger instance class

  • D. Create an RDS Proxy and route the application connections through it

Best answer: D

Explanation: The exhibit shows brief spikes in DatabaseConnections above the 1,000 threshold that quickly return to normal, indicating connection storms during bursts rather than sustained resource saturation. RDS Proxy improves connection efficiency by pooling and reusing database connections, smoothing bursts and reducing connection-related errors without requiring database scaling.

This pattern points to an application connection-management issue: connections surge above a fixed threshold and then drop back within minutes (for example, 12:16:10 ALARM datapoint=1,310 followed by 12:19:10 OK datapoint=240). For bursty workloads, opening many new connections concurrently can exhaust the database’s connection capacity and cause intermittent failures.

RDS Proxy addresses this by maintaining a pool of established connections to the database and multiplexing many application connections onto fewer database connections. Operationally, you create an RDS Proxy for the Aurora cluster, configure IAM/Secrets Manager auth as needed, and update the application to use the proxy endpoint.

Raising max_connections or scaling the DB might mask symptoms but does not directly fix connection storms.

  • Add read capacity doesn’t address spikes in total connections shown by the DatabaseConnections alarm.
  • Scale up the writer targets compute/memory limits, but the exhibit highlights transient connection bursts.
  • Increase max_connections can defer failures, but it doesn’t improve connection reuse/pooling for bursty clients.

Question 25

Topic: Deployment, Provisioning, and Automation

An operations team uses AWS Systems Manager Automation runbooks to make changes to production EC2 instances. The team must add a guardrail so that each runbook execution requires a human approval step before the automation can start.

Which AWS capability provides this built-in approval workflow for operational changes?

Options:

  • A. AWS CloudTrail

  • B. Amazon EventBridge

  • C. AWS Systems Manager Change Manager

  • D. AWS Config

Best answer: C

Explanation: AWS Systems Manager Change Manager is designed to add governance to operational actions by requiring approvals before changes are executed. It integrates with Systems Manager operational capabilities (including Automation) so teams can enforce a human approval step as a guardrail. This directly reduces the risk of unintended automated changes in production.

The core guardrail in this scenario is a mandatory approval step before an operational automation runs. AWS Systems Manager Change Manager provides change management for AWS resources by using change templates and change requests that can require one or more approvers before execution. After approval, the change can be executed through Systems Manager operational actions such as Automation runbooks, providing auditability and controlled execution.

The key takeaway is that logging or configuration tracking services can record what happened, but Change Manager is the Systems Manager feature that enforces pre-execution approvals.

  • CloudTrail is audit, not control: it records API activity but does not require approvals before actions run.
  • Config is compliance tracking: it evaluates and records resource configuration changes; it does not gate an Automation execution.
  • EventBridge is event routing: it can trigger workflows but does not provide a native human approval control by itself.

Questions 26-50

Question 26

Topic: Reliability and Business Continuity

An application runs on an EC2 Auto Scaling group (ASG) behind an ALB. Users report intermittent latency during normal traffic fluctuations. The ASG currently uses two step scaling policies driven by CloudWatch CPU alarms.

Exhibit: ASG activity history (excerpt)

10:00Z  Launch     i-01  Cause: alarm CPUHigh-70 in ALARM, +1
10:05Z  Terminate  i-02  Cause: alarm CPULow-35 in ALARM,  -1
10:10Z  Launch     i-03  Cause: alarm CPUHigh-70 in ALARM, +1
10:15Z  Terminate  i-01  Cause: alarm CPULow-35 in ALARM,  -1

Based on the exhibit, what is the best next step to maintain performance under changing load?

Options:

  • A. Add a target tracking policy on average CPU and remove step scaling

  • B. Increase the ASG desired capacity and keep step scaling policies

  • C. Increase the CPU alarm evaluation periods to 30 minutes

  • D. Configure scheduled scaling to add capacity at fixed times

Best answer: A

Explanation: The exhibit shows rapid scale-out and scale-in events alternating every few minutes, driven directly by CPU alarms. This is a classic sign of step scaling “thrash” under normal load variability. A target tracking policy will continuously adjust capacity to keep a chosen metric near a target value, stabilizing performance without constant alarm tuning.

Target tracking scaling is designed to keep a workload stable by maintaining a metric near a set target value, adjusting capacity proportionally as load changes. In the exhibit, capacity changes are being triggered back-and-forth by CPU alarms: a launch at 10:00Z followed by a termination at 10:05Z, then another launch at 10:10Z and termination at 10:15Z. That alternating pattern indicates the current step scaling thresholds are too reactive for normal demand fluctuations.

A better operational approach is to replace the step scaling policies with a target tracking policy (for example, ASGAverageCPUUtilization at a chosen target such as 50–60%). Target tracking reduces oscillation by continuously steering capacity toward the target instead of reacting to discrete alarm states.

The key takeaway is that the rapid alternating events in the activity history point to using target tracking to maintain steady performance.

  • Longer alarm evaluation may reduce flapping, but it still relies on discrete ALARM/OK transitions shown in the exhibit.
  • Scheduled scaling is time-based and does not address the alternating events at 10:00Z/10:05Z/10:10Z/10:15Z.
  • Fixed higher desired capacity can mask the issue but doesn’t adapt to changing load like target tracking.

Question 27

Topic: Reliability and Business Continuity

A team stores critical CSV reports in an Amazon S3 bucket with versioning enabled. An operator accidentally deletes reports/2026-02.csv. The application now receives 404 responses when reading the object. The team’s current runbook restores the file by downloading an older version and uploading it again, which is slow and error-prone.

Exhibit: list-object-versions output (partial)

Key: reports/2026-02.csv
  IsLatest: true
  VersionId: 3HL4kqC9...      (DeleteMarker)
Key: reports/2026-02.csv
  IsLatest: false
  VersionId: qA1b2c3D...      (Object version)

Which change to the runbook is the best optimization to restore the object quickly with the least operational effort?

Options:

  • A. Copy the previous version to the same key to create a new current version

  • B. Suspend versioning on the bucket to prevent delete markers from being created

  • C. Delete the delete marker version so the previous version becomes current

  • D. Run an S3 Glacier restore on the object and wait for retrieval to complete

Best answer: C

Explanation: In a versioned S3 bucket, a simple DELETE adds a delete marker that hides the latest data version. Deleting only the delete marker is the fastest recovery because it makes the most recent non-delete version visible again immediately. This avoids download/upload steps and minimizes the chance of restoring the wrong content.

The core concept is that S3 versioning turns most deletes into a new “delete marker” version; the data version still exists, but it is hidden because the delete marker is the latest version. Operationally, the quickest way to “undelete” is to remove the delete marker (by deleting the specific delete-marker VersionId), which causes S3 to return the next-latest object version for normal GET requests.

Tradeoff: this is a powerful action that must be done carefully—deleting the wrong version ID could permanently remove a data version—so the runbook should explicitly confirm IsLatest and that the target version is a delete marker before deletion.

  • Glacier restore doesn’t address a delete marker; it’s for retrieving archived objects.
  • Suspend versioning reduces recoverability and doesn’t fix the already-hidden object.
  • Copy previous version can work, but it adds extra operations and creates an additional version when simply removing the delete marker is faster and simpler.

Question 28

Topic: Security and Compliance

Which THREE statements are true about enforcing compliance constraints on AWS Region and service usage with AWS Organizations governance controls (such as service control policies (SCPs))?

(Select THREE.)

Options:

  • A. SCPs can be attached directly to individual IAM roles to restrict what those roles can do.

  • B. AWS Control Tower detective guardrails prevent users from using disallowed AWS services.

  • C. You can restrict Regions by using SCP conditions on aws:RequestedRegion to deny requests outside an approved list.

  • D. AWS Config rules can prevent users from creating resources in disallowed Regions by blocking the API call.

  • E. An explicit Deny in an SCP overrides any Allow in IAM identity-based or resource-based policies.

  • F. SCPs set the maximum permissions an account can use; they do not grant permissions by themselves.

Correct answers: C, E and F

Explanation: SCPs in AWS Organizations are preventive guardrails that limit what permissions accounts can exercise, regardless of what IAM policies allow. They are commonly used to restrict access to specific AWS services and to constrain usage to approved Regions by evaluating request context keys such as aws:RequestedRegion. Detective controls (for example, Config rules) can identify violations but do not block API calls by themselves.

The core mechanism for enforcing (preventing) noncompliant Region and service usage across accounts is an AWS Organizations SCP. SCPs don’t grant access; instead, they define the outer boundary of what actions are even possible in affected accounts. During authorization, an explicit deny in an SCP takes precedence and cannot be overridden by an IAM allow.

For Region restrictions, an SCP can deny actions when the request context indicates a non-approved Region (commonly with conditions against aws:RequestedRegion). This is a preventive control. In contrast, governance tools like AWS Config and many Control Tower detective guardrails are primarily for detection and reporting/remediation workflows, not for blocking the initial API call.

Key takeaway: use SCPs for preventive enforcement; use detective tools to find and respond to drift.

  • OK SCPs are permission guardrails; they do not grant permissions on their own.
  • OK An explicit deny in an SCP cannot be overridden by IAM allows.
  • OK aws:RequestedRegion conditions in an SCP can enforce approved-Region usage when applicable to the request.
  • NO AWS Config is primarily detective; it evaluates and reports compliance rather than blocking API calls.
  • NO SCPs attach to the organization root/OU/account, not to individual IAM principals.
  • NO Detective guardrails detect and alert; preventive guardrails/SCPs are used to block actions.

Question 29

Topic: Reliability and Business Continuity

A production application in eu-west-1 experienced data corruption across several AWS resources. The operations team will restore the environment using existing recovery points in AWS Backup.

Requirements:

  • All restored data must remain in eu-west-1 (data residency).
  • The corrupted resources must remain available for forensics; do not delete or overwrite them.

Which TWO actions should you AVOID during the restore?

Options:

  • A. Copy recovery point to us-east-1; restore resources there

  • B. Restore RDS to new instance; cut over using Route 53

  • C. Restore DynamoDB to new table name; repoint application

  • D. Restore EBS to new volume; swap after stopping instance

  • E. Restore EFS to new file system; remount via Systems Manager

  • F. Delete existing DynamoDB table; restore backup to original name

Correct answers: A and F

Explanation: To meet residency and forensic requirements, restores must stay in eu-west-1 and should be performed by creating new resources from AWS Backup recovery points. Cut over traffic or configuration to the restored resources only after validation. Actions that move data to another Region or require deleting/overwriting the corrupted resources violate the stated rules.

AWS Backup restores are safest operationally when you restore to new resources in the same Region, validate them, and then perform a controlled cutover (DNS change, configuration update, or volume swap) while keeping the corrupted resources intact for investigation.

In this scenario, two explicit constraints drive what to avoid:

  • Restoring in a different Region breaks data residency.
  • Deleting or overwriting the corrupted resource breaks the forensic preservation requirement.

Restoring EBS, RDS, DynamoDB, and EFS to new targets in eu-west-1 and then switching the application to those targets satisfies both requirements and keeps rollback/investigation options available.

  • Cross-Region restore breaks the requirement that restored data must remain in eu-west-1.
  • Overwrite by deletion fails because deleting the DynamoDB table destroys the corrupted evidence you must preserve.
  • Restore then cut over is acceptable because it restores to new resources and changes traffic/config only after validation.
  • EBS swap after stop is acceptable because it replaces the attachment while keeping the original volume for forensics.

Question 30

Topic: Security and Compliance

A security team wants to continuously detect noncompliant AWS resource configurations by using AWS Config rules and automatically start remediation workflows when noncompliance is found.

Which statement is INCORRECT?

Options:

  • A. Managed AWS Config rules can flag noncompliant resources automatically.

  • B. Compliance changes can be sent to EventBridge for automation.

  • C. Config remediation can run Systems Manager Automation with an IAM role.

  • D. Config rules can read instance log files to verify patch status.

Best answer: D

Explanation: AWS Config rules evaluate resource configuration items that AWS Config records (for example, security group settings or S3 bucket policies) and report compliance. Remediation is typically automated by triggering workflows such as AWS Systems Manager Automation directly from AWS Config or by reacting to compliance events in EventBridge.

AWS Config is designed to record and evaluate AWS resource configurations against rules (managed rules or custom rules). When a resource becomes NON_COMPLIANT, AWS Config can both report the compliance state and drive automation.

For remediation, AWS Config supports remediation actions that run AWS Systems Manager Automation documents, using a specified IAM service role to perform the fix (for example, attach encryption, restrict a security group, or enable a setting). You can also route compliance change notifications to Amazon EventBridge and trigger additional workflows (Lambda, SSM Automation, SNS notifications, ticketing integrations).

What AWS Config does not do is inspect the operating system inside an EC2 instance (such as reading log files or checking installed patches); use services like Systems Manager Patch Manager or Amazon Inspector for that type of assessment.

  • Config scope AWS Config evaluates recorded AWS resource configurations, not in-guest files.
  • Event-driven workflows Sending compliance change events to EventBridge is a common automation pattern.
  • Built-in remediation Using SSM Automation with an IAM role is the standard AWS Config remediation mechanism.

Question 31

Topic: Reliability and Business Continuity

A company serves a global user base from an Application Load Balancer (ALB) origin in us-east-1. During peak hours, the ALB and web tier show high request rates and increased latency. The operations team is adding Amazon CloudFront to reduce origin load while keeping content reasonably fresh.

Which CloudFront configuration action should be AVOIDED because it prevents effective caching at edge locations?

Options:

  • A. Create separate cache behaviors so static paths use long TTLs while dynamic paths use low or disabled caching

  • B. Forward all headers, all cookies, and all query strings and set the default TTL to 0 seconds for all paths

  • C. Use an AWS managed cache policy optimized for caching static content and enable CloudFront compression

  • D. Enable Origin Shield in the AWS Region closest to the ALB origin to improve cache hit ratios

Best answer: B

Explanation: CloudFront reduces origin load only when many requests can be served from edge caches. Forwarding all request attributes and setting a 0-second TTL for all paths makes most requests uncacheable or uniquely cached, driving traffic back to the ALB. This violates the principle of minimizing cache-key variation and using appropriate TTLs to increase cache hit ratio.

The core operational goal is to increase CloudFront cache hits so fewer requests reach the origin. CloudFront caches objects based on the cache key (what you include from the request) and how long objects are allowed to stay in cache (TTL). If you forward many values (headers/cookies/query strings), you create many unique cache keys; if you set TTLs to 0, CloudFront must revalidate or fetch from the origin for every request.

To reduce origin load while keeping content fresh:

  • Cache static assets with longer TTLs (and version object names when you deploy changes)
  • Limit cache-key inputs to only what the application truly needs
  • Use features like compression and Origin Shield to improve efficiency

The key takeaway is to avoid configurations that make nearly every request a cache miss.

  • Split behaviors is a common way to cache static content aggressively while treating dynamic routes differently.
  • Managed cache policy + compression increases cacheability and reduces bytes served without breaking correctness.
  • Origin Shield can reduce origin load by consolidating cache misses before they reach the origin.

Question 32

Topic: Deployment, Provisioning, and Automation

An operations engineer must use AWS Systems Manager Patch Manager to apply patches every Sunday. The engineer must also limit the blast radius by ensuring that only a small, controlled number of instances are patched at the same time.

Which Systems Manager feature provides both the schedule and the concurrency control for patching?

Options:

  • A. Create a Maintenance Window task that runs AWS-RunPatchBaseline and set MaxConcurrency (and MaxErrors)

  • B. Use an EC2 Auto Scaling group instance refresh to roll patches through instances

  • C. Configure patch baseline approval rules to stagger when updates are installed

  • D. Create a State Manager association and rely on instance tags to limit patching concurrency

Best answer: A

Explanation: Systems Manager Maintenance Windows are the built-in scheduling mechanism for Patch Manager operations. When you run AWS-RunPatchBaseline as a Maintenance Window task, you can explicitly limit parallel execution by configuring task concurrency/error thresholds, which controls patching blast radius.

The core operational control for scheduled, controlled patching in Systems Manager is a Maintenance Window. You register targets (instances) for the window and run patching by adding a Maintenance Window task that uses the AWS-RunPatchBaseline SSM document. Blast radius is controlled by the task’s execution controls, especially MaxConcurrency (how many targets run at once) and MaxErrors (when to stop after failures).

Patch baselines define what is approved to install, but they do not schedule patching or control how many instances patch simultaneously. A State Manager association can run on a schedule, but the explicit parallelism controls for managed rollout are provided in Maintenance Window task settings.

  • Patch baseline approval rules control what patches are eligible, not the scheduling/concurrency of execution.
  • Auto Scaling instance refresh can roll EC2 replacements, but it is not the Patch Manager mechanism for scheduled patching.
  • State Manager tags can target instances, but this does not provide the Maintenance Window task concurrency/error-stop controls for rollout blast radius.

Question 33

Topic: Networking and Content Delivery

A SysOps engineer is reviewing a plan to use AWS Transit Gateway (TGW) for hub-and-spoke routing between multiple VPCs in one Region. Which statement is NOT correct?

Options:

  • A. TGW automatically adds required routes to VPC route tables.

  • B. Each VPC must have a TGW VPC attachment.

  • C. VPC route tables need routes that point to the TGW.

  • D. TGW route tables can segment which attachments communicate.

Best answer: A

Explanation: Transit Gateway attachments and TGW route tables provide the hub-and-spoke connectivity, but VPC route tables still require explicit routes that target the TGW. Assuming TGW will automatically add those VPC routes is incorrect and can lead to unexpected loss of connectivity during cutover.

The core concept is that TGW controls forwarding between attachments with TGW route tables (via association/propagation), but each VPC still decides what traffic leaves the VPC by using its own VPC route tables. Creating a TGW VPC attachment connects a VPC to the TGW, but it does not automatically insert routes into the VPC subnets’ route tables.

Operationally, hub-and-spoke typically involves:

  • Creating a TGW and VPC attachments for the hub and each spoke VPC.
  • Associating/propagating attachments into the appropriate TGW route table(s) to allow or block paths.
  • Adding routes in each VPC route table for remote CIDRs with the TGW as the target.

Key takeaway: TGW route tables control inter-attachment paths, while VPC route tables must be updated to actually send traffic to the TGW.

  • Attachment requirement is accurate: a VPC must be attached to a TGW to use it.
  • VPC routing requirement is accurate: subnet route tables need TGW-target routes for remote CIDRs.
  • Segmentation with TGW route tables is accurate: separate TGW route tables (and association/propagation choices) control reachability.

Question 34

Topic: Security and Compliance

An operations engineer reviews AWS Trusted Advisor Security checks for the production account. The checks show:

  • Root account has active access keys
  • MFA on the root account is disabled
  • IAM access keys have not been rotated in the last 90 days (6 keys)
  • IAM password policy does not require strong passwords

Only one remediation can be completed immediately. Which action should be prioritized first to reduce security risk the most?

Options:

  • A. Rotate IAM access keys that are older than 90 days

  • B. Enable MFA on the root account

  • C. Delete the root account access keys

  • D. Update the IAM password policy to enforce stronger passwords

Best answer: C

Explanation: The deciding factor is the presence of active root access keys. Root credentials provide unrestricted access, and access keys can be used for API calls without MFA. Removing the root access keys immediately eliminates a high-impact credential risk, while the other findings are important hardening tasks but typically less urgent than eliminating root keys.

AWS Trusted Advisor flags “Root account has active access keys” because root is an all-powerful principal and root access keys are long-lived credentials that are difficult to justify operationally. If a root access key is ever exposed, an attacker can use it to call AWS APIs with full permissions, potentially without any MFA challenge.

Operationally, prioritize eliminating the root access keys and then use least-privilege IAM roles/users for administration. After the root keys are removed, follow up by enabling root MFA and addressing routine hygiene findings like key rotation and password policy.

The closest alternative is enabling root MFA, but MFA does not mitigate API access that uses existing root access keys.

  • Root MFA first improves interactive sign-in security but does not prevent API calls made with existing root access keys.
  • Rotate old IAM keys is good credential hygiene, but it doesn’t remove the highest-privilege credential risk.
  • Stronger password policy reduces password-guessing risk for IAM users, but it doesn’t address exposed root programmatic credentials.

Question 35

Topic: Reliability and Business Continuity

A CloudFront distribution fronts an ALB origin for an e-commerce site.

Requirements:

  • /assets/* files are versioned (filename changes on updates) and should be cached at edge for 30 days with the highest possible cache hit ratio.
  • /api/* returns user-specific JSON that varies by the Authorization header and must reflect changes within 30 seconds.

Which TWO CloudFront cache behavior configurations should you AVOID? (Select TWO.)

Options:

  • A. Set /api/* TTL to 1 hour and do not include Authorization in the cache key

  • B. Set /api/* TTL to 30 seconds and include Authorization and all query strings in the cache key

  • C. Set /assets/* TTL to 30 days and exclude cookies, headers, and query strings from the cache key

  • D. Use separate behaviors: /assets/* uses a cache policy with 30-day TTL; /api/* uses a cache policy with 30-second TTL

  • E. Set /api/* caching disabled (TTL 0) and forward Authorization to the origin

  • F. Set /assets/* TTL to 30 days with a cache key that includes all cookies, all headers, and all query strings

Correct answers: A and F

Explanation: Dynamic, user-specific API responses must either not be cached or must be cached with a cache key that varies by Authorization, and the TTL must meet the 30-second freshness requirement. Static versioned assets should use a long TTL and a minimal cache key to maximize cache hits and reduce origin load.

The core CloudFront operational task here is matching TTL and cache key design to content type. For /api/*, caching is only safe when the cache key includes every request value that changes the response (here, Authorization), and the TTL must be no more than the allowed staleness (30 seconds). If Authorization is omitted from the cache key while caching is enabled, CloudFront can serve the wrong user’s cached response.

For /assets/*, versioned filenames allow long TTLs, and the cache key should exclude cookies/headers/query strings that do not affect the asset; otherwise, the same object is stored under many cache keys, lowering hit ratio and increasing origin traffic.

Key takeaway: include only response-varying values in the cache key, and set TTLs to match freshness requirements.

  • Long API TTL without auth keying breaks both user isolation and the 30-second freshness requirement.
  • Overly broad static cache key reduces cache hits and undermines the requirement to maximize edge caching.
  • Short-lived API caching with Authorization in key is acceptable because cached objects are per-user and within the freshness window.
  • Disabling API caching is acceptable for correctness (no stale or cross-user responses), even if it reduces offload.

Question 36

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

A CloudOps engineer supports a web application running on an EC2 Auto Scaling group (min 2, max 20) behind an Application Load Balancer. During sustained, unpredictable traffic spikes, CloudWatch shows ASGAverageCPUUtilization stays above 80%, ALBRequestCountPerTarget rises sharply, and users report slow responses. The engineer must configure scaling policies so the group automatically adds capacity during sustained load and maintains performance without manual intervention.

Which actions should the engineer take? (Select THREE.)

Options:

  • A. Add a step scaling policy driven by a CloudWatch alarm on TargetResponseTime with larger scale-out steps at higher latency thresholds

  • B. Enable ALB access logs to Amazon S3 to improve troubleshooting of slow requests

  • C. Add a target tracking policy using ASGAverageCPUUtilization to keep average CPU near a set value

  • D. Add a target tracking policy using ALBRequestCountPerTarget to keep requests per target near a set value

  • E. Manually increase the Auto Scaling group desired capacity when spikes occur

  • F. Create a scheduled scaling action to increase desired capacity at fixed times each day

Correct answers: A, C and D

Explanation: To remediate sustained load with minimal manual effort, configure Auto Scaling policies that continuously adjust capacity based on demand signals. Target tracking maintains a metric at a steady target (like CPU utilization or requests per target), while step scaling uses CloudWatch alarms and thresholds to add capacity in controlled increments when performance metrics degrade.

The core remediation is to configure Auto Scaling policies that respond automatically to sustained load signals. Target tracking is ideal when you want a metric to hover around a chosen target value; it continuously computes capacity needed to keep that metric near the target (for example, CPU utilization or ALB requests per target). Step scaling is appropriate when you want explicit thresholds and different scale-out amounts as conditions worsen (for example, larger scale-out when TargetResponseTime is much higher).

Using these together is valid operationally because multiple policies can be attached to the same Auto Scaling group, and the group applies the policy that results in the greatest required capacity during scale out. Actions like logging or manual/scheduled changes do not directly implement reactive scaling to maintain performance during unpredictable sustained spikes.

  • OK: Target tracking on requests per target scales capacity to match ALB demand.
  • OK: Target tracking on average CPU reduces sustained CPU saturation by adding instances.
  • OK: Step scaling on response time enables threshold-based, proportional scale-out during performance degradation.
  • NO: Manual desired-capacity changes are not a scaling policy and do not provide ongoing automatic remediation.
  • NO: Access logs help analysis but do not configure any scaling behavior.
  • NO: Scheduled scaling is time-based and won’t reliably match unpredictable sustained spikes.

Question 37

Topic: Deployment, Provisioning, and Automation

Select TWO statements that are true about building, tagging, and storing container images in Amazon ECR for operational use.

Options:

  • A. You must use AWS Certificate Manager (ACM) certificates to store private images in ECR.

  • B. Enable ECR tag immutability to prevent an existing tag from being overwritten.

  • C. ECR repositories must use S3 buckets that you create and manage.

  • D. Use aws ecr get-login-password to authenticate Docker to ECR.

  • E. docker push automatically creates the ECR repository if it does not exist.

  • F. ECR lifecycle policies cannot expire images that use the latest tag.

Correct answers: B and D

Explanation: To push or pull images with Amazon ECR, your container client must first authenticate to the ECR registry using an ECR login token. ECR tags are mutable by default, but you can enforce operational safety by enabling tag immutability so a tag cannot be reassigned to a different image digest.

The core operational steps for using Amazon ECR are: build an image locally, tag it with the ECR repository URI plus a tag, authenticate Docker to the ECR registry, and then push the image. Authentication is performed by obtaining a temporary registry token from ECR (commonly with aws ecr get-login-password) and using it with docker login.

Separately, tagging behavior affects deployment repeatability. By default, an ECR tag (such as latest or prod) can be moved to point to a different image digest when you push again with the same tag. If you need to prevent accidental overwrites and improve traceability, enable tag immutability at the repository level so an existing tag cannot be reassigned.

  • OK: Authenticating Docker with an ECR login token is required before pushes/pulls.
  • OK: Tag immutability prevents reusing a tag for a different digest.
  • NO: A repository must exist before pushing; create it (or use automation) first.
  • NO: Lifecycle policies can target tags (including latest) based on your rules.

Question 38

Topic: Reliability and Business Continuity

A company runs a customer-facing application in us-east-1 with a warm-standby DR environment in us-west-2. The workload uses Amazon RDS and Amazon EBS volumes, and backups are managed with AWS Backup.

The disaster recovery runbook must meet an RPO of 15 minutes and an RTO of 60 minutes, and it must include documented failover and failback steps that are auditable.

Which THREE runbook steps should be AVOIDED? (Select THREE.)

Options:

  • A. For failback, pause writes, take a final backup, restore primary, then cut back

  • B. Start taking manual snapshots only after declaring a disaster

  • C. Use AWS Backup to copy encrypted recovery points to a DR backup vault

  • D. Periodically test restores by restoring into an isolated DR VPC

  • E. Export backup files to an S3 bucket that allows public access

  • F. Retain all cross-Region backup copies indefinitely with no lifecycle

Correct answers: B, E and F

Explanation: A DR runbook should depend on pre-created, automated recovery points to meet stated RPO/RTO targets, and it must protect backup data with strong access controls. It also needs a sustainable retention strategy so backups remain manageable and cost-effective. Steps that create data exposure, jeopardize RPO, or create runaway retention should be excluded.

A disaster recovery runbook is only dependable if it is based on backups that are created continuously or on a defined schedule before an outage, secured like production data, and retained with clear boundaries. To meet a 15-minute RPO, the runbook should reference AWS Backup-managed recovery points that exist prior to the incident (and ideally include cross-Region copies to a dedicated vault). Backups must be private and access-controlled; placing backup exports where public access is possible violates basic data protection requirements.

Retention is also an operational control: define how long to keep recovery points (and when to transition/expire them) so storage cost and vault size do not grow without limit. Finally, validate the runbook with periodic restore tests and include safe failback sequencing (quiesce writes, capture a final recovery point, restore to primary, then switch traffic).

  • Cross-Region encrypted copies helps ensure recovery points exist in the DR Region and remain protected.
  • Restore testing is an operational best practice that proves backups are usable and the steps are accurate.
  • Orderly failback sequencing reduces data-loss risk by capturing a final recovery point after writes are paused.

Question 39

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

An ECS service writes application logs to a CloudWatch Logs log group. The on-call team wants an alert when the log stream contains "code":"PAYMENT_TIMEOUT" more than 5 times in 5 minutes. They want to avoid changing the application and reduce the current manual log searching during incidents.

Which change is the best optimization to meet this requirement?

Options:

  • A. Create a CloudWatch Logs metric filter and alarm on matches

  • B. Add a CloudWatch dashboard with a Logs Insights query widget

  • C. Publish a custom CloudWatch metric from the application code

  • D. Enable anomaly detection alarms on ECS service CPU utilization

Best answer: A

Explanation: Use CloudWatch Logs for the pattern match, convert matches into a CloudWatch metric with a metric filter, and then use a CloudWatch alarm to notify. This directly targets the PAYMENT_TIMEOUT signal without changing the application and removes the need for manual log searching. The tradeoff is you pay for the custom metric and alarm evaluation, but you gain timely, automated alerting.

CloudWatch metrics are numeric time-series values that alarms can evaluate; CloudWatch logs are event records that are searched and analyzed but are not directly alarmed on. When you need an alert based on a specific log message pattern (and you can’t change the app), the operationally efficient approach is to use a CloudWatch Logs metric filter to count matching events and then attach a CloudWatch alarm to that generated metric (for example, Sum over 5 minutes > 5) to notify via SNS.

This optimizes operations by turning unstructured log signals into an actionable, low-latency alarm, at the cost of an additional custom metric and alarm charges. A dashboard/query can help investigation, but it doesn’t provide proactive alerting by itself.

  • Dashboard-only visibility helps troubleshooting but does not create an automatic alert.
  • Custom metric from code would work, but it violates the no-application-change constraint and adds implementation effort.
  • CPU anomaly detection can reduce noisy performance alarms, but it won’t detect a specific application error code.

Question 40

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

A central operations account (111111111111) is setting up Amazon CloudWatch cross-account observability to build a single dashboard that shows application metrics from a production account (222222222222) in both us-east-1 and us-west-2. Linking works for us-east-1, but the team cannot link resources for us-west-2.

Exhibit: CloudTrail event (production account 222222222222)

EventSource: oam.amazonaws.com
EventName: CreateLink
awsRegion: us-west-2
requestParameters:
  sinkIdentifier: arn:aws:oam:us-west-2:111111111111:sink/7a1b2c3d4e
errorCode: ResourceNotFoundException
errorMessage: Sink ... not found

Based on the exhibit, what is the best next step to enable the centralized dashboard to include us-west-2 data from the production account?

Options:

  • A. Edit the existing dashboard widgets to query us-east-1 instead of us-west-2

  • B. Create the dashboard in the production account in us-west-2 and share it with the operations account

  • C. Add oam:CreateLink permissions to the IAM role used in the production account

  • D. Create an OAM sink in us-west-2 in the operations account, then create an OAM link to that sink from the production account in us-west-2

Best answer: D

Explanation: The CloudTrail event shows the CreateLink call in us-west-2 failing because the referenced sink cannot be found in that Region. CloudWatch cross-account observability (OAM) is Region-scoped, so you must create the sink in the Region where you want to link and then create the link in that same Region.

This is an Amazon CloudWatch cross-account observability (OAM) configuration issue. OAM sinks are Region-scoped resources, and links are created in a specific Region to a sink ARN in that same Region. In the exhibit, the awsRegion: us-west-2 call to CreateLink fails with errorCode: ResourceNotFoundException and errorMessage: Sink ... not found, while the request references a sinkIdentifier in arn:aws:oam:us-west-2:111111111111:sink/.... That combination indicates the monitoring (operations) account does not have an OAM sink created in us-west-2 (or the ARN is for a sink that doesn’t exist there).

Create an OAM sink in us-west-2 in the operations account, allow the production account to link to it, and then create the OAM link from the production account in us-west-2 so the dashboard can pull us-west-2 telemetry.

  • IAM permission change doesn’t fit because the error is ResourceNotFoundException, not an AccessDenied failure.
  • Sharing a dashboard doesn’t provide centralized cross-account metrics access; it just shares a dashboard artifact.
  • Pointing to us-east-1 would avoid the error but would not include the required us-west-2 data.

Question 41

Topic: Networking and Content Delivery

A company runs a public API behind an internet-facing ALB in us-east-1 (primary) and an identical ALB in us-west-2 (secondary). Amazon Route 53 uses a failover routing policy for api.example.com with CNAME records to each ALB and separate Route 53 HTTPS health checks to /health. The ALB target groups already have application health checks configured and are trusted.

The operations team wants to reduce ongoing cost and management overhead while keeping automatic regional failover behavior.

Which change is the best optimization?

Options:

  • A. Replace the CNAMEs with Route 53 alias A/AAAA failover records to the ALBs and enable Evaluate Target Health

  • B. Replace the HTTPS health checks with a CloudWatch alarm-based Route 53 health check

  • C. Switch from failover routing to weighted routing across both Regions

  • D. Increase the Route 53 health check request interval to reduce the number of checks

Best answer: A

Explanation: Using alias failover records with Evaluate Target Health lets Route 53 use the ALB’s health status to decide whether to answer with the primary or secondary record. This eliminates standalone Route 53 health checks (lower cost) and reduces configuration to maintain while preserving automated failover behavior.

For Route 53 failover routing, the decision to return the primary record is driven by the health evaluation of that record. When the record is an alias to an AWS resource like an ALB, enabling Evaluate Target Health makes Route 53 use the ALB’s service-reported health instead of running separate Route 53 health checkers.

In this setup, the ALB target groups already perform application health checks, so Route 53 can safely base failover on the ALB’s health and you can delete the standalone Route 53 health checks to reduce cost and operational effort. The main tradeoff is that you’re relying on the ALB’s health model (not multi-location probing from Route 53 health checkers).

  • Slower check interval reduces health check cost slightly but still requires managing Route 53 health checks and can slow failover detection.
  • Weighted routing is for load distribution, not primary/secondary failover based on health.
  • CloudWatch alarm health check can work, but adds monitoring dependencies and doesn’t reduce cost/effort compared with using ALB health directly.

Question 42

Topic: Reliability and Business Continuity

A company runs a web application in us-east-1 on an Auto Scaling group with 6 EC2 instances. The company is implementing disaster recovery (DR) in us-west-2.

Requirements:

  • RTO: 20 minutes
  • RPO: 15 minutes
  • Budget for always-on DR resources: 350/month

Assumptions (use 730 hours/month; round to nearest dollar):

  • EC2 cost in the DR Region is .10 per instance-hour.
  • Cross-Region data replication and backup storage costs 45/month regardless of strategy.
  • Estimated recovery times and RPO by strategy:
  • Backup and restore: RTO 2 hours, RPO 60 minutes
  • Pilot light (1 instance always on, scale on failover): RTO 35 minutes, RPO 5 minutes
  • Warm standby (3 instances always on, scale on failover): RTO 15 minutes, RPO 5 minutes
  • Multi-site active-active (6 instances always on): RTO 5 minutes, RPO 5 minutes

Which DR strategy should the operations team choose to meet the requirements at the lowest monthly always-on cost?

Options:

  • A. Warm standby

  • B. Pilot light

  • C. Multi-site active-active

  • D. Backup and restore

Best answer: A

Explanation: Only warm standby and active-active meet the 20-minute RTO and 15-minute RPO. Calculating monthly always-on cost shows warm standby stays within the USD 350/month budget while active-active exceeds it. Therefore, warm standby is the lowest-cost strategy that satisfies both timing and budget requirements.

The decision hinges on meeting RTO/RPO first, then verifying the always-on monthly cost against the budget. Backup and restore and pilot light miss the 20-minute RTO (and backup and restore also misses the 15-minute RPO), leaving warm standby and active-active.

Warm standby monthly always-on cost (3 instances running):

  • Compute: 3 instances .10/hour 7 730 hours .00
  • Add fixed replication/backup storage: USD 219.00 + USD 45 = USD 264/month (rounded)

Active-active would require 6 always-on instances, doubling the compute portion and pushing total cost above USD 350/month. The key takeaway is that warm standby is the least expensive option that still satisfies both RTO and RPO here.

  • Backup/restore is not acceptable because its RTO (2 hours) and RPO (60 minutes) exceed the requirements.
  • Pilot light is not acceptable because its RTO (35 minutes) exceeds the 20-minute requirement.
  • Active-active meets RTO/RPO but exceeds the USD 350/month always-on budget when run at full capacity.

Question 43

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

An application runs on an Auto Scaling group (ASG) of EC2 instances behind an Application Load Balancer (ALB). During traffic surges, p95 latency increases and the ALB HTTPCode_Target_5XX_Count briefly spikes until scaling completes. Instances require about 3 minutes after launch to pass ALB health checks.

The ASG currently uses simple scaling: add 1 instance when average CPU >70% for 5 minutes, remove 1 instance when CPU <30% for 10 minutes (cooldown 300 seconds). CPU often stays near 50–60% even when request volume is high.

Which change is the best optimization to maintain performance during sustained load while avoiding unnecessary cost during quiet periods?

Options:

  • A. Add scheduled scaling to increase desired capacity before lunch hours

  • B. Use target tracking on ALB RequestCountPerTarget with 180s warmup

  • C. Switch to step scaling on CPU with 1-minute alarms

  • D. Increase the ASG minimum capacity to peak levels

Best answer: B

Explanation: A target tracking policy using RequestCountPerTarget aligns scaling with actual incoming load on the ALB, which is what drives latency and 5xx errors in this scenario. Setting an instance warmup close to the 3-minute bootstrap/health-check time prevents the policy from overestimating capacity while new instances are still starting. This improves performance during surges and still allows scale-in when demand drops.

The core issue is that the current CPU-based simple scaling does not reflect user load (CPU can stay moderate while request volume and queueing increase), and it scales too slowly for sustained surges. A target tracking policy on the ALB metric RequestCountPerTarget scales the ASG to keep an approximate steady request rate per healthy target, which is a better proxy for saturation and latency in ALB-fronted fleets.

Configure:

  • Target tracking on ALBRequestCountPerTarget to a reasonable per-instance target.
  • instanceWarmup around 180 seconds so newly launched instances are not counted as available capacity until they can pass health checks.

Tradeoff: you must choose and occasionally tune the target value, but it reduces manual threshold management and avoids keeping peak capacity running all day.

  • Fixed peak minimum keeps performance but wastes cost during off-peak periods.
  • Scheduled scaling only helps for predictable patterns and misses unexpected spikes (for example, campaigns).
  • CPU step scaling still uses a weak signal here and can lag or oscillate without improving the true bottleneck.

Question 44

Topic: Reliability and Business Continuity

In AWS Backup, which term refers to a centralized configuration that defines backup schedules, backup windows, lifecycle (retention) settings, and the backup vault to store recovery points for assigned resources (such as EBS volumes and RDS databases)?

Options:

  • A. Recovery point

  • B. EBS snapshot lifecycle policy

  • C. Backup vault

  • D. Backup plan

Best answer: D

Explanation: In AWS Backup, a backup plan is the object that automates backups by defining schedules, windows, retention (lifecycle), and the target backup vault. Resources are then assigned to that plan so AWS Backup can create and manage their recovery points. This is distinct from the storage container (vault) and from the backups that are created (recovery points).

The core AWS Backup automation construct is the backup plan. A backup plan contains one or more backup rules that specify when backups are taken (schedule), optional start/complete windows, where backups are stored (backup vault), and how long they are retained (lifecycle/retention). When you assign resources (or select them by tags) to the plan, AWS Backup executes those rules to create recovery points for supported resources such as EBS and RDS.

A backup vault is only the storage container for the resulting recovery points, while a recovery point is the actual backup created by a rule. An EBS snapshot lifecycle policy is a native Amazon Data Lifecycle Manager (DLM) concept for EBS snapshots, not the AWS Backup plan object.

  • Backup vault is the container that stores recovery points; it does not define schedules.
  • Recovery point is the created backup artifact, not the policy that creates it.
  • EBS snapshot lifecycle policy automates EBS snapshots via DLM, not via AWS Backup plans for multiple resource types.

Question 45

Topic: Networking and Content Delivery

You are troubleshooting why traffic between two AWS endpoints fails and you decide to use VPC Reachability Analyzer.

Which TWO statements about VPC Reachability Analyzer are false/unsafe? (Select TWO.)

Options:

  • A. It lets you specify the traffic details (such as protocol and port) to test reachability.

  • B. It can show the hop-by-hop network path and indicate where traffic is blocked (for example, a route, security group, or network ACL).

  • C. It can confirm instance operating system firewalls and application listener configuration are allowing the traffic.

  • D. It requires VPC Flow Logs to be enabled so it can evaluate whether traffic is allowed.

  • E. It simulates reachability from VPC configuration and does not send test packets.

  • F. It can be run via API/CLI (for example, by starting a Network Insights analysis) for repeatable troubleshooting.

Correct answers: C and D

Explanation: VPC Reachability Analyzer models the network path using AWS configuration (routes, security groups, NACLs, gateways) to determine whether traffic should be able to reach a destination. It does not depend on Flow Logs and it cannot validate what is happening inside the guest OS or application. Therefore, statements claiming Flow Logs are required or that host/application settings are verified are unsafe.

Core concept: VPC Reachability Analyzer is a control-plane analysis tool that computes whether a network path exists between two endpoints based on AWS networking configuration.

It is useful for “why is traffic blocked?” investigations because it evaluates and highlights path components such as:

  • Route tables, gateways, and attachments (for example, IGW, NATGW, TGW)
  • Security groups and network ACLs
  • The specific protocol/port you define for the test

It does not generate real traffic and does not require VPC Flow Logs to operate. Also, it cannot see inside an instance to validate OS firewall rules (iptables/Windows Firewall), local service binding, or application listener configuration; those require host-level troubleshooting and/or other telemetry. Key takeaway: use Reachability Analyzer to prove or disprove network-path reachability in AWS, not to validate runtime or application-layer behavior.

  • Flow Logs dependency is incorrect because Reachability Analyzer works from configuration state, not traffic logs.
  • Host/app validation is incorrect because guest OS firewalls and application listeners are outside its scope.
  • Control-plane simulation is accurate; it computes reachability without sending packets.
  • Blocked-hop identification is accurate; it can point to the path component that prevents reachability.

Question 46

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

An operations team runs a web application on an Auto Scaling group behind an Application Load Balancer. They want automated remediation when the ALB HTTPCode_Target_5XX_Count metric breaches a threshold. The remediation must be initiated by a CloudWatch alarm and must invoke a Lambda function that runs the corrective steps and logs the actions.

Which TWO approaches meet these requirements? (Select TWO.)

Options:

  • A. Send the alarm to an SNS topic with a Lambda subscription

  • B. Configure the alarm to run an Auto Scaling policy that executes Lambda

  • C. Use an EventBridge rule on alarm state change to invoke Lambda

  • D. Create a CloudWatch Logs subscription filter to invoke Lambda

  • E. Attach a Lambda function ARN directly as the alarm action

  • F. Use an AWS Config rule to invoke Lambda when the metric breaches

Correct answers: A and C

Explanation: CloudWatch alarms can drive remediation by emitting alarm state changes to services that can invoke Lambda. Two common operational patterns are publishing the alarm to SNS (with a Lambda subscription) or using EventBridge to route alarm state-change events directly to a Lambda target. Both keep the trigger tied to the alarm and provide auditable invocation paths.

The core requirement is “alarm-driven” remediation that results in a Lambda invocation. CloudWatch alarms don’t invoke Lambda directly; instead, they integrate with services that can deliver the alarm event to Lambda.

Two standard ways are:

  • Configure the alarm action to publish to an SNS topic, then subscribe the Lambda function (and allow invocation with Lambda permissions).
  • Create an EventBridge rule that matches CloudWatch Alarm State Change events and set the Lambda function as the rule target.

Both approaches start from the CloudWatch alarm signal and reliably invoke Lambda with the alarm context for automated remediation and logging.

  • OK Publish alarm to SNS with a Lambda subscription: alarm action triggers SNS, SNS invokes Lambda.
  • OK EventBridge rule on alarm state change: EventBridge routes the alarm state-change event to Lambda.
  • NO CloudWatch Logs subscription filter: triggers from log ingestion/patterns, not from a CloudWatch alarm.
  • NO AWS Config rule: evaluates configuration compliance, not CloudWatch metric alarm state.
  • NO Lambda ARN as alarm action: CloudWatch alarms don’t support Lambda as a direct action target.
  • NO Auto Scaling policy executes Lambda: scaling policies adjust capacity; they don’t run Lambda code.

Question 47

Topic: Reliability and Business Continuity

You configured Amazon Route 53 failover routing with two DNS records for app.example.com: a primary record for a Regional ALB and a secondary record for a different Regional ALB. To validate high availability, you plan to simulate a failure of the primary endpoint.

Which condition causes Route 53 to start answering DNS queries with the secondary record?

Options:

  • A. The health check associated with the primary record reports the endpoint as unhealthy

  • B. A CloudWatch alarm for the primary ALB enters the ALARM state

  • C. The DNS resolver cache for clients expires and forces a new lookup

  • D. The health check associated with the secondary record reports the endpoint as healthy

Best answer: A

Explanation: Route 53 failover routing is driven by the health status of the primary record. When Route 53 evaluates the primary as unhealthy (by an associated health check or alias target health evaluation), Route 53 responds with the secondary record. This is the key behavior you validate when simulating an outage.

In Route 53 failover routing, DNS responses depend on whether the primary record is considered healthy. Route 53 will answer with the primary record while it is healthy, and will answer with the secondary record only after the primary is evaluated as unhealthy.

To validate HA behavior operationally:

  • Ensure the primary record is associated with a Route 53 health check (or an alias that uses target health evaluation).
  • Simulate failure of the primary endpoint.
  • Verify that queries begin returning the secondary record (noting that recursive resolvers may cache answers based on TTL).

CloudWatch alarms can help detect issues, but they do not directly control Route 53 failover decisions unless integrated into a separate automation.

  • Secondary health isn’t the trigger because failover is initiated by the primary being unhealthy, not by the secondary being healthy.
  • CloudWatch alarms don’t drive Route 53 because Route 53 uses its own health check/target health evaluation for failover decisions.
  • TTL affects when clients see it because caching influences propagation timing, but it is not the condition that initiates failover.

Question 48

Topic: Security and Compliance

When using AWS Security Hub to aggregate and triage findings and route them into operational workflows, which THREE statements are INCORRECT? (Select three.)

Options:

  • A. Security Hub has no native cross-Region finding aggregation capability.

  • B. Findings can be routed via EventBridge to automation or ticketing.

  • C. An Organizations delegated admin can aggregate member-account findings.

  • D. Security Hub quarantines resources automatically when findings appear.

  • E. Only AWS services can send findings to Security Hub.

  • F. Insights help group and prioritize findings for triage.

Correct answers: A, D and E

Explanation: Security Hub is a finding aggregation and triage service, not an auto-remediation tool. It centralizes and prioritizes findings (including across accounts and, where supported, across Regions) and integrates with EventBridge so operations teams can route specific findings into tickets, notifications, or automated runbooks.

The core workflow is: Security Hub aggregates findings (AWS services, partners, and custom findings in ASFF), helps you triage them (for example, with Insights and severity/workflow status), and then you route the findings into operational systems.

What Security Hub does and does not do in this workflow:

  • Aggregation: can centralize findings across accounts using an AWS Organizations delegated administrator, and can also aggregate across Regions using Security Hub cross-Region finding aggregation (where supported).
  • Triage: use filtering and Insights to group findings (by severity, resource type, account, product, etc.) for operational prioritization.
  • Routing/remediation: publish findings to Amazon EventBridge and then invoke your workflow (SNS/ticketing/Lambda/Systems Manager Automation); remediation is performed by that downstream automation or humans, not by Security Hub itself.

Key takeaway: Security Hub orchestrates visibility and routing; remediation requires integrations.

  • Auto-quarantine assumption is wrong because Security Hub does not change resources; it only records and forwards findings.
  • No cross-Region aggregation is wrong because Security Hub provides a built-in cross-Region aggregation feature (availability depends on Region support).
  • AWS-only ingestion is wrong because Security Hub accepts partner and custom findings in ASFF, not just AWS-native sources.

Question 49

Topic: Security and Compliance

Select THREE statements that are true about storing application secrets in AWS Secrets Manager and retrieving them securely from workloads using IAM roles.

Options:

  • A. A Secrets Manager resource policy can allow an IAM role in another AWS account to call GetSecretValue.

  • B. Store the secret in EC2 user data and read it from the instance metadata service for best security.

  • C. Reading a secret always requires the workload IAM role to have kms:Decrypt permission.

  • D. Grant the workload’s IAM role secretsmanager:GetSecretValue on the secret ARN to retrieve it via SDK/CLI.

  • E. If a secret uses the default AWS managed key aws/secretsmanager, GetSecretValue is typically sufficient without explicit KMS permissions.

  • F. To retrieve a secret, the workload must use long-term IAM user access keys.

Correct answers: A, D and E

Explanation: Secure retrieval from Secrets Manager is primarily done by granting least-privilege permissions to the workload’s IAM role and letting the AWS SDK use role credentials automatically. Cross-account access can be granted with a secret resource policy. KMS permissions depend on whether a customer managed KMS key is used versus the default AWS managed key.

The core pattern is: keep secrets in Secrets Manager, and let workloads fetch them at runtime using temporary credentials from an attached IAM role (instance profile, task role, or Lambda execution role). Scope access with IAM to specific secret ARNs, and use a resource policy when you need cross-account access.

Statement (summary)OK/NOBrief fix/why
Role has GetSecretValue on secret ARNOKStandard least-privilege retrieval using role credentials.
Secret resource policy grants other account roleOKEnables cross-account access to the secret.
Default aws/secretsmanager key needs no explicit KMS permsOKKMS permissions are not typically added for AWS managed keys.
Must use IAM user access keysNOUse IAM roles with temporary credentials instead.
Always need kms:DecryptNONeeded only when using a customer managed KMS key and the key policy allows it.
Put secret in user data/IMDSNOUser data is not a secure secret store; retrieve from Secrets Manager at runtime.

Key takeaway: use IAM roles + scoped GetSecretValue, and add KMS permissions only when a customer managed key is involved.

  • Long-term keys required is incorrect because workloads should use IAM roles with temporary credentials.
  • Always add kms:Decrypt is incorrect because it depends on the KMS key type used to encrypt the secret.
  • User data as a secret store is insecure and bypasses Secrets Manager access controls and auditing.
  • Cross-account via resource policy is valid when you need to share a secret with a role in another account.

Question 50

Topic: Networking and Content Delivery

When troubleshooting AWS Site-to-Site VPN hybrid connectivity issues at a high level (tunnel state, routes, DNS), which THREE statements are true?

Options:

  • A. If VPC instances cannot resolve on-prem hostnames, verify DNS forwarding/reachability (for example, Route 53 Resolver outbound endpoint and rules to on-prem DNS).

  • B. A tunnel showing UP does not guarantee reachability; the VPC/TGW route table must route the on-prem CIDR to the VPN attachment.

  • C. Security group rules determine whether the VPN tunnel can establish (UP/DOWN).

  • D. If either VPN tunnel is DOWN, no traffic can pass until both tunnels are UP.

  • E. A Site-to-Site VPN connection provides two IPsec tunnels for redundancy.

  • F. Creating a VPN connection automatically adds on-prem routes to every VPC route table in the VPC.

Correct answers: A, B and E

Explanation: VPN troubleshooting usually separates into three checks: tunnel health, routing, and DNS. A tunnel can be UP while traffic still fails due to missing/incorrect route table entries to the VGW/TGW attachment. DNS issues are often independent of the tunnel state and require correct DNS forwarding or reachable resolvers over the VPN.

For VPN-based hybrid issues, validate the control plane and the data plane separately: tunnel state indicates whether IPsec/IKE is established, but routing and DNS still must be correct for end-to-end connectivity.

  • OK: Two IPsec tunnels exist per VPN connection for redundancy; one tunnel can carry traffic if routing/failover is configured.
  • OK: If tunnels are UP but you cannot reach on-prem CIDRs, check the relevant VPC/TGW route table entries and route propagation/associations.
  • OK: If IP connectivity works but names fail, verify DNS query paths (Route 53 Resolver outbound endpoint/rules or direct reachability to on-prem DNS).
  • NO: VPN creation does not automatically insert on-prem routes into every VPC route table; routing must be configured per route table.
  • NO: Security groups do not control IPsec tunnel establishment; they apply to instance/ENI traffic after routing.
  • NO: A single DOWN tunnel does not inherently stop traffic if the other tunnel is UP and selected by routing.

Key takeaway: tunnel UP is necessary but not sufficient—routes and DNS commonly cause the remaining failures.

  • Automatic routing everywhere is incorrect because route tables must be explicitly updated/associated (or use propagation where applicable).
  • Security groups affect tunnels is incorrect because tunnel establishment is between gateways, not instance ENIs.
  • Both tunnels required is incorrect because the second tunnel provides redundancy rather than a hard dependency.

Questions 51-65

Question 51

Topic: Deployment, Provisioning, and Automation

An operations team uses EC2 Image Builder to produce “golden” AMIs and track releases across environments. Select TWO statements that are true about creating/managing AMIs with Image Builder pipelines and versioning.

Options:

  • A. If you update an image recipe (or its components), previously created AMIs are automatically updated to match the latest recipe state.

  • B. Rebuilding the same image recipe version overwrites the existing AMI ID so downstream launch templates do not need updating.

  • C. EC2 Image Builder pipelines require AWS CodePipeline to schedule and run image builds.

  • D. Image Builder stores the created AMI only inside the Image Builder service; it does not appear as an EC2 AMI in the account.

  • E. An Image Builder pipeline can distribute the resulting AMI to additional AWS Regions and/or share it with other AWS accounts by using distribution settings.

  • F. Image Builder image recipes are versioned (for example, 1.2.3), and a new release is represented by creating a new recipe version and building a new AMI.

Correct answers: E and F

Explanation: EC2 Image Builder uses versioned image recipes to represent immutable releases, and each build produces a new AMI. To promote the same release to more places, you use Image Builder distribution settings to copy/share the resulting AMI across Regions and accounts.

The core ideas are (1) AMIs are immutable artifacts and (2) Image Builder tracks “release” intent through versioned resources. In practice, you create an image recipe with an explicit semantic version (for example, 1.2.3) and run a pipeline build; the output is a new AMI (new AMI ID). When you need a new release, you create a new recipe version and build again.

Option check (OK/NO):

  • OK recipe versions represent releases; new version → new AMI.
  • OK distribution settings can copy/share AMIs to other Regions/accounts.
  • NO existing AMIs are not modified when recipes/components change.
  • NO rebuilds don’t overwrite an AMI ID; they create another AMI.
  • NO the AMI is an EC2 AMI in your account after build.
  • NO Image Builder pipelines can run without CodePipeline.

Key takeaway: treat each pipeline build as producing a new, versioned AMI artifact and use distribution to replicate/share it.

  • Mutable AMI misconception fails because AMIs are immutable; changes require building a new AMI.
  • Overwrite assumption fails because each build results in a new AMI ID.
  • Service location confusion fails because the artifact is a normal EC2 AMI in the account.
  • Unnecessary dependency fails because Image Builder can schedule/run builds without CodePipeline.

Question 52

Topic: Deployment, Provisioning, and Automation

A networking account shared two private subnets from a shared VPC to an application account by using AWS Resource Access Manager (AWS RAM). An EC2 Auto Scaling CloudFormation stack in the application account fails.

Exhibit: CloudFormation event

Resource: LaunchTemplate
Status reason: InvalidSubnetID.NotFound: The subnet ID 'subnet-0a12bc34d5e6f7890' does not exist

In the networking account, the AWS RAM resource share shows the application account principal with status Pending acceptance. What should a CloudOps engineer do to resolve the issue with the least change?

Options:

  • A. Add an IAM policy in the application account to allow ec2:DescribeSubnets

  • B. Update the shared subnet network ACL to allow ephemeral ports from the application account

  • C. Add routes in the shared VPC route tables to the application VPC CIDR range

  • D. Accept the AWS RAM resource share invitation in the application account

Best answer: D

Explanation: The stack fails because the application account cannot resolve the referenced subnet ID, which happens when the shared subnets have not been accepted through AWS RAM. The networking account shows the share as Pending acceptance, confirming the share is not active for the recipient. Accepting the resource share makes the subnets available to use without changing networking configuration.

Symptom: CloudFormation reports InvalidSubnetID.NotFound for a subnet that exists in the networking account.

Root cause: The subnets were shared with AWS RAM, but the recipient (application) account has not accepted the resource share, so the subnets are not available in that account and API calls treat the subnet ID as nonexistent.

Fix: In the application account, accept the pending AWS RAM resource share (console or aws ram accept-resource-share-invitation). Once accepted, the shared subnets appear as usable “shared subnets,” and the Auto Scaling stack can reference them successfully.

Security groups, NACLs, and route tables affect traffic after resources are created; they do not cause a shared subnet to be “NotFound.”

  • Routing change doesn’t apply because the failure happens before any instance is created, at subnet lookup time.
  • NACL change affects data-plane connectivity, not whether the subnet ID exists/appears in the account.
  • Describe permission would produce AccessDenied, not a subnet NotFound for an unaccepted share.

Question 53

Topic: Security and Compliance

An operations team uses AWS Systems Manager Session Manager to access EC2 instances. You must update an IAM policy to enforce least privilege with these requirements:

  • Allow ssm:StartSession only to instances tagged Environment=prod.
  • Require MFA.
  • Allow access only from the corporate VPN egress ranges 203.0.113.64/27 and 203.0.113.96/27.
  • The policy must use one CIDR in aws:SourceIp.

For this question, use these IPv4 sizes: /27 = 32 addresses, /26 = 64 addresses, /25 = 128 addresses. Choose the policy condition set that meets the requirements while allowing the fewest source IP addresses.

Options:

  • A. SourceIp 203.0.113.64/26, MFA true, tag Environment=prod

  • B. SourceIp 203.0.113.64/26, tag Environment=prod only

  • C. SourceIp 203.0.113.64/27, MFA true, tag Environment=prod

  • D. SourceIp 203.0.113.64/25, MFA true, tag Environment=prod

Best answer: A

Explanation: To restrict access by source network using a single CIDR, you must choose the smallest block that still includes both provided VPN egress CIDRs. Each /27 contains 32 addresses, so the two contiguous /27 ranges require 64 total addresses, which corresponds to a /26. Adding aws:MultiFactorAuthPresent and the instance tag condition enforces least privilege.

This is an IAM least-privilege problem using policy conditions: resource tags to scope which instances can be targeted, aws:MultiFactorAuthPresent to require MFA, and aws:SourceIp to restrict where requests can originate.

Calculation (in IPv4 addresses):

  • Each VPN egress range is a /27, which is 32 addresses.
  • Total addresses needed: 32 + 32 = 64 addresses.
  • A single CIDR that allows exactly 64 addresses is a /26, and because the two /27 blocks are contiguous (.64/27 and .96/27), they summarize to 203.0.113.64/26.

Using a /25 would unnecessarily allow 128 addresses, and using only one /27 would block half of the VPN egress space.

  • Over-broad CIDR uses /25, which allows 128 addresses instead of the required 64.
  • Under-scoped CIDR uses only one /27, which excludes the other VPN egress range.
  • Missing MFA enforcement omits aws:MultiFactorAuthPresent, so access would not require MFA.

Question 54

Topic: Security and Compliance

A company runs a public web application in us-west-2 behind an Application Load Balancer (ALB). Static assets are served through an Amazon CloudFront distribution using the same domain (example.com). The company also exposes a Regional Amazon API Gateway endpoint for api.example.com in us-west-2.

You must enable encryption in transit using ACM-managed TLS certificates for these public endpoints. Which actions should you take? (Select THREE.)

Options:

  • A. Use an ACM Private CA certificate for the public CloudFront domain name

  • B. Upload a self-signed certificate to ACM for production internet-facing endpoints

  • C. Attach an ACM certificate from us-west-2 directly to the CloudFront distribution

  • D. Request a public ACM certificate in us-west-2 and attach it to an ALB HTTPS listener

  • E. Request a public ACM certificate in us-east-1 and attach it to the CloudFront distribution

  • F. Request a public ACM certificate in us-west-2 and create an API Gateway Regional custom domain name

Correct answers: D, E and F

Explanation: Use ACM public certificates to terminate TLS at each edge/front-door service. ALB and Regional API Gateway require the certificate to exist in the same Region as the endpoint. CloudFront is the exception: it requires the ACM certificate in us-east-1 for custom domain names.

The core concept is where ACM certificates must live to be attached to different AWS front-door endpoints.

  • OK Request a public ACM certificate in us-west-2 and attach it to an ALB HTTPS listener (ALB is regional).
  • OK Request a public ACM certificate in us-east-1 and attach it to the CloudFront distribution (CloudFront uses us-east-1 for ACM).
  • OK Request a public ACM certificate in us-west-2 and create an API Gateway Regional custom domain name (Regional API is regional).
  • NO Attach an ACM certificate from us-west-2 directly to the CloudFront distribution (wrong Region for CloudFront).
  • NO Use an ACM Private CA certificate for the public CloudFront domain name (browsers won’t trust a private CA by default).
  • NO Upload a self-signed certificate to ACM for production internet-facing endpoints (clients won’t trust it).

Key takeaway: pick ACM public certs and place them in the correct Region for the specific endpoint type.

  • CloudFront Region CloudFront can only use ACM certificates in us-east-1 for custom domains.
  • Private CA trust Private CA certificates are for private PKI, not public internet trust chains.
  • Self-signed certs Self-signed certificates cause client trust warnings and don’t meet typical public TLS requirements.
  • Regional endpoints ALB and Regional API Gateway require same-Region ACM certificates.

Question 55

Topic: Networking and Content Delivery

A company uses split-horizon DNS for corp.example.com (public hosted zone for internet clients, private hosted zone for VPC clients). Operations created two separate Route 53 private hosted zones named corp.example.com: one in Account A associated with VPC-A, and another in Account B associated with VPC-B. Teams manually duplicate records between the zones, and the records often drift during incidents.

The company wants to reduce operational effort and improve reliability while keeping split-horizon behavior. VPC-A and VPC-B must both resolve the same private records.

Which change is the best optimization?

Options:

  • A. Keep a single private hosted zone and associate it with both VPC-A and VPC-B (use cross-account VPC association authorization if needed)

  • B. Create Route 53 Resolver outbound endpoints and conditional forwarding rules between the VPCs

  • C. Replace the private hosted zones with a single public hosted zone and restrict access with security groups

  • D. Create additional private hosted zones per application and delegate subdomains to each VPC

Best answer: A

Explanation: Associating one Route 53 private hosted zone with multiple VPCs lets both VPCs resolve the same internal records, which removes manual record replication and reduces drift-related outages. This keeps split-horizon DNS intact because VPCs associated with the private hosted zone will prefer the private answers for that domain. The main tradeoff is that DNS changes are now centralized and affect all associated VPCs.

The core optimization is to use one Route 53 private hosted zone (PHZ) for the internal view of corp.example.com and associate that PHZ with every VPC that should use the private answers. With split-horizon, you can have both a public hosted zone and a private hosted zone with the same domain name; queries that originate from an associated VPC resolve using the PHZ, while internet clients use the public hosted zone.

Operationally, this avoids maintaining duplicate PHZs with identical records (a common source of drift and incidents). To do this safely:

  • Ensure each VPC has DNS resolution enabled (enableDnsSupport and typically enableDnsHostnames).
  • If VPCs are in different accounts, create VPC association authorization in the hosted-zone owner account, then associate the VPC from the other account.
  • After validation, remove the duplicate PHZ to eliminate conflicting sources of truth.

Using Resolver forwarding is for integrating with other DNS systems, not for sharing a single Route 53 PHZ across VPCs.

  • More zones to manage increases operational overhead and still risks inconsistent records.
  • Public zone for internal names breaks split-horizon and security; security groups do not control public DNS visibility.
  • Resolver forwarding adds cost and components and doesn’t eliminate duplicate Route 53 zone management in this scenario.

Question 56

Topic: Security and Compliance

In Amazon GuardDuty, what does the finding type UnauthorizedAccess:EC2/SSHBruteForce most specifically indicate, and what is the most appropriate high-level initial response?

Options:

  • A. A large-scale network port scan is occurring against the VPC; enable VPC Flow Logs to stop it

  • B. An IAM user successfully logged in from an anomalous location; rotate access keys and enable MFA

  • C. Repeated SSH login attempts against an internet-exposed instance; restrict port 22 and investigate

  • D. The instance is confirmed compromised and communicating with a command-and-control server; isolate it

Best answer: C

Explanation: UnauthorizedAccess:EC2/SSHBruteForce means GuardDuty detected behavior consistent with repeated SSH authentication attempts against an EC2 instance (typically reachable from the internet). The appropriate initial response is to limit inbound SSH exposure (for example, security group narrowing) and then review logs and access attempts to assess impact.

GuardDuty finding types are named to describe the suspected threat and the AWS resource involved. UnauthorizedAccess:EC2/SSHBruteForce indicates GuardDuty observed SSH brute-force patterns (many failed login attempts or rapid repeated attempts) directed at an EC2 instance.

Operationally, the first actions are containment and validation:

  • Reduce attack surface by restricting inbound TCP/22 (security group or NACL) to trusted IPs, or temporarily remove exposure.
  • Review instance authentication logs and related telemetry (for example, /var/log/secure or /var/log/auth.log, VPC Flow Logs) to determine whether any access succeeded.

This differs from malware/C2-style findings, which are driven by suspicious outbound communications rather than inbound login attempts.

  • C2/malware confusion describes a different class of GuardDuty findings (trojan/backdoor) and assumes confirmed compromise.
  • IAM anomaly confusion applies to findings about unusual IAM activity, not EC2 SSH authentication attempts.
  • Flow Logs misconception can help investigate traffic, but enabling logs does not “stop” scanning and is not the primary containment step.

Question 57

Topic: Reliability and Business Continuity

Select THREE statements that are true about these disaster recovery (DR) strategies and their typical cost/RTO trade-offs: backup and restore, pilot light, and warm standby.

Options:

  • A. Warm standby has the lowest cost because resources stay stopped.

  • B. Warm standby runs a scaled-down full environment, enabling low RTO.

  • C. Backup and restore is usually lowest cost, with the highest RTO.

  • D. Pilot light generally achieves a lower RTO than warm standby.

  • E. Pilot light keeps minimal core components running and scales up in DR.

  • F. Backup and restore requires a fully running duplicate environment.

Correct answers: B, C and E

Explanation: Backup and restore minimizes ongoing spend by relying on backups and rebuild steps, but it usually has the longest recovery time. Pilot light reduces recovery time by keeping only critical components running and scaling out during an event. Warm standby keeps a smaller but complete environment running, trading higher cost for a typically lower RTO.

The core distinction among these DR strategies is how much of the workload is already running before a disaster, which drives both ongoing cost and achievable recovery time.

  • Backup and restore: store backups and rebuild the environment during DR; lowest steady-state cost, typically the longest RTO.
  • Pilot light: keep only essential components running (for example, databases, minimal services) and scale out the rest during DR; moderate cost and RTO.
  • Warm standby: keep a fully functional but smaller environment running and scale up when needed; higher steady-state cost with typically the lowest RTO of the three.

A common operational rule is: more pre-provisioned/running capacity costs more, but restores service faster.

  • OK Backup and restore is cheapest steady-state but typically has the longest RTO.
  • OK Pilot light runs only critical components and scales out during DR.
  • OK Warm standby runs a complete scaled-down environment for faster recovery.
  • NO Claims that warm standby is cheapest, that pilot light beats warm standby on RTO, or that backup/restore needs a running duplicate environment reverse the basic cost/RTO trade-offs.

Question 58

Topic: Networking and Content Delivery

An application EC2 instance (10.0.1.25) in subnet subnet-app (10.0.1.0/24) cannot connect to an Amazon RDS for MySQL DB instance (10.0.2.80) in subnet subnet-db (10.0.2.0/24). The connection to port 3306 times out.

You run VPC Reachability Analyzer from the EC2 instance ENI to the RDS instance ENI. The analysis result shows Not reachable and indicates the packet is blocked by an inbound rule on the network ACL associated with subnet-db.

Exhibit: subnet-db network ACL inbound rules (in order)

100  ALLOW  TCP 443   Source 10.0.1.0/24
110  DENY   TCP 3306  Source 10.0.1.0/24
*    ALLOW  ALL ALL   Source 0.0.0.0/0

Which action will fix the root cause with the least change?

Options:

  • A. Add a route in the subnet-app route table for 10.0.2.0/24 that targets a NAT gateway

  • B. Enable VPC Flow Logs on both subnets to identify the rejected traffic, then retry

  • C. Add an inbound rule on the RDS security group to allow TCP 3306 from 10.0.1.25/32

  • D. Remove the DENY rule and add an ALLOW for TCP 3306 from 10.0.1.0/24

Best answer: D

Explanation: The Reachability Analyzer result points to an inbound network ACL rule on the DB subnet as the point of failure. Because network ACLs are stateless and evaluated in rule order, an explicit DENY for TCP 3306 will block the MySQL connection even if security groups are correct. Removing the DENY and allowing TCP 3306 from the app subnet is the minimal fix.

Symptom: the app instance times out connecting to RDS on TCP 3306.

Root cause: VPC Reachability Analyzer shows the path is blocked at the subnet-db network ACL, and the NACL has an explicit inbound DENY TCP 3306 from 10.0.1.0/24 that is evaluated before the later catch-all allow.

Fix: change the subnet-db NACL to permit inbound TCP 3306 from the app subnet (and remove/override the denying rule so the allow is evaluated first).

The key takeaway is that Reachability Analyzer can pinpoint whether routing, security groups, or NACLs stop the flow; here it is a deterministic NACL rule-order issue.

  • Security group change doesn’t help because the analysis already identifies the NACL as the blocking component.
  • NAT gateway route is incorrect because traffic between subnets in the same VPC uses the local route, not NAT.
  • Flow Logs first can help with evidence gathering, but it does not remediate the blocked path.

Question 59

Topic: Security and Compliance

A company uses AWS Organizations with separate application accounts in a “Prod” OU. Account administrators occasionally disable AWS CloudTrail or GuardDuty during incidents and forget to re-enable them. The security team currently runs daily scripts in each account to detect and remediate these changes, which creates operational overhead.

The security team must enforce an organization-wide guardrail that prevents disabling CloudTrail and GuardDuty in all Prod accounts, without granting any new permissions. A break-glass role in the security account must still be able to perform these actions when required.

Which change is the best optimization?

Options:

  • A. Deploy AWS Config conformance packs to detect and auto-remediate violations

  • B. Update IAM policies in the management account to restrict these API actions

  • C. Attach a deny SCP to the Prod OU with a break-glass exception

  • D. Create an IAM permission boundary policy and require all roles to use it

Best answer: C

Explanation: Using a Service Control Policy (SCP) to explicitly deny the CloudTrail and GuardDuty disable actions at the Prod OU enforces a preventive guardrail across all member accounts. The break-glass role can be excluded by condition so emergency operations still work. This reduces ongoing per-account scripting and remediation effort, and SCPs do not grant permissions.

SCPs in AWS Organizations are the right tool for preventive, multi-account guardrails because they set the maximum permissions that any principal in attached accounts can have. To stop administrators from disabling CloudTrail/GuardDuty, attach an SCP to the Prod OU that uses explicit Deny on the relevant actions (for example, cloudtrail:StopLogging, cloudtrail:DeleteTrail, guardduty:DeleteDetector, guardduty:UpdateDetector) and scope the deny with a condition so it does not apply to the break-glass role ARN.

This optimizes operations by eliminating ongoing per-account scripts and remediations and makes enforcement consistent. Tradeoff: an explicit deny can block legitimate admin workflows unless the exception is carefully defined and tested in a non-production OU first.

  • Permission boundaries require changes to identities in every account and won’t cover existing principals unless enforced everywhere.
  • AWS Config remediation detects after the fact; it reduces drift but does not prevent the disable actions.
  • Management-account IAM policies do not apply to principals operating inside member accounts.

Question 60

Topic: Deployment, Provisioning, and Automation

A CloudFormation stack named web-prod in a production account must be updated to use a new AMI ID in an Auto Scaling group and to adjust an ALB security group rule. The team suspects some resources might have been changed manually in the console over time, and the change must be performed using safe, reviewable practices.

Which THREE actions should the CloudOps engineer take before applying the update?

Options:

  • A. Disable stack rollback so failed updates leave resources in place

  • B. Delete and recreate the stack to guarantee a clean deployment

  • C. Create a change set and review it before executing the update

  • D. Manually edit the Auto Scaling group and security group in the console first

  • E. Apply a stack policy to block updates to critical resources during the change

  • F. Run drift detection and resolve any drifted resources before updating

Correct answers: C, E and F

Explanation: Use CloudFormation features that make changes previewable and controlled. Reviewing a change set validates exactly what CloudFormation will modify, drift detection helps ensure the stack reflects reality before applying changes, and a stack policy can prevent accidental updates to sensitive resources during execution. Together these reduce unexpected impact in production.

Safe CloudFormation operations focus on (1) understanding the exact blast radius of a change and (2) ensuring the stack’s declared state matches the real environment. A change set provides a concrete, auditable preview of adds/modifies/deletes before you execute anything. Drift detection helps catch manual console changes or other out-of-band modifications that could cause the update to fail or produce unexpected replacements. For additional protection, a stack policy can be used to deny updates to specific critical resources while still allowing the intended changes to proceed.

Key takeaway: prefer preview + drift validation + guardrails over manual edits or risky failure-handling tweaks.

  • OK — Change set review gives a precise preview of the update actions before execution.
  • OK — Drift detection surfaces manual/out-of-band changes that should be reconciled before updating.
  • NO — Disable rollback is not a safe default; it increases the chance of leaving broken intermediate state.
  • NO — Manual console edits create/compound drift and bypass the stack’s controlled workflow.
  • NO — Recreate the stack is disruptive and unnecessary for routine in-place updates.
  • OK — Stack policy adds a preventive control to protect critical resources from unintended updates.

Question 61

Topic: Networking and Content Delivery

A workload runs on EC2 instances in a private subnet and must initiate outbound internet connections for patching. An operator is troubleshooting NAT gateway and internet gateway (IGW) connectivity.

Which statement is INCORRECT?

Options:

  • A. Add 0.0.0.0/0 to the IGW in the private subnet route table.

  • B. Instances in a private subnet do not need a public IPv4 to use a NAT gateway.

  • C. Private subnet routes 0.0.0.0/0 to a NAT gateway.

  • D. A NAT gateway must be in a public subnet with a route to an IGW.

Best answer: A

Explanation: Private subnets get outbound internet access by sending the default route to a NAT gateway, which then sends traffic to the internet through an IGW from a public subnet. Pointing a private subnet’s default route at an IGW is not the correct fix for NAT-based egress and can unintentionally change the subnet’s exposure model. NAT-based egress does not require public IPv4 addresses on the instances.

The core troubleshooting concept is that private-subnet instances use a NAT gateway for outbound-only internet access, while an IGW provides direct internet routing for resources that are publicly addressable. For private egress, the private subnet route table should send 0.0.0.0/0 to the NAT gateway, and the NAT gateway must reside in a public subnet whose route table sends 0.0.0.0/0 to the IGW (the NAT gateway uses an Elastic IP).

If you route a private subnet directly to an IGW, the subnet becomes “public” from a routing perspective, and instances would typically also need a public IPv4/EIP (and appropriate security group/NACL rules) for internet connectivity. The key takeaway is: private subnet egress goes to NAT; only public subnets route directly to IGW.

  • IGW in private route table is unsafe/incorrect because it changes the subnet to public routing and doesn’t provide NAT-style egress.
  • Private default route to NAT is correct for outbound internet from private subnets.
  • NAT in a public subnet is correct because the NAT gateway itself must reach the internet via an IGW.
  • No public IPv4 on instances is correct because the NAT gateway’s EIP is used for translation.

Question 62

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

A CloudWatch alarm indicates an EC2 instance is unresponsive. The operations team must perform a restart in a repeatable and auditable way, and they are required to use an AWS-provided Systems Manager Automation runbook (not ad hoc commands or custom code). Which action meets this requirement?

Options:

  • A. Use Session Manager to connect to the instance and restart the OS

  • B. Invoke a Lambda function that calls the EC2 RebootInstances API

  • C. Use Run Command with AWS-RunShellScript to run sudo reboot

  • D. Execute the Automation document AWS-RestartEC2Instance with an AutomationAssumeRole

Best answer: D

Explanation: The deciding factor is using an AWS-provided Systems Manager Automation runbook for the restart. Running AWS-RestartEC2Instance as an Automation execution provides a standardized workflow and an execution record in Systems Manager. Using AutomationAssumeRole also keeps permissions scoped and auditable for operations use.

Systems Manager Automation is the AWS service feature designed to run operational procedures as runbooks (Automation documents), with an execution history, parameters, and controlled permissions through AutomationAssumeRole. When the requirement is to use an AWS-provided Automation runbook, you should start an Automation execution of the appropriate AWS-owned document (for example, AWS-RestartEC2Instance) instead of using interactive access or generic command execution.

A common safe operations pattern is:

  • Start an Automation execution of the AWS-owned runbook
  • Provide the target instance ID(s) as parameters
  • Provide an AutomationAssumeRole that has only the needed permissions

The key takeaway is to choose Automation (runbooks) rather than other mechanisms when you need standardized, auditable remediation.

  • Run Command vs Automation: AWS-RunShellScript is a command document, not an Automation runbook.
  • Interactive access: Session Manager is an interactive session, not a runbook execution.
  • Custom code path: A Lambda function can reboot an instance but does not satisfy the requirement to use an AWS-provided Automation runbook.

Question 63

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

An operations team configured a CloudWatch alarm to notify the on-call engineer by publishing to an Amazon SNS topic. No emails are being received.

Exhibit: CloudTrail event (partial)

EventName: Subscribe
RequestParameters:
  topicArn: arn:aws:sns:us-east-1:111122223333:prod-incidents
  protocol: email
  endpoint: oncall@example.com
ResponseElements:
  subscriptionArn: pending confirmation

Based on the exhibit, what is the best next step to restore the incident notification workflow?

Options:

  • A. Confirm the SNS email subscription for oncall@example.com

  • B. Create an AWS User Notifications email delivery channel

  • C. Add an SNS topic policy to allow CloudWatch to publish

  • D. Enable CloudWatch alarm actions on the alarm

Best answer: A

Explanation: The SNS email endpoint is subscribed but not yet confirmed. CloudTrail shows the subscription ARN as “pending confirmation,” which prevents email delivery to that address. Confirming the subscription completes the notification path so alarms published to the topic can reach the on-call engineer.

This is an SNS subscription state issue. In the exhibit, the CloudTrail Subscribe event response shows subscriptionArn: pending confirmation, which indicates the email endpoint has not confirmed the subscription. SNS will not deliver notifications to an email endpoint until the recipient confirms the subscription (typically by clicking the confirmation link in the email, or by confirming with the token).

Key takeaway: when notifications aren’t arriving, verify the SNS subscription is confirmed before changing alarms or introducing new notification services.

  • Enable alarm actions doesn’t help when the email subscription is still pending confirmation.
  • Add SNS publish permissions is unnecessary evidence-wise; the exhibit points to subscription confirmation, not an authorization failure.
  • Use AWS User Notifications is optional here; the existing SNS workflow can work once the subscription is confirmed.

Question 64

Topic: Deployment, Provisioning, and Automation

An operator runs an AWS Systems Manager Automation document that patches a fleet of EC2 instances. Several executions fail. The operator uses the Automation execution history to identify the failed step, reviews the step output and associated CloudWatch Logs to confirm an AccessDenied error, then uses IAM diagnostics (for example, the IAM policy simulator) to determine the exact missing IAM actions. The operator updates the Automation assume role policy by adding only the required actions and resource scope for that step.

Which operational principle does this action best demonstrate?

Options:

  • A. Automation and standardization

  • B. Shared responsibility model

  • C. Blast-radius reduction

  • D. Least privilege

Best answer: D

Explanation: The operator uses Automation execution history, logs, and IAM diagnostics to pinpoint missing permissions and then grants only what is required for the failing step. Tightening IAM policies to the minimum needed permissions is the core practice of the least privilege principle, improving security while restoring the automation’s functionality.

This scenario is about troubleshooting an Automation failure and correcting an authorization issue without over-permissioning. Using Automation execution history and step output identifies exactly where the run failed, CloudWatch Logs corroborate the AccessDenied condition, and IAM diagnostics (such as the policy simulator) validate which actions/resources are missing. Updating the Automation assume role with only the proven-required permissions is an explicit application of least privilege, keeping the permissions narrowly scoped while resolving the failure. The key takeaway is that effective operations remediation should fix the immediate issue while minimizing unnecessary access.

  • Blast radius focuses on limiting the impact of failures (for example, smaller targets or phased rollouts), not refining IAM permissions.
  • Automation/standardization is about making tasks repeatable and consistent, but the described remediation is permission scoping based on diagnostics.
  • Shared responsibility clarifies which security controls belong to AWS vs the customer; it doesn’t describe how to right-size IAM permissions.

Question 65

Topic: Deployment, Provisioning, and Automation

An operations team needs all newly provisioned AWS resources to include a consistent set of tags (for example CostCenter, Application, Owner, Environment) to support cost allocation and governance. Which action best aligns with the ops principle of enforcing this requirement consistently across teams?

Options:

  • A. Grant developers broad permissions so they can add tags manually after resources are created

  • B. Provision resources only through standardized IaC templates that require tag inputs and apply tags automatically

  • C. Separate each team into its own AWS account and allow teams to choose their own tagging conventions

  • D. Rely on AWS to enforce tagging standards because tagging affects billing and cost reports

Best answer: B

Explanation: Standardizing provisioning through automation is the most reliable way to enforce consistent tags. By requiring tag parameters and applying tags as part of infrastructure provisioning, the organization reduces manual variation and improves auditability. This directly supports operational governance and cost allocation without depending on users to remember post-creation steps.

The core principle is automation/standardization (Operational Excellence): enforce requirements by embedding them into repeatable provisioning mechanisms rather than relying on manual behavior. If resources can only be created through standardized infrastructure-as-code (for example, CloudFormation or Service Catalog products) that require specific tag inputs and apply tags automatically, tagging becomes consistent, auditable, and scalable across teams.

This approach also supports day-2 operations because you can:

  • Require the same tag keys everywhere
  • Apply defaults where appropriate (for example, Environment)
  • Prevent untagged resources from being introduced during change windows

Manual tagging and “policy by email” typically leads to drift and gaps over time.

  • Manual tagging is error-prone and will lead to inconsistent or missing tags.
  • Account separation can help isolation and billing boundaries but does not enforce tag key/value consistency.
  • Shared responsibility misunderstanding is incorrect; AWS provides tagging features, but customers must implement and enforce their tagging standards.

Continue with full practice

Use the AWS SOA-C03 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try AWS SOA-C03 on Web View AWS SOA-C03 Practice Test

Focused topic pages

Free review resource

Read the AWS SOA-C03 Cheat Sheet on Tech Exam Lexicon for concept review before another timed run.

Revised on Thursday, May 14, 2026