AWS SOA-C03: Monitoring and Optimization

Try 10 focused AWS SOA-C03 questions on Monitoring and Optimization, with explanations, then continue with IT Mastery.

On this page

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try AWS SOA-C03 on Web View full AWS SOA-C03 practice page

Topic snapshot

FieldDetail
Exam routeAWS SOA-C03
Topic areaMonitoring, Logging, Analysis, Remediation, and Performance Optimization
Blueprint weight22%
Page purposeFocused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Monitoring, Logging, Analysis, Remediation, and Performance Optimization for AWS SOA-C03. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

PassWhat to doWhat to record
First attemptAnswer without checking the explanation first.The fact, rule, calculation, or judgment point that controlled your answer.
ReviewRead the explanation even when you were correct.Why the best answer is stronger than the closest distractor.
RepairRepeat only missed or uncertain items after a short break.The pattern behind misses, not the answer letter.
TransferReturn to mixed practice once the topic feels stable.Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 22% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.

Question 1

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

An operations engineer must identify performance hotspots across EC2 instances and Lambda functions by using CloudWatch performance metrics and existing Application and Environment tags. Which statement is INCORRECT?

Options:

  • A. Filter CloudWatch EC2 metrics directly by instance tags.

  • B. Use the CloudWatch agent memory/disk metrics to spot EC2 bottlenecks.

  • C. Use the Resource Groups Tagging API to list resource IDs, then compare metrics.

  • D. Use Lambda Duration/Throttles metrics and tags to find compute hotspots.

Best answer: A

Explanation: CloudWatch metric queries and the Metrics console operate on metric namespaces and dimensions (such as EC2 InstanceId), not on tag keys/values. To use tags for hotspot analysis, first use tags to identify the relevant resources (IDs/ARNs), then review and compare the appropriate CloudWatch metrics for those resources.

The core workflow is: use tags to define “which resources to look at,” then use CloudWatch metrics to find the bottleneck signal for those specific resources. CloudWatch does not provide native tag-based filtering/grouping of EC2 metrics; EC2 metrics are emitted with dimensions like InstanceId, so you must first translate Application/Environment tags into the corresponding instance IDs (or other resource identifiers) and then compare metrics across that set.

Common operational approach:

  • Query tagged resources (for example, via Resource Groups Tagging API/Resource Groups).
  • Pull CloudWatch metrics for those resource IDs (CPU/network/EBS balance, or Lambda Duration/Throttles).
  • Add CloudWatch agent metrics (memory/disk) when default metrics don’t expose the bottleneck.

Key takeaway: tags help you scope and correlate; dimensions are what CloudWatch metrics filter on.

  • Tag filtering myth CloudWatch Metrics doesn’t let you filter EC2 metrics by tag key/value; you filter by dimensions like InstanceId.
  • Tag-to-ID scoping Using the Resource Groups Tagging API to get IDs/ARNs is a valid way to build the metric comparison set.
  • Missing OS metrics The CloudWatch agent is the standard way to publish memory and disk metrics needed for EC2 bottleneck analysis.
  • Lambda hotspot signals Duration and Throttles are primary Lambda performance indicators, and tags help correlate functions to apps/environments.

Question 2

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

Which TWO statements about Amazon EC2 placement groups are INCORRECT or unsafe to rely on? (Select TWO.)

Options:

  • A. Cluster placement groups can span multiple Availability Zones.

  • B. Partition placement groups isolate instances by partition for large fleets.

  • C. Spread placement groups place each instance on distinct hardware.

  • D. Cluster placement groups suit tightly coupled HPC in one AZ.

  • E. Spread placement groups are best for a small number of critical instances.

  • F. Partition placement groups guarantee partitions are in different AZs.

Correct answers: A and F

Explanation: Cluster placement groups are designed for low-latency, high-throughput networking within a single Availability Zone, not across AZs. Partition placement groups isolate failure domains between partitions, but you cannot assume partitions are placed in separate AZs. The unsafe statements are the ones that incorrectly claim multi-AZ behavior or guarantees.

The core concept is choosing the placement group type that matches the failure domain and performance goal, and not assuming guarantees the feature does not provide. Cluster placement groups focus on performance by placing instances close together within one Availability Zone, so treating them as multi-AZ is incorrect.

Partition placement groups focus on resilience at scale by isolating instances into partitions (separate underlying hardware sets), but AWS does not guarantee that each partition corresponds to a unique Availability Zone.

The following statements are safe/accurate to rely on:

  • Cluster placement groups are a good fit for tightly coupled HPC or network-intensive workloads within one AZ.
  • Spread placement groups place instances on distinct underlying hardware to reduce correlated failure.
  • Spread placement groups are typically used for a small number of critical instances.
  • Partition placement groups isolate instances by partition and are commonly used for large distributed systems.

Key takeaway: cluster = single-AZ performance, spread = instance-level separation, partition = partition-level isolation (not AZ guarantees).

  • Cluster across AZs is unsafe because cluster placement is limited to one Availability Zone.
  • Partition equals multi-AZ is unsafe because partition isolation does not guarantee AZ separation.
  • Spread for critical instances is reasonable because it reduces correlated failures by separating hardware.
  • Partition for large fleets is reasonable because partitions provide scalable failure-domain isolation.

Question 3

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

An operations team wants a repeatable remediation runbook for a web application running in an Auto Scaling group behind an ALB. A CloudWatch alarm on the ALB metric UnHealthyHostCount triggers an EventBridge rule to start a custom AWS Systems Manager Automation document.

Which approach should the engineer NOT use when authoring the custom Automation document?

Options:

  • A. Use aws:executeAwsApi steps to deregister an unhealthy target and terminate the associated instance

  • B. Use an aws:runCommand step to restart the application service on the affected instance and then re-check target health

  • C. Use an Automation assumeRole and aws:executeAwsApi to start an instance refresh for the Auto Scaling group

  • D. Hard-code an IAM user access key and secret key in the document and use AWS CLI commands in a script step

Best answer: D

Explanation: Custom Systems Manager Automation documents should call AWS APIs and commands using an Automation-assumed IAM role and controlled parameters. Embedding long-term credentials in the document is an operational anti-pattern because it creates unmanaged secret exposure and bypasses least-privilege access control. Secure remediation should rely on IAM roles and approved secret stores.

The core practice when creating custom SSM Automation documents is to run actions under an IAM role (via assumeRole) and use native steps like aws:executeAwsApi and aws:runCommand to perform remediation. This keeps the runbook auditable (CloudTrail), repeatable, and aligned with least privilege.

Hard-coding access keys inside an Automation document is insecure and operationally risky because the document content can be read by principals with SSM document access, the credentials are hard to rotate, and the runbook no longer relies on centrally managed IAM permissions. If a script needs a secret (for example, an API token), pass it securely from AWS Secrets Manager or SSM Parameter Store (SecureString) rather than storing it in the document.

Key takeaway: use IAM roles and managed secrets; never embed long-term credentials in Automation content.

  • API-based remediation is valid because aws:executeAwsApi is designed for calling AWS APIs directly from Automation.
  • Command-based remediation is valid because aws:runCommand can safely execute controlled operational commands on instances.
  • Using assumeRole is best practice because Automation can perform actions with a scoped role instead of static credentials.

Question 4

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

A workload in us-east-1 uses an Amazon EventBridge rule to invoke a Lambda function when new orders are created. Operations sees occasional FailedInvocations on the rule in CloudWatch metrics, and some orders are never processed. The team cannot change application code and must keep the Lambda function as the direct target (no new components in the path).

Which TWO changes will improve reliability by ensuring transient failures are retried and unrecoverable events are captured for later analysis and reprocessing? (Select TWO.)

Options:

  • A. Configure a target retry policy (max age and attempts)

  • B. Configure a Lambda asynchronous DLQ for the function

  • C. Increase the Lambda reserved concurrency to stop invocation errors

  • D. Add a CloudWatch alarm and run an SSM Automation runbook

  • E. Configure an SQS dead-letter queue on the rule target

  • F. Enable EventBridge event archive and plan manual replays

Correct answers: A and E

Explanation: EventBridge supports per-target retry policies and an SQS dead-letter queue (DLQ) to make event delivery more resilient. Retries address transient delivery failures, while a DLQ preserves the original event when delivery ultimately cannot succeed. Together, these settings prevent silent event loss and provide an operational path to investigate and reprocess failures.

The core mechanism is EventBridge target delivery controls: a retry policy and a dead-letter queue. When a rule invokes a target and the delivery fails (for example, transient service errors or throttling), EventBridge can retry delivery according to the target’s retry policy (maximum event age and maximum retry attempts). If the event still cannot be delivered before those limits are reached, EventBridge can send the event payload to an SQS DLQ configured on that same target.

Operationally, this gives you:

  • Automatic retries for transient failures without code changes.
  • A durable failure store (SQS DLQ) for audit, inspection, and later reprocessing.

The key takeaway is that EventBridge reliability for target delivery is improved by configuring retry and DLQ directly on the rule target, not by relying on downstream service features that don’t apply to EventBridge’s synchronous target invocation.

  • OK: Target retry policy enables EventBridge to retry transient target delivery failures.
  • OK: Target SQS DLQ captures events that still fail after retries for later analysis.
  • NO: Lambda async DLQ doesn’t apply because EventBridge invokes Lambda as a rule target, not via Lambda async invocation settings.
  • NO: Archive/replay helps after the fact but doesn’t provide per-target retry/DLQ handling for delivery failures.
  • NO: Alarm + Automation improves detection/response but does not configure EventBridge delivery retries or DLQ capture.
  • NO: Reserved concurrency may reduce throttling risk but doesn’t provide capture of undelivered events when failures still occur.

Question 5

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

A team uses Amazon EFS to share build artifacts between 20 Amazon EC2 Linux instances in two Availability Zones. The EFS file system is only 200 GiB, and a nightly job needs consistent read throughput of 250 MiB/s for about 4 hours. During the job, CloudWatch shows BurstCreditBalance reaching 0 and users report slow file access.

Requirements:

  • Keep Multi-AZ availability (no data-loss acceptable).
  • Meet the nightly throughput requirement without application changes.
  • Reduce storage cost for files not accessed for 60 days.
  • Provide an operationally simple way to monitor whether throughput is being limited.

Which action best meets these requirements?

Options:

  • A. Move the file system to EFS One Zone and enable Provisioned Throughput set to 250 MiB/s; enable an EFS lifecycle policy to transition files to Infrequent Access after 60 days.

  • B. Switch to Max I/O performance mode and keep Bursting throughput; enable an EFS lifecycle policy to transition files to Infrequent Access after 60 days.

  • C. Use General Purpose performance mode with Provisioned Throughput set to 250 MiB/s, enable an EFS lifecycle policy to transition files to Infrequent Access after 60 days, and alarm on EFS PercentIOLimit (and related throughput metrics).

  • D. Replace EFS with EBS gp3 volumes attached to each instance, and use an OS-level sync job plus EBS snapshot lifecycle management for cost control.

Best answer: C

Explanation: The workload is exhausting EFS burst credits on a small file system, so Bursting throughput cannot reliably meet a sustained 250 MiB/s nightly demand. Provisioned Throughput guarantees the required throughput independent of file system size. EFS lifecycle management then reduces cost by moving cold files to Infrequent Access while CloudWatch alarms provide simple operational visibility into throughput limiting.

This is a classic EFS “burst credit exhaustion” symptom: with Bursting throughput, small file systems can run out of burst credits during sustained high-throughput periods, causing throttling and latency. To meet a known, sustained throughput target without changing the application, configure the file system with Provisioned Throughput at the required MiB/s so performance no longer depends on credits.

To optimize cost at the same time, enable an EFS lifecycle policy to transition files that have not been accessed for 60 days to the EFS Infrequent Access storage class. For operations, use CloudWatch EFS metrics such as PercentIOLimit and throughput-related metrics to alarm when the file system is approaching or hitting its throughput limit.

Changing performance mode alone does not address the underlying throughput-mode constraint.

  • Bursting still throttles keeps dependence on BurstCreditBalance, which the stem shows is already reaching 0 during the nightly job.
  • One Zone reduces availability violates the Multi-AZ availability requirement even if it improves cost/performance.
  • EBS is not a shared file system breaks the shared, concurrent POSIX file access requirement and adds operational complexity with syncing.

Question 6

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

When troubleshooting an EBS-backed Amazon EC2 instance, which statement best defines an IO-bound workload based on CloudWatch and OS-level metrics?

Options:

  • A. High NetworkIn/NetworkOut with normal disk latency and normal OS iowait

  • B. High memory utilization with sustained swapping/page faults and normal disk latency

  • C. High EBS latency and high OS iowait, while CPU usage is not consistently maxed

  • D. Sustained high CPUUtilization with low OS iowait and low disk latency

Best answer: C

Explanation: An IO-bound workload is bottlenecked on storage performance rather than CPU or memory. In practice, you see elevated EBS volume latency (and often reduced IOPS headroom) alongside high OS iowait, while CPU is not consistently saturated. This points to threads waiting on disk reads/writes to complete.

IO-bound means the instance spends a significant portion of time waiting for storage operations, so overall throughput is limited by disk/volume performance. In AWS, this commonly presents as increased EBS VolumeReadLatency/VolumeWriteLatency (and sometimes VolumeQueueLength), and at the OS level as higher iowait time, because runnable work is blocked on I/O completion.

By contrast, CPU-bound workloads show sustained high CPUUtilization with low iowait (the CPU is doing work, not waiting), and memory-bound workloads show memory pressure signals like swapping/page faults and reclaim activity. The key differentiator for IO-bound is “waiting on storage” (iowait/latency), not “running out of CPU cycles.”

  • CPU saturation describes CPU-bound behavior (high CPU, low iowait), not an I/O bottleneck.
  • Memory pressure (swapping/page faults) indicates memory-bound behavior even if performance is poor.
  • High throughput networking points to network-bound constraints rather than storage I/O limits.

Question 7

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

An operations team uses an Amazon EventBridge rule to capture CloudWatch alarm state-change events and forward them to (1) an Amazon SNS topic for on-call notifications and (2) an EventBridge API destination for a vendor incident system.

Requirements:

  • The vendor system must receive only: alarmName, newState, account, region, and runbookUrl.
  • The team must add runbookUrl in EventBridge before delivery.
  • The team is not allowed to deploy or maintain custom code (for example, AWS Lambda).

Which TWO actions should you AVOID? (Select TWO.)

Options:

  • A. Send the full CloudWatch alarm event JSON to the vendor API destination without transformation

  • B. Use an EventBridge input transformer on the SNS target to format a concise on-call message with alarmName, state, account, region, and runbookUrl

  • C. Create multiple EventBridge rules that match alarm name patterns, each using an input transformer to inject the appropriate runbookUrl

  • D. Invoke an AWS Step Functions state machine (using AWS SDK integrations) to retrieve runbookUrl, then call the vendor API destination with the reduced payload

  • E. Use an EventBridge input transformer on the API destination target to map only the approved fields and add a static runbookUrl value

  • F. Invoke a Lambda function from the rule to look up runbookUrl and forward the enriched payload

Correct answers: A and F

Explanation: EventBridge input transformers can reshape events and inject static context before sending to targets, which satisfies strict payload requirements without code. When additional context must be fetched dynamically, an enrichment pattern must still comply with operational constraints like “no custom code.” Sending unfiltered events to external systems violates least-privilege data sharing requirements.

The core concept is using EventBridge input transformers (and, when necessary, enrichment patterns) to add or shape context before delivering an event to a target. Input transformers can select specific JSON fields from the incoming CloudWatch alarm event and build a new payload, including adding constant fields like a runbookUrl, ensuring the vendor receives only the approved attributes.

Two actions are unsafe here because they break explicit requirements:

  • Deploying Lambda introduces custom code the team is not permitted to operate.
  • Forwarding the entire event to the vendor violates the requirement to send only a limited set of fields.

If runbookUrl must be derived dynamically without Lambda, a managed enrichment workflow (for example, Step Functions with AWS SDK integrations) can retrieve context and then emit only the reduced payload.

  • Lambda enrichment is prohibited because the scenario explicitly disallows deploying or maintaining custom code.
  • No transformation to vendor is unacceptable because it sends more than the approved five fields.
  • Input transformer to vendor/SNS is acceptable because it can both reduce fields and inject static context before delivery.
  • Pattern-based multiple rules is acceptable because it adds runbookUrl without custom code and keeps control in EventBridge.

Question 8

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

A team uses Amazon EFS as a shared working directory for a fleet of Linux build servers. The workload performs many small, latency-sensitive file operations and has unpredictable throughput spikes during business hours. Build artifacts are rarely accessed after 45 days.

You need to optimize EFS cost and performance. Which action should you NOT take?

Options:

  • A. Enable a 45-day lifecycle policy to EFS Infrequent Access

  • B. Switch to Max I/O performance mode to minimize latency

  • C. Use Elastic throughput mode for unpredictable spikes

  • D. Use the General Purpose performance mode

Best answer: B

Explanation: For latency-sensitive workloads with many small file operations, EFS General Purpose is the appropriate performance mode. Max I/O is intended for highly parallel workloads that need higher aggregate throughput and can tolerate higher latencies. Lifecycle policies to EFS Infrequent Access and an adaptive throughput mode help reduce cost and handle variable demand without hurting steady-state operations.

The key decision is EFS performance mode selection for a workload dominated by small, latency-sensitive file operations. EFS General Purpose is designed to provide the lowest latency per operation, which fits interactive build-server file activity.

Max I/O performance mode is meant for use cases that need higher levels of parallelism and aggregate throughput (for example, large-scale shared workloads) and it typically trades off per-operation latency. Separately, throughput mode and lifecycle management address different goals: Elastic throughput automatically scales throughput for spiky demand, and lifecycle policies reduce storage cost by moving cold data to EFS Infrequent Access based on last access time. The takeaway is to avoid Max I/O when low latency is the primary requirement.

  • General Purpose for low latency aligns with many small, latency-sensitive file operations.
  • Elastic throughput is appropriate when throughput demand is unpredictable and spiky.
  • Lifecycle to IA is a standard cost optimization for artifacts rarely accessed after 45 days.

Question 9

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

A web application runs on an Auto Scaling group (ASG) behind an Application Load Balancer. During business hours, CloudWatch shows a sustained increase in traffic and rising p95 latency. The operations team wants the ASG to automatically add and remove instances to maintain steady performance without manual intervention.

Which action BEST aligns with the Well-Architected operational excellence/reliability principle of automating responses to operational events?

Options:

  • A. Manually increase the ASG desired capacity whenever a latency alarm enters ALARM state

  • B. Create a target tracking scaling policy using ALBRequestCountPerTarget to keep a steady requests-per-target value

  • C. Set the ASG desired capacity to the maximum value and disable scale-in to prevent future latency

  • D. Change to larger instance types and keep the ASG size fixed to avoid scaling complexity

Best answer: B

Explanation: A target tracking scaling policy is an automated, feedback-based control loop that continuously adjusts ASG capacity to match sustained load. Using an ALB load metric such as requests per target directly ties scaling to the workload, helping maintain consistent performance. This aligns with Well-Architected operational excellence and reliability by automating remediation instead of relying on manual actions.

The core ops principle here is automation: when load changes, the system should respond automatically and predictably to maintain service performance. ASG target tracking is designed for this outcome because it continuously evaluates a metric and adjusts capacity to keep that metric near a target, reducing human intervention and stabilizing performance during sustained demand.

In this scenario, ALBRequestCountPerTarget is a strong choice because it scales on per-target load rather than indirectly on instance resource usage.

  • Choose an appropriate requests-per-target target value.
  • Attach a target tracking policy to the ASG.
  • Let the ASG scale out and in automatically as traffic changes.

Manual scaling or permanently overprovisioning can work temporarily, but they do not implement automated, repeatable remediation.

  • Manual intervention increases operational risk and slows remediation during sustained load.
  • Permanent max capacity reduces elasticity and increases cost while still lacking event-driven automation.
  • Vertical scaling only can help performance but does not provide automated capacity adjustment with demand.

Question 10

Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

A web application runs on a single Amazon EC2 instance (Linux) with the CloudWatch agent installed. Users report intermittent slow responses.

Exhibit: CloudWatch metrics (5-minute average during a slowdown)

CPUUtilization: 17%
cpu_usage_iowait: 41%
mem_used_percent: 58%
swap_used_percent: 1%
EBSIOBalance%: 6%
DiskQueueLength: 11.8
VolumeReadOps: 4,900
VolumeWriteOps: 5,300

Based on the exhibit, what is the best interpretation of the bottleneck?

Options:

  • A. The workload is CPU-bound

  • B. The workload is network-bound

  • C. The workload is memory-bound

  • D. The workload is EBS I/O-bound

Best answer: D

Explanation: The instance is spending a large portion of CPU time waiting on I/O rather than executing work. The high I/O wait and disk queue, combined with depleted EBS I/O balance, point to EBS storage as the limiting resource during the slowdown rather than CPU or memory pressure.

This pattern indicates an I/O-bound workload: the CPU isn’t saturated, but it is blocked waiting for storage operations to complete. In the exhibit, CPUUtilization: 17% is low while cpu_usage_iowait: 41% is high, which commonly happens when disk latency/throughput is the constraint. The storage side corroborates this: DiskQueueLength: 11.8 shows requests piling up, and EBSIOBalance%: 6% indicates the volume is close to exhausting its burst capability (or otherwise constrained), consistent with elevated VolumeReadOps/VolumeWriteOps. The key takeaway is that high iowait plus a growing disk queue points to EBS I/O contention rather than CPU or memory exhaustion.

  • CPU-bound would typically show sustained high CPUUtilization, not 17%.
  • Memory-bound is unlikely with mem_used_percent: 58% and swap_used_percent: 1%.
  • Network-bound isn’t supported because the exhibit’s strongest signals are cpu_usage_iowait: 41% and DiskQueueLength: 11.8 tied to EBS metrics.

Continue with full practice

Use the AWS SOA-C03 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try AWS SOA-C03 on Web View AWS SOA-C03 Practice Test

Free review resource

Read the AWS SOA-C03 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.

Revised on Thursday, May 14, 2026