Try 10 focused AWS SOA-C03 questions on Monitoring and Optimization, with explanations, then continue with IT Mastery.
Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.
| Field | Detail |
|---|---|
| Exam route | AWS SOA-C03 |
| Topic area | Monitoring, Logging, Analysis, Remediation, and Performance Optimization |
| Blueprint weight | 22% |
| Page purpose | Focused sample questions before returning to mixed practice |
Use this page to isolate Monitoring, Logging, Analysis, Remediation, and Performance Optimization for AWS SOA-C03. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.
| Pass | What to do | What to record |
|---|---|---|
| First attempt | Answer without checking the explanation first. | The fact, rule, calculation, or judgment point that controlled your answer. |
| Review | Read the explanation even when you were correct. | Why the best answer is stronger than the closest distractor. |
| Repair | Repeat only missed or uncertain items after a short break. | The pattern behind misses, not the answer letter. |
| Transfer | Return to mixed practice once the topic feels stable. | Whether the same skill holds up when the topic is no longer obvious. |
Blueprint context: 22% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.
These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.
Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization
An operations engineer must identify performance hotspots across EC2 instances and Lambda functions by using CloudWatch performance metrics and existing Application and Environment tags. Which statement is INCORRECT?
Options:
A. Filter CloudWatch EC2 metrics directly by instance tags.
B. Use the CloudWatch agent memory/disk metrics to spot EC2 bottlenecks.
C. Use the Resource Groups Tagging API to list resource IDs, then compare metrics.
D. Use Lambda Duration/Throttles metrics and tags to find compute hotspots.
Best answer: A
Explanation: CloudWatch metric queries and the Metrics console operate on metric namespaces and dimensions (such as EC2 InstanceId), not on tag keys/values. To use tags for hotspot analysis, first use tags to identify the relevant resources (IDs/ARNs), then review and compare the appropriate CloudWatch metrics for those resources.
The core workflow is: use tags to define “which resources to look at,” then use CloudWatch metrics to find the bottleneck signal for those specific resources. CloudWatch does not provide native tag-based filtering/grouping of EC2 metrics; EC2 metrics are emitted with dimensions like InstanceId, so you must first translate Application/Environment tags into the corresponding instance IDs (or other resource identifiers) and then compare metrics across that set.
Common operational approach:
Key takeaway: tags help you scope and correlate; dimensions are what CloudWatch metrics filter on.
InstanceId.Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization
Which TWO statements about Amazon EC2 placement groups are INCORRECT or unsafe to rely on? (Select TWO.)
Options:
A. Cluster placement groups can span multiple Availability Zones.
B. Partition placement groups isolate instances by partition for large fleets.
C. Spread placement groups place each instance on distinct hardware.
D. Cluster placement groups suit tightly coupled HPC in one AZ.
E. Spread placement groups are best for a small number of critical instances.
F. Partition placement groups guarantee partitions are in different AZs.
Correct answers: A and F
Explanation: Cluster placement groups are designed for low-latency, high-throughput networking within a single Availability Zone, not across AZs. Partition placement groups isolate failure domains between partitions, but you cannot assume partitions are placed in separate AZs. The unsafe statements are the ones that incorrectly claim multi-AZ behavior or guarantees.
The core concept is choosing the placement group type that matches the failure domain and performance goal, and not assuming guarantees the feature does not provide. Cluster placement groups focus on performance by placing instances close together within one Availability Zone, so treating them as multi-AZ is incorrect.
Partition placement groups focus on resilience at scale by isolating instances into partitions (separate underlying hardware sets), but AWS does not guarantee that each partition corresponds to a unique Availability Zone.
The following statements are safe/accurate to rely on:
Key takeaway: cluster = single-AZ performance, spread = instance-level separation, partition = partition-level isolation (not AZ guarantees).
Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization
An operations team wants a repeatable remediation runbook for a web application running in an Auto Scaling group behind an ALB. A CloudWatch alarm on the ALB metric UnHealthyHostCount triggers an EventBridge rule to start a custom AWS Systems Manager Automation document.
Which approach should the engineer NOT use when authoring the custom Automation document?
Options:
A. Use aws:executeAwsApi steps to deregister an unhealthy target and terminate the associated instance
B. Use an aws:runCommand step to restart the application service on the affected instance and then re-check target health
C. Use an Automation assumeRole and aws:executeAwsApi to start an instance refresh for the Auto Scaling group
D. Hard-code an IAM user access key and secret key in the document and use AWS CLI commands in a script step
Best answer: D
Explanation: Custom Systems Manager Automation documents should call AWS APIs and commands using an Automation-assumed IAM role and controlled parameters. Embedding long-term credentials in the document is an operational anti-pattern because it creates unmanaged secret exposure and bypasses least-privilege access control. Secure remediation should rely on IAM roles and approved secret stores.
The core practice when creating custom SSM Automation documents is to run actions under an IAM role (via assumeRole) and use native steps like aws:executeAwsApi and aws:runCommand to perform remediation. This keeps the runbook auditable (CloudTrail), repeatable, and aligned with least privilege.
Hard-coding access keys inside an Automation document is insecure and operationally risky because the document content can be read by principals with SSM document access, the credentials are hard to rotate, and the runbook no longer relies on centrally managed IAM permissions. If a script needs a secret (for example, an API token), pass it securely from AWS Secrets Manager or SSM Parameter Store (SecureString) rather than storing it in the document.
Key takeaway: use IAM roles and managed secrets; never embed long-term credentials in Automation content.
aws:executeAwsApi is designed for calling AWS APIs directly from Automation.aws:runCommand can safely execute controlled operational commands on instances.Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization
A workload in us-east-1 uses an Amazon EventBridge rule to invoke a Lambda function when new orders are created. Operations sees occasional FailedInvocations on the rule in CloudWatch metrics, and some orders are never processed. The team cannot change application code and must keep the Lambda function as the direct target (no new components in the path).
Which TWO changes will improve reliability by ensuring transient failures are retried and unrecoverable events are captured for later analysis and reprocessing? (Select TWO.)
Options:
A. Configure a target retry policy (max age and attempts)
B. Configure a Lambda asynchronous DLQ for the function
C. Increase the Lambda reserved concurrency to stop invocation errors
D. Add a CloudWatch alarm and run an SSM Automation runbook
E. Configure an SQS dead-letter queue on the rule target
F. Enable EventBridge event archive and plan manual replays
Correct answers: A and E
Explanation: EventBridge supports per-target retry policies and an SQS dead-letter queue (DLQ) to make event delivery more resilient. Retries address transient delivery failures, while a DLQ preserves the original event when delivery ultimately cannot succeed. Together, these settings prevent silent event loss and provide an operational path to investigate and reprocess failures.
The core mechanism is EventBridge target delivery controls: a retry policy and a dead-letter queue. When a rule invokes a target and the delivery fails (for example, transient service errors or throttling), EventBridge can retry delivery according to the target’s retry policy (maximum event age and maximum retry attempts). If the event still cannot be delivered before those limits are reached, EventBridge can send the event payload to an SQS DLQ configured on that same target.
Operationally, this gives you:
The key takeaway is that EventBridge reliability for target delivery is improved by configuring retry and DLQ directly on the rule target, not by relying on downstream service features that don’t apply to EventBridge’s synchronous target invocation.
Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization
A team uses Amazon EFS to share build artifacts between 20 Amazon EC2 Linux instances in two Availability Zones. The EFS file system is only 200 GiB, and a nightly job needs consistent read throughput of 250 MiB/s for about 4 hours. During the job, CloudWatch shows BurstCreditBalance reaching 0 and users report slow file access.
Requirements:
Which action best meets these requirements?
Options:
A. Move the file system to EFS One Zone and enable Provisioned Throughput set to 250 MiB/s; enable an EFS lifecycle policy to transition files to Infrequent Access after 60 days.
B. Switch to Max I/O performance mode and keep Bursting throughput; enable an EFS lifecycle policy to transition files to Infrequent Access after 60 days.
C. Use General Purpose performance mode with Provisioned Throughput set to 250 MiB/s, enable an EFS lifecycle policy to transition files to Infrequent Access after 60 days, and alarm on EFS PercentIOLimit (and related throughput metrics).
D. Replace EFS with EBS gp3 volumes attached to each instance, and use an OS-level sync job plus EBS snapshot lifecycle management for cost control.
Best answer: C
Explanation: The workload is exhausting EFS burst credits on a small file system, so Bursting throughput cannot reliably meet a sustained 250 MiB/s nightly demand. Provisioned Throughput guarantees the required throughput independent of file system size. EFS lifecycle management then reduces cost by moving cold files to Infrequent Access while CloudWatch alarms provide simple operational visibility into throughput limiting.
This is a classic EFS “burst credit exhaustion” symptom: with Bursting throughput, small file systems can run out of burst credits during sustained high-throughput periods, causing throttling and latency. To meet a known, sustained throughput target without changing the application, configure the file system with Provisioned Throughput at the required MiB/s so performance no longer depends on credits.
To optimize cost at the same time, enable an EFS lifecycle policy to transition files that have not been accessed for 60 days to the EFS Infrequent Access storage class. For operations, use CloudWatch EFS metrics such as PercentIOLimit and throughput-related metrics to alarm when the file system is approaching or hitting its throughput limit.
Changing performance mode alone does not address the underlying throughput-mode constraint.
BurstCreditBalance, which the stem shows is already reaching 0 during the nightly job.Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization
When troubleshooting an EBS-backed Amazon EC2 instance, which statement best defines an IO-bound workload based on CloudWatch and OS-level metrics?
Options:
A. High NetworkIn/NetworkOut with normal disk latency and normal OS iowait
B. High memory utilization with sustained swapping/page faults and normal disk latency
C. High EBS latency and high OS iowait, while CPU usage is not consistently maxed
D. Sustained high CPUUtilization with low OS iowait and low disk latency
Best answer: C
Explanation: An IO-bound workload is bottlenecked on storage performance rather than CPU or memory. In practice, you see elevated EBS volume latency (and often reduced IOPS headroom) alongside high OS iowait, while CPU is not consistently saturated. This points to threads waiting on disk reads/writes to complete.
IO-bound means the instance spends a significant portion of time waiting for storage operations, so overall throughput is limited by disk/volume performance. In AWS, this commonly presents as increased EBS VolumeReadLatency/VolumeWriteLatency (and sometimes VolumeQueueLength), and at the OS level as higher iowait time, because runnable work is blocked on I/O completion.
By contrast, CPU-bound workloads show sustained high CPUUtilization with low iowait (the CPU is doing work, not waiting), and memory-bound workloads show memory pressure signals like swapping/page faults and reclaim activity. The key differentiator for IO-bound is “waiting on storage” (iowait/latency), not “running out of CPU cycles.”
Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization
An operations team uses an Amazon EventBridge rule to capture CloudWatch alarm state-change events and forward them to (1) an Amazon SNS topic for on-call notifications and (2) an EventBridge API destination for a vendor incident system.
Requirements:
alarmName, newState, account, region, and runbookUrl.runbookUrl in EventBridge before delivery.Which TWO actions should you AVOID? (Select TWO.)
Options:
A. Send the full CloudWatch alarm event JSON to the vendor API destination without transformation
B. Use an EventBridge input transformer on the SNS target to format a concise on-call message with alarmName, state, account, region, and runbookUrl
C. Create multiple EventBridge rules that match alarm name patterns, each using an input transformer to inject the appropriate runbookUrl
D. Invoke an AWS Step Functions state machine (using AWS SDK integrations) to retrieve runbookUrl, then call the vendor API destination with the reduced payload
E. Use an EventBridge input transformer on the API destination target to map only the approved fields and add a static runbookUrl value
F. Invoke a Lambda function from the rule to look up runbookUrl and forward the enriched payload
Correct answers: A and F
Explanation: EventBridge input transformers can reshape events and inject static context before sending to targets, which satisfies strict payload requirements without code. When additional context must be fetched dynamically, an enrichment pattern must still comply with operational constraints like “no custom code.” Sending unfiltered events to external systems violates least-privilege data sharing requirements.
The core concept is using EventBridge input transformers (and, when necessary, enrichment patterns) to add or shape context before delivering an event to a target. Input transformers can select specific JSON fields from the incoming CloudWatch alarm event and build a new payload, including adding constant fields like a runbookUrl, ensuring the vendor receives only the approved attributes.
Two actions are unsafe here because they break explicit requirements:
If runbookUrl must be derived dynamically without Lambda, a managed enrichment workflow (for example, Step Functions with AWS SDK integrations) can retrieve context and then emit only the reduced payload.
runbookUrl without custom code and keeps control in EventBridge.Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization
A team uses Amazon EFS as a shared working directory for a fleet of Linux build servers. The workload performs many small, latency-sensitive file operations and has unpredictable throughput spikes during business hours. Build artifacts are rarely accessed after 45 days.
You need to optimize EFS cost and performance. Which action should you NOT take?
Options:
A. Enable a 45-day lifecycle policy to EFS Infrequent Access
B. Switch to Max I/O performance mode to minimize latency
C. Use Elastic throughput mode for unpredictable spikes
D. Use the General Purpose performance mode
Best answer: B
Explanation: For latency-sensitive workloads with many small file operations, EFS General Purpose is the appropriate performance mode. Max I/O is intended for highly parallel workloads that need higher aggregate throughput and can tolerate higher latencies. Lifecycle policies to EFS Infrequent Access and an adaptive throughput mode help reduce cost and handle variable demand without hurting steady-state operations.
The key decision is EFS performance mode selection for a workload dominated by small, latency-sensitive file operations. EFS General Purpose is designed to provide the lowest latency per operation, which fits interactive build-server file activity.
Max I/O performance mode is meant for use cases that need higher levels of parallelism and aggregate throughput (for example, large-scale shared workloads) and it typically trades off per-operation latency. Separately, throughput mode and lifecycle management address different goals: Elastic throughput automatically scales throughput for spiky demand, and lifecycle policies reduce storage cost by moving cold data to EFS Infrequent Access based on last access time. The takeaway is to avoid Max I/O when low latency is the primary requirement.
Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization
A web application runs on an Auto Scaling group (ASG) behind an Application Load Balancer. During business hours, CloudWatch shows a sustained increase in traffic and rising p95 latency. The operations team wants the ASG to automatically add and remove instances to maintain steady performance without manual intervention.
Which action BEST aligns with the Well-Architected operational excellence/reliability principle of automating responses to operational events?
Options:
A. Manually increase the ASG desired capacity whenever a latency alarm enters ALARM state
B. Create a target tracking scaling policy using ALBRequestCountPerTarget to keep a steady requests-per-target value
C. Set the ASG desired capacity to the maximum value and disable scale-in to prevent future latency
D. Change to larger instance types and keep the ASG size fixed to avoid scaling complexity
Best answer: B
Explanation: A target tracking scaling policy is an automated, feedback-based control loop that continuously adjusts ASG capacity to match sustained load. Using an ALB load metric such as requests per target directly ties scaling to the workload, helping maintain consistent performance. This aligns with Well-Architected operational excellence and reliability by automating remediation instead of relying on manual actions.
The core ops principle here is automation: when load changes, the system should respond automatically and predictably to maintain service performance. ASG target tracking is designed for this outcome because it continuously evaluates a metric and adjusts capacity to keep that metric near a target, reducing human intervention and stabilizing performance during sustained demand.
In this scenario, ALBRequestCountPerTarget is a strong choice because it scales on per-target load rather than indirectly on instance resource usage.
Manual scaling or permanently overprovisioning can work temporarily, but they do not implement automated, repeatable remediation.
Topic: Monitoring, Logging, Analysis, Remediation, and Performance Optimization
A web application runs on a single Amazon EC2 instance (Linux) with the CloudWatch agent installed. Users report intermittent slow responses.
Exhibit: CloudWatch metrics (5-minute average during a slowdown)
CPUUtilization: 17%
cpu_usage_iowait: 41%
mem_used_percent: 58%
swap_used_percent: 1%
EBSIOBalance%: 6%
DiskQueueLength: 11.8
VolumeReadOps: 4,900
VolumeWriteOps: 5,300
Based on the exhibit, what is the best interpretation of the bottleneck?
Options:
A. The workload is CPU-bound
B. The workload is network-bound
C. The workload is memory-bound
D. The workload is EBS I/O-bound
Best answer: D
Explanation: The instance is spending a large portion of CPU time waiting on I/O rather than executing work. The high I/O wait and disk queue, combined with depleted EBS I/O balance, point to EBS storage as the limiting resource during the slowdown rather than CPU or memory pressure.
This pattern indicates an I/O-bound workload: the CPU isn’t saturated, but it is blocked waiting for storage operations to complete. In the exhibit, CPUUtilization: 17% is low while cpu_usage_iowait: 41% is high, which commonly happens when disk latency/throughput is the constraint. The storage side corroborates this: DiskQueueLength: 11.8 shows requests piling up, and EBSIOBalance%: 6% indicates the volume is close to exhausting its burst capability (or otherwise constrained), consistent with elevated VolumeReadOps/VolumeWriteOps. The key takeaway is that high iowait plus a growing disk queue points to EBS I/O contention rather than CPU or memory exhaustion.
CPUUtilization, not 17%.mem_used_percent: 58% and swap_used_percent: 1%.cpu_usage_iowait: 41% and DiskQueueLength: 11.8 tied to EBS metrics.Use the AWS SOA-C03 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.
Try AWS SOA-C03 on Web View AWS SOA-C03 Practice Test
Read the AWS SOA-C03 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.