Keep this page open while drilling. SOA-C03 rewards structured operations thinking: signal -> diagnosis -> low-risk remediation -> verification.
Quick facts (SOA-C03)
| Item |
Value |
| Questions |
65 total |
| Scoring |
50 scored + 15 unscored (unscored items are not identified) |
| Question types |
Multiple choice, multiple response |
| Time |
130 minutes |
| Passing score |
720 (scaled 100-1000) |
| Cost |
150 USD |
| Domains |
D1 22% - D2 22% - D3 22% - D4 16% - D5 18% |
Fast strategy
- Start from the constraint in the last sentence (availability, compliance, latency, cost, operational effort).
- Prefer the smallest safe operational change that addresses root cause.
- For noisy incidents, choose approaches that improve signal quality first (better alarms, filtering, dashboards).
- For repeated incidents, prefer automation (EventBridge + Lambda/SSM runbooks).
Final 20-minute recall (exam day)
Cue -> best answer (pattern map)
| If the question says… |
Usually best answer |
| Alarm fatigue / noisy incidents |
Composite alarms + tuned thresholds + actionable routing |
| Repeatable remediation needed |
EventBridge -> SSM Automation/Lambda runbook |
| Patch governance at scale |
Systems Manager Patch Manager |
| Configuration drift detection |
AWS Config rules + automatic remediation |
| Need secure shell-less instance access |
Systems Manager Session Manager |
| Stack update failed |
CloudFormation events + change sets + rollback analysis |
| Backup policy across accounts |
AWS Backup plans/policies |
| Access denial investigation |
IAM policy + resource policy + KMS key policy evaluation |
| Network reachability issue |
Route table -> SG -> NACL -> endpoint/NAT path validation |
| Incident postmortem prevention |
Runbook updates + alarm improvements + automation |
Must-memorize SOA defaults
| Topic |
Fast recall |
| Core observability stack |
CloudWatch metrics/logs, CloudTrail, X-Ray (where relevant) |
| RTO/RPO |
Recovery time and data loss objectives drive backup/DR choice |
| Safe remediation order |
Detect -> triage -> fix low blast radius -> verify -> automate |
| Operational preference |
Managed services + automation over manual repetitive operations |
Last-minute traps
- Acting on one symptom metric without correlation to logs/traces/deploy timeline.
- Running high-risk remediation before confirming blast radius.
- Treating backups as compliant without restore testing.
- Alerting on every metric spike instead of SLO-aligned sustained conditions.
1) CloudOps incident loop
flowchart LR
A[Detect signal] --> B[Triage severity]
B --> C[Identify probable root cause]
C --> D[Apply low-risk remediation]
D --> E[Validate recovery]
E --> F[Document + automate prevention]
Use this loop in scenario questions. Wrong answers often skip validation or choose high-blast-radius changes.
2) Monitoring and logging defaults (Domain 1)
Choose the right telemetry
| Need |
Best AWS signal |
| Resource/service health trends |
CloudWatch metrics |
| Application/system event detail |
CloudWatch Logs |
| API-level audit trail |
CloudTrail |
| Network allow/deny and flow diagnosis |
VPC Flow Logs |
Alarm design defaults
- Use alarm thresholds tied to SLO/error budgets where possible.
- Use composite alarms to reduce alert noise.
- Route alerts to SNS or EventBridge for automation paths.
- Include runbook links in alarm descriptions for faster response.
Common D1 pitfalls
- Alarm on raw spikes without sustained evaluation windows.
- No distinction between symptom metrics and cause metrics.
- Missing CloudWatch agent config on EC2/ECS/EKS.
- Automation triggers without guardrails/permissions checks.
3) Reliability and business continuity (Domain 2)
HA and scaling picks
| Requirement |
Typical choice |
| Multi-instance failover + balancing |
ELB + Auto Scaling |
| Regional DNS failover patterns |
Route 53 health checks + routing policy |
| Managed DB high availability |
Multi-AZ for RDS/Aurora |
| Burst read/load reduction |
CloudFront or ElastiCache |
Backup/restore language you must apply correctly
- RPO: acceptable data loss window.
- RTO: acceptable restoration time.
If question emphasizes strict RPO/RTO, prioritize restore method and backup frequency that explicitly satisfy those targets.
Reliability anti-patterns
- Single-AZ for critical stateful production workloads.
- Backups with no restore test evidence.
- Scaling policies with no cooldown/health alignment.
4) Deployment, provisioning, and automation (Domain 3)
Core service map
| Need |
Typical AWS answer |
| Declarative infrastructure |
CloudFormation (or CDK) |
| Fleet ops and runbooks |
Systems Manager |
| Event-driven operational actions |
EventBridge + Lambda/SSM |
| Multi-account/region deployment sharing |
StackSets / AWS RAM |
- Validate IAM permissions for stack actions.
- Check resource dependency/order failures.
- Confirm subnet CIDR sizing and limits.
- Review event log for first failing resource (not only terminal error).
Automation rule of thumb
Automate repetitive, deterministic operations first: patching, restart/remediation runbooks, compliance drift checks, and standard incident responses.
5) Security and compliance operations (Domain 4)
High-yield controls
| Control goal |
Typical services |
| Identity and least privilege |
IAM, IAM Access Analyzer |
| Auditability |
CloudTrail, AWS Config |
| Secrets and key management |
Secrets Manager, KMS |
| Findings aggregation |
Security Hub, GuardDuty, Inspector |
| Encryption in transit |
ACM/TLS |
Common exam patterns
- Access denied: check identity policy, resource policy, and KMS key policy.
- Compliance drift: Config rule failure -> remediation workflow.
- Multi-account controls: Organizations/SCP boundaries and delegated operations.
6) Networking and content delivery troubleshooting (Domain 5)
VPC troubleshooting order
- Route tables
- Security groups (stateful)
- NACLs (stateless)
- Gateway/path (IGW, NAT, TGW, endpoints)
- DNS resolution (Route 53 / Resolver)
Network/data path service picks
| Need |
Typical AWS answer |
| Private access to AWS services |
VPC endpoints / PrivateLink |
| CDN and edge caching |
CloudFront |
| Global traffic acceleration |
Global Accelerator |
| Hybrid/private connectivity |
Site-to-Site VPN / Transit Gateway |
Frequent D5 anti-patterns
- Allowing SG but blocking ephemeral return traffic via NACL.
- Assuming NAT provides inbound access.
- CloudFront cache issue treated as origin outage.
7) Troubleshooting playbooks you can reuse
5xx spike behind load balancer
- Check target health first.
- Correlate LB access logs + target logs + alarm timeline.
- Validate autoscaling events and recent config/deploy changes.
Alarm noise flood
- Replace independent symptom alarms with composite alarm logic.
- Tune thresholds/evaluation periods from observed baseline.
- Route only actionable alerts to incident channels.
Intermittent connectivity failure
- Validate route and SG path both directions.
- Inspect NACL rules for stateless return traffic blocks.
- Use Reachability Analyzer and VPC Flow Logs for confirmation.
8) Cost-aware operations quick wins
- Delete idle unattached EBS volumes and stale snapshots with retention policy.
- Use lifecycle policies for S3/EFS where access patterns allow.
- Reduce NAT egress by using VPC endpoints where applicable.
- Right-size compute using utilization and recommendation signals.
Next steps