SOA-C03 Cheatsheet - CloudOps Signals, Runbooks, Reliability, Security & Networking

High-signal SOA-C03 reference: monitoring/logging/remediation patterns, reliability and DR decisions, CloudFormation/SSM automation, security/compliance operations, and network troubleshooting defaults.

On this page

Keep this page open while drilling. SOA-C03 rewards structured operations thinking: signal -> diagnosis -> low-risk remediation -> verification.

Quick facts (SOA-C03)

Item	Value
Questions	65 total
Scoring	50 scored + 15 unscored (unscored items are not identified)
Question types	Multiple choice, multiple response
Time	130 minutes
Passing score	720 (scaled 100-1000)
Cost	150 USD
Domains	D1 22% - D2 22% - D3 22% - D4 16% - D5 18%

Fast strategy

Start from the constraint in the last sentence (availability, compliance, latency, cost, operational effort).
Prefer the smallest safe operational change that addresses root cause.
For noisy incidents, choose approaches that improve signal quality first (better alarms, filtering, dashboards).
For repeated incidents, prefer automation (EventBridge + Lambda/SSM runbooks).

Final 20-minute recall (exam day)

Cue -> best answer (pattern map)

If the question says…	Usually best answer
Alarm fatigue / noisy incidents	Composite alarms + tuned thresholds + actionable routing
Repeatable remediation needed	EventBridge -> SSM Automation/Lambda runbook
Patch governance at scale	Systems Manager Patch Manager
Configuration drift detection	AWS Config rules + automatic remediation
Need secure shell-less instance access	Systems Manager Session Manager
Stack update failed	CloudFormation events + change sets + rollback analysis
Backup policy across accounts	AWS Backup plans/policies
Access denial investigation	IAM policy + resource policy + KMS key policy evaluation
Network reachability issue	Route table -> SG -> NACL -> endpoint/NAT path validation
Incident postmortem prevention	Runbook updates + alarm improvements + automation

Must-memorize SOA defaults

Topic	Fast recall
Core observability stack	CloudWatch metrics/logs, CloudTrail, X-Ray (where relevant)
RTO/RPO	Recovery time and data loss objectives drive backup/DR choice
Safe remediation order	Detect -> triage -> fix low blast radius -> verify -> automate
Operational preference	Managed services + automation over manual repetitive operations

Last-minute traps

Acting on one symptom metric without correlation to logs/traces/deploy timeline.
Running high-risk remediation before confirming blast radius.
Treating backups as compliant without restore testing.
Alerting on every metric spike instead of SLO-aligned sustained conditions.

1) CloudOps incident loop

    flowchart LR
	  A[Detect signal] --> B[Triage severity]
	  B --> C[Identify probable root cause]
	  C --> D[Apply low-risk remediation]
	  D --> E[Validate recovery]
	  E --> F[Document + automate prevention]

Use this loop in scenario questions. Wrong answers often skip validation or choose high-blast-radius changes.

2) Monitoring and logging defaults (Domain 1)

Choose the right telemetry

Need	Best AWS signal
Resource/service health trends	CloudWatch metrics
Application/system event detail	CloudWatch Logs
API-level audit trail	CloudTrail
Network allow/deny and flow diagnosis	VPC Flow Logs

Alarm design defaults

Use alarm thresholds tied to SLO/error budgets where possible.
Use composite alarms to reduce alert noise.
Route alerts to SNS or EventBridge for automation paths.
Include runbook links in alarm descriptions for faster response.

Common D1 pitfalls

Alarm on raw spikes without sustained evaluation windows.
No distinction between symptom metrics and cause metrics.
Missing CloudWatch agent config on EC2/ECS/EKS.
Automation triggers without guardrails/permissions checks.

3) Reliability and business continuity (Domain 2)

HA and scaling picks

Requirement	Typical choice
Multi-instance failover + balancing	ELB + Auto Scaling
Regional DNS failover patterns	Route 53 health checks + routing policy
Managed DB high availability	Multi-AZ for RDS/Aurora
Burst read/load reduction	CloudFront or ElastiCache

Backup/restore language you must apply correctly

RPO: acceptable data loss window.
RTO: acceptable restoration time.

If question emphasizes strict RPO/RTO, prioritize restore method and backup frequency that explicitly satisfy those targets.

Reliability anti-patterns

Single-AZ for critical stateful production workloads.
Backups with no restore test evidence.
Scaling policies with no cooldown/health alignment.

4) Deployment, provisioning, and automation (Domain 3)

Core service map

Need	Typical AWS answer
Declarative infrastructure	CloudFormation (or CDK)
Fleet ops and runbooks	Systems Manager
Event-driven operational actions	EventBridge + Lambda/SSM
Multi-account/region deployment sharing	StackSets / AWS RAM

CloudFormation troubleshooting checklist

Validate IAM permissions for stack actions.
Check resource dependency/order failures.
Confirm subnet CIDR sizing and limits.
Review event log for first failing resource (not only terminal error).

Automation rule of thumb

Automate repetitive, deterministic operations first: patching, restart/remediation runbooks, compliance drift checks, and standard incident responses.

5) Security and compliance operations (Domain 4)

High-yield controls

Control goal	Typical services
Identity and least privilege	IAM, IAM Access Analyzer
Auditability	CloudTrail, AWS Config
Secrets and key management	Secrets Manager, KMS
Findings aggregation	Security Hub, GuardDuty, Inspector
Encryption in transit	ACM/TLS

Common exam patterns

Access denied: check identity policy, resource policy, and KMS key policy.
Compliance drift: Config rule failure -> remediation workflow.
Multi-account controls: Organizations/SCP boundaries and delegated operations.

6) Networking and content delivery troubleshooting (Domain 5)

VPC troubleshooting order

Route tables
Security groups (stateful)
NACLs (stateless)
Gateway/path (IGW, NAT, TGW, endpoints)
DNS resolution (Route 53 / Resolver)

Network/data path service picks

Need	Typical AWS answer
Private access to AWS services	VPC endpoints / PrivateLink
CDN and edge caching	CloudFront
Global traffic acceleration	Global Accelerator
Hybrid/private connectivity	Site-to-Site VPN / Transit Gateway

Frequent D5 anti-patterns

Allowing SG but blocking ephemeral return traffic via NACL.
Assuming NAT provides inbound access.
CloudFront cache issue treated as origin outage.

7) Troubleshooting playbooks you can reuse

5xx spike behind load balancer

Check target health first.
Correlate LB access logs + target logs + alarm timeline.
Validate autoscaling events and recent config/deploy changes.

Alarm noise flood

Replace independent symptom alarms with composite alarm logic.
Tune thresholds/evaluation periods from observed baseline.
Route only actionable alerts to incident channels.

Intermittent connectivity failure

Validate route and SG path both directions.
Inspect NACL rules for stateless return traffic blocks.
Use Reachability Analyzer and VPC Flow Logs for confirmation.

8) Cost-aware operations quick wins

Delete idle unattached EBS volumes and stale snapshots with retention policy.
Use lifecycle policies for S3/EFS where access patterns allow.
Reduce NAT egress by using VPC endpoints where applicable.
Right-size compute using utilization and recommendation signals.

Next steps

Follow the Syllabus task-by-task.
Drill weak tasks in Practice .
Use the Study Plan if you want a timeline.

Syllabus

Practice