SOA-C03 Cheatsheet - CloudOps Signals, Runbooks, Reliability, Security & Networking

High-signal SOA-C03 reference: monitoring/logging/remediation patterns, reliability and DR decisions, CloudFormation/SSM automation, security/compliance operations, and network troubleshooting defaults.

Keep this page open while drilling. SOA-C03 rewards structured operations thinking: signal -> diagnosis -> low-risk remediation -> verification.


Quick facts (SOA-C03)

Item Value
Questions 65 total
Scoring 50 scored + 15 unscored (unscored items are not identified)
Question types Multiple choice, multiple response
Time 130 minutes
Passing score 720 (scaled 100-1000)
Cost 150 USD
Domains D1 22% - D2 22% - D3 22% - D4 16% - D5 18%

Fast strategy

  • Start from the constraint in the last sentence (availability, compliance, latency, cost, operational effort).
  • Prefer the smallest safe operational change that addresses root cause.
  • For noisy incidents, choose approaches that improve signal quality first (better alarms, filtering, dashboards).
  • For repeated incidents, prefer automation (EventBridge + Lambda/SSM runbooks).

Final 20-minute recall (exam day)

Cue -> best answer (pattern map)

If the question says… Usually best answer
Alarm fatigue / noisy incidents Composite alarms + tuned thresholds + actionable routing
Repeatable remediation needed EventBridge -> SSM Automation/Lambda runbook
Patch governance at scale Systems Manager Patch Manager
Configuration drift detection AWS Config rules + automatic remediation
Need secure shell-less instance access Systems Manager Session Manager
Stack update failed CloudFormation events + change sets + rollback analysis
Backup policy across accounts AWS Backup plans/policies
Access denial investigation IAM policy + resource policy + KMS key policy evaluation
Network reachability issue Route table -> SG -> NACL -> endpoint/NAT path validation
Incident postmortem prevention Runbook updates + alarm improvements + automation

Must-memorize SOA defaults

Topic Fast recall
Core observability stack CloudWatch metrics/logs, CloudTrail, X-Ray (where relevant)
RTO/RPO Recovery time and data loss objectives drive backup/DR choice
Safe remediation order Detect -> triage -> fix low blast radius -> verify -> automate
Operational preference Managed services + automation over manual repetitive operations

Last-minute traps

  • Acting on one symptom metric without correlation to logs/traces/deploy timeline.
  • Running high-risk remediation before confirming blast radius.
  • Treating backups as compliant without restore testing.
  • Alerting on every metric spike instead of SLO-aligned sustained conditions.

1) CloudOps incident loop

    flowchart LR
	  A[Detect signal] --> B[Triage severity]
	  B --> C[Identify probable root cause]
	  C --> D[Apply low-risk remediation]
	  D --> E[Validate recovery]
	  E --> F[Document + automate prevention]

Use this loop in scenario questions. Wrong answers often skip validation or choose high-blast-radius changes.


2) Monitoring and logging defaults (Domain 1)

Choose the right telemetry

Need Best AWS signal
Resource/service health trends CloudWatch metrics
Application/system event detail CloudWatch Logs
API-level audit trail CloudTrail
Network allow/deny and flow diagnosis VPC Flow Logs

Alarm design defaults

  • Use alarm thresholds tied to SLO/error budgets where possible.
  • Use composite alarms to reduce alert noise.
  • Route alerts to SNS or EventBridge for automation paths.
  • Include runbook links in alarm descriptions for faster response.

Common D1 pitfalls

  • Alarm on raw spikes without sustained evaluation windows.
  • No distinction between symptom metrics and cause metrics.
  • Missing CloudWatch agent config on EC2/ECS/EKS.
  • Automation triggers without guardrails/permissions checks.

3) Reliability and business continuity (Domain 2)

HA and scaling picks

Requirement Typical choice
Multi-instance failover + balancing ELB + Auto Scaling
Regional DNS failover patterns Route 53 health checks + routing policy
Managed DB high availability Multi-AZ for RDS/Aurora
Burst read/load reduction CloudFront or ElastiCache

Backup/restore language you must apply correctly

  • RPO: acceptable data loss window.
  • RTO: acceptable restoration time.

If question emphasizes strict RPO/RTO, prioritize restore method and backup frequency that explicitly satisfy those targets.

Reliability anti-patterns

  • Single-AZ for critical stateful production workloads.
  • Backups with no restore test evidence.
  • Scaling policies with no cooldown/health alignment.

4) Deployment, provisioning, and automation (Domain 3)

Core service map

Need Typical AWS answer
Declarative infrastructure CloudFormation (or CDK)
Fleet ops and runbooks Systems Manager
Event-driven operational actions EventBridge + Lambda/SSM
Multi-account/region deployment sharing StackSets / AWS RAM

CloudFormation troubleshooting checklist

  1. Validate IAM permissions for stack actions.
  2. Check resource dependency/order failures.
  3. Confirm subnet CIDR sizing and limits.
  4. Review event log for first failing resource (not only terminal error).

Automation rule of thumb

Automate repetitive, deterministic operations first: patching, restart/remediation runbooks, compliance drift checks, and standard incident responses.


5) Security and compliance operations (Domain 4)

High-yield controls

Control goal Typical services
Identity and least privilege IAM, IAM Access Analyzer
Auditability CloudTrail, AWS Config
Secrets and key management Secrets Manager, KMS
Findings aggregation Security Hub, GuardDuty, Inspector
Encryption in transit ACM/TLS

Common exam patterns

  • Access denied: check identity policy, resource policy, and KMS key policy.
  • Compliance drift: Config rule failure -> remediation workflow.
  • Multi-account controls: Organizations/SCP boundaries and delegated operations.

6) Networking and content delivery troubleshooting (Domain 5)

VPC troubleshooting order

  1. Route tables
  2. Security groups (stateful)
  3. NACLs (stateless)
  4. Gateway/path (IGW, NAT, TGW, endpoints)
  5. DNS resolution (Route 53 / Resolver)

Network/data path service picks

Need Typical AWS answer
Private access to AWS services VPC endpoints / PrivateLink
CDN and edge caching CloudFront
Global traffic acceleration Global Accelerator
Hybrid/private connectivity Site-to-Site VPN / Transit Gateway

Frequent D5 anti-patterns

  • Allowing SG but blocking ephemeral return traffic via NACL.
  • Assuming NAT provides inbound access.
  • CloudFront cache issue treated as origin outage.

7) Troubleshooting playbooks you can reuse

5xx spike behind load balancer

  • Check target health first.
  • Correlate LB access logs + target logs + alarm timeline.
  • Validate autoscaling events and recent config/deploy changes.

Alarm noise flood

  • Replace independent symptom alarms with composite alarm logic.
  • Tune thresholds/evaluation periods from observed baseline.
  • Route only actionable alerts to incident channels.

Intermittent connectivity failure

  • Validate route and SG path both directions.
  • Inspect NACL rules for stateless return traffic blocks.
  • Use Reachability Analyzer and VPC Flow Logs for confirmation.

8) Cost-aware operations quick wins

  • Delete idle unattached EBS volumes and stale snapshots with retention policy.
  • Use lifecycle policies for S3/EFS where access patterns allow.
  • Reduce NAT egress by using VPC endpoints where applicable.
  • Right-size compute using utilization and recommendation signals.

Next steps