Use this syllabus as your source of truth for SOA-C03. Work through each domain in order and drill focused sets after every task.
What’s covered
Content Domain 1: Monitoring, Logging, Analysis, Remediation, and Performance Optimization (22%)
Task 1.1 - Implement metrics, alarms, and filters by using AWS monitoring and logging services
- Differentiate CloudWatch metrics, logs, and alarms and choose the right signal type for a given operational requirement.
- Configure CloudWatch metric collection and CloudWatch Logs for common AWS services and workloads.
- Configure CloudTrail to record management events and deliver logs to Amazon S3 and/or CloudWatch Logs for auditing and investigations.
- Use CloudWatch Logs Insights to filter, aggregate, and analyze operational logs to answer specific troubleshooting questions.
- Integrate CloudWatch with Amazon Managed Service for Prometheus and Amazon Managed Grafana at a high level to monitor containerized workloads.
- Configure and manage the CloudWatch agent on EC2 to collect OS-level metrics (CPU, memory, disk) and application log files.
- Configure the CloudWatch agent for ECS/EKS environments at a high level to collect container logs and metrics.
- Troubleshoot missing CloudWatch agent metrics or logs by validating IAM permissions, network reachability, and agent configuration.
- Create CloudWatch metric alarms with appropriate thresholds, evaluation periods, and actions to detect unhealthy conditions.
- Configure composite alarms to reduce alert fatigue and represent dependent conditions across multiple metrics.
- Route alarm events through EventBridge to trigger automation targets (for example, Lambda or Systems Manager) at a high level.
- Troubleshoot alarm behavior (flapping, INSUFFICIENT_DATA, missing datapoints) by adjusting metric selection and alarm settings.
- Create CloudWatch dashboards that summarize key service metrics and alarm states for a workload or fleet.
- Configure cross-account and cross-Region observability dashboards for centralized operations monitoring.
- Configure Amazon SNS topics and subscriptions for alert notifications and connect CloudWatch alarms to SNS.
- Analyze CloudWatch metrics and logs to identify performance degradation and availability issues in AWS workloads.
- Correlate operational symptoms with recent configuration or deployment changes by using CloudTrail and resource history.
- Select an automated remediation approach (Auto Scaling, Lambda, Systems Manager Automation) based on the incident type and blast radius.
- Implement CloudWatch alarm-driven remediation by invoking Systems Manager Automation runbooks.
- Implement CloudWatch alarm-driven remediation by invoking Lambda functions at a high level.
- Configure scaling policies (target tracking or step scaling) to remediate sustained load and maintain service performance.
- Configure notification workflows for incidents using AWS User Notifications and/or SNS at a high level.
- Configure EventBridge rules to route service events (for example, EC2 state changes) to operational targets.
- Use EventBridge input transformers or enrichment patterns at a high level to add context before delivering events to targets.
- Configure EventBridge targets with retry and dead-letter queue settings to improve operational reliability.
- Troubleshoot EventBridge rules that do not trigger by validating event patterns, permissions, and bus/Region selection.
- Run prebuilt Systems Manager Automation runbooks to perform common operational tasks (restart, recover, remediate) safely.
- Create custom Systems Manager Automation documents that call AWS APIs or scripts to automate repeatable operations at a high level.
- Troubleshoot Automation execution failures by validating assume-role permissions, parameters, and resource preconditions.
- Use performance metrics and resource tags to identify hotspots and bottlenecks across compute resources.
- Interpret AWS Compute Optimizer recommendations and choose right-sizing actions while managing risk and change windows.
- Tune Auto Scaling behavior and instance health checks to meet latency and throughput targets during scaling events.
- Diagnose whether a workload is CPU-bound, memory-bound, or IO-bound using CloudWatch and OS metrics.
- Interpret key EBS performance metrics (queue length, throughput, IOPS, burst balance) to diagnose volume constraints.
- Select an appropriate EBS volume type (for example, gp3, io2, st1) based on performance and cost requirements.
- Modify EBS volume size, IOPS, and throughput safely to improve performance without unnecessary downtime.
- Troubleshoot EBS performance issues caused by instance limits, mis-sized volumes, or suboptimal configuration.
- Optimize S3 upload and download performance using multipart uploads, concurrency, and request patterns.
- Choose between S3 Transfer Acceleration, DataSync, and standard transfers based on data location and throughput requirements.
- Apply S3 lifecycle policies and storage class choices to align performance, retrieval patterns, and cost objectives.
- Select shared storage solutions (EFS vs FSx variants) based on protocol needs (NFS/SMB) and performance characteristics.
- Configure EFS performance and throughput modes and lifecycle policies to optimize cost and performance for a workload.
- Monitor RDS performance using Performance Insights and CloudWatch alarms and identify likely bottleneck categories.
- Select remediation actions for RDS performance issues (instance class, storage, parameter group, connection management).
- Use RDS Proxy at a high level to improve database connection efficiency for bursty application workloads.
- Use EC2 placement groups appropriately (cluster, spread, partition) to improve performance or resilience for a given workload.
Content Domain 2: Reliability and Business Continuity (22%)
Task 2.1 - Implement scalability and elasticity
- Differentiate scalability from elasticity and apply the concepts to real operations scenarios on AWS.
- Configure Auto Scaling groups with target tracking policies to maintain performance under changing load.
- Configure Auto Scaling group health checks, instance replacement behavior, and capacity settings to maintain availability.
- Use scheduled scaling to prepare for predictable traffic patterns and operational events.
- Configure ECS service auto scaling or EKS scaling at a high level to maintain application performance.
- Tune scaling cooldowns and stabilization windows to prevent thrashing and reduce operational noise.
- Use CloudFront caching to reduce origin load and improve global scalability for web applications.
- Configure CloudFront cache behaviors (TTL, cache key) appropriate for dynamic and static content patterns.
- Use ElastiCache to offload frequent reads and reduce database load for scalable applications.
- Choose between Redis and Memcached for caching based on persistence, features, and operational requirements.
- Configure DynamoDB capacity mode and auto scaling to support variable traffic patterns reliably.
- Apply caching strategies such as DynamoDB Accelerator (DAX) or application caching to improve read scalability.
- Configure RDS scaling strategies at a high level (read replicas, instance sizing) to meet workload demand.
- Interpret scalability metrics (request rate, latency, CPU, connections) to validate that scaling meets SLOs.
- Troubleshoot scaling failures by validating scaling policies, metrics, permissions, and capacity constraints.
Task 2.2 - Implement highly available and resilient environments
- Choose the appropriate Elastic Load Balancing option (ALB, NLB, GWLB) for a highly available workload.
- Configure target groups and health checks to detect unhealthy instances accurately and trigger failover.
- Configure cross-zone load balancing and understand its impact on distribution behavior and cost.
- Configure listeners and routing at a high level to support resilient application traffic patterns.
- Troubleshoot unhealthy targets and 5xx errors using ELB metrics and access logs.
- Configure Route 53 health checks and failover routing to route traffic away from unhealthy endpoints.
- Apply Route 53 routing policies (failover, weighted, latency) to improve availability and resilience.
- Implement multi-AZ compute patterns using Auto Scaling groups across multiple subnets and Availability Zones.
- Configure RDS Multi-AZ deployments and understand failover behavior and operational considerations.
- Configure Aurora replicas and failover behavior at a high level to support resilience requirements.
- Identify AZ-scoped single points of failure (for example, NAT gateways, instance-local state) and remediate them.
- Use regional services (for example, S3, DynamoDB) where appropriate to reduce the impact of AZ failures.
- Validate high availability by simulating failures and verifying health check and failover behavior.
- Reduce blast radius by using isolation boundaries (AZs, subnets, partitions) and controlled rollout practices.
Task 2.3 - Implement backup and restore strategies
- Create AWS Backup plans with schedules, lifecycle policies, and vault configuration to meet retention requirements.
- Select and assign backup resources using tags and resource assignments to standardize coverage.
- Configure cross-account and cross-Region backup copy policies to meet business continuity requirements.
- Restore common resources (EBS volumes, EC2 instances, RDS databases, DynamoDB tables, EFS) using AWS Backup.
- Explain RPO and RTO and map them to backup frequency, restore approach, and operational runbooks.
- Perform RDS snapshot restore and point-in-time recovery to meet stated RTO/RPO and cost constraints.
- Perform DynamoDB point-in-time recovery at a high level and validate restore outcomes.
- Validate backup integrity by conducting restore drills and documenting operational results.
- Enable and manage S3 versioning to protect against accidental deletion and overwrite scenarios.
- Recover data by managing S3 delete markers and restoring previous object versions operationally.
- Use FSx backup or snapshot capabilities at a high level to support recovery objectives.
- Create and follow a disaster recovery runbook that includes failover and failback steps for a workload.
- Choose an appropriate DR strategy (backup and restore, pilot light, warm standby) based on requirements and cost.
- Automate snapshots and backups for EC2/EBS/RDS resources using AWS Backup or native mechanisms.
- Troubleshoot failed backup jobs by validating permissions, vault policies, configuration, and service prerequisites.
Content Domain 3: Deployment, Provisioning, and Automation (22%)
Task 3.1 - Provision and maintain cloud resources
- Create and manage AMIs by using EC2 Image Builder pipelines and manage versioning across releases.
- Apply image hardening and patching practices during AMI builds to reduce operational risk.
- Distribute AMIs across Regions and accounts and manage rollback to a prior image version when needed.
- Build, tag, and manage container images and store them in Amazon ECR for operational use.
- Create and update CloudFormation stacks using safe change practices such as change sets and drift detection.
- Troubleshoot CloudFormation stack failures by using stack events, error messages, and rollback states.
- Use AWS CDK at a high level to synthesize and deploy CloudFormation stacks as infrastructure as code.
- Diagnose subnet sizing and IP exhaustion issues that prevent deployments and remediate with CIDR planning.
- Diagnose IAM permission issues that prevent resource provisioning and remediate with least-privilege policies.
- Deploy standardized resources across multiple accounts and Regions by using CloudFormation StackSets.
- Share resources across accounts by using AWS RAM (for example, subnets or Transit Gateway attachments) at a high level.
- Implement deployment strategies (rolling, blue/green, canary) to minimize downtime and reduce risk during changes.
- Configure deployment services at a high level (for example, CodeDeploy or ECS deployment options) to support automatic rollback.
- Apply consistent tagging standards to provisioned resources to support operations, cost allocation, and governance.
- Use Terraform at a high level to provision AWS resources while managing state safely and predictably.
- Use Git workflows at a high level (branching, pull requests, reviews) to manage infrastructure as code changes.
- Remediate common deployment issues such as parameter misconfiguration, dependency ordering, and Region constraints.
Task 3.2 - Automate the management of existing resources
- Use Systems Manager Run Command to execute operational actions on managed instances at scale.
- Configure Patch Manager to apply OS and application patches on a schedule with controlled blast radius.
- Use State Manager to enforce configuration baselines and remediate configuration drift.
- Store and retrieve configuration values securely using Parameter Store and related Systems Manager capabilities.
- Use Session Manager for secure administrative access without opening inbound ports or managing SSH keys.
- Create automation workflows in Systems Manager Automation to restart services, remediate issues, and standardize operations.
- Configure maintenance windows and associations to control timing and scope of automated operational tasks.
- Automate operational tasks based on events by using Lambda and EventBridge at a high level.
- Configure S3 event notifications to trigger automation when objects are created, updated, or deleted.
- Implement guardrails for automation (approvals, rate limits, scoped IAM roles) to reduce operational risk.
- Monitor automation executions and troubleshoot failures by using execution history, logs, and IAM diagnostics.
- Combine CloudWatch alarms with automation targets to close the loop on detection and remediation.
Content Domain 4: Security and Compliance (16%)
- Configure IAM password policies and multi-factor authentication (MFA) requirements for human users.
- Design and implement IAM roles and trust policies for secure service-to-service and cross-account access.
- Apply IAM policy conditions (for example, tags, source IP, MFA present) to enforce least privilege.
- Configure and use federated identity and IAM Identity Center at a high level for centralized access management.
- Implement resource-based policies for services such as S3 or KMS and reason about policy evaluation at a high level.
- Troubleshoot AccessDenied errors by using CloudTrail and identifying the calling principal and API action.
- Use the IAM policy simulator to validate effective permissions before deploying access changes.
- Use IAM Access Analyzer findings to detect unintended external access and remediate exposure.
- Implement secure multi-account strategies using AWS Organizations and Control Tower concepts at a high level.
- Use service control policies (SCPs) to enforce guardrails across accounts without granting permissions.
- Interpret AWS Trusted Advisor security checks and prioritize remediation actions.
- Operationalize remediation for common Trusted Advisor findings (for example, public S3 access, open security groups).
- Enforce compliance constraints on Region and service usage by using SCPs and governance controls.
- Use AWS Config at a high level to assess compliance posture and record configuration changes for auditing.
Task 4.2 - Implement strategies to protect data and infrastructure
- Define and implement a data classification scheme and apply classification through tagging and access controls.
- Use Amazon Macie at a high level to discover and classify sensitive data stored in Amazon S3.
- Enforce encryption at rest for common AWS services (EBS, S3, RDS, DynamoDB) by using AWS KMS.
- Manage KMS key policies and grants to enable least-privilege access to encrypted data.
- Troubleshoot KMS-related access failures by validating key policy, IAM policy, key state, and Region.
- Configure encryption in transit by using ACM certificates for endpoints such as ALB, CloudFront, or API front doors.
- Troubleshoot TLS and certificate issues (expired certificates, wrong domain/SNI, chain problems) at a high level.
- Store application secrets in AWS Secrets Manager and retrieve them securely from workloads using IAM roles.
- Configure secret rotation and monitor rotation outcomes to ensure credentials remain valid.
- Use AWS Config rules to detect noncompliant resource configurations and trigger remediation workflows.
- Interpret GuardDuty findings and select appropriate incident response and remediation actions at a high level.
- Interpret Inspector findings for vulnerabilities and prioritize remediation actions for affected resources.
- Aggregate and triage security findings in Security Hub and route them into operational workflows.
- Implement incident response actions to protect infrastructure (isolation, credential rotation, evidence preservation).
Content Domain 5: Networking and Content Delivery (18%)
Task 5.1 - Implement and optimize networking features and connectivity
- Create and configure VPCs, subnets, and route tables to support public and private tiers for workloads.
- Configure security groups and network ACLs and explain their differences in statefulness and evaluation order.
- Configure NAT gateways and internet gateways to enable outbound and inbound connectivity appropriately.
- Use egress-only internet gateways to enable IPv6 outbound-only internet access for private subnets.
- Configure VPC endpoints (gateway and interface) to access AWS services privately without traversing the public internet.
- Implement VPC peering connectivity and understand limitations such as non-transitive routing.
- Implement Transit Gateway connectivity at a high level for hub-and-spoke routing across multiple VPCs.
- Implement AWS PrivateLink at a high level to access services privately across VPCs and accounts.
- Configure AWS Client VPN at a high level to provide secure user connectivity to VPC resources.
- Configure Site-to-Site VPN at a high level to provide hybrid connectivity to AWS networks.
- Audit network protection services (DNS Firewall, WAF, Shield, Network Firewall) and validate that controls are applied correctly.
- Enable and review logs and metrics for network protection services to validate effectiveness and support investigations.
- Optimize network architecture cost by reducing NAT gateway data processing and minimizing cross-AZ data transfer.
- Choose private connectivity options (VPC endpoints, PrivateLink) versus public endpoints based on security and cost requirements.
- Identify common connectivity-breaking misconfigurations (routes, DNS, SG/NACL) and select appropriate remediations.
Task 5.2 - Configure domains, DNS services, and content delivery
- Configure Route 53 hosted zones and record sets for public and private DNS use cases.
- Configure Route 53 Resolver inbound and outbound endpoints at a high level for hybrid DNS resolution.
- Configure private hosted zones and associate them with VPCs correctly to support split-horizon DNS.
- Implement Route 53 routing policies (simple, weighted, latency, failover) to meet availability and performance goals.
- Implement Route 53 health checks and understand how they influence failover decisions and routing.
- Enable Route 53 query logging and interpret logs to troubleshoot DNS resolution problems.
- Troubleshoot DNS issues such as resolver misconfiguration, split-horizon conflicts, and record set mistakes.
- Configure CloudFront distributions with origins, behaviors, and cache policies for content delivery.
- Configure CloudFront origin access control (OAC) at a high level and restrict direct access to the origin.
- Tune CloudFront caching behavior (TTL, cache key) to balance performance and correctness for a workload.
- Use CloudFront and AWS WAF together at a high level to protect edge-delivered applications.
- Configure Global Accelerator endpoints and health checks at a high level for improved performance and availability.
- Choose between CloudFront and Global Accelerator based on protocol, caching needs, and performance requirements.
- Troubleshoot content delivery issues by using CloudFront logs, metrics, and cache invalidations.
Task 5.3 - Troubleshoot network connectivity issues
- Use VPC Reachability Analyzer to determine why traffic between two endpoints fails.
- Troubleshoot routing issues caused by route tables, subnet associations, and missing or incorrect routes.
- Troubleshoot security issues caused by security groups and network ACL rules for inbound and outbound traffic.
- Troubleshoot NAT gateway and internet gateway connectivity issues for workloads in private subnets.
- Troubleshoot Transit Gateway attachments and route propagation at a high level when connectivity is broken.
- Collect and interpret VPC Flow Logs to identify accepted versus rejected traffic and the likely rejecting layer.
- Enable and interpret ELB access logs to diagnose client errors, backend errors, and routing issues.
- Enable and interpret AWS WAF web ACL logs to diagnose blocked requests and reduce false positives.
- Use CloudFront logs and metrics to diagnose edge errors, origin timeouts, and caching behavior.
- Identify and remediate CloudFront caching issues by adjusting cache policies and using invalidations appropriately.
- Troubleshoot hybrid connectivity issues for VPN-based connections at a high level (tunnel state, routes, DNS).
- Troubleshoot private connectivity issues involving VPC endpoints, PrivateLink, and DNS resolution at a high level.
- Configure and analyze CloudWatch network monitoring metrics and features to detect and investigate connectivity degradation.
- Apply a systematic troubleshooting workflow: symptom, scope, signals (logs/metrics), root cause, and remediation.
- Validate remediation by re-running reachability checks and monitoring post-change metrics and logs.
Tip: for SOA-C03, convert misses into short runbook rules (signal -> root cause -> first safe remediation).