Free Google Cloud Professional Cloud Architect Practice Questions: Operations Excellence

Practice 10 free Google Cloud Professional Cloud Architect questions on Operations Excellence, with answers, explanations, and the IT Mastery next step.

Try the IT Mastery web app for a richer interactive practice experience with mixed sets, timed mocks, topic drills, explanations, and progress tracking.

Try Google Cloud Professional Cloud Architect on Web

Topic snapshot

FieldDetail
Exam routeGoogle Cloud Professional Cloud Architect
Topic areaEnsuring Solution and Operations Excellence
Blueprint weight12%
Page purposeFocused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Ensuring Solution and Operations Excellence for Google Cloud Professional Cloud Architect. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

PassWhat to doWhat to record
First attemptAnswer without checking the explanation first.The fact, rule, calculation, or judgment point that controlled your answer.
ReviewRead the explanation even when you were correct.Why the best answer is stronger than the closest distractor.
RepairRepeat only missed or uncertain items after a short break.The pattern behind misses, not the answer letter.
TransferReturn to mixed practice once the topic feels stable.Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 12% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These are original IT Mastery practice questions aligned to this topic area. They are not official exam questions, copied live-exam content, or exam dumps. Use them for self-assessment, scope review, and deciding what to drill next.

Question 1

Topic: Ensuring Solution and Operations Excellence

A team is preparing to release a new version of a customer-facing checkout service.

Exhibit: Release facts

FactDetail
PlatformCloud Run behind an HTTPS load balancer
Current stateOne stable revision receives 100% of traffic
Change riskNew payment validation logic
RequirementStart with up to 5% real traffic, monitor SLOs, and roll back without redeploying

Which deployment strategy best meets these requirements?

Options:

  • A. Create a second service URL for users to try

  • B. Use a canary release with Cloud Run traffic splitting

  • C. Use a rolling replacement of all service instances

  • D. Use blue-green deployment with immediate full cutover

Best answer: B

Explanation: The key requirement is controlled production exposure with fast rollback. For Cloud Run, a new revision can receive a small percentage of traffic while the stable revision continues serving most users. This is a canary deployment pattern. The team can watch error rate, latency, and business metrics before increasing traffic, and can quickly send traffic back to the known-good revision if the new payment logic causes issues. This minimizes user impact because only a small slice of production traffic is exposed initially.

Blue-green can support rollback, but an immediate full cutover creates a larger blast radius than the stated 5% start.

  • Rolling replacement gradually replaces capacity, but it does not directly satisfy the requirement to cap real user exposure at 5%.
  • Second URL shifts validation burden to users and does not test normal production routing behavior.
  • Immediate blue-green cutover is reversible, but it exposes all users at once after smoke testing.

Question 2

Topic: Ensuring Solution and Operations Excellence

A retailer has frequent customer-impacting incidents caused by rushed production releases to services running on Google Cloud. The landing zone, networks, and runtime platforms are already provisioned with Terraform. Leadership wants fewer failed releases, auditable approvals for regulated changes, and minimal disruption to product teams. A major application redesign is not funded this quarter.

Which recommendation best balances these constraints?

Options:

  • A. Redesign the services into an event-driven microservices architecture

  • B. Require all releases to be performed manually by a central operations team

  • C. Implement release gates, progressive rollout, monitoring checks, and rollback automation

  • D. Provision duplicate production environments in more regions

Best answer: C

Explanation: Release management owns how changes move safely into production: approvals, test gates, deployment strategies, rollback plans, release evidence, and production-readiness checks. In this scenario, the platform is already provisioned and a major application redesign is out of scope, so the best recommendation is to improve the release path rather than rebuild the application or add infrastructure. Progressive rollout with automated health checks and rollback reduces availability risk, while approval gates and audit trails address regulated-change requirements. Keeping product teams in the workflow also reduces operational bottlenecks and supports future maturity.

  • Application redesign may improve long-term architecture, but it violates the near-term funding constraint and does not directly fix release governance.
  • More regions can improve resilience, but it adds cost and operational effort without addressing rushed deployments or approval evidence.
  • Central manual releases emphasize control, but they create bottlenecks and reduce team autonomy instead of improving automated release discipline.

Question 3

Topic: Ensuring Solution and Operations Excellence

A retail company runs an order service on Cloud Run, Pub/Sub, and Cloud SQL. Recent incidents were caused by deployment regressions, and teams often learned about failures from customers before dashboards. Requirements are to improve production support within the next quarter, alert on customer impact instead of every low-level metric, preserve release velocity, and avoid a platform migration. Which architecture decision is best?

Options:

  • A. Build a warm standby region and rely on manual failover during incidents.

  • B. Add more Cloud Run min instances and page on CPU, memory, and request-count thresholds.

  • C. Adopt SLO-based operations with Cloud Monitoring alerts, error budgets, runbooks, and blameless postmortems.

  • D. Migrate the services to GKE Autopilot and require approval for every release.

Best answer: C

Explanation: The operational excellence pillar emphasizes measurable reliability, observable systems, controlled change, and continuous improvement. In this scenario, the main failures are late detection and unsafe releases, not a missing compute platform or capacity setting. Defining service-level objectives in Cloud Monitoring, alerting on symptoms that affect users, using error budgets to guide release risk, and creating runbooks and blameless postmortems addresses the stated production-support gaps while preserving release velocity. The key takeaway is to improve how the system is operated before changing where it runs.

  • Platform migration overbuilds the response; GKE Autopilot does not by itself create SLOs, observability, or better incident learning.
  • Resource scaling may reduce some saturation risks, but paging on raw metrics worsens alert fatigue and misses customer-impact goals.
  • Standby region targets disaster recovery, not the stated release-regression and detection problems in current production support.

Question 4

Topic: Ensuring Solution and Operations Excellence

A retail team runs a checkout service on Cloud Run that calls Spanner and a third-party payment API. Leaders want alerts that show customer impact quickly, but the small SRE team cannot support noisy dashboards or high-cardinality custom metrics. Which monitoring recommendation best balances visibility with operational effort?

Options:

  • A. Monitor only Cloud Run CPU and memory utilization.

  • B. Use SLO burn-rate alerts for success/latency, plus dependency and saturation signals.

  • C. Use synthetic checkout probes and omit backend dependency metrics.

  • D. Alert on all error logs and request labels for maximum detail.

Best answer: B

Explanation: Operational health monitoring should start with user-visible SLOs, such as checkout success rate and latency, because they show real customer impact. Add dependency signals, such as Spanner and payment API error rate and latency, to identify likely failure domains. Add a small set of saturation signals, such as Cloud Run concurrency, instance count, CPU/memory, and database utilization, to catch capacity pressure. This balances availability and latency visibility with operational effort because it avoids paging on every log line or creating high-cardinality custom metrics. Detailed logs and traces can support investigation, but paging signals should be few, actionable, and tied to service health.

  • All error logs creates high alert volume and cost, and high-cardinality labels may make the signal less actionable.
  • CPU and memory only can show infrastructure pressure but misses user impact and dependency failures.
  • Synthetic probes only show some end-to-end failures but do not reveal backend dependency health or saturation causes.

Question 5

Topic: Ensuring Solution and Operations Excellence

A retail team is reviewing production support for a Cloud Run checkout service after two customer-impacting incidents were detected first by support tickets. What is the best next action?

Exhibit: Operations review

Service: checkout-api
Business target: p95 checkout latency < 500 ms
Business target: failed checkouts < 0.1%
Current alerts: CPU and memory only
Rollback: manual, not documented
Incident review: no customer-facing SLI or error budget

Options:

  • A. Adopt SLO-based monitoring with runbooks.

  • B. Increase CPU and memory alert sensitivity.

  • C. Move the service from Cloud Run to GKE.

  • D. Require manual approval for every release.

Best answer: A

Explanation: The operational excellence pillar emphasizes measurable operations, actionable observability, incident readiness, and continuous improvement. The exhibit shows that the team has business targets but monitors only infrastructure symptoms, so incidents can affect customers before responders are alerted. Defining SLIs for latency and failed checkouts, setting SLO/error-budget alerts in Cloud Monitoring, and documenting response and rollback runbooks aligns production support with user impact. This creates a feedback loop for incidents and releases without assuming an unproven platform problem.

  • Utilization-only alerts can miss user-visible failures, especially when CPU and memory are normal.
  • Platform migration is not justified because the exhibit does not show a Cloud Run limitation.
  • Manual approvals may slow releases but do not create customer-facing detection or response guidance.

Question 6

Topic: Ensuring Solution and Operations Excellence

A retail company runs a customer API on Cloud Run. Twice a month, incidents involve elevated 5xx rates and latency, but post-incident notes do not identify repeat causes or show MTTR trends. The team has limited SRE capacity and must avoid major rearchitecture this quarter. Which support improvement best balances these constraints and provides measurable evidence of progress?

Options:

  • A. Increase Cloud Run minimum instances across all services to reduce latency risk.

  • B. Migrate the API to GKE to give engineers more debugging control.

  • C. Create more dashboards and page the team on every 5xx log entry.

  • D. Create SLO-based runbooks with burn-rate alerts, triage steps, owners, and MTTR/error-budget tracking.

Best answer: D

Explanation: Recurring production issues should be addressed with an operational feedback loop, not only more capacity or more visibility. SLO-based runbooks connect symptoms to clear triage actions, escalation owners, and measurable outcomes such as MTTR, incident frequency, alert precision, and error-budget consumption. Burn-rate alerts help page on customer-impacting risk rather than every raw error, which reduces noise for a team with limited SRE capacity. Post-incident tracking then shows whether the runbook is improving support outcomes over time.

The key trade-off is improving reliability with evidence while avoiding a costly platform change or broad overprovisioning.

  • Overprovisioning may reduce some latency, but it increases cost and does not address unclear incident causes or MTTR measurement.
  • Platform migration adds operational effort and risk, violating the requirement to avoid major rearchitecture this quarter.
  • More paging increases alert fatigue because every 5xx log entry is not necessarily customer-impacting or actionable.

Question 7

Topic: Ensuring Solution and Operations Excellence

An order API runs on Cloud Run behind an external Application Load Balancer and Cloud Armor. It meets a p95 latency target of 150 ms and a tight monthly budget in one region; the database already uses regional HA. Three recent Sev1 incidents were caused by new revisions that passed tests but increased 5xx errors after full rollout. Security requires provenance checks and vulnerability scanning before production. Which operations recommendation best improves reliability while preserving these constraints?

Options:

  • A. Increase minimum instances and CPU for every Cloud Run revision.

  • B. Use SLO-based canary rollouts with automated rollback and existing security gates.

  • C. Run active-active Cloud Run and database stacks in two distant regions.

  • D. Disable provenance and vulnerability gates to speed emergency rollbacks.

Best answer: B

Explanation: Operational excellence favors controls tied to the observed failure mode. The outages are release-induced, so progressive delivery is the best-balanced recommendation: send a small percentage of traffic to the new revision, monitor SLO and error-rate signals, and automatically roll back when reliability degrades. Keeping provenance checks and vulnerability scanning preserves security and compliance. Avoiding a full duplicate regional stack preserves cost and reduces unnecessary data movement and operational complexity. Cloud Run traffic splitting also limits customer exposure without changing the latency profile for most users.

  • Multi-region active-active can improve some outage scenarios, but it adds cost and complexity without addressing faulty releases.
  • Bypassing security gates may speed deployments, but it violates the stated production security requirement.
  • Raising minimum instances targets cold starts or capacity, not defective revisions, and increases baseline cost.

Question 8

Topic: Ensuring Solution and Operations Excellence

An online retailer is moving its checkout API to Cloud Run behind a global external Application Load Balancer with Cloud SQL as the database. Before a holiday launch, leadership’s main unresolved risk is whether the system can sustain 6 times normal traffic while keeping p95 latency under 300 ms. Testing must avoid customer impact and produce capacity and bottleneck evidence. Which validation activity should the architect recommend?

Options:

  • A. Run a tabletop disaster recovery exercise

  • B. Run a penetration test against the checkout API

  • C. Run a load test in a production-like environment

  • D. Run chaos experiments that fail database dependencies

Best answer: C

Explanation: The unresolved risk is performance and capacity under a defined traffic increase, so the right validation is load testing. A production-like environment lets the team generate realistic synthetic traffic, observe Cloud Run scaling, Cloud SQL connection pressure, latency, errors, and bottlenecks, and tune the architecture before the launch. Penetration testing is for validating security exposure, such as exploitable vulnerabilities or weak controls. Chaos engineering is for validating resilience to injected failures, such as instance loss, dependency failure, or degraded network behavior. Here, the decisive requirement is sustaining 6 times normal load within the p95 latency target without affecting customers.

  • Penetration testing targets security risk, not throughput, latency, or capacity bottlenecks.
  • Chaos experiments validate failure tolerance and recovery behavior, not whether normal components can handle peak demand.
  • Tabletop recovery reviews process readiness but does not produce measured performance evidence.

Question 9

Topic: Ensuring Solution and Operations Excellence

A retailer plans to launch a customer-facing order API on Cloud Run next week. During the production readiness review, you receive this support model excerpt.

AreaCurrent plan
MonitoringDashboard exists; no alert policies
EscalationIncidents start in a team chat; no severity levels
RollbackRedeploy previous image “if needed”
ResponsibilityApp/platform ownership for shared failures is TBD

What is the best next action before approving go-live?

Options:

  • A. Define and test the production support runbook

  • B. Proceed because Cloud Run manages the infrastructure

  • C. Move all incident ownership to the platform team

  • D. Add dashboards for each service component

Best answer: A

Explanation: A production support model must make operational ownership explicit before launch. The exhibit shows several readiness gaps: monitoring is passive, escalation has no severity-based path, rollback has no trigger or accountable owner, and responsibility is unclear for failures that cross application and platform boundaries. The next action is to create and validate a runbook or RACI-style support plan that defines alert policies, escalation contacts, severity levels, rollback decision criteria, and ownership for common incident types. Managed services reduce infrastructure operations, but they do not remove the need for application support, incident response, or release recovery planning.

  • Managed service assumption fails because Cloud Run does not define application incident escalation or rollback decisions.
  • Single-team ownership fails because assigning every incident to the platform team hides application responsibilities and shared-failure handoffs.
  • Dashboard-only monitoring fails because dashboards are passive and do not provide alerting, escalation, or rollback accountability.

Question 10

Topic: Ensuring Solution and Operations Excellence

An online retailer runs a customer-facing order API on Cloud Run with Cloud SQL. For the last month, support tickets show recurring 500 errors and high p95 latency during promotions. The engineering manager wants a support improvement that reduces mean time to resolution and provides evidence that the recurring issue rate is improving. Major application refactoring is not approved this quarter. What should the cloud architect recommend?

Options:

  • A. Increase Cloud Run minimum instances and close tickets after each successful rollback.

  • B. Create an SLO-based incident runbook with Cloud Monitoring alerts, log-based metrics, and post-incident trend reviews.

  • C. Migrate the API to GKE and create a new on-call rotation for the service.

  • D. Schedule engineers to manually watch dashboards during promotions and escalate issues in chat.

Best answer: B

Explanation: The best improvement is a production-support runbook tied to observable evidence. For recurring incidents, the runbook should define symptoms, alert thresholds, ownership, triage steps, rollback or mitigation steps, and escalation paths. Cloud Monitoring alerts, log-based metrics, and ticket data can measure whether support is improving through indicators such as MTTR, incident count, error rate, and p95 latency during promotions. This fits the constraint that major refactoring is not approved while still creating a feedback loop from incidents to permanent fixes. A capacity or platform change might later be justified by the data, but the immediate architecture decision should make recurring issues diagnosable, repeatable to handle, and measurable.

  • Manual dashboard watching depends on human vigilance and does not create a repeatable support process with trend evidence.
  • Platform migration is overbuilt for the stated goal and conflicts with the no-refactoring constraint.
  • Minimum instances only might reduce one latency cause, but it does not address 500 errors, runbook quality, or recurrence measurement.

Continue in the web app

Use IT Mastery for interactive Google Cloud Professional Cloud Architect practice with mixed sets, timed mocks, topic drills, explanations, and progress tracking.

Try Google Cloud Professional Cloud Architect on Web