AWS AIP-C01: Genai Operations

Try 10 focused AWS AIP-C01 questions on Genai Operations, with explanations, then continue with IT Mastery.

On this page

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try AWS AIP-C01 on Web View full AWS AIP-C01 practice page

Topic snapshot

FieldDetail
Exam routeAWS AIP-C01
Topic areaOperational Efficiency and Optimization for Genai Applications
Blueprint weight12%
Page purposeFocused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Operational Efficiency and Optimization for Genai Applications for AWS AIP-C01. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

PassWhat to doWhat to record
First attemptAnswer without checking the explanation first.The fact, rule, calculation, or judgment point that controlled your answer.
ReviewRead the explanation even when you were correct.Why the best answer is stronger than the closest distractor.
RepairRepeat only missed or uncertain items after a short break.The pattern behind misses, not the answer letter.
TransferReturn to mixed practice once the topic feels stable.Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 12% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.

Question 1

Topic: Operational Efficiency and Optimization for Genai Applications

A company runs a customer support assistant on Amazon Bedrock and sends Bedrock invocation logs to CloudWatch. Cost Anomaly Detection alerted shortly after a prompt template release at 11:00. What is the best interpretation and next step?

WindowInvocationsAvg input tokensGrounding failures
09:00-10:0010,2001,1503.1%
10:00-11:0010,3501,1803.0%
11:00-12:0010,4104,9503.2%

Options:

  • A. Request a Bedrock quota increase for higher throughput.

  • B. Scale the application fleet to handle a traffic spike.

  • C. Tighten grounding controls because hallucinations increased.

  • D. Correlate logs to the prompt release and reduce input context.

Best answer: D

Explanation: The cost anomaly is most consistent with prompt or context expansion, not increased demand. The decisive signal is the jump in average input tokens from about 1,180 to 4,950 after the prompt release while invocations and grounding failures remained stable.

Operational monitoring for GenAI applications should correlate cost anomalies with token metrics, prompt versions, and quality signals. In this exhibit, request volume is almost unchanged, so higher spend is unlikely to be caused by traffic growth. Grounding failures also remain near 3%, so the issue is not an obvious hallucination-rate regression. The major change is average input tokens increasing more than fourfold immediately after the 11:00 prompt release. The next step is to use Bedrock invocation logs and CloudWatch metrics to confirm which prompt version or retrieved context caused the token increase, then reduce unnecessary instructions, history, or retrieved chunks. This preserves quality while addressing the likely cost driver.

  • Traffic spike fails because invocations remain nearly flat across all three windows.
  • Hallucination increase fails because grounding failures stay around 3% before and after the release.
  • Quota increase fails because the exhibit shows a cost and token anomaly, not throttling or capacity saturation.

Question 2

Topic: Operational Efficiency and Optimization for Genai Applications

A SaaS company is building a customer-support RAG API with Amazon Bedrock for generation, Lambda behind API Gateway, and Amazon OpenSearch Serverless for the document index. Retrieval often returns semantically similar chunks from the wrong product, and exact error-code articles are missed. The API must enforce tenant isolation before retrieval and keep prompt context small for a p95 latency target under 2 seconds. Which implementation best optimizes retrieval?

Options:

  • A. Increase vector topK and send all retrieved chunks.

  • B. Put tenant and product rules in the system prompt.

  • C. Use larger chunks and disable keyword matching.

  • D. Use hybrid search with metadata prefilters and candidate reranking.

Best answer: D

Explanation: The best implementation combines query preprocessing, metadata filtering, hybrid retrieval, and reranking. Tenant and product filters should be applied before scoring, hybrid search improves exact error-code recall, and reranking keeps only the strongest candidates for the prompt.

Retrieval optimization should happen before the model receives context. A Lambda retriever can extract structured fields such as tenant, product, and error code from the request, apply tenant and product metadata filters in OpenSearch, run hybrid lexical plus vector search, then rerank a bounded candidate set before sending the top passages to Bedrock. This improves recall for exact identifiers without losing semantic matching and prevents unrelated tenant or product content from entering the candidate pool. Keeping only a small reranked set also protects latency and token cost. Prompt instructions are not a substitute for retrieval-time authorization or filtering.

  • Vector-only topK increases candidate volume but raises latency and prompt cost while still missing exact-match signals.
  • Prompt-only rules apply after retrieval, so wrong-tenant content can already be exposed to the model.
  • Larger chunks can dilute relevance, increase token usage, and remove the keyword signal needed for error codes.

Question 3

Topic: Operational Efficiency and Optimization for Genai Applications

A company runs a nightly summarization job for 800,000 call transcripts with Amazon Bedrock. The job reads JSONL files from Amazon S3, fans out synchronous InvokeModel calls from Lambda at reserved concurrency 1,000, and stores results in DynamoDB. During the job, CloudWatch shows ThrottlingException, SDK retry storms, and higher latency for the interactive chat application that uses the same model. The summaries are not needed until the next morning. Which change fixes the root cause with the smallest safe change?

Options:

  • A. Increase Lambda reserved concurrency and retry attempts.

  • B. Use Step Functions Distributed Map with higher concurrency.

  • C. Enable response streaming for transcript summaries.

  • D. Run the transcript job as a Bedrock batch inference job.

Best answer: D

Explanation: The symptom is throttling, retry storms, and collateral latency on the interactive application. The root cause is an offline workload using excessive synchronous fan-out against shared model token-processing capacity. Bedrock batch inference is the smallest safe fit because the inputs are already in S3 and the results are not needed immediately.

Symptom: the nightly job causes ThrottlingException and retries while interactive traffic slows down. Root cause: Lambda concurrency is not the limiting resource; the fan-out pattern is overwhelming model invocation and token-processing capacity shared with the chat workload. Fix: move the non-real-time S3-based summarization workload to Amazon Bedrock batch inference, then load completed outputs to the downstream store. This aligns the processing mode with the workload and avoids increasing pressure on the synchronous path.

  • More concurrency worsens the bottleneck because it creates more simultaneous model requests and retry traffic.
  • Response streaming improves perceived first-token latency, not offline batch throughput or throttling.
  • Distributed Map fan-out changes orchestration but higher concurrency still exceeds token-processing capacity.

Question 4

Topic: Operational Efficiency and Optimization for Genai Applications

A SaaS company runs a RAG support assistant on Amazon Bedrock in us-east-1 and eu-west-1. The current semantic-only vector search often misses exact error codes and retrieves passages from the wrong tenant or product version. The new design must support acronym/error-code queries and natural-language queries, keep p95 retrieval latency under 800 ms, enforce tenant, Region, and product-version isolation before generation, and reduce prompt token cost without custom model training.

Which architecture best meets these requirements?

Options:

  • A. Fine-tune a custom embedding model and scan S3 documents at query time.

  • B. Replicate all documents globally and apply tenant filters in the prompt.

  • C. Use hybrid OpenSearch retrieval, metadata prefilters, query rewriting, and reranking.

  • D. Increase vector top_k and let the FM discard irrelevant passages.

Best answer: C

Explanation: The best fit is an AWS-native RAG retrieval layer that combines lexical and vector matching, applies metadata filters before candidate generation, rewrites short or ambiguous queries, and reranks a small candidate set. This directly addresses exact codes, semantic queries, isolation, latency, and token cost without training a custom model.

For this workload, retrieval quality and efficiency depend on selecting and ranking the right candidates before the FM sees context. Use Amazon Bedrock Knowledge Bases or an application retrieval layer backed by Amazon OpenSearch Serverless/OpenSearch Service with hybrid search, such as BM25 plus vector similarity. Apply metadata filters for tenant, Region, and product version as prefilters so unauthorized or wrong-scope passages are not retrieved. Add query preprocessing to normalize product names, expand acronyms, and preserve error codes. Then rerank only a bounded top candidate set and pass the highest-value passages to the model. This improves precision and reduces prompt tokens while staying within latency targets. Relying on the FM to filter retrieved content is both slower and weaker for governance.

  • Larger top_k increases latency and token cost, and it does not enforce retrieval-time isolation.
  • Custom training overbuilds the requirement and scanning S3 at query time cannot meet low-latency retrieval.
  • Prompt-only filtering violates the need to enforce tenant, Region, and product-version isolation before generation.

Question 5

Topic: Operational Efficiency and Optimization for Genai Applications

An application team already publishes Amazon Bedrock CloudWatch metrics for invocation latency and token counts and uses AWS Cost Anomaly Detection for unusual spend. The team now needs an AWS-native telemetry source that can capture model inputs and outputs so developers can sample prompts, compare responses with ground truth, and calculate hallucination and quality indicators. Which capability provides this role?

Options:

  • A. AWS Cost Explorer usage reports

  • B. AWS CloudTrail management event logging

  • C. Amazon Bedrock model invocation logging

  • D. AWS X-Ray distributed tracing

Best answer: C

Explanation: Amazon Bedrock model invocation logging is the AWS-native source for capturing model inputs and outputs for later analysis. CloudWatch metrics and cost tools help with operational and spend signals, but invocation logs provide the prompt and response artifacts needed to assess prompt effectiveness and hallucination indicators.

For GenAI observability, different telemetry sources answer different questions. CloudWatch metrics can track invocation counts, latency, errors, and token-related usage signals. Cost tools can detect unusual spend patterns. When the team needs prompt and response artifacts for quality review, grounding checks, or hallucination measurement, Amazon Bedrock model invocation logging is the appropriate capability because it can deliver invocation records to CloudWatch Logs or Amazon S3 under the team’s retention, IAM, and encryption controls. The key distinction is that quality analysis needs payload-level evidence, not only audit events, traces, or billing summaries.

  • CloudTrail audit scope is useful for who-called-what evidence, but it does not provide full model prompt and response content for quality scoring.
  • Cost reports help identify spend trends and anomalies, but they do not explain prompt effectiveness or response grounding.
  • X-Ray tracing helps diagnose application latency and service paths, but it is not the primary source for Bedrock prompt and output evaluation data.

Question 6

Topic: Operational Efficiency and Optimization for Genai Applications

A company is deploying a customer-support RAG application: API Gateway invokes Lambda, and Lambda calls an Amazon Bedrock Prompt Flow that uses Prompt Management and a Bedrock Knowledge Base backed by OpenSearch Service. SREs need per-request latency, throttling, and error diagnostics. The AI team needs FM interaction records with model ID, prompt version, retrieved document IDs, token usage, and guardrail outcome. Product owners need a weekly case-deflection dashboard. Raw prompts and outputs can be retained only in an encrypted audit bucket with restricted IAM, not in general logs or business dashboards. Which implementation best meets these requirements?

Options:

  • A. Use default Lambda metrics and API Gateway access logs, then build product dashboards from DynamoDB cases.

  • B. Enable Bedrock invocation logging to CloudWatch Logs and point product dashboards at those logs.

  • C. Store full prompts, outputs, and trace data in DynamoDB for all dashboards.

  • D. Use X-Ray correlation, sanitized EMF logs, encrypted Bedrock audit logging, and curated business events.

Best answer: D

Explanation: The best implementation separates observability concerns while preserving privacy. X-Ray and correlation IDs support request diagnostics, CloudWatch EMF provides operational and FM interaction metrics, restricted Bedrock invocation logging covers audit needs, and curated events feed business dashboards without exposing raw prompts.

Holistic GenAI observability should connect application traces, model interaction metadata, operational metrics, and business outcomes using a shared correlation ID. For this workload, instrument API Gateway and Lambda with X-Ray, create subsegments around Bedrock and Knowledge Base calls, and emit structured CloudWatch Embedded Metric Format events that include sanitized fields such as model ID, prompt version, retrieval IDs, token counts, latency, errors, and guardrail actions. Raw prompts and outputs should not be placed in general logs or product analytics; if retained, Bedrock model invocation logging should write to a KMS-encrypted S3 audit bucket with tightly scoped IAM. Curated business events can then be analyzed with Athena or QuickSight for case-deflection reporting. The key is separating raw audit evidence from sanitized operational and business observability.

  • Default metrics only miss FM interaction details, retrieval context, token usage, guardrail outcomes, and end-to-end trace correlation.
  • CloudWatch raw prompts violates the constraint that raw prompts and outputs must not enter general logs or business dashboards.
  • DynamoDB for everything centralizes sensitive raw content and does not provide native distributed tracing or metric aggregation.

Question 7

Topic: Operational Efficiency and Optimization for Genai Applications

A production GenAI API uses Amazon Bedrock for a selected foundation model. Traffic analysis shows predictable, sustained business-hour demand, and on-demand invocations are being throttled during those periods. Which AWS capability is designed to improve throughput predictability for this model workload?

Options:

  • A. AWS Lambda reserved concurrency

  • B. InvokeModelWithResponseStream

  • C. Amazon Bedrock Provisioned Throughput

  • D. Amazon Bedrock Prompt Management

Best answer: C

Explanation: Amazon Bedrock Provisioned Throughput is used when capacity planning shows predictable, sustained model invocation demand. It provides dedicated model capacity for more consistent throughput than on-demand usage during known peaks.

Provisioned Throughput in Amazon Bedrock addresses model inference capacity, not prompt versioning or response presentation. For a workload with recurring high request and token volume, the team should estimate expected input and output tokens, concurrency, and peak periods, then provision the appropriate model capacity. Streaming can improve perceived latency for users, but it does not reserve model throughput. Lambda reserved concurrency can protect or cap the caller tier, but it does not increase Bedrock model capacity.

  • Prompt versioning helps manage prompts but does not reserve inference capacity.
  • Streaming responses can improve user experience but does not remove model-side throttling.
  • Caller concurrency controls Lambda execution capacity, not the Bedrock model throughput limit.

Question 8

Topic: Operational Efficiency and Optimization for Genai Applications

An online legal research company hosts a customized FM on a SageMaker real-time endpoint for a synchronous document Q&A API. Application Auto Scaling uses CPUUtilization target tracking with minimum 2 and maximum 6 instances. During launch testing, p95 latency rises from 11 seconds to 35 seconds, with no API Gateway or Lambda throttles.

  • Peak load: 160 concurrent users; each sends 1 prompt every 40 seconds
  • Average model call: 2,000 output tokens and 10 seconds
  • Benchmark: 1 endpoint instance meets the SLO for 5 concurrent generations
  • During the test: CPUUtilization is 45%; DesiredInstanceCount remains 2

Which change fixes the root cause with the smallest safe change?

Options:

  • A. Scale endpoint on concurrent requests with maximum capacity 8.

  • B. Reduce max_new_tokens to 1,000.

  • C. Raise Lambda reserved concurrency to 160.

  • D. Raise SageMaker maximum capacity to 8 only.

Best answer: A

Explanation: The symptom is high endpoint latency while upstream services and CPU are not saturated. The root cause is model-serving concurrency, not Lambda or CPU. With 160 users sending one request every 40 seconds and 10-second calls, the endpoint needs about 40 active generation slots, or 8 instances at 5 each.

Symptom: p95 latency spikes while DesiredInstanceCount stays at 2 and CPU remains below the scaling target. Root cause: the endpoint is capacity-bound by in-flight token generation, so CPU target tracking does not represent demand. The load is 4 requests per second, and 10-second model calls create about 40 concurrent generations. Since one instance meets the SLO at 5 concurrent generations, Application Auto Scaling should use a concurrent-requests or generation-slots metric and be allowed to scale to 8 instances. Changing Lambda capacity or buffering upstream does not add model-serving capacity.

  • Lambda concurrency is not the bottleneck because there are no API Gateway or Lambda throttles.
  • Maximum only fails because the CPU-based policy still has no signal to scale out.
  • Token reduction may reduce latency, but it changes the required answer length instead of fixing capacity planning.

Question 9

Topic: Operational Efficiency and Optimization for Genai Applications

An enterprise is deploying a fine-tuned open-source text FM for an internal support assistant. The endpoint must stay in us-east-1, be invoked privately from the company VPC, stream tokens, and keep p95 response time under 40 seconds. Peak forecast is 600 concurrent chat requests; each request averages 800 total tokens and 20 seconds of model time. Load testing shows one GPU instance sustains 40 concurrent streams or 1,200 tokens/sec before latency degrades. Overnight traffic drops to 10%, and finance wants GPU capacity to scale in. Which architecture best meets these requirements?

Options:

  • A. Run a SageMaker real-time endpoint with 15 peak instances and autoscale only on invocations per instance.

  • B. Queue all requests in SQS and process them with SageMaker Batch Transform on GPU.

  • C. Use Bedrock cross-Region on-demand inference profiles and let AWS manage all model capacity.

  • D. Run a SageMaker real-time endpoint with 20 peak instances, VPC endpoints, streaming, and scheduled plus token/concurrency autoscaling.

Best answer: D

Explanation: FM capacity must be sized by the stricter of concurrent request capacity and token throughput. In this case, token throughput requires 20 GPU instances, not the 15 that concurrency alone suggests. SageMaker real-time inference with private networking, streaming, and Application Auto Scaling fits the operational constraints.

Capacity planning for FM endpoints must use the tighter of concurrent-request capacity and token-throughput capacity, then pair that with an autoscaling signal that reflects real load. Here concurrency requires 15 instances, but token throughput requires 20 instances:

\[ \begin{aligned} \text{Token rate} &= 600 \times 800 / 20 = 24{,}000\\ \text{By concurrency} &= \lceil 600 / 40 \rceil = 15\\ \text{By tokens/sec} &= \lceil 24{,}000 / 1{,}200 \rceil = 20 \end{aligned} \]

A SageMaker real-time endpoint invoked privately through VPC endpoints can stream responses and be managed by Application Auto Scaling. Scheduled scaling can pre-warm peak capacity, while custom CloudWatch metrics for active streams and tokens/sec allow scale-in overnight.

  • Concurrency-only sizing fails because 15 instances handle 600 streams but not the required 24,000 tokens/sec.
  • Cross-Region inference conflicts with the stated us-east-1 locality requirement and does not fit the specified custom deployment.
  • Batch processing is for offline jobs and would break interactive token streaming and the p95 latency goal.

Question 10

Topic: Operational Efficiency and Optimization for Genai Applications

An enterprise support assistant uses API Gateway and Lambda to call Amazon Bedrock Converse after retrieving tenant-scoped documents. AWS Cost Anomaly Detection reports a 42% Bedrock spend increase after a mobile app release. CloudWatch Logs show many duplicate successful invocations within 60 seconds:

tenant=acme model=claude promptTemplate=v14 guardrail=v3
normalizedQuestion="reset vpn token"
retrievedDocSetHash=1b9a
bedrockInvoke=true repeatCount=18

The team must reduce repeated FM calls without returning answers across tenants or stale retrieval contexts. Which smallest safe change addresses the root cause?

Options:

  • A. Enable prompt caching for only the static system prompt.

  • B. Enable CloudFront caching for chat API responses.

  • C. Cache responses by tenant/model/template/question/context/guardrail hash with TTL.

  • D. Use semantic caching keyed only by normalized question.

Best answer: C

Explanation: The cost symptom is repeated identical logical requests reaching Bedrock. A deterministic cache or idempotency key built from all answer-affecting inputs prevents duplicate invocations while preserving tenant and retrieval boundaries. A TTL limits freshness risk.

Symptom: Bedrock spend spikes and logs show the same tenant, model, prompt template, guardrail, question, and retrieved document set invoked many times. Root cause: duplicate client submissions or retries are not deduplicated before the FM call. Fix: create a canonical request representation after retrieval and policy selection, hash the tenant/security scope, model, prompt template version, guardrail version, normalized question, and retrievedDocSetHash, then store successful responses with a short TTL, such as in DynamoDB. This avoids cross-tenant reuse and avoids serving an answer when the retrieval context changes. The key takeaway is to cache the complete deterministic request, not just a convenient subset of the prompt.

  • Edge cache mismatch fails because authenticated chat responses are tenant-specific and dynamic, so generic CloudFront caching is unsafe.
  • Question-only semantics risks reusing answers across tenants or outdated document sets because retrieval context is not part of the key.
  • Prefix-only prompt caching may reduce some input-token processing, but it still allows duplicate full requests and output generation.

Continue with full practice

Use the AWS AIP-C01 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try AWS AIP-C01 on Web View AWS AIP-C01 Practice Test

Free review resource

Read the AWS AIP-C01 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.

Revised on Thursday, May 14, 2026