AWS AIF-C01: Applications of Foundation Models

Try 10 focused AWS AIF-C01 questions on Applications of Foundation Models, with explanations, then continue with IT Mastery.

On this page

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try AWS AIF-C01 on Web View full AWS AIF-C01 practice page

Topic snapshot

FieldDetail
Exam routeAWS AIF-C01
Topic areaApplications of Foundation Models
Blueprint weight28%
Page purposeFocused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Applications of Foundation Models for AWS AIF-C01. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

PassWhat to doWhat to record
First attemptAnswer without checking the explanation first.The fact, rule, calculation, or judgment point that controlled your answer.
ReviewRead the explanation even when you were correct.Why the best answer is stronger than the closest distractor.
RepairRepeat only missed or uncertain items after a short break.The pattern behind misses, not the answer letter.
TransferReturn to mixed practice once the topic feels stable.Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 28% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.

Question 1

Topic: Applications of Foundation Models

In generative AI prompt engineering, which term refers to a curated set of representative test prompts (often with expected answers or a scoring rubric) that is run repeatedly to measure response quality and track regressions when changing a model or prompt?

Options:

  • A. Golden prompt set (evaluation prompt suite)

  • B. Embedding index

  • C. Bedrock Guardrails policy

  • D. Few-shot prompt

Best answer: A

Explanation: A golden prompt set is an evaluation artifact: a stable, representative suite of prompts paired with expected outputs and/or a scoring rubric. Running it consistently lets you quantify quality and detect regressions when you modify prompts, models, or safety settings.

A key best practice for LLM evaluation is to build a repeatable test suite. A “golden prompt set” (also called an evaluation prompt suite) is a curated collection of prompts that represent real user tasks, paired with expected outputs and/or clear scoring criteria (for example, accuracy, completeness, tone, groundedness). You run the same suite across prompt/model iterations to compare results and catch regressions.

The closest confusion is with few-shot prompting: few-shot examples are included inside a prompt to guide model behavior at runtime, whereas a golden prompt set is used outside runtime as a benchmark for measuring quality.

  • Few-shot prompting guides generation by embedding examples in the prompt; it is not a measurement artifact.
  • Embedding index supports semantic retrieval (RAG) and similarity search, not end-to-end response quality evaluation.
  • Guardrails policy helps enforce safety and formatting constraints; it does not define a quality benchmark suite.

Question 2

Topic: Applications of Foundation Models

A company is adapting a generative AI assistant for customer support. The company’s key requirement is that the model’s behavior aligns with internal policies for tone and refusal handling. The company can have reviewers compare two candidate responses to the same prompt and choose which response is better.

Which approach best fits this requirement, and why?

Options:

  • A. Use RLHF to optimize the model to human preferences

  • B. Supervised fine-tune on a labeled dataset of ideal answers

  • C. Use RAG so answers are grounded in company documents

  • D. Add prompt instructions and enforce rules with guardrails

Best answer: A

Explanation: The discriminating factor is using human preference comparisons as the training signal. RLHF turns those comparisons into a reward model (or reward signal) and then uses reinforcement learning to adjust the model’s behavior toward what humans prefer. This is commonly used to improve helpfulness and safety/alignment characteristics.

Reinforcement Learning from Human Feedback (RLHF) is used when you want a model’s responses to better match human expectations (for example, helpfulness, tone, and safe refusal behavior) using human judgments rather than only “correct labels.” In the scenario, reviewers can compare two responses and pick the better one, which is the classic input to RLHF.

At a high level, RLHF works like this:

  • Collect human preference data (rankings/comparisons of responses)
  • Train a reward model (or reward function) to predict those preferences
  • Use reinforcement learning to update the model to maximize the reward

Compared with supervised fine-tuning, RLHF directly optimizes toward a learned preference-based reward, which is why it’s often chosen for alignment goals.

  • Supervised-only tuning uses target outputs, not preference-based reward optimization.
  • RAG improves grounding with external data but does not train the model’s behavior to match preferences.
  • Guardrails/prompting can constrain outputs at runtime but does not perform preference-driven training updates.

Question 3

Topic: Applications of Foundation Models

Which statement best defines a task success measure for evaluating whether a foundation model meets business objectives?

Options:

  • A. A measure of similarity between generated text and a reference using n-grams

  • B. An intrinsic metric that estimates how well the model predicts the next token

  • C. A task-specific pass/fail or score based on meeting predefined outcome criteria

  • D. A count of tokens processed per request to estimate prompt size and cost

Best answer: C

Explanation: Task success measures are task-specific metrics tied directly to business outcomes (for example, correct classification rate, resolution rate, or compliance rate). They use predefined success criteria to determine whether the model’s outputs are acceptable for the real workload. This helps decide if the model meets objectives beyond generic language quality indicators.

The core idea is aligning evaluation to what “success” means for the business task. A task success measure checks model outputs against predefined criteria (often using labeled data, automated checks, or human review) and summarizes performance as a rate or score, such as accuracy for routing, percent of answers that cite approved sources, or percent of support chats resolved without escalation. This differs from intrinsic or proxy metrics (like next-token prediction quality) that can be useful for model research but do not directly confirm the model achieves the desired task outcome under your acceptance standards.

  • Next-token prediction quality describes intrinsic metrics like perplexity, which don’t directly measure business task outcomes.
  • Reference similarity describes overlap metrics (for example, ROUGE/BLEU) that may not reflect whether the output satisfies real acceptance criteria.
  • Token counting relates to cost/limits and throughput, not whether responses are successful for the task.

Question 4

Topic: Applications of Foundation Models

A company is building a customer support assistant using an Amazon Bedrock text model to label each incoming chat into one of five predefined issue categories and return a short justification. In testing, a single instruction prompt produces inconsistent category names and output structure across similar chats. Constraints: the solution must not require model training, must keep latency low, and must not store chat transcripts for later use.

Which approach is the BEST way to improve output consistency?

Options:

  • A. Fine-tune a custom text classifier in Amazon SageMaker AI using stored chat logs

  • B. Use single-shot prompting and only expand the instruction details

  • C. Use few-shot prompting with several labeled input/output examples and a fixed response format

  • D. Use Knowledge Bases for Amazon Bedrock to retrieve similar past chats with citations

Best answer: C

Explanation: Few-shot prompting improves consistency by showing the model a small set of representative examples that match the required labels and output schema. This approach works with a foundation model as-is, adds minimal overhead compared to building retrieval or training pipelines, and avoids persisting chat transcripts.

Single-shot prompting provides only instructions, so the model may still vary labels (synonyms, capitalization) and formatting. Few-shot prompting adds a handful of example chat inputs paired with the correct category and the exact output structure (for example, a specific JSON shape). Those examples act as in-context guidance, which typically makes the model follow the fixed label set and formatting more reliably while still using Amazon Bedrock without training.

In this scenario, few-shot prompting best meets the constraints because it:

  • avoids model training
  • keeps latency low (no retrieval step)
  • does not require storing chat transcripts for later retrieval or training

The key takeaway is to use few-shot prompting when you need the model to mimic a specific pattern or schema consistently.

  • More instructions only may still produce label/format drift because there are no concrete patterns to imitate.
  • RAG for similar chats adds a retrieval dependency and typically requires storing/indexing chats, which conflicts with the constraints and isn’t necessary to enforce a schema.
  • Fine-tuning a classifier violates the “no training” constraint and would require retaining labeled chat data.

Question 5

Topic: Applications of Foundation Models

When using a foundation model (FM) in production, which TWO statements best explain why ongoing monitoring and feedback are needed after deployment?

Options:

  • A. Monitoring is only needed for model latency and cost, not response quality

  • B. If offline test results are strong, production monitoring is unnecessary

  • C. Encryption with AWS KMS prevents model quality drift

  • D. AWS CloudTrail logs are sufficient to measure response accuracy and relevance

  • E. User feedback and periodic re-evaluation help detect new failure modes and needed updates

  • F. Production inputs can shift, causing output quality to drift over time

Correct answers: E and F

Explanation: FM behavior in production can change because real user prompts, retrieved context, and business needs evolve over time. Monitoring helps detect quality drift (accuracy, relevance, safety) that offline tests may not predict. Feedback provides evidence to update prompts, guardrails, or knowledge sources and to re-evaluate the solution against current requirements.

The core idea is that offline evaluation is a snapshot, but production is a moving target. After deployment, the distribution of prompts, languages, and user intent can shift, and any retrieval sources (for RAG) can change as new documents are added or older content becomes outdated. These changes can create data drift or concept drift that degrades quality, increases hallucinations, or introduces new unsafe behaviors.

A practical post-deployment loop is:

  • Monitor quality signals (for example, groundedness, relevance, safety).
  • Collect user or reviewer feedback on failures.
  • Re-evaluate on updated test sets and real production slices.
  • Apply updates (prompt/guardrails/knowledge base/model choice) and repeat.

The key takeaway is that monitoring and feedback are necessary to detect and correct real-world drift, not just to track operational metrics.

  • Quality drift risk Production inputs and context can change, so quality can degrade even if the model is unchanged.
  • Feedback loop value Feedback and re-evaluation reveal new edge cases and guide updates to prompts/guardrails/knowledge.
  • Offline tests guarantee stability Offline results do not account for future data and user behavior shifts.
  • Security/audit equals quality KMS and CloudTrail support security/governance, but they do not measure or prevent accuracy/relevance drift.

Question 6

Topic: Applications of Foundation Models

A company is building an internal HR assistant using Amazon Bedrock. HR policies are stored in Amazon S3 and updated weekly. The company wants the assistant to answer questions using the latest approved policy text and reduce hallucinations by grounding responses in external knowledge.

Which TWO actions best implement Retrieval Augmented Generation (RAG) for this use case? (Select TWO.)

Options:

  • A. Select a larger model to improve factual accuracy without retrieval

  • B. Add retrieved passages to the prompt and require source citations

  • C. Enable AWS CloudTrail logging for Bedrock API calls

  • D. Use Knowledge Bases for Amazon Bedrock over the S3 policy docs

  • E. Fine-tune the foundation model on the HR policy documents

  • F. Enable Bedrock Guardrails to block unsafe or off-topic responses

  • G. Encrypt the S3 bucket and vector store with AWS KMS keys

Correct answers: B and D

Explanation: RAG combines retrieval and generation: it first retrieves relevant, authoritative content from an external knowledge source (such as HR policies in S3) and then uses that retrieved context to generate the answer. The goal is to ground responses in up-to-date, approved information, which reduces hallucinations compared with relying on the model’s parametric knowledge alone.

Retrieval Augmented Generation (RAG) is a pattern where an application retrieves relevant information from an external knowledge source at inference time and then provides that information to a foundation model as context for generating the response. In this scenario, the external knowledge is the frequently updated HR policy content in S3, so retrieving the latest passages and conditioning the model on them helps ensure answers reflect current, approved policies.

RAG at a high level:

  • Retrieve the most relevant policy passages for the user’s question (often via embeddings and a vector index).
  • Augment the model input with those passages.
  • Generate an answer that stays anchored to the retrieved content (often with citations or an “I don’t know” behavior when nothing relevant is found).

Controls like guardrails, encryption, and auditing are valuable, but they do not themselves implement the retrieve-then-generate grounding mechanism.

  • ✔ Use Knowledge Bases for Amazon Bedrock over the S3 policy docs — performs retrieval from an external knowledge source at runtime.
  • ✔ Add retrieved passages to the prompt and require source citations — grounds the generated answer in retrieved context.
  • ✖ Fine-tune the foundation model on the HR policy documents — updates model weights, but does not retrieve current sources per query.
  • ✖ Enable Bedrock Guardrails to block unsafe or off-topic responses — improves safety, not grounding via retrieval.
  • ✖ Encrypt the S3 bucket and vector store with AWS KMS keys — strengthens security, but doesn’t add retrieval-augmented context.
  • ✖ Select a larger model to improve factual accuracy without retrieval — can still hallucinate and won’t reflect weekly updates.
  • ✖ Enable AWS CloudTrail logging for Bedrock API calls — supports auditing/governance, not RAG grounding.

Question 7

Topic: Applications of Foundation Models

Which TWO statements correctly describe prompt engineering risks related to prompt poisoning and data contamination when using foundation models? (Select TWO)

Options:

  • A. Contamination is harmless because foundation models don’t memorize.

  • B. Malicious text in RAG sources can override intended instructions.

  • C. Prompt poisoning happens only when you fine-tune a model.

  • D. Setting temperature to 0 eliminates prompt poisoning risk.

  • E. Mixing test questions into training data can inflate evaluation metrics.

  • F. Data contamination is fixed by encrypting prompts with KMS.

Correct answers: B and E

Explanation: Prompt poisoning is an inference-time risk where untrusted input (including retrieved documents) contains instructions that manipulate model behavior. Data contamination is a data hygiene risk where training or tuning data overlaps with evaluation data, or includes sensitive content, leading to misleading results or unintended disclosure.

Prompt poisoning occurs when an attacker (or untrusted source) embeds instructions in user input or retrieved context (common in RAG) that attempt to override the application’s intended instructions, such as asking the model to ignore prior guidance or reveal hidden data.

Data contamination is about what data is included in training, fine-tuning, or evaluation: if evaluation items leak into training data, performance metrics are artificially high; if sensitive data is included, the model may reproduce or reveal it later. Strong prompt design and guardrails help with poisoning, while strict dataset governance and separation help prevent contamination. The key distinction is untrusted instructions at inference time vs. problematic overlap/sensitive content in datasets.

  • Poisoned retrieved context is a real risk when RAG sources are untrusted.
  • Train/test leakage is contamination that can inflate reported accuracy.
  • Fine-tuning only is wrong; poisoning can happen at inference via inputs.
  • Temperature control changes randomness, not susceptibility to malicious instructions.

Question 8

Topic: Applications of Foundation Models

Which statement about using ROUGE to evaluate foundation model outputs is INCORRECT?

Options:

  • A. A high ROUGE score does not guarantee the summary is factual or free of hallucinations.

  • B. ROUGE primarily scores lexical overlap (for example, n-grams) between a generated summary and a reference summary.

  • C. ROUGE captures semantic similarity using embeddings, so it is best for evaluating open-ended question answering without references.

  • D. ROUGE is commonly used for automated evaluation of text summarization quality when reference summaries exist.

Best answer: C

Explanation: The incorrect statement is the one claiming ROUGE measures embedding-based semantic similarity and is best for open-ended QA without references. ROUGE is a reference-based, overlap-oriented family of metrics designed mainly for summarization. It can be useful for regression testing but it does not verify factuality.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is most appropriate when you can compare a model-generated summary to one or more human-written reference summaries. At a high level, it measures surface-level text overlap such as matching unigrams/bigrams (ROUGE-1/2) or longest common subsequences (ROUGE-L), which makes it a common automated metric for summarization benchmarking.

Because ROUGE is overlap-based, it does not directly measure meaning via embeddings, and it is not a good primary metric for open-ended question answering where there may be many valid phrasings or no reference answer. Also, a high ROUGE score can still occur even if a summary contains incorrect facts, so separate checks (for example, human review or factuality evaluation) are needed.

  • Overlap-based metric is accurate: ROUGE scores lexical overlap (n-grams/LCS) against references.
  • Good fit for summarization is accurate: it’s widely used when reference summaries are available.
  • Embedding semantic QA claim is misleading: ROUGE is not an embedding similarity metric and typically relies on reference text.
  • Not a factuality guarantee is accurate: overlap can be high even when facts are wrong.

Question 9

Topic: Applications of Foundation Models

A company is building a customer support chat assistant using Amazon Bedrock. The business objective is to reduce the number of support tickets handed to human agents by 20% while maintaining customer satisfaction.

Which evaluation approach is the BEST way to determine whether a foundation model meets this business objective?

Options:

  • A. Compare model perplexity on a held-out set of past chat transcripts

  • B. Compare average response latency and throughput under expected load

  • C. Compare ROUGE scores against reference answers for common questions

  • D. Run an A/B test and measure ticket deflection and post-chat CSAT

Best answer: D

Explanation: The deciding attribute is using task success measures aligned to the business objective. Ticket deflection (or escalation rate) and post-chat CSAT measure whether the assistant resolves issues without harming satisfaction. This directly answers whether the model meets the company’s goals, rather than only measuring model behavior or system performance.

To determine whether a foundation model meets business objectives, evaluation should focus on task success measures that reflect real outcomes for the intended workflow. In this scenario, success is defined by fewer tickets reaching human agents while maintaining customer satisfaction, so the evaluation should measure escalation/deflection and CSAT in a realistic pilot (often an A/B test) using representative users and traffic. Model-intrinsic metrics (like perplexity) and text-overlap metrics (like ROUGE) can be useful for research-style comparisons, but they do not reliably predict whether customers will resolve issues without escalation. Likewise, latency and throughput are important non-functional requirements, but they do not indicate task completion or satisfaction.

Key takeaway: choose metrics that map directly to the business KPI, not just model quality proxies.

  • Perplexity proxy measures next-token prediction, not whether users successfully resolve issues.
  • Performance-only check validates responsiveness, but not ticket reduction or satisfaction outcomes.
  • ROUGE mismatch is a text similarity metric and may not reflect helpfulness or resolution in a chat workflow.

Question 10

Topic: Applications of Foundation Models

A legal team uses Amazon Bedrock to generate redlines for vendor contracts. Each request must include the entire contract plus instructions, totaling up to 150,000 input tokens, and the model must be able to return up to 3,000 output tokens. Each contract is unique, and the team is not allowed to pre-chunk documents or persist contract text in a vector store.

The current solution uses Model A but has high cost and latency.

Exhibit: Candidate model limits (Bedrock)

ModelMax input (context)Max outputRelative costRelative latency
Model A200,000 tokens8,000 tokensHighHigh
Model B160,000 tokens4,000 tokensMediumLow
Model C120,000 tokens4,000 tokensLowLow
Model D160,000 tokens2,000 tokensLowLow

Which change is the best way to reduce cost and latency while meeting all requirements?

Options:

  • A. Switch the workload to Model B

  • B. Use RAG with Model D to reduce prompt size

  • C. Keep Model A and enable prompt caching

  • D. Switch the workload to Model C

Best answer: A

Explanation: The key constraint is token limits: the model must support at least 150,000 input tokens and 3,000 output tokens in a single request. Model B meets both limits and improves both cost and latency compared with the current Model A. Options that require chunking/RAG or that miss either limit do not satisfy the stated requirements.

Selecting a foundation model for a GenAI application starts with verifying that the model’s maximum input context window and maximum output length meet the application’s worst-case prompt and response sizes. Here, the application requires a single-call interaction with up to 150,000 input tokens and up to 3,000 output tokens.

A practical way to decide is:

  • Filter out models with context <150,000 tokens or output <3,000 tokens.
  • Among the remaining models, choose the one that best optimizes the goal (lower cost/latency).

Model B is the only option that keeps the single-request workflow, satisfies both token limits, and reduces cost and latency versus the current model; switching models is a simpler optimization than adding new retrieval components.

  • Too-small context fails because 120,000 input tokens cannot fit the 150,000-token prompt.
  • Caching mismatch is ineffective because unique contracts provide little or no cache reuse.
  • RAG violates constraints because it requires chunking/persisting text in a vector store, which is disallowed.
  • Too-small output fails because 2,000 output tokens cannot meet the 3,000-token requirement.

Continue with full practice

Use the AWS AIF-C01 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try AWS AIF-C01 on Web View AWS AIF-C01 Practice Test

Free review resource

Read the AWS AIF-C01 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.

Revised on Thursday, May 14, 2026