GARP RAI: AI Tools and Techniques

Try 10 focused GARP RAI questions on AI Tools and Techniques, with answers and explanations, then continue with Finance Prep.

Use this page to isolate AI Tools and Techniques before returning to mixed GARP RAI practice.

Open the matching Finance Prep practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Topic snapshot

FieldDetail
Exam routeGARP RAI
IssuerGARP
Topic areaAI Tools and Techniques
Blueprint weight20%
Page purposeFocused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate AI Tools and Techniques for GARP RAI. Work through the 10 questions first, then review the explanations and return to mixed practice in Finance Prep.

PassWhat to doWhat to record
First attemptAnswer without checking the explanation first.The fact, rule, calculation, or judgment point that controlled your answer.
ReviewRead the explanation even when you were correct.Why the best answer is stronger than the closest distractor.
RepairRepeat only missed or uncertain items after a short break.The pattern behind misses, not the answer letter.
TransferReturn to mixed practice once the topic feels stable.Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 20% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These questions are original Finance Prep practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.

Question 1

Topic: AI Tools and Techniques

A bank can deploy a complex ensemble model that improves default prediction by a small amount, or a logistic regression model that is slightly less accurate but easier to explain, validate, monitor, and govern for credit decision reviews. The team chooses the logistic regression. Which concept best matches this decision?

  • A. Automated feature discovery
  • B. Interpretability-performance trade-off
  • C. Dimensionality reduction
  • D. Concept drift monitoring

Best answer: B

What this tests: AI Tools and Techniques

Explanation: In model development, the most accurate model is not always the best model for a regulated or high-impact use case. A simpler model may be preferred when stakeholders need to understand key drivers, validate assumptions, explain decisions, monitor behavior, and assign control ownership. In this scenario, the ensemble model offers only a small performance gain, while the logistic regression provides stronger interpretability and easier governance. That is a classic interpretability-performance trade-off: the selected model better fits the business, risk, and oversight requirements even if its predictive metric is slightly lower.

  • Automated feature discovery concerns finding useful inputs, not choosing a simpler model for governance reasons.
  • Concept drift monitoring addresses changes in model behavior after deployment, not the initial simplicity-versus-performance choice.
  • Dimensionality reduction simplifies input structure, but it does not by itself capture the governance rationale for preferring an interpretable model.

The team is accepting a marginal reduction in predictive performance to gain explainability, governance, and control benefits.


Question 2

Topic: AI Tools and Techniques

A bank is developing a machine learning model to flag small-business loan applications for analyst review. The raw data include daily account balances, transaction timestamps, and merchant category codes. A pilot model using the raw records is difficult to interpret and shows weak validation performance. Which action is the best example of feature engineering to improve the model’s usefulness?

  • A. Lower the analyst-review threshold so the model flags more applications.
  • B. Create inputs such as balance volatility, cash-inflow ratio, recent overdraft count, and merchant-category concentration for each applicant.
  • C. Remove all unusual transactions from the dataset before training the model.
  • D. Switch to a more complex algorithm while keeping the same raw transaction records as inputs.

Best answer: B

What this tests: AI Tools and Techniques

Explanation: Feature engineering is the process of transforming raw data into structured inputs that a model can use more effectively. In this case, raw transaction-level data may be too granular or noisy for the model to learn useful patterns. Aggregating or deriving variables such as volatility, ratios, counts, or concentration measures can make relevant borrower behavior easier for the model to use and easier for reviewers to understand. Feature engineering does not guarantee better performance, but it is a targeted data-preparation step intended to improve model usefulness before or alongside model selection.

  • Switching algorithms addresses model choice, not the transformation of raw data into useful inputs.
  • Lowering the review threshold changes the operating decision rule, not the data used by the model.
  • Removing all unusual transactions is a broad cleaning step and may discard important risk signals rather than engineer useful features.

Feature engineering transforms raw data into model-ready inputs that may capture more useful predictive signals.


Question 3

Topic: AI Tools and Techniques

A bank’s operations team uses an approved large language model to draft complaint summaries. The model often omits urgency, so an analyst changes only the user instruction from “summarize this complaint” to “summarize using four fields: product, customer issue, potential harm, and urgency; use only the call text.” A small pilot shows more consistent summaries, and no model weights, training data, or retrieval sources changed. What is the best action before adopting the change?

  • A. Bypass further review because prompt changes do not affect AI system risk.
  • B. Initiate model retraining because the improved summaries show that the model’s behavior has materially changed.
  • C. Treat it as a prompt-template improvement, document the new version, and test it on a representative sample under normal change control.
  • D. Add the complaint calls to the model’s training data so the model permanently learns the urgency field.

Best answer: C

What this tests: AI Tools and Techniques

Explanation: Changing the wording, structure, or constraints in a prompt affects how an existing model responds at inference time; it does not update the model’s learned parameters. In this scenario, the model, training data, and retrieval sources are unchanged, so the observed improvement is evidence of better prompting and context specification, not retraining. The best risk-managed action is to version and document the prompt, test it on representative complaints, confirm that it does not introduce new issues such as unsupported urgency labels, and implement it through the relevant prompt or application change-control process.

  • Retraining is not supported because no evidence shows that the model’s learned parameters or underlying capability need to change.
  • Adding calls to training data is unnecessary and could introduce privacy or governance issues without a demonstrated training need.
  • Bypassing review is inappropriate because prompt templates can materially affect outputs, controls, and user reliance.

Only the instruction changed, so the appropriate action is governed prompt improvement and evaluation rather than model retraining.


Question 4

Topic: AI Tools and Techniques

A bank is testing a retrieval-augmented LLM assistant for complaint handling. The retrieved context says: “Customer reported a duplicate debit-card charge; one charge was reversed; merchant inquiry is still open.” The test prompt asks: “Confirm that the merchant committed fraud and recommend whether the customer’s account should be closed.” Which action is BEST?

  • A. Expand the prompt to ask the model to use general banking knowledge to fill in the missing facts.
  • B. Allow the model to infer fraud because duplicate charges are commonly associated with merchant misconduct.
  • C. Flag the prompt as requesting conclusions not supported by the retrieved context and revise it to answer only what the context establishes.
  • D. Have the model answer with a confidence score so reviewers can decide whether the inference is acceptable.

Best answer: C

What this tests: AI Tools and Techniques

Explanation: A prompt asks for unsupported inference when it requires the model to state or decide facts that are not present in the provided context. Here, the retrieved context establishes only that a duplicate charge was reported, one charge was reversed, and the merchant inquiry remains open. It does not establish merchant fraud, customer fault, or whether account closure is appropriate. The best action is to flag and revise the prompt so the model either limits its response to known facts or asks for additional evidence. This reduces hallucination risk and keeps the assistant grounded in the supplied context.

  • Treating duplicate charges as proof of fraud overgeneralizes from a possible pattern rather than the provided facts.
  • Adding a confidence score does not cure the absence of evidence; it can make an unsupported conclusion appear more reliable.
  • Using general banking knowledge to fill gaps defeats the purpose of context-grounded prompting and increases hallucination risk.

The context contains a duplicate-charge complaint and open inquiry, but no evidence of fraud or basis for account-closure advice.


Question 5

Topic: AI Tools and Techniques

A bank is benchmarking a generative AI assistant for drafting responses to customer service agents. The evaluation rubric asks reviewers to check whether each answer is factually correct, remains consistent across equivalent prompts, avoids harmful or noncompliant advice, and cites or aligns with approved source documents. Which evaluation concept does this description best match?

  • A. Hyperparameter tuning to reduce training loss
  • B. Unsupervised clustering validation using silhouette scores
  • C. Generative AI output evaluation for reliability, safety, and grounding
  • D. Supervised classification evaluation using a confusion matrix

Best answer: C

What this tests: AI Tools and Techniques

Explanation: Generative AI evaluation often differs from traditional model evaluation because the output is open-ended text rather than a fixed class label or numeric prediction. A generated answer may sound fluent while being factually wrong, inconsistent across similar prompts, unsafe, or unsupported by the underlying sources. For a financial-services assistant, these dimensions matter because users may rely on the text in customer communications or decisions. Therefore, evaluation should test factuality, consistency, harmful output, and source support or grounding, not only general language quality or a single accuracy metric.

  • A confusion matrix is useful for fixed-label supervised classification, but it does not capture grounding, harmfulness, or open-ended factual quality.
  • Clustering validation assesses grouping quality in unsupervised learning, not the safety or source support of generated text.
  • Reducing training loss may improve model fit, but it is not a sufficient evaluation of whether deployed generative outputs are truthful, consistent, and safe.

Open-ended generative outputs require assessment of truthfulness, stability, harmfulness, and support from trusted sources.


Question 6

Topic: AI Tools and Techniques

A bank deploys an internal generative AI assistant for operations staff. For each question, the assistant retrieves approved policy excerpts, uses them as context for the response, and cites the excerpts, while the risk team still requires answer testing and human review for high-impact outputs. Which concept does this description best illustrate?

  • A. Grounding through retrieval-augmented generation
  • B. Unsupervised clustering of policy documents
  • C. Fine-tuning on historical user conversations
  • D. Prompt injection by an external user

Best answer: A

What this tests: AI Tools and Techniques

Explanation: Grounding connects a generative AI response to specific, trusted information such as approved documents, databases, or retrieved excerpts. In this scenario, the assistant uses policy excerpts and citations to make responses less likely to be unsupported or fabricated. However, grounding is not a guarantee of correctness: retrieved sources may be incomplete, stale, misread by the model, or applied incorrectly. Therefore, validation, answer testing, source-quality checks, and human review remain necessary, especially for high-impact decisions or regulated processes.

  • Fine-tuning changes model behavior through additional training, but the stem emphasizes retrieving current approved context at response time.
  • Unsupervised clustering groups documents or records by similarity; it does not describe citing trusted sources to support a generated answer.
  • Prompt injection is an attack or manipulation risk, not the control pattern described in the stem.

Grounding uses trusted retrieved context to reduce unsupported outputs, but it does not replace validation, testing, or human review.


Question 7

Topic: AI Tools and Techniques

A bank’s credit-risk team finds that a gradient-boosted model materially outperforms a simple scorecard, but business owners and compliance reviewers struggle to understand and explain individual decisions. Which model-development trade-off is most directly illustrated?

  • A. Bias versus variance
  • B. Precision versus recall
  • C. Privacy versus utility
  • D. Predictive performance versus interpretability

Best answer: D

What this tests: AI Tools and Techniques

Explanation: Model development often involves a trade-off between predictive performance and interpretability. More complex methods, such as ensembles or deep learning models, may capture nonlinear relationships and improve accuracy, but their decision logic can be harder for stakeholders to understand, validate, challenge, or explain. In regulated financial services, this matters because business users, compliance teams, model validators, and customers may need clear reasons for decisions. The issue in the stem is not simply an error-rate balance or a data-use constraint; it is that higher performance comes with reduced explainability.

  • Bias versus variance concerns underfitting and overfitting, not stakeholder explainability.
  • Precision versus recall concerns the balance between false positives and false negatives.
  • Privacy versus utility concerns extracting value from data while protecting sensitive information.

The scenario describes a more accurate but less explainable model, which is the performance-interpretability trade-off.


Question 8

Topic: AI Tools and Techniques

A bank uses a machine-learning model to prioritize small-business loan reviews. The holdout test set shows stable aggregate precision, a challenge set of newly incorporated firms shows many false negatives, loan officers report misleading denial explanations, and production monitoring shows declining precision after a new marketing campaign. What is the best interpretation for the model risk manager?

  • A. Rely on the holdout test set because aggregate precision is the most objective benchmark for approval decisions.
  • B. Treat the challenge-set failures as irrelevant unless they also appear in the random holdout test set.
  • C. Use the evidence sources together because each reveals a different weakness: representative historical performance, targeted edge-case behavior, user-facing output problems, and live drift.
  • D. Use production monitoring only, because pre-deployment testing cannot provide useful evidence once the model is live.

Best answer: C

What this tests: AI Tools and Techniques

Explanation: Different evaluation sources are designed to reveal different weaknesses. A holdout test set estimates performance on data intended to resemble the historical target population, so it may miss rare or emerging cases. A challenge set deliberately stresses known edge cases or high-risk segments, such as newly incorporated firms. User feedback can reveal problems with explanations, workflow fit, or unintended impacts that may not appear in numerical benchmarks. Production monitoring detects changes in live data, behavior, or performance after deployment, such as drift following a new marketing campaign. The best interpretation is not that one source overrides the others, but that they provide complementary evidence for model evaluation and remediation.

  • Aggregate holdout precision can hide segment-specific weaknesses and post-deployment changes.
  • Challenge-set failures are relevant because they intentionally test important edge cases that may be underrepresented in random samples.
  • Production monitoring is essential after deployment, but it does not replace pre-deployment tests or user feedback.

This correctly recognizes that each evaluation source samples a different condition and therefore can expose different model limitations.


Question 9

Topic: AI Tools and Techniques

A bank is piloting an internal LLM assistant for operational-policy questions. The model was fine-tuned on historical policy documents, and during testing it confidently states a retention rule that is not in the current policy. The product owner argues that the answer must be a stored fact because the documents were included in training. What is the best risk-management response?

  • A. Increase the model temperature so the assistant is less likely to repeat outdated wording from training data.
  • B. Rely on user feedback after launch to identify any policy statements that are no longer current.
  • C. Approve the assistant because fine-tuning on policy documents makes its responses equivalent to database lookup results.
  • D. Treat the response as generated text and require grounding in an approved current policy repository with source references before authoritative use.

Best answer: D

What this tests: AI Tools and Techniques

Explanation: Large language models do not normally retrieve and return stored facts in a deterministic way simply because facts appeared in training or fine-tuning data. They generate outputs by predicting likely token sequences based on learned patterns, which can produce fluent but unsupported or outdated statements. In this scenario, the assistant’s confident but incorrect policy statement is evidence that it should not be treated as an authoritative lookup tool. For current policy questions, a stronger control is to ground responses in an approved, current source such as a retrieval system or policy database, with references that users or reviewers can verify.

  • Fine-tuning may adapt the model’s style or domain knowledge, but it does not make every answer a deterministic retrieval from source documents.
  • Adjusting temperature changes output variability; it does not ensure factual correctness or currency.
  • Post-launch user feedback is useful monitoring, but it is not sufficient before using the tool for authoritative policy answers.

An LLM generates likely text from learned patterns, so factual policy answers should be grounded by deterministic retrieval or verified sources.


Question 10

Topic: AI Tools and Techniques

A bank’s analytics team selects a third-party LLM because it scored highest on a public general reasoning benchmark. The planned production use is to draft summaries of internal credit memos for relationship managers; the memos contain institution-specific abbreviations and confidential client details. The benchmark used no internal documents and measured multiple-choice accuracy, not factual summarization or data-handling errors. What is the best action before approving the model for production?

  • A. Approve deployment because the public benchmark ranking shows superior general model quality.
  • B. Run a use-case-specific evaluation with representative credit memos and predefined production metrics before deployment.
  • C. Tune prompts on a few sample memos and address remaining issues through post-deployment monitoring.
  • D. Rely on the vendor’s benchmark report if it includes the model architecture and training-data summary.

Best answer: B

What this tests: AI Tools and Techniques

Explanation: A strong benchmark result is useful comparative evidence, but it is not the same as production suitability. Here, the benchmark task, data, and metric do not match the bank’s intended use: summarizing internal credit memos with specialized language and confidentiality concerns. Before approval, the bank should test the model on representative examples under expected workflow conditions and measure outcomes that matter in production, such as factual accuracy, omitted material, hallucinations, handling of confidential information, and human-review effectiveness. Public benchmark performance may inform model selection, but it cannot replace use-case-specific validation when the benchmark does not reflect the target environment.

  • Public benchmark ranking is insufficient because the benchmark did not test the bank’s documents, summarization task, or data-handling risks.
  • Vendor documentation may support due diligence, but it does not prove fit for this specific production use.
  • Prompt tuning on a few examples is weak evidence and should not substitute for representative pre-deployment evaluation.

Benchmark strength must be supplemented with evidence that the model performs safely and accurately on the organization’s actual task, data, and controls.

Continue with full practice

Use the GARP RAI Practice Test page for the full Finance Prep practice bank, mixed-topic practice, timed mock exams, and explanations.

Open the matching Finance Prep practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Free review resource

Use the full Finance Prep practice page above for the latest review links and practice page.

Revised on Monday, May 25, 2026