Free Databricks GenAI Engineer Practice Questions: Evaluation and Monitoring

Practice 10 free Databricks Certified Generative AI Engineer Associate (Databricks Generative AI Engineer Associate) questions on Evaluation and Monitoring, with answers, explanations, and the IT Mastery next step.

Try the IT Mastery web app for a richer interactive practice experience with mixed sets, timed mocks, topic drills, explanations, and progress tracking.

Try Databricks Generative AI Engineer Associate on Web

Topic snapshot

FieldDetail
Practice targetDatabricks Generative AI Engineer Associate
Topic areaEvaluation and Monitoring
Blueprint weight12%
Page purposeFocused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Evaluation and Monitoring for Databricks Generative AI Engineer Associate. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

PassWhat to doWhat to record
First attemptAnswer without checking the explanation first.The fact, rule, calculation, or judgment point that controlled your answer.
ReviewRead the explanation even when you were correct.Why the best answer is stronger than the closest distractor.
RepairRepeat only missed or uncertain items after a short break.The pattern behind misses, not the answer letter.
TransferReturn to mixed practice once the topic feels stable.Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 12% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These are original IT Mastery practice questions aligned to this topic area. They are not official Databricks questions, copied live-exam content, or exam dumps. Use them to preview question style and explanation depth before continuing with topic drills, mixed sets, and timed mocks in IT Mastery.

Question 1

Topic: Evaluation and Monitoring

A team is preparing an offline evaluation set for a Databricks RAG agent. They want to run the listed evaluation judges using the artifact as-is.

Evaluation artifact

FieldPresent?Notes
requestYesUser question
responseYesAgent answer
retrieved_contextYesChunks returned by retriever
expected_responseNoNo SME reference answer

Which judge requires adding ground truth before it can produce valid scoring results?

Options:

  • A. Correctness

  • B. Relevance to query

  • C. Safety

  • D. Groundedness

Best answer: A

Explanation: In Databricks agent evaluation, a correctness-style judge compares the generated answer with a reference answer, so it requires ground truth such as expected_response. The artifact has the user request, model response, and retrieved context, but no SME-approved answer. Judges such as groundedness, safety, and relevance can evaluate different properties from the request, response, and context without a reference answer. The key distinction is whether the judge must know what the answer should have been, not just whether the answer appears safe, relevant, or supported by retrieved evidence.

  • Groundedness is assessable from the response and retrieved context because it checks support in the provided evidence.
  • Safety does not require a reference answer because it evaluates harmful or policy-violating content.
  • Relevance to query can compare the response to the user request without an SME-labeled expected response.

Question 2

Topic: Evaluation and Monitoring

A Databricks team runs a deployed RAG assistant through AI Gateway. In the last week, total spend increased and product owners need to know whether adoption grew broadly, a few users or teams concentrated traffic, or a specific model is driving token cost. They need an evidence-based answer from existing telemetry without redeploying the app. Which engineering decision is best?

Options:

  • A. Inspect Vector Search query results and tune the chunking strategy.

  • B. Review MLflow traces for retrieval spans and prompt templates.

  • C. Query AI Gateway usage tables by user, model, endpoint, time, tokens, and cost.

  • D. Add custom application logs and redeploy before analyzing traffic.

Best answer: C

Explanation: AI Gateway usage tables are the right monitoring source when the question is who is using a deployed LLM application, how much traffic they generate, and what usage is driving cost. Aggregating request counts, token usage, and cost fields by dimensions such as user, team, endpoint, model, and time bucket can separate broad adoption growth from concentrated usage or model-specific expense. This uses existing gateway telemetry, so it avoids redeployment and gives product owners an evidence-based view of live usage patterns. MLflow traces and retrieval diagnostics can help debug quality or latency, but they are not the primary source for adoption and cost attribution across deployed traffic.

  • MLflow traces help inspect execution details, but they are not the best source for aggregate adoption or cost concentration.
  • Vector Search tuning addresses retrieval relevance, not who is using the app or which model is driving spend.
  • New custom logs overbuilds the investigation because the required telemetry already exists in AI Gateway usage tables.

Question 3

Topic: Evaluation and Monitoring

A deployed Databricks HR assistant uses an Agent Framework app with a Mosaic AI Vector Search tool over Unity Catalog policy tables. Monitoring shows incorrect answers for contractor leave questions, but latency and endpoint health are within SLO, and the team must keep the current foundation model endpoint. An MLflow trace for a failed turn is shown.

User: Can California contractors take parental leave?

Step 1 - tool call: search_policy_docs
Index: uc.hr.policy_chunks_vs
Query: "California contractors parental leave"
Filters: {"department": "travel", "doc_status": "approved"}
Results:
  travel_expense_ca.pdf, score 0.84
  contractor_trip_rules.pdf, score 0.81

Step 2 - model call: current FM endpoint
Prompt: answer only from retrieved passages
Finish: stop

Step 3 - response
Answer cites travel_expense_ca.pdf and discusses reimbursements.

Which engineering decision is best?

Options:

  • A. Tighten the final answer formatting prompt

  • B. Increase the Vector Search top_k value

  • C. Switch to a larger foundation model endpoint

  • D. Correct the retrieval tool’s department filter mapping

Best answer: D

Explanation: Read an agent trace in execution order and identify the earliest stage that makes the later output inevitable. In this trace, the user query is relevant to contractor leave, but the retrieval tool passes department = travel, so Vector Search returns travel documents from the approved index. The model then follows the prompt by answering only from those retrieved passages, and the response cites the wrong source type. Since endpoint health and latency are acceptable and the current foundation model must remain, replacing the model is not justified. The engineering fix should target the tool argument or filter mapping, then validate with traces and evaluation examples.

  • Larger model misses the evidence because the model received only travel passages and was not the earliest failing component.
  • Formatting prompt would change presentation, not the source documents used for grounding.
  • Higher top_k still applies the wrong travel filter, so it is unlikely to retrieve the needed leave policy chunks.

Question 4

Topic: Evaluation and Monitoring

Agent Monitoring for a deployed claims-support agent shows repeated low-quality answers when a customer omits the purchase date. SMEs reviewed the captured traces and wrote this rule: “If the purchase date is missing, ask a clarifying question before citing warranty eligibility.” The team needs to improve the next agent version and prevent regressions. Which action is best?

Options:

  • A. Drive agent iteration with SME-labeled MLflow evaluations and a Custom Scorer.

  • B. Rebuild Vector Search with smaller chunks.

  • C. Increase AI Gateway rate limits for the endpoint.

  • D. Tighten Unity Catalog permissions on the registered model.

Best answer: A

Explanation: SME feedback should be converted into durable evaluation assets, not left as one-off comments. In Databricks, the reviewed traces can become labeled evaluation examples or guidelines in MLflow/Agent Evaluation, and the SME rule can be encoded as a Custom Scorer. The team can then update the agent prompt or tool policy, run the evaluation, and compare versions before deployment. This uses the monitoring gap as a feedback loop and gives the team a regression check for the missing-purchase-date behavior. Traffic controls, retrieval tuning, and access governance solve different layers of the application.

  • Gateway limits control usage and traffic, but they do not encode SME guidance or test response behavior.
  • Chunking changes may help retrieval quality, but the observed gap is a decision rule about missing information.
  • Catalog permissions improve governance, but they do not change or evaluate the agent’s response policy.

Question 5

Topic: Evaluation and Monitoring

A team is evaluating two served models for a Databricks RAG assistant that answers internal policy questions from governed Unity Catalog Delta tables. The release goal is to improve grounded answer quality with citations while keeping p95 latency under 2 seconds and controlling serving cost. The same retriever, prompt, and MLflow evaluation set were used.

ModelGrounded answer pass rateCitation pass ratep95 latencyCost/query
Smaller model91.0%89.0%1.4 sec$0.02
Larger model91.2%89.1%3.6 sec$0.11

Which engineering decision is best?

Options:

  • A. Keep the smaller model for this release

  • B. Fine-tune the larger model before release

  • C. Route citation-heavy questions to the larger model

  • D. Promote the larger model for all traffic

Best answer: A

Explanation: Model choice should be driven by the measured target outcome, not model size. In this evaluation, the larger model produces essentially the same grounded answer and citation pass rates as the smaller model, while p95 latency increases beyond the 2-second requirement and cost per query rises substantially. Because the target quality outcome does not improve, promoting the larger model would add operational cost and degrade the deployment constraint without evidence of user benefit. The best decision is to serve the smaller model and use the evaluation results as justification, then investigate retrieval, prompt, or data improvements only if the quality target still needs work.

  • All-traffic promotion overweights model size and ignores that the measured quality outcome is flat while latency fails the requirement.
  • Selective routing adds complexity without evidence that citation-heavy questions benefit from the larger model.
  • Fine-tuning first overbuilds the solution because the current evaluation does not show that model capacity is the bottleneck.

Question 6

Topic: Evaluation and Monitoring

After running an MLflow evaluation and estimating Model Serving usage, an engineering team is choosing a Databricks RAG configuration. The release gate requires helpfulness ≥0.85, groundedness ≥0.85, unsafe response rate ≤0.5%, p95 latency ≤3.0 seconds, and cost ≤$0.010 per request. Which configuration should be promoted?

ConfigurationHelpfulnessGroundednessUnsafe ratep95 latencyCost/request
Large LLM, top-k 30.900.910.2%4.8 s$0.014
Medium LLM, top-k 5 + reranker0.870.890.3%2.6 s$0.007
Small LLM, top-k 30.800.860.2%1.1 s$0.002
Medium LLM, top-k 100.880.830.8%2.9 s$0.006

Options:

  • A. Large LLM, top-k 3

  • B. Small LLM, top-k 3

  • C. Medium LLM, top-k 10

  • D. Medium LLM, top-k 5 + reranker

Best answer: D

Explanation: Model choice should follow the stated metric gates, not a single best-looking metric. Higher helpfulness and groundedness are better, while lower unsafe rate, latency, and cost are better. The medium LLM with top-k 5 plus a reranker clears all requirements: helpfulness 0.87, groundedness 0.89, unsafe rate 0.3%, p95 latency 2.6 seconds, and cost $0.007 per request. The large model has higher quality metrics, but it exceeds both latency and cost limits. The small model is efficient but misses the task-fit quality gate. The top-k 10 variant fails the grounding and safety gates.

  • Highest quality trap fails because the large LLM exceeds the latency and cost limits.
  • Lowest cost trap fails because the small LLM does not meet the helpfulness requirement.
  • More retrieval trap fails because top-k 10 lowers groundedness and increases unsafe responses beyond the gates.

Question 7

Topic: Evaluation and Monitoring

A deployed Databricks RAG agent answers HR policy questions through Model Serving. Pre-release MLflow evaluation passed, but AI Gateway inference logs now show live complaints about stale benefit answers, and auditors require documented SME review before prompt or retrieval changes. The team wants a weekly improvement loop based on production evidence without adding model fine-tuning. Which loop is the BEST engineering decision?

Options:

  • A. Rerun only the pre-release evaluation set before each release

  • B. Score live traces, route failures to SMEs, then update and redeploy

  • C. Raise AI Gateway rate limits and monitor usage growth

  • D. Fine-tune the foundation model on all logged conversations

Best answer: B

Explanation: A feedback-driven monitoring loop starts from live production evidence, not only from the pre-release evaluation set. AI Gateway inference logs or inference tables can provide the production requests, responses, and traces that show where the deployed RAG agent is failing. A Custom Scorer can flag stale, unsupported, or policy-mismatched responses. Those low-scoring examples should be sampled for SME review, converted into documented feedback or evaluation cases, and used to make targeted prompt, retrieval, or data updates before redeploying through the normal MLflow and serving workflow. The key distinction is that monitoring observes deployed behavior, while pre-release evaluation validates changes before they go live.

  • Pre-release only fails because it ignores the live complaint evidence that triggered the improvement loop.
  • Fine-tuning logs overbuilds the solution and skips the required scorer-driven triage and SME review.
  • Rate-limit changes address traffic control, not response quality, retrieval freshness, or human feedback.

Question 8

Topic: Evaluation and Monitoring

A team is choosing an LLM for a Databricks RAG assistant that answers HR policy questions. Users need cited answers from Unity Catalog-governed Delta tables, and incorrect benefit guidance creates compliance risk. The current comparison only includes these results:

ModelGeneric scoreAvg latencyCost
Model A0.911.8 sLow
Model B0.882.4 sMedium

Which engineering decision is best before production deployment?

Options:

  • A. Select Model A based on the current metrics

  • B. Run task-specific evaluation with SME-reviewed HR examples

  • C. Deploy both models and compare live click-through rates

  • D. Choose Model B because higher cost implies better quality

Best answer: B

Explanation: Generic quantitative metrics can support screening, but they are insufficient when the application has task-specific correctness, citation, and compliance requirements. For a governed HR RAG assistant, the team needs evaluation evidence that reflects real HR policy questions, expected answer behavior, citation accuracy, and unsafe or misleading responses. In Databricks, this can be captured with MLflow evaluation artifacts, traces, custom scorers, judge-based checks, and SME feedback on a representative evaluation set. Latency and cost still matter, but they should not decide the model choice until task-specific quality and safety evidence exists.

The key takeaway is to validate the model against the actual business task before relying on generic scores.

  • Generic score trap fails because a high aggregate score does not prove correct, cited HR-policy answers.
  • Live traffic first fails because compliance-sensitive behavior should be evaluated before exposing users to risky answers.
  • Cost-quality assumption fails because price is not evidence of task-specific answer quality or safety.

Question 9

Topic: Evaluation and Monitoring

An assistant is live on a Databricks Model Serving endpoint. AI Gateway usage tables show one partner integration is making unbounded request bursts, driving up tokens and affecting other users. The business wants to keep the integration enabled but enforce a hard cap at the serving access layer. What should the engineer configure?

Options:

  • A. MLflow offline evaluation

  • B. Agent Monitoring quality metrics

  • C. Unity Catalog table permissions

  • D. AI Gateway rate limiting

Best answer: D

Explanation: AI Gateway controls are used for operational governance of live LLM and agent traffic. When the observed problem is excessive or uncontrolled usage from a client, rate limiting is the control that caps request volume or usage at the serving access layer while keeping the endpoint available. Usage tables and inference logs help reveal the problem, but they do not by themselves stop bursts. Offline evaluation and quality monitoring help improve responses, not enforce live consumption limits.

  • Offline evaluation helps compare application quality before or between releases, but it does not control live traffic bursts.
  • Quality monitoring detects response or agent behavior issues, not excessive request volume from a client.
  • Table permissions protect governed source data, but the stated risk is uncontrolled endpoint usage, not unauthorized data access.

Question 10

Topic: Evaluation and Monitoring

A team deploys a Databricks RAG assistant through Model Serving for support agents. It uses Vector Search over a Unity Catalog knowledge base. After product docs are reorganized, agents report fluent answers that are sometimes not supported by the retrieved passages. The endpoint is still within its latency SLO and token budget, and the app logs requests, responses, retrieved chunk IDs, and MLflow traces. Which production monitoring metric should they add first?

Options:

  • A. p95 end-to-end response latency

  • B. Average output tokens per request

  • C. Daily request count by user

  • D. Groundedness failure rate on production traces

Best answer: D

Explanation: The reported risk is answer quality, specifically ungrounded or hallucinated responses in a deployed RAG application. Because the app already logs responses, retrieved chunk IDs, and MLflow traces, a groundedness or faithfulness scorer can evaluate whether each answer is supported by the retrieved passages. Monitoring the failure rate over live traffic reveals whether the documentation reorganization is causing unsupported answers to increase. Latency, token usage, and request volume are useful deployment metrics, but they do not measure the quality issue described in the scenario.

  • Latency metric fails because the endpoint already meets its latency SLO and speed does not reveal unsupported answers.
  • Token metric fails because output length helps monitor cost, not whether the response is grounded in retrieved content.
  • Usage metric fails because request count shows adoption or load, not response quality.

Continue in the web app

Use IT Mastery for interactive Databricks Generative AI Engineer Associate practice with mixed sets, timed mocks, topic drills, explanations, and progress tracking.

Try Databricks Generative AI Engineer Associate on Web