Try 10 focused Microsoft AI-300 questions on GenAI system performance, model optimization, latency, cost, and quality tradeoffs, then continue with IT Mastery.
Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.
Try Microsoft AI-300 on Web View full Microsoft AI-300 practice page
| Field | Detail |
|---|---|
| Exam route | Microsoft AI-300 |
| Topic area | GenAI Performance Optimization |
| Blueprint weight | 14% |
| Page purpose | Focused sample questions before returning to mixed practice |
Use this page to isolate GenAI Performance Optimization for Microsoft AI-300. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.
| Pass | What to do | What to record |
|---|---|---|
| First attempt | Answer without checking the explanation first. | The fact, rule, calculation, or judgment point that controlled your answer. |
| Review | Read the explanation even when you were correct. | Why the best answer is stronger than the closest distractor. |
| Repair | Repeat only missed or uncertain items after a short break. | The pattern behind misses, not the answer letter. |
| Transfer | Return to mixed practice once the topic feels stable. | Whether the same skill holds up when the topic is no longer obvious. |
Blueprint context: 14% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.
These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.
Topic: Optimize Generative AI Systems and Model Performance
A team is tuning a RAG flow in Microsoft Foundry for a support-policy assistant. Failed evaluation traces show that the expected policy document is usually retrieved, but the answer often cites the wrong clause. You must improve retrieval grounding without changing the foundation model.
Exhibit: Failed-query retrieval evidence
| Evidence | Observation |
|---|---|
| Relevant rank | Top 3 for 82% of failures |
| Chunk content | 6-10 policy clauses per chunk |
| Retrieval relevance | High |
| Groundedness | Low; cites neighboring clauses |
Options:
A. Fine-tune the foundation model on policy answers
B. Increase topK to return more chunks
C. Lower the similarity threshold for retrieval
D. Re-chunk documents into smaller overlapping passages
Best answer: D
Explanation: The core issue is chunk granularity, not initial document discovery. The failed traces show that the relevant document is already appearing near the top of the retrieval set, but each chunk contains many policy clauses. That gives the generator too much neighboring context and increases the chance it grounds the response in the wrong clause. Re-chunking into smaller passages, with sensible overlap to preserve context across boundaries, makes retrieved evidence more focused while preserving the current model constraint.
Lowering thresholds or increasing topK adds more context, which can worsen confusion when the existing chunks are already broad. Fine-tuning changes the model rather than fixing the retrieval evidence problem.
topK fails because more broad chunks can add noise when relevant chunks are already in the top results.Topic: Optimize Generative AI Systems and Model Performance
A team deployed a fine-tuned support model in Microsoft Foundry as a 15% canary. Promotion policy requires groundedness of at least 0.85, safety incident rate of at most 1%, and no material latency or token regression. Monitoring shows no data collection failures.
| Metric | Current model | Canary model |
|---|---|---|
| Groundedness | 0.89 | 0.78 |
| Safety incident rate | 0.3% | 3.4% |
| p95 response time | 2.1 s | 2.0 s |
| Avg tokens/response | 790 | 775 |
Which action best follows this evidence?
Options:
A. Promote the canary because performance costs improved.
B. Keep the canary live until latency regresses.
C. Retrain the canary immediately on all production logs.
D. Roll back the canary to the current model.
Best answer: D
Explanation: Fine-tuned model monitoring should gate production lifecycle decisions on quality and safety metrics, not only latency or token cost. The canary is below the groundedness threshold and above the safety incident threshold, while latency and token usage are stable. Because the model is already serving production traffic and violates explicit promotion criteria, the immediate action is rollback to the known-good current model. After rollback, the team can inspect traces, segment failures, and evaluation data to decide whether targeted retraining or additional evaluation is needed.
Stable operational metrics do not offset quality and safety regressions.
Topic: Optimize Generative AI Systems and Model Performance
A Microsoft Foundry team runs an offline evaluation for a RAG assistant after updating the retrieval index. Users report answers that sound plausible but cite unrelated source passages.
| Metric | Result | Status |
|---|---|---|
| Answer relevance | 0.86 | Pass |
| Context relevance | 0.42 | Fail |
| Groundedness | 0.38 | Fail |
| Response latency | Normal | Pass |
What is the best root cause indicated by the evaluation evidence?
Options:
A. Retrieved passages do not support the generated answer.
B. The response-generation prompt is too short.
C. The evaluation dataset has no relevant questions.
D. The deployed foundation model is underprovisioned.
Best answer: A
Explanation: For a RAG system, relevance evaluation must separate whether the answer sounds useful from whether the retrieved context actually supports it. Here, answer relevance passes, so the generated response appears responsive to the user question. However, context relevance and groundedness both fail, and users see unrelated citations. That pattern points to a retrieval/support problem: the system is generating plausible answers while the selected chunks do not contain the evidence needed to justify them.
The next investigation would focus on failed queries, top-k retrieved chunks, citation mapping, similarity thresholds, chunking, or retrieval strategy. Latency and model capacity are not the primary signal in this evidence.
Topic: Optimize Generative AI Systems and Model Performance
A GenAIOps team is fine-tuning a foundation model in Microsoft Foundry to summarize customer support tickets. The team generated synthetic training examples from product documentation and wants to prevent the model from learning unrelated marketing-style responses. Before registering the fine-tuned model for deployment, which implementation best validates that the synthetic data supports the target task?
Options:
A. Add more synthetic examples from broader product content.
B. Run a held-out ticket-summary evaluation with task-specific quality gates.
C. Deploy the model and monitor only latency and token usage.
D. Approve the model when fine-tuning training loss decreases steadily.
Best answer: B
Explanation: For synthetic-data or fine-tuning validation, the key is to test the model against the target task, not just the training process. In this scenario, the team should use a held-out evaluation dataset of representative support tickets with expected summaries, then apply task-specific metrics or rubrics such as relevance, groundedness, coherence, and off-task behavior checks. This can be automated as a release gate before model registration or deployment in Microsoft Foundry. Training loss can show optimization progress, but it does not prove the model learned the right behavior. Operational metrics such as latency and token usage are useful later, but they do not validate task alignment.
Topic: Optimize Generative AI Systems and Model Performance
A team operates a Microsoft Foundry RAG chat app for internal support. They want to determine whether a higher retrieval similarity threshold improves answer relevance without increasing hallucinations. The production model, index, and prompt must stay unchanged except for the retrieval threshold, and users should have a consistent experience during the test. What should you implement?
Options:
A. Deploy the higher threshold to all users and compare this week to last week
B. Run an A/B test with sticky user assignment by retrieval threshold
C. Test a new prompt and new model with the higher threshold
D. Run only an offline evaluation dataset and skip production telemetry
Best answer: B
Explanation: A RAG A/B test should isolate the change being measured and compare variants under comparable conditions. In this scenario, the only intended variable is the retrieval similarity threshold, so the production model, index, and prompt should remain the same across both variants. Sticky user assignment avoids a user receiving different behavior across turns or sessions, which can distort the experience and the telemetry. The test should collect relevance metrics and hallucination-related signals such as groundedness, along with operational metrics like latency and token use if they affect rollout decisions.
Changing all traffic at once creates a before-and-after comparison, not a controlled A/B test. Changing the prompt or model at the same time prevents attribution of any improvement to the retrieval threshold.
Topic: Optimize Generative AI Systems and Model Performance
A team optimized a RAG flow in Microsoft Foundry by changing chunk size, similarity threshold, and hybrid search weighting. The release gate requires proof that answer-quality gains came from retrieval optimization, not an unsupported foundation-model change.
| Run | Model deployment | Prompt version | Groundedness | Relevance |
|---|---|---|---|---|
| Baseline | chat-prod | faq-v12 | 3.1 | 3.4 |
| Tuned | chat-prod | faq-v12 | 4.2 | 4.1 |
What is the best next diagnostic step?
Options:
A. Replay the fixed evaluation with retrieval traces enabled
B. Increase maximum output tokens and compare user ratings
C. Fine-tune the foundation model on the evaluation dataset
D. Upgrade the foundation-model deployment and rerun evaluations
Best answer: A
Explanation: To validate RAG retrieval optimization, isolate the retrieval layer while holding unsupported variables constant. The exhibit already shows the same model deployment and prompt version, so the next diagnostic step is to replay the same evaluation dataset and inspect retrieval traces: returned chunks, similarity scores, ranking, citations, and whether answers are grounded in the retrieved context. This confirms whether changes to threshold, chunking, or hybrid weighting plausibly caused the groundedness and relevance gains.
Changing the model, fine-tuning, or altering generation settings would introduce new variables and weaken the claim that retrieval optimization improved answer quality.
Topic: Optimize Generative AI Systems and Model Performance
A team uses Microsoft Foundry to release a customer-support assistant. The next release must use a model fine-tuned from an approved foundation model, and production promotion must preserve lineage to the fine-tuning run and evaluation results. The team also needs versioned rollback if the tuned model regresses. Which configuration should the team use?
Options:
A. Redeploy the foundation model and update only the system prompt
B. Overwrite the existing foundation model deployment in place
C. Register the fine-tuned model as a versioned deployment candidate
D. Store the tuned model files only in the prompt Git repository
Best answer: C
Explanation: Fine-tuned model lifecycle management treats the customized model as a release artifact, not just as a configuration change to a foundation model deployment. In this scenario, the release path needs traceability to the base model, fine-tuning run, training data, and evaluation results, plus a way to promote or roll back a specific tuned version. That points to managing the fine-tuned model as a versioned deployment candidate in the Foundry production lifecycle, with evaluation and monitoring tied to that model version. Ordinary foundation model deployment mainly selects and hosts an existing model version; it does not by itself capture the customization lineage and rollback requirements for a tuned artifact. The key distinction is that fine-tuning creates a new managed model lifecycle object for release control.
Topic: Optimize Generative AI Systems and Model Performance
A team operates a Microsoft Foundry RAG assistant for internal policy questions. Recent evaluations show low groundedness. The production foundation model deployment and prompt are locked for this release; only retrieval settings can change. The team must prove that any optimization improves answer quality before rollout. Which implementation should you use?
Options:
A. Switch to a larger foundation model deployment for evaluation.
B. Fine-tune the foundation model on the policy documents.
C. Compare retrieval variants with a fixed model and mapped evaluation dataset.
D. Promote the lowest similarity threshold based on higher retrieval counts.
Best answer: C
Explanation: RAG retrieval optimization should be validated by isolating retrieval changes from model and prompt changes. In Microsoft Foundry, use the same deployed model and prompt, create candidate retrieval configurations such as different chunk sizes, similarity thresholds, or hybrid search settings, and run the same mapped evaluation dataset against each variant. Compare answer-quality metrics such as groundedness and relevance, and optionally check operational metrics such as latency and token consumption. This proves whether retrieval changes improved answer quality rather than masking the result with a different model. Retrieval volume alone is not enough because more retrieved chunks can add noise and reduce groundedness.
Topic: Optimize Generative AI Systems and Model Performance
A team operates a RAG chatbot in Microsoft Foundry. Evaluation shows acceptable relevance, but groundedness failures increased after retrieval changed from top_k=3 to top_k=8. Traces show each response includes 2–3 high-score passages from the right policy and 4–5 low-score passages from unrelated policies. Chunk previews are single-topic and readable. The fix must ship this release without rebuilding the index or changing the prompt. Which implementation should the engineer use?
Options:
A. Switch from vector search to hybrid search
B. Lower the similarity threshold to improve recall
C. Reduce chunk size and re-index the corpus
D. Increase the minimum similarity score threshold
Best answer: D
Explanation: Threshold tuning is the right operational fix when traces show relevant high-score chunks are already retrieved, but low-score unrelated chunks are also being passed into the generation context. Raising the minimum similarity threshold improves precision by excluding weak matches while preserving the existing index, chunking scheme, and retrieval strategy. Chunk-size tuning is better when chunks are too broad, too narrow, or split important context. Retrieval-strategy changes, such as hybrid search, are better when the current strategy misses relevant content because of lexical, semantic, or identifier-matching gaps. Here, the visible evidence points to noisy low-score context, not chunk shape or retrieval coverage.
Topic: Optimize Generative AI Systems and Model Performance
A team fine-tuned a Microsoft Foundry model for customer support and approved support-ft:4 for production after evaluation. After release, quality alerts continue to match the previous version.
Evidence:
| Source | Evidence |
|---|---|
| Dev evaluation | support-ft:4, groundedness 0.86, avg tokens 740 |
| Release tag | expected model support-ft:4, prompt ticket-summary:12 |
| Production endpoint | traffic 100% to deployment using support-ft:3 |
| Production traces | model support-ft:3, groundedness 0.62, avg tokens 1,250 |
What is the best root cause?
Options:
A. Production is still serving the previous fine-tuned model version.
B. The evaluation dataset is too small for approval.
C. The prompt version was not promoted with the model.
D. The endpoint needs more provisioned throughput units.
Best answer: A
Explanation: Versioning evidence should connect the approved fine-tuned model, release artifact, deployed endpoint, and production traces. Here, development evaluation approved support-ft:4, and the release tag also expected support-ft:4. However, the production endpoint routes all traffic to a deployment using support-ft:3, and production traces confirm that requests are being served by support-ft:3. The monitoring symptoms are therefore tied to the old model, not to the evaluated production candidate.
The next operational fix would be to update or roll out the production deployment so that traffic is routed to the approved fine-tuned model version, then continue monitoring quality and token metrics for that version.
ticket-summary:12, and no production prompt mismatch is shown.Use the Microsoft AI-300 Practice Test page for the full IT Mastery practice bank, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.
Try Microsoft AI-300 on Web View Microsoft AI-300 Practice Test
Read the Microsoft AI-300 Cheat Sheet for compact concept review before returning to timed practice.