AI-103: Implement Computer Vision Solutions

May 1, 2026

Try 10 focused AI-103 questions on Implement Computer Vision Solutions, with explanations, then continue with IT Mastery.

On this page

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try AI-103 on Web View full AI-103 practice page

Topic snapshot

Field	Detail
Exam route	AI-103
Topic area	Implement Computer Vision Solutions
Blueprint weight	13%
Page purpose	Focused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate Implement Computer Vision Solutions for AI-103. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

Pass	What to do	What to record
First attempt	Answer without checking the explanation first.	The fact, rule, calculation, or judgment point that controlled your answer.
Review	Read the explanation even when you were correct.	Why the best answer is stronger than the closest distractor.
Repair	Repeat only missed or uncertain items after a short break.	The pattern behind misses, not the answer letter.
Transfer	Return to mixed practice once the topic feels stable.	Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 13% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These questions are original IT Mastery practice items aligned to this topic area. They are designed for self-assessment and are not official exam questions.

Question 1

Topic: Implement Computer Vision Solutions

An auto-insurance company uses a Foundry project with a vision-capable model to create damage summaries from uploaded claim photos. The responsible AI policy says to block explicit sexual content or harm instructions, route graphic injury images to a human reviewer, transform personal data by redacting it before model processing, and preserve moderation evidence with the summary.

A submitted photo has these policy findings: license plate: high confidence; bystander face: high confidence; graphic injury: not detected; sexual content: not detected.

What should the app do?

Options:

A. Block the photo because personal data is present.
B. Redact the identifiers, store findings, then summarize.
C. Route the photo for review because a face is present.
D. Summarize unchanged and add a safety disclaimer.

Best answer: B

Explanation: The policy requires action based on the specific safety findings, not a single action for all risks. Because only personal identifiers were detected, the app should transform the image by redacting them, keep the evidence, and continue the legitimate claims workflow.

For multimodal responsible AI controls, the handling action should match the configured policy category and severity. In this case, the detected license plate and face are personal-data findings, and the policy explicitly says to redact personal data before model processing. There are no findings that trigger blocking or human review. Preserving the moderation findings with the generated summary supports traceability and later audit of why the image was allowed after transformation.

The key takeaway is to minimize unnecessary blocking while still enforcing the required safety and privacy guardrails.

Blocking personal data over-enforces the policy because the stated action for identifiers is transformation, not rejection.
Routing faces to review adds unnecessary human oversight because the policy reserves review for graphic injury findings.
Summarizing unchanged misses the required redaction step and would process personal identifiers contrary to policy.

Question 2

Topic: Implement Computer Vision Solutions

You are building a Microsoft Foundry agent for workplace accessibility inspections. Users upload photos of entrances and ask whether a wheelchair user can enter safely. The agent retrieves the company accessibility checklist, but pilot traces show a text-only workflow misses ramp obstructions, curb cuts, and door-handle placement that are visible only in the images. Which workflow should you implement?

Options:

A. Extract OCR text and answer by using checklist retrieval only.
B. Invoke multimodal visual QA on the uploaded image and user question.
C. Ask users to describe the image before the agent answers.
D. Generate captions once and discard the uploaded images.

Best answer: B

Explanation: The scenario requires reasoning over visual context, such as obstructions and placement. A Foundry agent should route the image and the user’s question to a multimodal visual understanding or visual QA capability, then combine those findings with retrieved checklist guidance.

For image-dependent accessibility decisions, OCR and text retrieval are not enough because the critical evidence may be spatial, visual, or object-based. The agent workflow should preserve the uploaded image in the conversation context and use a multimodal model or visual QA tool to inspect the scene. Retrieval from the checklist can still supply policy wording, but the visual finding must come from analysis of the image itself.

The key takeaway is to avoid reducing the image to text too early when the answer depends on visual context.

OCR-only analysis fails because ramp blockage and handle placement may not appear as readable text.
Caption-only storage loses image details that may be needed for follow-up questions or evidence review.
User-described images shifts visual reasoning to the user and weakens consistency, traceability, and accessibility.

Question 3

Topic: Implement Computer Vision Solutions

A retail company is adding a Foundry-based agent that reviews campaign images from an image-generation deployment and from designer uploads. Before an asset can be published, the app must verify that the corporate watermark is present, approved brand marks are used correctly, prohibited symbols are absent, and inappropriate visual content is blocked. The solution must use managed identity over private endpoints and keep auditable policy decisions with human approval for borderline results. Which architecture best fits these requirements?

Options:

A. Index brand guidelines in Azure AI Search and let the multimodal model self-certify each image without separate image moderation.
B. Train a new computer-vision model from scratch in Azure Machine Learning and expose it as the only policy service.
C. Use only the image-generation prompt to require watermarks and avoid prohibited content, then publish outputs automatically.
D. Use a Foundry publish gate with Azure AI Content Safety image moderation plus a custom Azure Content Understanding visual analyzer; connect by managed identity/private endpoints, log decisions, and route low-confidence results for approval.

Best answer: D

Explanation: The best design uses an enforced publish gate, not just model instructions. Azure AI Content Safety handles inappropriate visual content, while a custom Content Understanding visual analyzer can return structured policy signals for watermark, brand, and prohibited-symbol checks before release.

For visual policy enforcement, combine native safety moderation with custom visual understanding and make the result a blocking workflow step. Azure AI Content Safety is the right Azure-native component for inappropriate image detection. A custom Azure Content Understanding analyzer can inspect uploaded or generated images for organization-specific requirements, such as required watermarks, approved brand usage, and prohibited symbols. The Foundry agent should call these services through managed identity and private endpoints, persist the policy outcome for auditability, and route uncertain or failed checks to a human approval workflow. This enforces the policy at the application boundary instead of relying on the generation model to follow instructions.

Prompt-only control fails because it cannot reliably inspect designer uploads or create auditable safety decisions before publishing.
Policy RAG alone can explain brand rules but does not enforce visual detection for symbols, watermarks, or harmful imagery.
Training from scratch overbuilds the solution and omits the built-in Azure-native moderation control for inappropriate content.

Question 4

Topic: Implement Computer Vision Solutions

You are building a Foundry-based accessibility assistant. A user uploads a photo of a room and asks arbitrary questions about what is visible, such as whether there is a step before an exit. The app must answer only from that image, say when evidence is not visible, and include a short visual-evidence rationale. You do not need to train a custom model. Which implementation should you use?

Options:

A. Fine-tune an image classifier for each question type.
B. Index generated captions in Azure AI Search for text-only QA.
C. Use a Foundry multimodal model with image and question inputs.
D. Run OCR and answer only from extracted text.

Best answer: C

Explanation: Visual question answering requires a model that can process both the image and the user’s natural-language question. A Foundry multimodal model deployment can ground the response in the uploaded image and be instructed to say when evidence is not visible.

The core implementation pattern is multimodal visual QA: send the image and the user’s question to a multimodal model deployment in Foundry, and use instructions that constrain the answer to visible evidence. The app should ask the model to provide a short rationale tied to observed visual content and to respond that the evidence is not visible when the image does not support an answer. This avoids inventing facts and supports arbitrary questions without training a custom model.

Caption-only or OCR-only pipelines can be useful for retrieval or accessibility metadata, but they can omit visual details needed for open-ended image questions.

Caption indexing can miss important visual details because the text-only model answers from generated captions rather than the image itself.
OCR only works for visible text but cannot answer questions about non-text objects, layout, obstacles, or scene context.
Custom classification is unnecessary for ad hoc visual QA and would not support arbitrary user questions without many labels.

Question 5

Topic: Implement Computer Vision Solutions

A retail team uses a Foundry project to build an image editing workflow that replaces product-photo backgrounds from text prompts while preserving the product itself. The team wants an evaluation dashboard for release gates. An existing Azure AI Vision dashboard tracks category classifier precision and recall for product labels. Which evaluation should you use for the editing workflow?

Options:

A. Azure AI Search grounding relevance score
B. Product-category classifier precision and recall
C. Multimodal prompt-adherence and edit-fidelity scoring
D. OCR word error rate on generated images

Best answer: C

Explanation: Image generation and editing workflows need output-quality evaluation, not ordinary image classification metrics. The dashboard should assess whether the generated image follows the prompt and preserves the intended visual content from the source image.

The core concept is evaluating a generative visual workflow by the quality of the created or edited media. For product-photo editing, useful release-gate signals include prompt adherence, edit fidelity, preservation of the product, and safety-policy outcomes for the generated image. Classifier precision and recall can confirm that a label model recognizes a category, but they do not prove that the edit was correct or visually faithful to the input. The evaluation should compare the prompt, original image, and edited image as one multimodal task.

Classification metrics fail because category accuracy does not measure whether the requested background edit was performed correctly.
OCR error rate is relevant only when text recognition quality is the goal, not general photo editing quality.
Search relevance applies to retrieval and grounding workflows, not validation of generated image edits.

Question 6

Topic: Implement Computer Vision Solutions

You are building a Foundry agent workflow for retail inspections. A capture agent receives store aisle photos, and a policy agent decides whether to open a safety case. Before the policy agent runs, the system must extract reusable visual facts such as blocked exits, damaged signs, empty facings, and affected regions, and include evidence in trace logs for later review. Which implementation best fits this requirement?

Options:

A. Have the policy agent inspect each raw image only during final reasoning.
B. Call a Content Understanding analyzer to output region-grounded visual facts.
C. Run OCR layout extraction and pass only detected text to the policy agent.
D. Index the photos in Azure AI Search and retrieve visually similar cases.

Best answer: B

Explanation: Content Understanding is appropriate when an app needs structured, grounded representations from visual content before later reasoning. In this workflow, the policy agent needs reusable visual characteristics, affected regions, and auditable evidence rather than only a direct image prompt.

The core concept is using Content Understanding as a visual extraction stage in an agent workflow. It can analyze images to produce structured outputs that describe visible characteristics and their regions, which can then be passed to another agent for policy reasoning, case creation, or review. This separates evidence extraction from downstream decision logic and supports traceability because the workflow can log the analyzer output and provenance. Direct multimodal prompting may answer a one-off question, but it is less suitable when the requirement is a reusable, auditable set of extracted visual facts.

OCR-only extraction misses non-text visual conditions such as empty facings or blocked exits.
Similarity retrieval can find related cases, but it does not extract the current photo’s visual characteristics.
Raw image prompting may support visual reasoning, but it does not best satisfy reusable, region-grounded extraction for downstream agents.

Question 7

Topic: Implement Computer Vision Solutions

A Microsoft Foundry app uses Azure Content Understanding to analyze workplace inspection videos. A downstream agent creates a structured safety report. During evaluation trace review, the team must verify that each reported violation is grounded in the exact object location and time interval in the source video, not only in a scene summary. Which analyzer configuration best supports this observability goal?

Options:

A. Return object regions, video segments, and structured fields.
B. Run OCR and layout extraction on sampled frames only.
C. Return global visual characteristics and captions only.
D. Log agent token counts and latency breakdowns only.

Best answer: A

Explanation: The evaluation goal requires traceable visual grounding for both where and when a violation appears. Content Understanding analyzer output should include object regions and video segments, then map that evidence into structured fields used by the agent.

For video-based safety reporting, the core concept is evidence-rich analyzer output. If reviewers must validate exact object location and time interval, the analyzer should emit object regions, relevant video segments, and structured fields that the downstream agent can reference in its report. This lets trace logs connect each structured finding to spatial and temporal evidence from the source media. A summary-only configuration may describe the scene, but it does not provide enough inspectable grounding for this evaluation requirement.

Summary-only output fails because captions and visual characteristics do not prove the exact object region or time range.
OCR-only extraction is useful for text in frames but does not target object localization or video segments.
Operational telemetry only helps monitor cost and performance, not whether visual findings are grounded.

Question 8

Topic: Implement Computer Vision Solutions

You are building a store-inspection agent in a Microsoft Foundry project. Users upload shelf photos, and the agent must return structured data for each relevant product: object type, visible condition, dominant color, and the image region where it appears. The output must be based on the uploaded image, not on prewritten captions.

Which two actions should you take?

Options:

A. Use a document layout analyzer to extract tables and paragraphs.
B. Create a Content Understanding image analyzer with fields for the required characteristics and regions.
C. Enable image moderation only and return the safety category labels.
D. Index generated image captions in Azure AI Search for vector retrieval.
E. Use an image generation model to create alternate shelf views.
F. Attach the configured Content Understanding analyzer as a Foundry Tool used by the agent.

Correct answers: B and F

Explanation: Azure Content Understanding is the appropriate Foundry capability for extracting structured information from visual inputs. The analyzer defines what visual characteristics and regions to extract, and the agent must be configured to call that analyzer as a tool on uploaded images.

For this scenario, the core concept is configuring Azure Content Understanding in Foundry Tools for visual understanding. The analyzer should be designed for image input and should define a structured output that includes the required visual characteristics, such as object type, condition, color, and the relevant image region. After the analyzer is created, the agent needs access to it as a Foundry Tool so uploaded photos are analyzed before the agent formats or explains the result.

Retrieval, generation, layout-only document extraction, and safety moderation can be useful in other workloads, but they do not directly extract the requested visual characteristics and regions from the uploaded shelf photo.

Caption retrieval fails because it searches text representations instead of extracting regions and attributes from the current uploaded image.
Image generation fails because it creates or edits images rather than analyzing the submitted photo.
Document layout fails because tables and paragraphs do not satisfy object-level visual characteristic extraction.
Safety labels only fails because moderation categories are not the requested product attributes and regions.

Question 9

Topic: Implement Computer Vision Solutions

Your team uses a Microsoft Foundry project with an Azure Content Understanding analyzer to extract PPE compliance fields from inspection uploads. The analyzer was copied from a document-review workflow. The responsible AI policy requires safety filtering, traceable visual evidence for each extracted field, and human approval for risky or low-evidence outputs; normal inspection videos must still be accepted.

Trace excerpt:

Input: mp4 inspection clip
Pipeline mode: document
Segmentation: page
Warnings:
- unsupported input for current mode
- hard_hat_present: no visual evidence region
- safety: medium risk frame detected

Which remediation should you implement?

Options:

A. Reject all uploads that trigger any safety warning.
B. Keep document mode and disable safety filters for MP4 uploads.
C. Use OCR-only frame extraction and ignore visual evidence regions.
D. Use visual/video mode with segmentation, provenance, and targeted approval.

Best answer: D

Explanation: The trace points to a pipeline/input mismatch and missing visual evidence, not a need to weaken safety. A visual/video Content Understanding pipeline with frame or region segmentation can produce the required provenance, while targeted approval preserves legitimate inspection workflows.

Content Understanding extraction from images and video depends on using an analyzer mode that supports visual inputs and produces segment-level evidence such as regions, objects, or frame timestamps. The log shows the analyzer is in document/page mode for an MP4 clip, so it cannot reliably segment visual content; the field fails because no supporting visual evidence is available. The responsible fix is to use a visual/video analyzer, configure appropriate frame or region segmentation, retain provenance metadata for each extracted field, and route medium-risk or low-evidence results to human approval. This keeps safety filters active without applying a blanket block to valid inspection videos.

Disabling safety filters conflicts with the responsible AI policy and does not fix unsupported video processing.
Rejecting all warnings over-blocks legitimate inspections because the policy calls for approval, not blanket denial.
OCR-only extraction may read text but still misses object or region evidence needed for PPE fields.

Question 10

Topic: Implement Computer Vision Solutions

A publishing team uses a Microsoft Foundry agent to prepare product pages for accessibility review. The agent receives each image plus the page title and nearby caption. The team requires concise alt text for informative images, an extended description for charts and diagrams, empty alt text for decorative images, auditability, and editor approval before publishing low-confidence descriptions. How should you configure the agent workflow?

Options:

A. Use a multimodal description tool with a structured schema for image purpose, alt text, extended description, confidence, and approval status.
B. Let the agent publish generated descriptions automatically and rely on user feedback to correct errors.
C. Generate one detailed caption for every image and store it as both alt text and extended description.
D. Use OCR only and create alt text from any text detected inside the image.

Best answer: A

Explanation: Accessible image handling requires different outputs based on image purpose and complexity. A Foundry agent should use multimodal understanding plus structured fields so it can produce concise alt text, require extended descriptions for complex visuals, mark decorative images correctly, and route uncertain outputs for approval.

The core concept is aligning multimodal agent output with accessibility intent. Alt text should be concise and useful for informative images, while complex charts, diagrams, and infographics often need a longer extended description. Decorative images should not receive descriptive alt text because that creates noise for screen reader users. A structured tool schema helps enforce these distinctions and makes the workflow auditable by capturing confidence, purpose classification, and approval status. Human approval is appropriate when confidence is low or the description may affect published accessibility quality.

Using one generic caption for every image ignores the difference between alt text and extended descriptions. OCR alone can miss visual meaning, and automatic publishing removes the required safeguard.

One caption for all images fails because concise alt text, empty decorative alt text, and extended descriptions have different accessibility purposes.
OCR-only extraction fails because detected text does not capture charts, diagrams, image purpose, or visual relationships.
Automatic publishing fails because the scenario requires editor approval for low-confidence descriptions before publication.

Continue with full practice

Use the AI-103 Practice Test page for the full IT Mastery route, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try AI-103 on Web View AI-103 Practice Test

Free review resource

Read the AI-103 Cheat Sheet on Tech Exam Lexicon, then return to IT Mastery for timed practice.

Revised on Thursday, May 14, 2026

Implement Generative AI and Agentic Solutions

Implement Text Analysis Solutions

Browse Certification Practice Tests by Exam Family

AI-103: Implement Computer Vision Solutions

Topic snapshot

How to use this topic drill

Sample questions

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Continue with full practice

Related focused pages

Free review resource