Free Cisco 300-640 DCAI Practice Exam: Cisco Implementing Data Center AI Infrastructure

Last revised: July 14, 2026

Try 60 free Cisco Implementing Data Center AI Infrastructure (Cisco 300-640 DCAI) questions across the exam domains, with explanations, then continue with IT Mastery practice.

This free full-length Cisco 300-640 DCAI practice exam includes 60 original IT Mastery questions across the exam domains.

These are original IT Mastery practice questions. They are not official Cisco questions, copied live-exam content, or exam dumps. Use them to preview question style and explanation depth before continuing with mixed sets, topic drills, and timed mocks in IT Mastery.

Count note: this page uses the full-length practice count maintained in the Mastery exam catalog. Some certification vendors publish total questions, scored questions, duration, or unscored/pretest-item rules differently; always confirm exam-day rules with the sponsor.

Try the IT Mastery web app for a richer interactive practice experience with mixed sets, timed mocks, topic drills, explanations, and progress tracking.

Try Cisco 300-640 DCAI on Web

Exam snapshot

Practice target: Cisco 300-640 DCAI
Practice-set question count: 60
Time limit: 90 minutes
Practice style: mixed-domain diagnostic run with answer explanations

Full-length exam mix

Domain	Weight
AI Fundamentals and Applications	20%
AI Infrastructure Components and Architecture	30%
AI Infrastructure Deployment and Data Management	30%
AI Infrastructure Operations and Troubleshooting	20%

Use this as one diagnostic run. IT Mastery gives you timed mocks, topic drills, analytics, code-reading practice where relevant, and interactive practice.

Practice questions

Questions 1-25

Question 1

Topic: AI Fundamentals and Applications

A team reports that a RAG chatbot has a high time-to-first-token, but normal token generation speed after the answer starts. Recent telemetry shows:

Observation	Value
Vector DB query p95	Increased from 400 ms to 4.8 s
Storage read latency on vector index volume	High
LLM GPU utilization during generation	Normal
RoCE congestion drops	None
Orchestrator pod restarts	None

Which stage and infrastructure dependency should be investigated first?

Options:

A. Retrieval: vector-index storage and query latency
B. Generation: GPU memory bandwidth for token decoding
C. Augmentation: prompt-template CPU formatting latency
D. Orchestration: pod restart recovery time

Best answer: A

Explanation: In a RAG workflow, retrieval locates relevant context, augmentation prepares that context with the user prompt, and generation uses the LLM to produce tokens. The symptom is high time-to-first-token with normal token speed after output begins. That points to work before generation, and the telemetry specifically shows degraded vector database query latency plus high storage read latency on the vector index volume. Those are retrieval-stage dependencies. GPU utilization and token generation speed do not support a generation bottleneck, and the absence of pod restarts makes orchestration recovery unlikely. The key takeaway is to map the performance symptom to the RAG stage that owns the stressed infrastructure path.

Prompt formatting is less supported because the strongest evidence is slow vector database access, not CPU-bound context assembly.
GPU decoding does not fit because token generation speed and GPU utilization are normal after context is available.
Pod recovery is not supported because the orchestrator shows no restarts or workload rescheduling symptom.

Question 2

Topic: AI Infrastructure Deployment and Data Management

A team is deploying a multi-node generative AI training cluster on an Ethernet leaf-spine fabric. The workload uses frequent GPU-to-GPU gradient exchanges and must reduce communication latency and CPU overhead without replacing Ethernet. Which design best maps to these requirements?

Options:

A. Enable RoCEv2 with QoS, PFC, and ECN for AI traffic
B. Move model checkpoints to Fibre Channel storage
C. Increase CPU core count on each training server
D. Use standard TCP forwarding with best-effort QoS

Best answer: A

Explanation: RDMA over Converged Ethernet matters because distributed AI training often depends on fast node-to-node communication, not only local GPU speed. RoCEv2 carries RDMA traffic across an IP-routed Ethernet fabric so GPU servers can exchange large tensors with lower latency and less CPU overhead than traditional TCP-based communication. For AI fabrics, RoCEv2 is typically paired with traffic engineering and congestion controls such as QoS classification, Priority Flow Control, and ECN so the RDMA traffic is protected from loss and congestion collapse. The key design point is preserving Ethernet while making it suitable for high-performance, low-latency AI communication.

More CPU does not directly reduce network protocol overhead or fabric latency for GPU-to-GPU exchanges.
Fibre Channel storage may improve checkpoint I/O, but it does not address inter-server training communication.
Best-effort TCP keeps Ethernet simple but fails the low-latency and low-overhead communication requirement.

Question 3

Topic: AI Infrastructure Components and Architecture

A team is validating an on-premises GPU cluster for distributed model training before adding more nodes. Training logs show repeated data-loader waits and GPU utilization dropping below 35% during batch reads. RoCE fabric telemetry shows no congestion drops, and the servers have CPU and memory headroom. Which storage evidence best supports prioritizing storage remediation before scale-out?

Options:

A. Dataset repository capacity at 68% with monthly growth
B. Backup network throughput lower than the training fabric
C. Dual-controller storage array operating in active/active mode
D. High p95 read latency on the dataset volume during batch reads

Best answer: D

Explanation: For an AI training bottleneck, the strongest storage evidence is a metric that matches the workload symptom and timing. Data-loader waits during batch reads point to the storage path that feeds training data to the GPUs. If p95 read latency spikes on that dataset volume at the same time GPUs go idle, storage performance is the likely limiter, especially when network congestion and server CPU or memory pressure have been ruled out. Capacity, redundancy, and unrelated backup-path throughput may matter operationally, but they do not explain the observed training stalls. The key is correlating storage performance evidence with the AI workload phase that is slowing down.

Capacity growth is a planning signal, but 68% used does not explain immediate batch-read stalls.
Active/active controllers support availability and throughput, but their presence is not evidence of a current bottleneck.
Backup throughput is on a different path unless backups are shown to affect the training data path.

Question 4

Topic: AI Infrastructure Operations and Troubleshooting

A data center team remediated degraded all-reduce performance for an AI training cluster by correcting the QoS policy applied to RoCEv2 traffic. Validation shows that packet drops are gone and the benchmark is back within target. However, telemetry now shows GPU-to-GPU traffic growing 18% month over month and the fabric will exceed the planned utilization threshold next quarter. The operations requirement is to close only validated incidents, preserve troubleshooting knowledge, avoid unnecessary escalation, and prevent recurrence. Which follow-up design best maps to these results?

Options:

A. Escalate to Cisco TAC and defer documentation until root cause is confirmed
B. Document the fix, update monitoring baselines, and open capacity planning
C. Increase GPU reservations in the orchestrator without changing operations records
D. Close the incident and rely on existing alerts for future detection

Best answer: B

Explanation: Remediation follow-up should match the validation outcome. Because the benchmark recovered and drops are gone, the incident can move toward closure without escalation for an unresolved fault. The team should document the corrected QoS mapping and validation evidence so the runbook and incident history are useful later. Because telemetry shows sustained growth and a projected threshold breach, monitoring baselines or alert thresholds should be reviewed, and a capacity-planning action should be created. This separates a fixed incident from a future scalability risk.

Unneeded escalation fails because the issue is already validated as fixed, and no unresolved vendor defect is stated.
Existing alerts only fails because the post-fix telemetry changes the monitoring and capacity assumptions.
GPU reservations targets compute scheduling, not the documented network remediation or projected fabric utilization risk.

Question 5

Topic: AI Fundamentals and Applications

A data science team is deploying an inference workload that scores 40 million archived support tickets each night to update search metadata. The job must finish before 6:00 a.m., can start after business hours, and has no user-facing response-time requirement. Daytime GPU capacity must remain available for a customer-facing chatbot. Which infrastructure decision is BEST?

Options:

A. Run queued batch jobs on off-peak shared GPUs
B. Prioritize synchronous API scaling over job scheduling
C. Deploy edge inference nodes near each user region
D. Reserve dedicated low-latency GPUs for every request

Best answer: A

Explanation: Batch inference is designed for high-throughput processing where individual request latency is not critical. In this scenario, the ticket-scoring job has a completion deadline but no real-time user interaction, so it should be scheduled as a queued workload during off-peak hours. GPU allocation can be shared, quota-controlled, or lower priority so that daytime interactive inference for the chatbot keeps its capacity. Interactive inference, by contrast, typically needs synchronous serving, low p95/p99 latency, and reserved or rapidly scalable resources.

The key distinction is completion-window scheduling versus per-request response-time guarantees.

Dedicated low-latency GPUs overbuilds for a workload that has no user-facing response-time requirement.
Edge inference nodes address user proximity and latency, which are not constraints for nightly archive scoring.
Synchronous API scaling focuses on interactive serving and ignores the stated scheduling window.

Question 6

Topic: AI Infrastructure Operations and Troubleshooting

A team runs distributed generative AI training on GPU servers using RoCEv2 across a leaf-spine fabric. During all-reduce phases, training time increases and GPU utilization drops, while storage latency remains within baseline. Which telemetry signal best supports the suspected network issue?

Options:

A. Rising ECN marks and PFC pause counters on the RoCE traffic class
B. Frequent pod image pull failures in the orchestration cluster
C. Increasing GPU temperature and power throttling events on the servers
D. Higher read latency on the shared training dataset volume

Best answer: A

Explanation: For distributed AI training, all-reduce phases are highly sensitive to network latency, congestion, and packet-loss behavior. In a RoCEv2 fabric, telemetry such as ECN marking and PFC pause counters on the correct QoS traffic class is strong evidence that congestion management is being exercised during GPU-to-GPU synchronization. The stem already rules against storage by saying storage latency is normal, and falling GPU utilization can be an effect of waiting on network communication rather than a compute fault. The key takeaway is to correlate the signal with the workload phase and infrastructure layer being suspected.

Compute symptom fails because GPU thermal throttling would support a server compute issue, not fabric congestion.
Storage latency fails because the stem says dataset storage latency remains within baseline.
Orchestration failures fail because image pull problems affect deployment readiness, not all-reduce performance during active training.

Question 7

Topic: AI Infrastructure Deployment and Data Management

A team is deploying Cisco UCS GPU servers for distributed AI training. The workload must boot from a shared FC SAN LUN and use local NVMe devices as high-throughput scratch space for checkpoints. The server profile template currently uses a storage policy built for stateless inference nodes with no local disk claim and no FC boot path.

Which design choice best maps to the requirement?

Options:

A. Increase the RoCE traffic class bandwidth for training traffic.
B. Add more GPU memory to each training node profile.
C. Move checkpoints to the orchestration control-plane datastore.
D. Assign a storage policy that defines FC boot and local NVMe scratch use.

Best answer: D

Explanation: A UCS storage policy mismatch can prevent an AI node from being provisioned correctly or can place I/O on the wrong data path. In this scenario, the requirement is explicit: FC SAN boot plus local NVMe scratch for checkpoint-heavy training. A profile built for stateless inference does not claim local disks and does not provide the FC boot path, so it cannot satisfy the deployment requirements. The design should align the UCS storage policy and related profile settings with the intended storage data paths before workload placement. Network QoS or GPU changes may help other bottlenecks, but they do not correct a missing storage path.

RoCE tuning addresses network transport behavior, not the missing FC boot path or local NVMe claim.
More GPU memory changes compute capacity but does not make required storage devices available to the profile.
Control-plane datastore is not an appropriate high-throughput checkpoint target for GPU training nodes.

Question 8

Topic: AI Infrastructure Components and Architecture

A team is deploying a production generative AI inference service on Cisco UCS GPU servers. The service must continue serving requests during one server failure or planned host maintenance, and the model endpoint must remain available without manual rebuild. Which compute design best maps to these requirements?

Options:

A. Deploy an N+1 GPU server pool across failure domains with automated workload rescheduling.
B. Deploy active/passive GPU servers with manual model reload after failure.
C. Deploy one larger GPU server with redundant power supplies and NICs.
D. Deploy redundant storage for model weights on a single GPU server.

Best answer: A

Explanation: Production AI inference needs compute redundancy at the service layer, not only component redundancy inside a single host. An N+1 GPU server pool provides enough spare GPU capacity for the workload to keep running when one server fails or is placed into maintenance. Placing servers across failure domains reduces the chance that one chassis, power, or fabric issue removes all serving capacity. Automated orchestration or workload rescheduling keeps the endpoint available without a manual rebuild. Redundant power, NICs, or storage are useful, but they do not replace the need for redundant GPU compute nodes.

Bigger single host improves capacity but still leaves the service dependent on one compute node.
Manual passive failover does not meet the requirement for continuity without manual rebuild.
Redundant storage only protects model data but does not provide alternate GPU execution capacity.

Question 9

Topic: AI Infrastructure Components and Architecture

An AI team is designing compute placement for two workloads. Large-model fine-tuning runs as 8-GPU jobs with heavy GPU-to-GPU collective traffic and must minimize intra-node latency. Several inference services are smaller, independently scaled containers that can share accelerators but need tenant isolation. Which compute architecture best maps to these requirements?

Options:

A. PCIe-only single-GPU nodes for all workloads with scheduler spreading
B. CPU-only training nodes with GPU inference nodes
C. Shared vGPU VMs for training and bare-metal nodes for inference
D. Dedicated NVLink-connected GPU nodes for training; containerized shared-GPU pool for inference

Best answer: D

Explanation: Large-model fine-tuning with 8-GPU jobs and heavy collective operations benefits from GPUs placed in the same server or pod with high-bandwidth, low-latency GPU-to-GPU connectivity such as NVLink or NVSwitch. Keeping those jobs on dedicated, tightly connected GPU nodes avoids fragmentation and reduces intra-node communication delay. Smaller inference services usually scale independently and can often run as containers with GPU sharing or partitioning, provided isolation is enforced by the platform. The key is matching placement to traffic pattern: training needs tightly coupled accelerators, while inference needs elastic, isolated accelerator access.

Spreading training fails because heavy collective traffic becomes more sensitive to inter-node latency and lower GPU-to-GPU bandwidth.
Shared vGPU training fails because sharing accelerators can undermine predictable performance for tightly coupled 8-GPU fine-tuning jobs.
CPU-only training fails because the stated workload requires accelerator performance for large-model fine-tuning.

Question 10

Topic: AI Infrastructure Components and Architecture

A team reports that a distributed GPU training job performs normally when all workers run in one rack, but step time more than doubles when workers span two racks. GPU utilization drops during gradient synchronization.

Telemetry summary

Signal	Observation
Storage latency	Normal during dataset reads
CPU utilization	Below 50% on all workers
Leaf-spine links	Sustained above 90% during all-reduce
RoCE traffic class	Queue buildup and pause activity

Which network architecture concern is most likely affecting performance?

Options:

A. Oversubscribed east-west fabric between GPU racks
B. GPU memory capacity mismatch across workers
C. Insufficient north-south Internet bandwidth
D. Slow storage reads during dataset ingestion

Best answer: A

Explanation: Distributed GPU training is highly sensitive to east-west bandwidth and latency because workers frequently exchange gradients or parameters during synchronization. The key evidence is that performance is normal within one rack but degrades when the job crosses racks, while storage and CPU signals are not stressed. Sustained high utilization on leaf-spine links plus RoCE queue buildup and pause activity indicate congestion on the inter-rack data path. For AI fabrics, the architecture must provide enough nonblocking or low-oversubscription east-west capacity for GPU-to-GPU communication. A north-south or storage-focused explanation does not match the scope and timing of the symptom.

North-south bandwidth is less likely because the symptom occurs during inter-rack synchronization, not external data transfer.
Storage reads are not supported because storage latency is normal during dataset access.
GPU memory mismatch would not specifically align with congested leaf-spine links and RoCE queue buildup.

Question 11

Topic: AI Infrastructure Deployment and Data Management

A data center team is troubleshooting slower completion times for a distributed AI training job over a RoCEv2 leaf-spine fabric. The RoCEv2 class is configured as lossless and has an ETS minimum bandwidth share.

Telemetry summary:

Signal	Observation
RoCE packet drops	None detected
PFC pause frames	Frequent spikes on the RoCE class
ECN-marked packets	Near zero
ETS utilization	RoCE class receives its configured share

Which remediation is best supported by these facts?

Options:

A. Enable PFC on every traffic class
B. Disable PFC for the RoCEv2 class
C. Increase the ETS minimum bandwidth share
D. Tune ECN marking for the RoCEv2 class

Best answer: D

Explanation: PFC, ECN, and ETS solve different problems in an AI fabric. PFC protects a selected lossless class from drops by pausing traffic, but excessive pause activity can create latency and head-of-line blocking. ECN marks packets during congestion so RoCEv2 endpoints can react before buffers fill enough to trigger PFC. ETS allocates bandwidth among classes; it does not signal congestion or prevent pause storms. Because drops are absent, PFC is already protecting the lossless class. Because ECN marks are near zero while PFC pauses spike, the supported fix is to tune ECN marking for the RoCEv2 class.

Bandwidth share trap fails because ETS affects allocation, not congestion notification.
PFC everywhere fails because extending lossless behavior to all classes can worsen head-of-line blocking.
Removing PFC fails because RoCEv2 lossless transport still needs protection from drops.

Question 12

Topic: AI Infrastructure Components and Architecture

A data science team runs several AI inference services on shared Cisco UCS GPU servers by installing framework and CUDA dependencies directly on the host OS. After each model update, operations observes:

One service upgrade breaks another service that requires a different framework version.
Rollback restores the application code but leaves mixed library versions on the host.
GPU use is uneven because processes are manually assigned to servers.
Network and storage telemetry show no congestion.

Which remediation best addresses the likely cause?

Options:

A. Add NVLink to improve GPU-to-GPU communication
B. Increase RoCE bandwidth for the inference VLAN
C. Deploy each service as a container with GPU resource requests
D. Create one shared host image for all services

Best answer: C

Explanation: The symptoms point to lifecycle and dependency management problems, not a network, storage, or GPU interconnect bottleneck. Containerization is a better fit when AI workloads need portable runtime environments, dependency isolation, repeatable upgrades and rollbacks, and scheduler-aware GPU placement. Packaging each inference service as an immutable container image avoids mixing framework versions on the host. Using an orchestrator with GPU resource requests also improves placement control because workloads can be scheduled based on available GPU resources rather than manual server assignment. The key takeaway is that containers solve deployment consistency and workload scheduling issues; faster fabric or GPU interconnects do not fix host-level dependency drift.

RoCE bandwidth is not supported because telemetry shows no network congestion.
NVLink expansion targets GPU-to-GPU communication, not dependency isolation or rollout consistency.
Shared host image can standardize a baseline, but it does not isolate conflicting framework versions between services.

Question 13

Topic: AI Infrastructure Components and Architecture

A manufacturer is designing AI infrastructure for computer-vision quality inspection across several plants. Raw camera feeds must remain on-site, inference responses must stay under 20 ms, and model retraining can use cloud GPU capacity only with sanitized datasets. Which infrastructure decision best meets these requirements?

Options:

A. Store video on-site but stream frames to cloud inference
B. Run all inference and training on cloud GPUs
C. Build a fully isolated on-premises AI environment
D. Use on-site GPU inference with secure cloud integration

Best answer: D

Explanation: A purely cloud design is unsuitable when the workload has strict real-time latency and data-location constraints. In this scenario, raw video cannot leave each plant and inference must complete in under 20 ms, so inspection inference should run close to the cameras on on-site or edge GPU infrastructure. Cloud still has a useful role for elastic retraining, but only after data is sanitized and transferred through secure hybrid connectivity. This is a hybrid AI pattern: keep latency-sensitive and regulated data paths local, while using cloud capacity where it does not violate placement or response-time requirements.

All-cloud processing misses both the raw-data location rule and the tight inference latency target.
Cloud inference with local storage still requires sending live frames off-site, which violates the data-location and latency constraints.
Fully isolated on-premises satisfies locality but ignores the stated need to use cloud GPU capacity for sanitized retraining.

Question 14

Topic: AI Infrastructure Deployment and Data Management

A team is deploying Cisco UCS GPU servers for AI training. The requirements are: OS images must be portable when a server profile moves to replacement hardware; each server needs a fast temporary cache that survives a single local drive failure; shared datasets and checkpoints must remain on the existing redundant Fibre Channel SAN. Which storage design best maps to these requirements?

Options:

A. SAN boot, no local disks, and cache on the FC SAN
B. Local M.2 boot, RAID0 cache, and FC workload LUNs
C. SAN boot, mirrored local cache, and redundant FC workload LUNs
D. Local boot, mirrored cache, and replicated local checkpoints

Best answer: C

Explanation: For UCS-based AI infrastructure, storage policy choices should align each data path with the requirement it serves. Portable OS images are typically handled with a boot policy that points to SAN boot LUNs, allowing the server profile to move to replacement hardware without depending on local boot media. A fast temporary cache belongs on local storage when the workload needs low-latency scratch space, and the single-drive-failure requirement calls for a mirrored or parity-protected local disk policy rather than RAID0. Shared datasets and checkpoints should use the existing redundant FC SAN because they are persistent workload data that must be available beyond one server.

Local M.2 boot fails the portability requirement because the OS depends on server-local media.
RAID0 cache may be fast, but it does not survive a single local drive failure.
Cache on SAN preserves centralization but misses the stated need for fast server-local temporary cache.
Local checkpoints do not meet the requirement for shared persistent data on the redundant FC SAN.

Question 15

Topic: AI Infrastructure Operations and Troubleshooting

A team reports that distributed GPU training on Cisco UCS servers slowed after a second training job was launched. Intersight shows server health as Good, normal GPU power and temperature, and no GPU link faults. Kubernetes shows all training pods as Running. Storage latency is within baseline. Nexus Dashboard shows rising PFC pause frames and ECN marks on the RoCEv2 traffic class between the GPU nodes. Which action is most supported by these monitoring views?

Options:

A. Replace the affected GPUs in the UCS servers
B. Validate fabric congestion and QoS for the RoCEv2 class
C. Increase storage IOPS for the training dataset
D. Restart the Kubernetes scheduler for the cluster

Best answer: B

Explanation: The monitoring views isolate the problem to the network fabric. Intersight server and GPU health are normal, so the evidence does not support a compute hardware fault. Kubernetes shows the workload is scheduled and running, so the orchestration state is not the main issue. Storage latency is normal, which makes storage performance an unlikely bottleneck. Nexus Dashboard is the relevant view for fabric health, congestion, and RoCEv2 behavior; rising PFC pause frames and ECN marks indicate congestion in the lossless traffic class used for GPU-to-GPU communication. The next step is to validate and remediate RoCEv2 QoS, congestion management, or load distribution in the fabric.

GPU replacement is not supported because Intersight reports healthy servers, normal GPU conditions, and no GPU link faults.
Storage expansion does not match the evidence because storage latency remains within baseline.
Scheduler restart is not supported because Kubernetes already shows the training pods in the Running state.

Question 16

Topic: AI Infrastructure Operations and Troubleshooting

A new distributed training job on Cisco UCS GPU nodes shows intermittent all-reduce latency spikes. Intersight shows the servers, GPUs, adapters, firmware compliance, and power/cooling health as normal. The storage system is not reporting latency. The team suspects congestion on the leaf-spine fabric carrying RoCEv2 traffic. What is the best next validation step?

Options:

A. Review Intersight firmware compliance for the GPU servers
B. Use Intersight inventory to confirm GPU model counts
C. Check Nexus Dashboard fabric telemetry for congestion and drops
D. Restart the orchestration pods on the affected nodes

Best answer: C

Explanation: Nexus Dashboard and Intersight provide different operational views. Intersight is the right tool for infrastructure inventory, server health, firmware compliance, and lifecycle visibility across UCS-managed resources. In this scenario, those server-side indicators are already healthy, and the suspected issue is congestion in the Nexus leaf-spine fabric carrying RoCEv2 traffic. Nexus Dashboard is better suited to validate fabric behavior such as interface utilization, drops, congestion events, flow visibility, and correlated fabric telemetry. The key distinction is that healthy compute inventory does not rule out a network-fabric performance issue.

Firmware compliance is server lifecycle visibility, but the stem already says compliance and health are normal.
GPU model counts confirm inventory, not whether the fabric is congested during all-reduce traffic.
Pod restart treats orchestration as the issue without evidence from the monitoring facts.

Question 17

Topic: AI Fundamentals and Applications

A data science team is moving large-model training into an on-premises data center. Each training job uses 64 GPUs for multi-day runs, repeatedly reads a 20 TB dataset each epoch, and performs frequent gradient synchronization that is sensitive to latency variation. The team wants predictable job completion times. Which design best maps to these workload behaviors?

Options:

A. GPU-dense UCS servers, high-throughput shared storage, and a non-oversubscribed congestion-managed fabric
B. CPU-dense virtualization hosts, capacity-optimized object storage, and oversubscribed access switching
C. GPU servers, archive-tier storage, and best-effort fabric QoS
D. Small edge GPU nodes, local SSD caching, and periodic dataset synchronization

Best answer: A

Explanation: AI training workloads tend to run for long periods with many GPUs active at the same time. To keep those GPUs productive, the storage path must feed large datasets at high throughput, and the network fabric must provide predictable performance for synchronization traffic such as gradient exchange. A non-oversubscribed or carefully engineered fabric with congestion management helps reduce latency variation, while high-throughput shared storage avoids starving GPUs during repeated epoch reads. Designs optimized mainly for CPU density, edge placement, archive capacity, or best-effort connectivity do not address the combined training requirements.

CPU-focused design misses the primary constraint because training is GPU-bound and needs predictable fabric performance.
Edge placement fits distributed inference better than centralized multi-GPU training with repeated large dataset reads.
Archive storage emphasizes capacity and retention, not the throughput needed to keep GPUs fed during training.

Question 18

Topic: AI Fundamentals and Applications

A manufacturer is deploying computer-vision inference for quality inspection at 40 factories. Each line must reject defects within 20 ms, continue operating during WAN outages, and keep raw camera feeds inside the factory. Central IT still wants standardized rollout and health monitoring across all sites. Which deployment model best fits these requirements?

Options:

A. Public-cloud GPU autoscaling for all inference
B. Edge AI at each factory with centralized operations
C. Cloud-hosted inference with streamed camera feeds
D. Central on-premises inference over the WAN

Best answer: B

Explanation: Edge AI is the best fit when inference must happen close to the data source with very low latency and local survivability. In this scenario, defect rejection depends on a 20 ms response and must continue during WAN outages, so sending camera streams to a remote cloud or central data center introduces avoidable latency and dependency on connectivity. Keeping raw camera feeds inside each factory also aligns with edge placement because only summaries, model updates, or telemetry need to traverse the WAN. Centralized operations can still manage fleet consistency through orchestration, image/version control, and health monitoring across sites. The key distinction is that compute for time-critical inference stays local, while management can remain centralized.

Cloud streaming fails because it moves raw feeds offsite and depends on WAN latency and availability.
Central data center inference keeps compute owned by the company but still misses the local latency and outage requirements.
Public-cloud autoscaling may help elastic training or batch workloads, but it ignores the stated data-location and real-time inference constraints.

Question 19

Topic: AI Infrastructure Operations and Troubleshooting

An AI operations team is investigating a 40% drop in distributed training throughput on a Cisco data center fabric. Nexus Dashboard shows elevated ECN marking on one leaf, Intersight shows intermittent GPU underutilization, and storage latency is within baseline. An engineer concludes that PFC should be enabled globally before the next job run. The change window is limited and the team must avoid unnecessary disruption. Which operational design best maps to the requirement?

Options:

A. Correlate telemetry across fabric, GPU, storage, and job timelines before changing controls
B. Increase storage queue depth because training throughput is below baseline
C. Move the workload to different GPU nodes to bypass the underutilization symptom
D. Enable PFC globally to eliminate possible RoCEv2 loss immediately

Best answer: A

Explanation: The core concept is evidence-driven troubleshooting. A single symptom, such as ECN marking on one leaf, is not enough to justify a broad fabric change like globally enabling PFC. The best operational design is to correlate Nexus Dashboard fabric telemetry, RoCE counters, Intersight GPU utilization, storage latency, and job timing to confirm whether the bottleneck is network congestion, compute scheduling, storage, or orchestration. This approach reduces change risk and avoids masking the real fault. In AI clusters, symptoms often cascade: network congestion can idle GPUs, but GPU underutilization can also result from storage stalls or job placement issues. The key takeaway is to validate the conclusion before remediation.

Global PFC change skips root-cause validation and could introduce unnecessary operational risk across traffic classes.
Moving GPU nodes treats GPU underutilization as the root cause without proving a compute or placement issue.
Increasing storage queues conflicts with the stated storage latency baseline and does not address the observed fabric signal.

Question 20

Topic: AI Infrastructure Deployment and Data Management

A new UCS-based GPU pool was added for distributed AI training. The jobs start successfully, but training throughput is about 40% lower than the validated baseline.

Operational observations:

Signal	Observation
GPU telemetry	High utilization, low clocks, power-limit throttling active
Network fabric	No RoCE drops or ECN congestion spikes
Storage	Normal read latency and IOPS
Recent change	Domain profile uses a capped power policy

What is the most likely remediation?

Options:

A. Move the dataset to lower-latency block storage
B. Enable PFC on the RoCE traffic class
C. Increase Kubernetes pod CPU requests
D. Raise or remove the power cap after validating rack power and cooling

Best answer: D

Explanation: UCS power policy selection can affect GPU workload performance when it limits the power budget available to a server, chassis, or GPU-dense node. In this scenario, the strongest evidence is not the application scheduler, storage path, or network fabric; it is GPU telemetry showing power-limit throttling while a capped power policy is applied. For AI training, GPUs often need sustained power headroom to maintain boost clocks during long-running compute phases. The appropriate action is to validate facility and rack power and cooling capacity, then adjust the UCS power policy so the GPU nodes can draw the power required for the workload. A power cap may be useful for protection or capacity management, but it can reduce readiness or performance for dense GPU deployments.

RoCE tuning is not supported because the fabric shows no loss or congestion symptoms.
Storage latency is not indicated because IOPS and read latency are normal.
CPU requests do not address GPU clocks being reduced by an active power limit.

Question 21

Topic: AI Fundamentals and Applications

A team is building an on-premises AI platform for containerized training and RAG inference on Cisco UCS GPU nodes. Jobs must request GPUs, receive the correct pod network connectivity, mount persistent datasets from shared storage, and recover or reschedule when nodes fail. Which orchestration design best maps to these requirements?

Options:

A. Use standalone Slurm with manually provisioned network and storage
B. Use Terraform-managed UCS profiles with static VLANs and LUN mappings
C. Use Nexus Dashboard alerts with Intersight firmware compliance policies
D. Use Kubernetes with GPU device plugins, CNI, CSI, and operators

Best answer: D

Explanation: AI workload orchestration must coordinate runtime placement and dependencies across compute, network, and storage. In this scenario, Kubernetes is the best fit because the control plane schedules containerized jobs, GPU device plugins advertise GPU resources to the scheduler, CNI provides pod network connectivity, CSI connects workloads to persistent storage, and operators/controllers maintain desired state and recover workloads after failures. Infrastructure provisioning tools and monitoring platforms are useful, but they do not by themselves perform runtime scheduling and dependency coordination for containerized AI workloads.

Static provisioning can prepare servers, VLANs, and storage mappings, but it does not dynamically schedule containers or recover workloads.
Standalone job scheduling may place compute jobs, but manually provisioned network and storage skip key orchestration dependencies.
Monitoring and compliance help operate the environment, but alerts and firmware policy do not coordinate workload placement.

Question 22

Topic: AI Infrastructure Deployment and Data Management

A data center team has completed deployment of an on-premises AI cluster for RAG indexing and model fine-tuning. Cisco UCS GPU servers report healthy GPUs, the Nexus fabric is up, NVMe-backed file storage is mounted, and Kubernetes nodes show Ready. The production requirement is to ingest large document sets, schedule distributed GPU jobs, sustain low-latency east-west traffic, and export operational telemetry. Which decision BEST determines that the environment is ready?

Options:

A. Add additional GPU nodes before running workload validation.
B. Validate a representative end-to-end AI workflow across storage, fabric, GPU scheduling, and telemetry.
C. Begin production ingestion and monitor only user-facing errors.
D. Declare readiness because all deployed components show healthy status.

Best answer: B

Explanation: Successful component deployment does not equal end-to-end AI infrastructure readiness. In this scenario, individual layers appear healthy, but the workload depends on the combined path: data ingestion from storage, network behavior during east-west GPU communication, container orchestration and GPU scheduling, and telemetry visibility. A representative validation run confirms that these layers work together under conditions similar to the intended RAG indexing and fine-tuning workflow. It can reveal bottlenecks or gaps that isolated health checks miss, such as storage throughput limits, scheduling misconfiguration, or missing telemetry coverage. The key distinction is between components being installed and the integrated AI pipeline being operationally ready.

Component health only misses integration risks between storage, network, compute, orchestration, and monitoring.
Adding GPUs first overbuilds before proving that the current design fails a workload requirement.
Production-first monitoring delays validation until users or jobs experience failures, which is not a readiness decision.

Question 23

Topic: AI Infrastructure Deployment and Data Management

A team deployed a leaf-spine fabric for a multi-node GPU training cluster. ECMP is configured across two spine paths, and the acceptance requirement is to prove that training traffic can use the available fabric capacity rather than concentrating on one path. Which validation step best maps to this requirement?

Options:

A. Verify that PFC and ECN are enabled for the RDMA class.
B. Run a representative training workload and verify balanced path utilization in telemetry.
C. Send ICMP pings between GPU nodes and verify low latency.
D. Confirm that the routing table lists equal-cost next hops.

Best answer: B

Explanation: Load-distribution validation must prove forwarding behavior under realistic traffic, not just configuration intent. For an AI training cluster, the strongest validation is to generate representative multi-flow training traffic and use fabric telemetry, such as interface counters or Nexus Dashboard observations, to confirm that utilization is spread across the expected ECMP links or spine paths without congestion drops. This demonstrates that hashing, path selection, and capacity use are working together for the workload. Route entries, ping tests, and QoS feature checks can support readiness, but they do not prove that production-like traffic is using all available fabric capacity.

Route presence shows possible equal-cost paths, but it does not prove traffic is actually distributed across them.
Ping testing verifies basic reachability and latency, but it is too small and single-flow oriented to validate fabric capacity use.
PFC and ECN checks validate congestion-control readiness, but they do not confirm load distribution across paths.

Question 24

Topic: AI Infrastructure Operations and Troubleshooting

A distributed GPU training job suddenly slows down during the all-reduce phase. Nexus Dashboard and Intersight show these correlated events:

Time	Observation
10:14	High ECN marking and PFC pause frames on the RoCEv2 traffic class between two leaf switches
10:15	GPU utilization drops from 92% to 38% across several servers
10:16	Kubernetes reports delayed readiness probes for training pods
10:17	Storage latency remains within baseline

Which alert should be treated as the primary high-priority signal?

Options:

A. Delayed Kubernetes readiness probes
B. GPU utilization drop across the servers
C. RoCEv2 congestion on the lossless traffic class
D. Normal storage latency during the incident

Best answer: C

Explanation: In AI infrastructure incidents, the highest-priority alert is the earliest signal that explains the widest set of dependent symptoms. Here, the slowdown occurs during all-reduce, which depends heavily on low-latency GPU-to-GPU network communication. High ECN marking and PFC pause frames on the RoCEv2 traffic class indicate congestion or lossless fabric pressure on the path used by the training job. The later GPU utilization drop is consistent with GPUs waiting on network collectives, and delayed pod readiness can follow when the application becomes unresponsive. Normal storage latency also reduces the likelihood that storage is the root issue. The key is to correlate time, dependency, and blast radius rather than ranking alerts by how visible they are.

GPU utilization is a major symptom, but it occurs after the RoCEv2 congestion and can result from stalled collective communication.
Readiness probes indicate application impact, not the infrastructure layer most likely causing the all-reduce slowdown.
Storage latency being normal makes storage an unlikely primary cause under the stated facts.

Question 25

Topic: AI Infrastructure Components and Architecture

A multi-GPU training job that normally finishes an epoch in 22 minutes now takes 41 minutes after an orchestration change. Telemetry shows GPU utilization oscillating between 35% and 90%, high all-reduce wait time, normal storage latency, and no congestion drops on the AI fabric. Kubernetes reports MemoryPressure=False. The current placement is two workers on two 4-GPU nodes; the baseline placed all eight GPUs in one server with NVLink. What is the most likely cause or remediation?

Options:

A. Increase pod memory requests to avoid eviction
B. Move the dataset to lower-latency block storage
C. Enable ECN on the storage traffic class
D. Use topology-aware placement on one 8-GPU node

Best answer: D

Explanation: For tightly coupled multi-GPU training, compute-node placement and GPU interconnect topology can dominate performance. The facts point away from memory, storage, and fabric congestion: MemoryPressure=False, storage latency is normal, and the fabric has no congestion drops. The key change is that an eight-GPU job that previously stayed within one NVLink-connected server is now split across two 4-GPU nodes. That can increase synchronization latency and reduce effective GPU utilization during all-reduce operations. A topology-aware scheduler, node affinity, or job policy that keeps the workload on an appropriate 8-GPU node is the supported remediation. The key takeaway is to match GPU placement to the workload’s communication pattern, not just the GPU count.

Memory pressure is not supported because the node condition is false and the symptom is synchronization wait, not eviction or paging.
Storage tuning is not supported because storage latency is reported as normal during the slowdown.
Fabric congestion is not supported because telemetry shows no congestion drops, and ECN would not restore lost NVLink locality.

Questions 26-50

Question 26

Topic: AI Infrastructure Operations and Troubleshooting

An AI platform team validates a RAG inference cluster before adding tenants. The production requirement is 40 GB/s aggregate vector-index read throughput with P95 retrieval latency under 25 ms. Nexus Dashboard and Intersight show no active production alerts.

Benchmark result: 62 GB/s aggregate reads, 18 ms P95 latency, 55% average GPU utilization, and normal storage queue depth.

Which operational decision best maps to these results?

Options:

A. Open a network congestion incident and throttle RoCE traffic during inference.
B. Declare a storage bottleneck and expand the NVMe storage tier before onboarding.
C. Treat the result as capacity headroom and phase in tenants with telemetry monitoring.
D. Discard the benchmark because only production failures can inform capacity planning.

Best answer: C

Explanation: A benchmark result indicates capacity headroom when it exceeds the defined workload requirement and does not coincide with production symptoms such as alerts, SLO violations, saturation, or abnormal queueing. In this case, throughput is above 40 GB/s, P95 latency is below 25 ms, GPU utilization is moderate, and storage queue depth is normal. That evidence supports a controlled onboarding plan with continued Nexus Dashboard and Intersight monitoring, not an incident response. Benchmarks help estimate capacity, but they become troubleshooting evidence only when correlated with production degradation or resource saturation.

Storage expansion fails because storage queue depth is normal and the benchmark exceeded the required throughput.
Network incident fails because there is no telemetry showing congestion or latency violation.
Discarding benchmarks fails because benchmarks are useful for capacity validation when interpreted alongside production telemetry.

Question 27

Topic: AI Infrastructure Deployment and Data Management

An AI training cluster attached to an ACI fabric reports intermittent NCCL timeouts during all-reduce operations. GPU health and storage latency are normal.

Exhibit:

Nexus Dashboard: ECN marks and output drops on leaf-to-spine links
Nexus Dashboard: high-volume RoCEv2 flows between GPU node EPGs
APIC: AI tenant EPGs use the default QoS class
APIC: no RoCEv2 classification or lossless policy is applied

Which remediation is best supported by the observations?

Options:

A. Move the training dataset to local SSD storage
B. Increase Kubernetes GPU replica counts
C. Deploy APIC QoS/PFC/ECN policy for RoCEv2 traffic
D. Disable Nexus Dashboard telemetry collection

Best answer: C

Explanation: Nexus Dashboard provides the fabric visibility needed to identify congestion patterns, while APIC is the policy point for deploying fabric behavior. In this case, the evidence narrows the issue to RoCEv2 traffic between GPU node EPGs: GPUs and storage are healthy, but Nexus Dashboard reports ECN marks and output drops, and APIC shows the traffic remains in the default QoS class. The appropriate remediation is to classify the AI/RoCEv2 traffic and apply the required congestion-management and lossless handling policy through APIC, then validate improvement with Nexus Dashboard telemetry. Scaling compute or changing storage does not address the observed fabric policy gap.

More replicas can increase distributed traffic and does not fix congestion handling in the fabric.
Local SSDs target storage latency, but the storage telemetry is normal.
Disabling telemetry removes visibility and does not remediate drops or QoS misclassification.

Question 28

Topic: AI Infrastructure Components and Architecture

A data center team is sizing compute for a new RAG-based support chatbot. The application must answer user prompts with low latency, use an existing vector index for retrieval, and serve many concurrent sessions. The foundation model is already trained and fits within one accelerator’s memory. Which compute plan is the best fit?

Options:

A. Multi-node GPU training cluster with high-speed GPU-to-GPU interconnect
B. CPU-only servers with large memory for the vector database
C. Maximum-density GPU servers sized primarily for checkpoint storage
D. Inference GPU nodes plus CPU/RAM capacity for retrieval services

Best answer: D

Explanation: RAG serving combines retrieval with generative inference. Because the model is already trained and fits on one accelerator, the compute plan should prioritize low-latency inference GPUs and enough CPU and memory to run retrieval, ranking, and vector-index services close to the serving path. Horizontal scaling can add more inference replicas for concurrency. A training cluster with extensive GPU-to-GPU connectivity is mainly justified when model training or large distributed fine-tuning requires synchronized accelerator communication. CPU-only capacity may help retrieval, but it misses the GPU requirement for generative model serving.

Training fabric overbuild misses that the model is already trained and does not require distributed GPU synchronization.
CPU-only sizing supports vector retrieval but ignores the accelerator need for low-latency model inference.
Checkpoint-focused sizing targets training storage behavior rather than the active compute path for RAG serving.

Question 29

Topic: AI Infrastructure Components and Architecture

A manufacturer is designing infrastructure for a computer-vision inference workload that stops robotic equipment when defects are detected. Requirements include inference response under 10 ms, raw production images must remain inside the plant network, and the cloud link has 35 ms round-trip latency. The team also wants to use cloud services for model registry and periodic approved model updates. Which design best maps to these requirements?

Options:

A. Keep all AI components disconnected from the cloud.
B. Run inference in the cloud and stream camera images over the WAN.
C. Store raw images in cloud object storage and cache only recent frames locally.
D. Run inference and store raw images on-premises; use secure hybrid connectivity to cloud services for approved model lifecycle functions.

Best answer: D

Explanation: A purely cloud design is unsuitable when a workload has hard local latency requirements or data-location restrictions that the WAN and cloud placement cannot satisfy. In this scenario, a 35 ms cloud round trip already exceeds the under-10 ms response target, and raw production images are not allowed to leave the plant network. The appropriate architecture is hybrid: keep the time-sensitive inference path and protected data on-premises or at the edge, then use secured cloud integration only for functions that can tolerate latency and comply with policy, such as model registry, governance, or approved model updates. The key is separating the real-time protected data path from non-real-time lifecycle services.

Cloud inference fails because the WAN latency alone exceeds the stated response target.
Cloud raw storage fails because raw production images must remain inside the plant network.
Fully disconnected AI misses the stated requirement to use cloud services for model registry and approved updates.

Question 30

Topic: AI Infrastructure Deployment and Data Management

A Cisco UCS domain will host GPU nodes for an AI inference platform. The upstream fabric already trunks VLAN 110 for management, VLAN 220 for dataset storage, and VLAN 330 for tenant inference. The design must keep these traffic types separated and reachable from the hosts. Which UCS LAN connectivity design best maps to this requirement?

Options:

A. Increase the vNIC QoS MTU for all traffic classes
B. Assign the required VLANs to the appropriate vNIC templates
C. Use a larger MAC address pool for the service profiles
D. Apply an NTP policy to the UCS domain profile

Best answer: B

Explanation: In Cisco UCS, LAN connectivity policies define the host-facing vNICs, and vNIC templates commonly define properties such as fabric placement, MAC pool use, QoS policy, and VLAN membership. For workload reachability and traffic separation, the decisive setting is whether the required VLANs are assigned to the correct vNICs or vNIC templates. If VLAN 330 is not allowed on the tenant inference vNIC, the upstream trunk can be correct and the host still cannot send or receive that traffic. QoS, MAC pools, and NTP can be important, but they do not by themselves place host traffic into the required Layer 2 segments.

MTU tuning can affect performance or fragmentation behavior, but it does not add missing VLAN reachability.
MAC pool sizing supports server identity assignment, but it does not separate management, storage, and inference traffic.
NTP policy supports time synchronization, but it is unrelated to host VLAN membership.

Question 31

Topic: AI Infrastructure Components and Architecture

A team is designing a new AI training pod for large generative model fine-tuning. The workload is batch-oriented, can use nightly data replication, and requires the same GPU count, RoCE fabric, and storage throughput in any location.

Requirements:

Requirement	Detail
Data control	No public cloud storage for training data
Sustainability	Prefer renewable power and lower PUE when performance is unchanged
Facility limit	Current data center has limited cooling headroom
Operations	Central monitoring must remain available

Which design best maps to these requirements?

Options:

A. Place the pod in a renewable-powered colocation site with low PUE and private connectivity
B. Keep the pod in the current data center to avoid replication changes
C. Distribute smaller GPU nodes across branch edge sites
D. Move the training data and GPUs to a public cloud AI service

Best answer: A

Explanation: Renewable energy and sustainability requirements can change the recommended AI infrastructure location when the workload can tolerate the placement change without losing performance or violating data-control rules. In this scenario, the training workload is batch-oriented and supports nightly replication, so it does not require the pod to stay in the current facility. A renewable-powered colocation site with lower PUE and sufficient cooling headroom better satisfies the sustainability and facility requirements while still allowing the same GPU, RoCE, storage, and monitoring design. The key is that sustainability changes the site recommendation only because the performance and data-control constraints can still be met.

Current site preference fails because limited cooling headroom and weaker sustainability metrics are stated constraints.
Public cloud move fails because the requirement prohibits public cloud storage for training data.
Edge distribution fails because it changes the training architecture instead of preserving the required GPU and fabric design.

Question 32

Topic: AI Fundamentals and Applications

An AI platform team must design serving for two inference workloads in a Cisco data center. Workload A scores 20 million records after business close and must finish before 6:00 AM. Workload B serves a customer-facing assistant with small requests, bursty traffic, and a 300 ms p95 response-time target. Which design best maps to these requirements?

Options:

A. Run Workload A as scheduled batch jobs; run Workload B on reserved autoscaled inference endpoints.
B. Run both workloads as scheduled batch jobs on opportunistic GPU capacity.
C. Run Workload A on reserved endpoints; queue Workload B for batch execution.
D. Run both workloads on reserved low-latency inference endpoints.

Best answer: A

Explanation: Batch inference is designed for latency-tolerant work that can be queued, scheduled, and processed in large groups within a completion window. Workload A fits this pattern because it runs after business close and only needs to finish by 6:00 AM, so GPUs can be allocated in scheduled pools or during lower-demand periods. Interactive inference is request/response serving for users or applications with tight latency targets. Workload B fits this pattern because customer requests are bursty and must meet a 300 ms p95 target, requiring ready capacity, autoscaling, and low-latency request routing. The key distinction is not model type; it is the serving behavior and latency tolerance.

Treating both workloads as batch ignores the assistant’s real-time response requirement.
Treating both workloads as interactive over-reserves resources for a workload that can be scheduled.
Reversing the designs fails both requirements: the batch job gets costly always-on capacity, and the assistant misses latency targets.

Question 33

Topic: AI Fundamentals and Applications

A manufacturer is deploying computer-vision inference for robotic safety cells at several plants. Each site must make stop/no-stop decisions within 20 ms, continue operating during WAN outages, and keep raw video local while sending events and model metrics to a central environment for retraining. Which design best maps to these requirements?

Options:

A. Deploy CPU-only edge gateways and forward exceptions to the cloud for GPU analysis.
B. Deploy GPU-enabled edge nodes at each plant with local storage and centralized model/telemetry synchronization.
C. Stream raw video to a central cloud GPU cluster for all inference decisions.
D. Use a regional data center for inference and cache only final decisions at each plant.

Best answer: B

Explanation: Edge AI infrastructure places compute close to the data source when workloads require very low latency, local autonomy, or data locality. In this scenario, robotic safety decisions cannot depend on WAN transport or a centralized inference service because the decision must occur within 20 ms and continue during outages. GPU-enabled edge nodes support local computer-vision inference, while local storage keeps raw video on site for privacy and resiliency. A central environment can still support the broader AI lifecycle by receiving events and metrics and distributing validated model updates.

The key takeaway is that edge AI is not isolated from cloud or data center resources; it uses local processing for time-critical decisions and controlled synchronization for management, monitoring, and retraining.

Cloud inference fails because WAN latency and outages would directly affect safety decisions and raw video would leave the site.
Regional inference still depends on network transport for time-critical decisions and does not satisfy local autonomy.
CPU-only gateways may not meet computer-vision performance needs and still offload important analysis to the cloud.

Question 34

Topic: AI Infrastructure Components and Architecture

A distributed training job was moved to shared storage so checkpoints can survive node failure. Every checkpoint interval, GPU utilization drops sharply for several minutes, but the job does not fail.

Signal	Observation during checkpoint
Storage capacity	48% used
Storage throughput	19.5 GB/s of 20 GB/s sustained
Storage write latency	p99 rises to 170 ms
Network fabric	No drops; links below 45% utilization
GPU health	No NVLink or XID errors
Orchestration	Pods remain Running

Which remediation is best supported by these observations?

Options:

A. Reschedule workers to reduce GPU-to-GPU hops
B. Tune RoCE congestion controls on the fabric
C. Scale out the checkpoint storage performance tier
D. Add raw capacity to the existing storage pool

Best answer: C

Explanation: The checkpoint path is constrained by storage performance, not available capacity. Capacity is only 48% used, but throughput is essentially at the storage limit and p99 write latency spikes during checkpoints. For AI training, synchronized checkpoint writes can create bursty high-throughput, high-IOPS demand; the storage design must scale performance as well as provide redundancy and availability. A suitable remediation is to scale out the performance tier, such as using a higher-throughput parallel file or NVMe-backed storage design for checkpoints while preserving resilient access.

Capacity-only expansion fails because free space remains available; the bottleneck is write throughput and latency.
Fabric tuning is not supported because there are no drops and link utilization is well below saturation.
GPU placement changes do not address the observed storage saturation during checkpoint writes.

Question 35

Topic: AI Infrastructure Deployment and Data Management

An AI team is moving from nightly data staging to streaming a 300 TB training set and shared checkpoints over RoCEv2/NVMe-oF to Cisco UCS GPU nodes. Requirements are no local dataset copies, checkpoint writes must not starve training reads, and stable GPU utilization as pods scale from 8 to 32. During validation, Nexus Dashboard shows congestion and drops on leaf links carrying storage/RDMA traffic, while storage latency and server CPU remain normal. Which infrastructure decision is best?

Options:

A. Increase PVC capacity on the storage class
B. Move checkpoints to node-local NVMe drives
C. Tune fabric QoS for storage/RDMA traffic
D. Add GPU nodes and increase pod replicas

Best answer: C

Explanation: The decisive data-management requirement is continuous shared data access over the fabric, not more compute or storage capacity. Because Nexus Dashboard reports congestion and drops on the links carrying RoCEv2/NVMe-oF traffic, while the storage array and servers are not saturated, the required change is at the fabric policy layer. A suitable decision is to apply end-to-end QoS/lossless treatment for the storage/RDMA class, using mechanisms such as PFC, ECN, and ETS so checkpoint writes and training reads can coexist without causing drops that reduce GPU utilization. The key is to change the layer where the bottleneck is observed.

More GPUs worsens the contention because the current bottleneck is the fabric path to shared data.
More PVC capacity addresses space, not congestion or drops on the storage/RDMA links.
Local checkpoints violates the shared data-management requirement and weakens restart consistency across training pods.

Question 36

Topic: AI Infrastructure Deployment and Data Management

In a Cisco UCS domain for an AI training pod, the QoS policy must be corrected before deploying GPU nodes. RoCEv2 gradient-synchronization traffic is mapped by a vNIC QoS policy to Gold; NFS checkpoint traffic is mapped to Silver. Requirements: RoCEv2 needs lossless behavior and 9,000-byte MTU. NFS needs 9,000-byte MTU but must remain drop-eligible. Management stays Best Effort.

Current system classes:

Class	Packet handling	MTU
Gold	Drop	1500
Silver	No drop	9000
Best Effort	Drop	1500

Which correction best matches the workload requirements?

Options:

A. Set Gold to drop and MTU 9000; keep Silver no drop and MTU 9000.
B. Set Gold and Silver to no drop and MTU 9000.
C. Set Gold to no drop and MTU 9000; set Silver to drop and MTU 9000.
D. Keep the system classes unchanged and increase Gold bandwidth weight.

Best answer: C

Explanation: Cisco UCS system-class settings define shared QoS behavior such as packet drop/no-drop treatment and MTU for traffic mapped into that class. In this scenario, the vNIC QoS mappings already identify Gold for RoCEv2 and Silver for NFS. The mismatch is in the system-class attributes: RoCEv2 requires lossless behavior and jumbo MTU, so Gold must be no-drop with MTU 9000. NFS checkpoint traffic also needs jumbo MTU, but the requirement says it must remain drop-eligible, so Silver should not be no-drop. Bandwidth weights do not fix packet-loss behavior or MTU mismatches.

Lossless on storage fails because NFS is explicitly required to remain drop-eligible.
Both no drop overextends lossless treatment and violates the requirement to reserve it away from NFS.
Bandwidth-only change skips the decisive system-class settings for packet handling and MTU.

Question 37

Topic: AI Fundamentals and Applications

A generative AI chat service runs on GPU-enabled Cisco UCS servers in a Kubernetes cluster. During business-hour spikes, users report slow responses. Intersight and application telemetry show the following:

Signal	Observation
p95 time-to-first-token	Exceeds target during spikes
GPU utilization	92% to 97% on all inference pods
Request queue depth	Rises until traffic drops
Network/storage health	No congestion or latency alerts
Replica count	Fixed at 4 pods

Which infrastructure control best addresses the observed issue?

Options:

A. Increase storage IOPS for model weights
B. Enable PFC and ECN for RoCE traffic
C. Autoscale inference replicas from queue and GPU metrics
D. Raise batch size to maximize GPU throughput

Best answer: C

Explanation: Generative AI serving must handle variable request arrival rates while keeping response latency predictable. Here, all inference pods are near GPU saturation, queue depth grows, and network and storage telemetry are clean. That points to insufficient serving capacity during spikes, not a fabric or storage bottleneck. An infrastructure control such as Kubernetes autoscaling, driven by request queue depth, GPU utilization, or latency-oriented serving metrics, can add inference replicas and spread demand before queues create unacceptable time-to-first-token. Throughput tuning alone may not meet strict response expectations if it increases waiting time.

RoCE controls are useful for lossless high-performance traffic, but the stem reports no network congestion or latency alerts.
Storage IOPS would matter if loading model weights or retrieval data were slow, but storage health is normal.
Larger batches can improve throughput, but they may increase per-request waiting time under strict latency targets.

Question 38

Topic: AI Infrastructure Components and Architecture

An AI team is expanding an on-premises training pod from 64 to 128 GPUs for large-model fine-tuning. Pilot telemetry shows RoCE fabric links above 90% utilization during gradient all-reduce, GPUs waiting on network transfers, stable one-way latency within target, and cached storage reads after the first epoch. Which fabric decision is best?

Options:

A. Increase bisection bandwidth with a low-oversubscription leaf-spine fabric.
B. Add local NVMe caching on each GPU server.
C. Collapse the fabric to fewer hops using the existing uplink capacity.
D. Insert inline inspection for all east-west training flows.

Best answer: A

Explanation: This scenario is bandwidth-driven, not latency-driven. Large-model training commonly generates heavy east-west GPU-to-GPU traffic during synchronization operations such as all-reduce. The decisive evidence is high RoCE link utilization and GPUs waiting on network transfers while latency remains within target. A low-oversubscription leaf-spine design with higher bisection bandwidth lets more training traffic move concurrently as the pod scales. A latency-driven design would be favored when small transactions or tail latency dominate, such as real-time inference request paths.

Fewer hops targets latency, but the stated latency is already acceptable and uplink capacity remains the bottleneck.
Local NVMe caching targets storage reads, but storage is already cached after the first epoch.
Inline inspection may support security goals, but it can add a bottleneck in the high-volume east-west training path.

Question 39

Topic: AI Infrastructure Components and Architecture

A team added four GPU servers to an existing AI training rack. Full-cluster jobs now run slower than before the expansion, but storage latency is normal and the fabric shows no sustained drops or ECN marking.

Telemetry excerpt:

Signal	Observation
Rack PDU load	94% during training
Server events	Power cap asserted
GPU telemetry	Clocks reduced under load
Inlet temperature	Within target

Which action best addresses the likely cause?

Options:

A. Redistribute nodes or add rack power capacity
B. Restart the orchestration scheduler
C. Enable ECN marking for the RoCE class
D. Increase storage queue depth for training data

Best answer: A

Explanation: Dense AI scaling is limited by more than node count. In this case, the expansion increased rack power draw until the PDU approached its usable capacity, and the servers asserted power caps. Power capping reduces available GPU power, which lowers GPU clocks and training throughput even when network and storage telemetry look healthy. The inlet temperature being within target makes a thermal hot-aisle issue less likely, but it does not remove the power-density constraint.

The right remediation is to restore power headroom by spreading systems across racks or adding appropriate rack power capacity, then continue monitoring PDU load, server power-cap events, GPU clocks, and thermal telemetry as the cluster scales.

RoCE congestion is not supported because the fabric shows no sustained drops or ECN marking during the slowdown.
Storage tuning is not supported because storage latency remains normal during the workload.
Scheduler restart does not address the hardware telemetry showing power caps and reduced GPU clocks.

Question 40

Topic: AI Infrastructure Operations and Troubleshooting

An AI training cluster had unstable step times during distributed jobs. The operations team corrected a mismatched QoS policy on the RoCEv2 fabric. The closure criteria require proof that training throughput returned to baseline and that fabric events no longer align with GPU node stalls. Which operational design best satisfies the closure requirements?

Options:

A. Monitor only GPU utilization for 24 hours before closing the incident
B. Run a storage benchmark and verify that NVMe latency is unchanged
C. Close the incident after confirming the QoS policy now matches the template
D. Rerun the baseline training benchmark while correlating fabric telemetry and GPU node logs

Best answer: D

Explanation: Remediation closure for AI infrastructure should verify the original symptom, not just the configuration change. Because the issue affected distributed training step times and was tied to the RoCEv2 fabric, the team needs a comparable post-remediation benchmark plus telemetry and log correlation from the same validation window. This proves that training throughput returned to the expected baseline and that fabric signals such as congestion, pause behavior, or drops are no longer synchronized with GPU node stalls. A template match is useful evidence that the fix was applied, but it is not enough to prove service recovery.

Template-only closure skips workload validation, so it cannot prove training performance recovered.
Storage benchmark targets the wrong infrastructure layer for a QoS-related RoCEv2 symptom.
GPU-only monitoring may show utilization trends, but it does not correlate fabric events with node behavior.

Question 41

Topic: AI Infrastructure Components and Architecture

A team redesigned storage for an AI training cluster to maximize usable capacity per rack unit. Since the change, training jobs run 40% longer even though GPU and AI fabric telemetry show no compute faults or RoCE congestion.

Exhibit: Storage observation

Signal	Observation
Workload I/O	many small random reads; bursty checkpoint writes
Current target	capacity-optimized HDD file share
Protection	erasure-coded pool optimized for usable capacity
Symptom	high metadata and read latency during training

Which remediation best corrects the storage design?

Options:

A. Add more usable HDD capacity to the existing file share
B. Convert all datasets to block LUNs per training node
C. Move hot training data to NVMe-backed shared file storage with redundancy
D. Tune RoCE congestion controls on the AI fabric

Best answer: C

Explanation: AI training storage must match the workload access pattern, not only the required capacity. This workload has many small random reads and bursty checkpoint writes, so metadata latency, read latency, write performance, and redundancy are decisive. A capacity-optimized HDD file share with erasure coding can provide efficient usable capacity, but it is often a poor fit for hot training data when low-latency shared access is required. A better design separates tiers: keep cold or archival data on the capacity tier, and place the active dataset and checkpoints on a high-performance NVMe-backed shared file platform with appropriate redundancy and availability. The fabric and GPU telemetry reduce the likelihood that compute or RoCE congestion is the primary cause.

More HDD capacity fails because the problem is latency and access pattern, not insufficient usable terabytes.
RoCE tuning is not supported because telemetry shows no AI fabric congestion.
Per-node block LUNs can create data management and sharing issues for distributed training rather than solving shared hot-data access.

Question 42

Topic: AI Infrastructure Operations and Troubleshooting

A data center team is deploying an AI training pod with a Nexus-based leaf-spine fabric and Cisco UCS GPU servers. Operations must see fabric-level congestion and path behavior for east-west RoCEv2 traffic, and also maintain server inventory, hardware health, firmware compliance, and lifecycle actions. Which monitoring design best maps to these requirements?

Options:

A. Use Intersight for RoCEv2 path visibility and server firmware compliance
B. Use Nexus Dashboard for fabric visibility and Intersight for UCS infrastructure visibility
C. Use APIC only for fabric and UCS lifecycle monitoring
D. Use Nexus Dashboard for server lifecycle and fabric congestion visibility

Best answer: B

Explanation: Nexus Dashboard and Intersight provide complementary visibility for AI infrastructure operations. Nexus Dashboard is the better fit for network fabric visibility, including fabric health, telemetry, assurance, endpoint or flow context, and congestion indicators that affect east-west AI traffic such as RoCEv2. Cisco Intersight is the better fit for infrastructure inventory and lifecycle visibility, especially Cisco UCS domains, server health, firmware compliance, advisories, and operational actions across compute infrastructure. In this scenario, the requirements span both the network fabric and UCS server lifecycle, so the design should use each platform for its intended visibility domain rather than forcing one tool to cover both completely.

Server lifecycle in Nexus Dashboard fails because Nexus Dashboard is fabric-centered, not the primary UCS inventory and lifecycle platform.
RoCE pathing in Intersight fails because Intersight is not the primary tool for Nexus fabric assurance and traffic-path visibility.
APIC only fails because APIC manages ACI policy but does not replace Intersight for UCS lifecycle visibility across infrastructure.

Question 43

Topic: AI Infrastructure Deployment and Data Management

A Cisco UCS domain is being prepared for GPU nodes running distributed AI training. The RoCEv2 vNIC QoS policy marks RDMA traffic with CoS 4, but benchmark telemetry shows RDMA retransmits and unstable step times. The system-class summary is shown:

Traffic class	CoS	Drop behavior	MTU
Best effort	0	Drop	1,500
RDMA	4	Drop	1,500
Management	6	Drop	1,500

Which policy correction is the best decision?

Options:

A. Move RDMA traffic to the management system class.
B. Set the RDMA system class to no-drop with jumbo MTU.
C. Leave RDMA as drop traffic and raise best-effort bandwidth.
D. Increase the NTP policy polling frequency.

Best answer: B

Explanation: Cisco UCS system classes define how marked traffic is handled across the fabric, including CoS, MTU, bandwidth treatment, and drop behavior. In this scenario, the vNIC QoS policy already marks RoCEv2 RDMA traffic with CoS 4, but the matching RDMA system class is still configured as drop traffic with a 1,500-byte MTU. That does not meet the workload requirement for low-latency, lossless GPU training traffic. The right correction is to align the system class with the RDMA marking by making the RDMA class no-drop and using a jumbo MTU appropriate for high-throughput AI flows. The key is correcting the UCS QoS/system-class layer, not changing unrelated management or timing policies.

Management class misuse fails because RDMA data traffic should not be moved into a management class to solve lossless fabric behavior.
NTP tuning fails because time synchronization does not correct drop behavior or MTU mismatch for RoCEv2.
Best-effort bandwidth fails because more best-effort bandwidth does not make the RDMA class lossless or fix the MTU setting.

Question 44

Topic: AI Infrastructure Operations and Troubleshooting

A distributed training job across eight Cisco UCS GPU nodes slows after a new rack is added. Single-node GPU benchmarks remain normal, storage latency is unchanged, and Kubernetes pods are Running with no restarts.

Telemetry summary

Source	Observation
Intersight	GPU utilization oscillates between 25% and 45%
Nexus Dashboard	High PFC pause frames on the RoCEv2 class
Nexus Dashboard	ECN/CNP spikes on two new leaf-spine links

Which troubleshooting conclusion and remediation path are most supported?

Options:

A. Storage bottleneck; move the training dataset to faster block storage
B. Network congestion; verify RoCEv2 QoS, PFC, ECN, ETS, and link distribution
C. Orchestration fault; recreate the pods with stricter node affinity
D. GPU interconnect fault; replace the affected GPUs or NVLink components

Best answer: B

Explanation: The strongest evidence points to the network layer, specifically congestion or inconsistent lossless transport behavior for RoCEv2 traffic after the new rack was added. Distributed training depends heavily on east-west GPU-to-GPU communication, so low and oscillating GPU utilization can result when all-reduce traffic is delayed. Normal single-node benchmarks reduce the likelihood of a local GPU problem, unchanged storage latency reduces the likelihood of a data path issue, and healthy pod state reduces the likelihood of an orchestration failure. High PFC pause frames plus ECN/CNP spikes on the new links support validating QoS class mapping, PFC, ECN, ETS, and load distribution across the new fabric paths.

Storage bottleneck is not supported because storage latency is unchanged while the new symptom appears on RoCEv2 network telemetry.
GPU interconnect fault is less likely because single-node GPU benchmarks are normal and the issue appears during multi-node communication.
Orchestration fault is not supported because the pods are running without restarts and the telemetry points to fabric congestion.

Question 45

Topic: AI Infrastructure Deployment and Data Management

A team is deploying a Cisco UCS-based GPU cluster for distributed model training. Jobs are sensitive to GPU clock throttling, the facility provides two independent power feeds to each chassis, and operations requires the cluster to remain online if either feed fails. Cooling capacity is already validated. Which UCS power policy decision is best?

Options:

A. Use N+1 redundancy and tune storage QoS
B. Use grid redundancy with an aggressive power cap
C. Use grid redundancy and no power cap
D. Use non-redundant power and no power cap

Best answer: C

Explanation: Cisco UCS power policy behavior should match both the AI workload and the facility power design. For dense GPU training, power capping can reduce available server power and cause GPU frequency throttling, which directly affects training performance. Because the chassis has two independent feeds and must survive the loss of either feed, grid redundancy is the appropriate UCS power redundancy model. Cooling has already been validated, so there is no stated reason to trade performance for a restrictive cap. The key is to preserve feed-level resiliency without constraining GPU power unnecessarily.

Non-redundant power may maximize available power, but it fails the requirement to remain online after a feed failure.
Aggressive capping protects power budgets, but it risks throttling GPUs when the scenario explicitly prioritizes training performance.
Storage QoS tuning addresses a different infrastructure layer and does not configure UCS power behavior.

Question 46

Topic: AI Fundamentals and Applications

A financial services company is designing infrastructure for a generative AI assistant. Customer records must remain in the data center, inference must respond with low latency for internal users, and the team needs temporary extra GPU capacity for periodic model fine-tuning. Which infrastructure decision best matches these requirements?

Options:

A. Deploy edge GPU nodes in branch offices and synchronize all records to them
B. Move customer records, inference, and fine-tuning entirely to a public cloud GPU service
C. Build a larger on-premises GPU cluster sized for peak fine-tuning demand
D. Keep sensitive data and inference on-premises, and burst fine-tuning jobs to cloud GPUs over secure connectivity

Best answer: D

Explanation: Hybrid AI infrastructure combines on-premises and cloud resources so each workload component runs where it best fits. In this scenario, customer records and low-latency inference should stay on-premises because of data control and response-time requirements. Periodic fine-tuning can use cloud GPU capacity because it is bursty and does not require permanent peak-sized local infrastructure. Secure connectivity and controlled data or artifact synchronization are key hybrid characteristics, especially when sensitive datasets cannot freely move to the cloud.

The key takeaway is to place data-sensitive, latency-sensitive services on-premises while using cloud resources for elastic scale when the workload allows it.

All-cloud placement fails because it ignores the requirement that customer records remain in the data center.
Peak on-prem buildout meets control requirements but overbuilds for temporary fine-tuning demand.
Branch edge GPUs target distributed low-latency edge use cases, not centralized data residency plus cloud burst capacity.

Question 47

Topic: AI Infrastructure Components and Architecture

A team is building an on-premises GPU cluster for model training. Eight GPU servers must concurrently read the same curated image dataset and write checkpoints to a common path. The storage design must provide high aggregate throughput, a shared namespace, and continued access if one storage node fails. Which storage access model best fits these requirements?

Options:

A. Single-controller NFS export on one storage node
B. Dedicated Fibre Channel block LUN per server
C. Local NVMe SSDs in each GPU server
D. Scale-out file storage with a shared namespace

Best answer: D

Explanation: The core requirement is shared, high-throughput access for multiple GPU servers. Training jobs commonly need many workers to read the same dataset and write checkpoints to locations visible to the whole job. A scale-out file storage design, such as a parallel or clustered file service, provides a common namespace and can aggregate throughput across storage nodes while maintaining availability when a node fails. Block storage can deliver strong performance, but a basic per-server LUN does not provide a shared file namespace by itself. Local NVMe is very fast for one host but does not solve shared access or resiliency across servers. The key takeaway is to match multi-node AI training data access to a shared file model, not isolated block or local storage.

Local NVMe speed fails because host-local drives do not provide a shared namespace or storage-node failover.
Block LUN performance fails because per-server block devices do not inherently support concurrent shared file access.
Single NFS target fails because it creates a single storage-node dependency and does not meet the stated availability requirement.

Question 48

Topic: AI Fundamentals and Applications

A team trained a model on an on-premises Cisco UCS GPU cluster and then deployed it as a RAG service across two data centers. After document updates, users at one site receive stale answers. Intersight shows inference GPU utilization below 35%, Nexus telemetry shows no congestion, and logs show the vector index at site B is 18 hours behind site A. Which lifecycle transition was most likely missed?

Options:

A. Single-GPU to distributed-training transition
B. Development-to-production serving transition
C. Production-inference to experimentation transition
D. Data-ingestion to model-training transition

Best answer: B

Explanation: Moving an AI workload from development or training into production serving changes infrastructure requirements. A RAG service depends on current retrieval data, synchronized indexes, protected storage, and monitoring coverage for the serving path, not just GPU availability. In this case, low GPU utilization and clean network telemetry make GPU capacity and fabric congestion unlikely. The stale answers align with an operational production-serving gap: the vector index at one site is not staying synchronized after document updates.

The key takeaway is that lifecycle transitions often shift the bottleneck from model training resources to data freshness, protection, and observability for the deployed service.

Distributed training would emphasize GPU-to-GPU connectivity and fabric behavior, but the evidence shows low GPU use and no congestion.
Model training would focus on preparing data and consuming GPU capacity, not serving stale retrieval results across sites.
Experimentation would reduce production pressure rather than explain stale answers in a live multi-site RAG service.

Question 49

Topic: AI Infrastructure Components and Architecture

A team is deploying a large generative AI training workload in an AI pod. The job uses model parallelism, exchanges tensors frequently between GPUs, must run as containers, and the data center wants to avoid adding extra idle GPU nodes for capacity. Which compute architecture is the BEST fit?

Options:

A. NVLink-connected GPU-dense nodes with topology-aware container placement
B. PCIe-only single-GPU nodes scaled out across Ethernet
C. Shared vGPU hosts for all training containers
D. Existing CPU cluster with higher-performance storage

Best answer: A

Explanation: Model-parallel generative AI training is sensitive to GPU-to-GPU latency and bandwidth because GPUs exchange intermediate tensors during each training step. A GPU-dense compute node with NVLink or NVSwitch keeps those exchanges on a high-bandwidth local GPU fabric, while container orchestration can place pods according to GPU topology so a job consumes GPUs within the same connectivity domain before scaling out. This matches the workload without adding unnecessary nodes. Scaling many PCIe-only or single-GPU servers shifts the bottleneck to the network, and storage upgrades do not solve the primary compute communication requirement.

PCIe-only scale-out misses the stated need for frequent low-latency GPU-to-GPU communication.
Shared vGPU hosts can improve utilization but are not the best fit for tightly coupled model-parallel training.
Storage upgrade targets data access, not the GPU interconnect bottleneck in the scenario.

Question 50

Topic: AI Infrastructure Deployment and Data Management

An AI platform team is validating distributed training on Cisco UCS GPU servers. The data-management requirements are: one authoritative 40 TB image dataset, concurrent reads by all training pods, snapshot-based rollback, and no manual per-node data copies. The current Kubernetes cluster uses only local NVMe on each worker, and fabric telemetry shows no congestion during tests. Which design change best maps to these requirements?

Options:

A. Enable ECN and PFC for the RoCE traffic class.
B. Integrate shared high-throughput file storage with RWX persistent volumes.
C. Add more GPU and CPU memory to each UCS server.
D. Pin training pods to nodes with the largest local NVMe.

Best answer: B

Explanation: These requirements are primarily data-management and storage-integration requirements, not a fabric or compute scaling problem. A single authoritative dataset with concurrent pod access and snapshot rollback needs shared storage that supports high-throughput reads, persistence, and data protection. In Kubernetes, that storage should be exposed through persistent volumes that allow the required access mode, such as read-write-many for multiple training pods. Local NVMe can be useful for scratch space or caching, but it creates copies, version drift, and operational rollback challenges when used as the primary dataset location. The fabric may still need validation for throughput, but the stated telemetry does not point to congestion as the blocking issue.

RoCE tuning improves lossless high-performance transport, but it does not create a shared authoritative dataset or snapshots.
Compute expansion may help model execution, but the bottleneck described is dataset access and lifecycle control.
Pod pinning keeps workloads near local disks, but it preserves the manual-copy and version-drift problem.

Questions 51-60

Question 51

Topic: AI Infrastructure Deployment and Data Management

A Cisco UCS GPU cluster will run distributed fine-tuning jobs. The OS must boot from local M.2 RAID-1, but training datasets and checkpoints must use an existing dual-fabric Fibre Channel array with host multipathing. The current UCS policy set defines only the local disk storage policy and LAN vNICs, so the hosts cannot see the training LUNs. Which design correction best supports the stated AI workload data path?

Options:

A. Convert the local M.2 RAID-1 devices into shared training storage
B. Keep local boot and add dual-fabric FC vHBAs for data LUNs
C. Mount the array through the management network using NFS
D. Add only a LAN QoS policy for RoCE-enabled storage traffic

Best answer: B

Explanation: The workload has two distinct storage paths: local M.2 RAID-1 for operating system boot and dual-fabric Fibre Channel SAN for the AI data path. Correcting the UCS design means retaining the local disk policy for boot while adding the SAN connectivity needed by the training LUNs, such as vHBAs mapped to both FC fabrics, proper VSAN association, and downstream zoning and LUN masking. That supports multipathing and keeps the high-volume dataset and checkpoint traffic on the required FC storage path. A LAN-only or local-disk-only change does not make the SAN LUNs visible to the hosts.

Local-only storage fails because M.2 boot devices do not provide the required shared SAN path for datasets and checkpoints.
Management NFS violates the stated data path and would move training traffic onto the wrong network.
LAN QoS only may help Ethernet storage designs, but it does not create Fibre Channel host access to SAN LUNs.

Question 52

Topic: AI Infrastructure Deployment and Data Management

A team is deploying a Cisco UCS GPU cluster for distributed model training. The workload must sustain peak GPU utilization during nightly jobs, preserve PSU redundancy, and avoid power-cap or thermal throttling in a rack where cooling capacity cannot be increased. Which evidence BEST validates that the selected UCS power policy supports the workload requirement?

Options:

A. A fabric utilization chart showing east-west training traffic stays below link capacity
B. A facility PUE report showing the data hall is more efficient than last quarter
C. A UCS inventory report showing all servers have identical GPU models and firmware levels
D. Intersight telemetry from a representative training run showing power headroom, healthy PSU redundancy, and no throttling

Best answer: D

Explanation: Power policy validation for an AI training cluster should be based on observed behavior under representative load, not only on static configuration. The decisive evidence is telemetry that shows the servers can draw the power needed for sustained GPU operation while preserving the required PSU redundancy state and avoiding power-cap or thermal throttling. In Cisco UCS/Intersight context, useful signals include server and chassis power draw versus policy limits, PSU redundancy health, thermal status, and throttling indicators during a benchmark or production-like training job. Inventory, network utilization, and facility efficiency can be relevant to other checks, but they do not prove the power policy supports the workload.

Inventory-only evidence confirms hardware consistency but does not show whether the power policy permits sustained load.
Network utilization addresses training fabric capacity, not UCS power or thermal behavior.
Facility PUE reflects overall energy efficiency and cannot validate server-level power headroom or throttling.

Question 53

Topic: AI Infrastructure Deployment and Data Management

A data center team is preparing to release an APIC-managed AI-ready fabric for a new GPU training cluster. Nexus Dashboard shows the following validation summary:

Check	Visible evidence
Fabric reachability	All spines and leaves reachable
Fabric health	No critical device faults
Policy consistency	QoS/RoCE policy inconsistent on leaf-103
APIC deployment state	Tenant and interface policies deployed, except leaf-103 has a pending policy fault

What is the best readiness decision?

Options:

A. Remediate leaf-103 policy inconsistency before release
B. Release the fabric because all switches are reachable
C. Run a GPU benchmark before checking policy consistency
D. Release the fabric because no critical device faults exist

Best answer: A

Explanation: AI-ready fabric readiness requires more than basic device reachability. For GPU training traffic, fabric status, health, and policy consistency must all support the intended behavior. The evidence shows that devices are reachable and there are no critical device faults, but the QoS/RoCE policy is inconsistent on leaf-103 and APIC still reports a pending policy fault. That means one part of the fabric may not enforce the same traffic class, congestion, or lossless behavior as the rest of the deployment. The appropriate decision is to hold release, remediate the policy inconsistency, and revalidate health and consistency before onboarding the workload.

Reachability only is insufficient because switches can be reachable while required AI traffic policies are not applied consistently.
Health only is insufficient because no critical hardware fault does not clear a visible policy deployment fault.
Benchmark first risks testing on a known inconsistent fabric instead of validating the control-plane policy state first.

Question 54

Topic: AI Infrastructure Components and Architecture

An enterprise runs a RAG pipeline on a Cisco UCS GPU cluster and wants to burst indexing jobs to a cloud GPU environment. Sensitive source documents must remain encrypted in transit, the cloud jobs need data updates within 15 minutes, and the same workload must be able to move back on premises without application changes. Which design best corrects a hybrid design that currently uses manual file copies over the public Internet?

Options:

A. Use private encrypted connectivity, continuous data replication, and portable container orchestration across both sites.
B. Create nightly VM snapshots and restore them in the cloud when bursting is needed.
C. Keep all documents on premises and send public API calls to cloud inference endpoints.
D. Add larger cloud GPU instances and rebuild the dataset in the cloud for each burst.

Best answer: A

Explanation: A hybrid AI design must align data locality, synchronization, secure connectivity, and workload portability. In this scenario, the key gaps are not GPU capacity alone; the current design cannot move current data securely or run the same workload consistently across locations. Private encrypted connectivity protects sensitive data in transit, continuous or policy-based replication meets the 15-minute freshness requirement, and container-based orchestration with compatible runtime policies supports moving the RAG indexing workload between the Cisco UCS GPU cluster and cloud GPU capacity. The closest traps either solve compute capacity only or move workloads without meeting the data freshness and security constraints.

Compute-only scaling fails because larger cloud GPUs do not solve secure transfer or timely data synchronization.
API-only access fails because it avoids workload mobility and does not burst the indexing workload itself.
Nightly snapshots fail because they miss the 15-minute data freshness requirement and are not a clean portability strategy.

Question 55

Topic: AI Fundamentals and Applications

A healthcare company is deploying a RAG-based clinical assistant. Patient records must remain on-premises for compliance, but the team wants to burst GPU-intensive embedding refresh jobs to the cloud during monthly updates. The design must use encrypted connectivity, keep indexes synchronized, and allow workloads to move without changing the application packaging.

Which hybrid AI infrastructure design best maps to these requirements?

Options:

A. Cloud-only storage and compute with regional backup replication
B. Edge inference nodes with periodic USB data transfer
C. On-prem data lake with cloud GPU bursting over secure connectivity
D. On-prem-only GPU cluster with local file storage

Best answer: C

Explanation: A hybrid AI infrastructure combines on-premises and cloud resources while preserving placement, security, and portability requirements. In this scenario, regulated patient records stay on-premises, while burstable GPU capacity in the cloud handles temporary embedding refresh demand. The design should include encrypted private or VPN connectivity, data/index synchronization controls, and consistent orchestration or container packaging so the workload can run in either environment without application redesign. Cloud-only fails the data-residency constraint, and on-prem-only fails the burst-capacity requirement. The key characteristic is coordinated use of both environments, not simply adding remote access or backups.

Cloud-only placement fails because patient records must remain on-premises for compliance.
On-prem-only compute fails because it does not provide cloud GPU bursting for monthly refresh jobs.
Edge transfer model fails because manual data movement does not meet secure synchronization or workload mobility requirements.

Question 56

Topic: AI Infrastructure Deployment and Data Management

A converged AI cluster uses Ethernet for management, application/API traffic, and NVMe-oF over RoCEv2 storage. During training checkpoints, jobs stall and storage latency spikes, while management access and application endpoints remain healthy.

Telemetry summary:

Signal	Observation
Storage vNIC marking	CoS 0, best effort
Fabric no-drop class	CoS 4, PFC/ECN enabled
Drops	CoS 0 output drops during checkpoints
Management traffic	Stable, CoS 0

Which remediation is most directly supported by the evidence?

Options:

A. Move management traffic into the no-drop system class
B. Prioritize application/API traffic above storage traffic
C. Disable PFC and ECN on the fabric no-drop class
D. Map the storage/RoCE vNIC to the no-drop system class

Best answer: D

Explanation: In a converged AI environment, different traffic types need different QoS treatment. Management traffic is typically best effort and should remain reliable but not lossless. Application/API traffic may need priority or bandwidth controls, but it usually does not require no-drop behavior. RoCEv2 storage traffic, however, is sensitive to packet loss and should be mapped consistently to a lossless system class with PFC and ECN enabled across the host vNIC policy and fabric. The evidence shows storage marked CoS 0 and experiencing drops, while the fabric’s lossless treatment exists on CoS 4. The supported fix is to align the storage/RoCE marking with the no-drop class.

Management no-drop is unnecessary because management traffic is stable and does not explain checkpoint-related storage drops.
Disabling congestion controls would remove the lossless behavior needed by RoCEv2 storage traffic.
Application priority does not address the observed CoS 0 drops on checkpoint storage traffic.

Question 57

Topic: AI Infrastructure Components and Architecture

A data center team plans to add four GPU racks for an AI training pod. Each rack requires 30 kW of IT power and rejects about 30 kW of heat. Site policy requires N+1 UPS and cooling, with projected steady-state load at or below 80% of N+1 usable capacity.

Item	Current state
Current IT/heat load	590 kW
UPS capacity	4 × 300 kW modules
Cooling capacity	5 × 200 kW units

Which design best supports the expansion without reducing reliability?

Options:

A. Install the GPU racks using the existing UPS and cooling capacity.
B. Add one 300-kW UPS module before installing the GPU racks.
C. Use renewable energy credits to offset the new rack power.
D. Add one 200-kW cooling unit before installing the GPU racks.

Best answer: D

Explanation: For AI rack expansion, reliability must be checked against usable N+1 capacity, not total installed capacity. The expansion adds 120 kW, so the projected IT and heat load is 710 kW. UPS N+1 usable capacity is 900 kW, and 80% of that is 720 kW, so power remains within policy. Cooling N+1 usable capacity is 800 kW, and 80% of that is only 640 kW, so the planned heat load would violate the cooling reliability margin. Adding a 200-kW cooling unit increases N+1 usable cooling to 1,000 kW, making the 80% policy limit 800 kW. The key is that the limiting subsystem is cooling, not power.

Total capacity trap fails because total installed capacity ignores the N+1 and 80% reliability requirements.
UPS-only expansion fixes a subsystem that is not the current constraint.
Renewable offset may help sustainability goals but does not add UPS or cooling capacity.

Question 58

Topic: AI Infrastructure Operations and Troubleshooting

A distributed training job now takes twice as long to complete. The operations team correlates Kubernetes state, Intersight health, and Nexus Dashboard alerts for the same 10-minute window.

Time	Observation
10:02	Pods remain `Running`; no restarts or reschedules
10:04	GPUs healthy; utilization drops from 95% to 45% periodically
10:05	Nexus Dashboard reports rising ECN marks and PFC pause frames on the RoCE lossless class
10:06	Storage latency and IOPS remain at baseline

Which likely cause is best supported by these correlated alerts and logs?

Options:

A. RoCE traffic-class congestion is stalling GPU communication
B. GPU thermal throttling is reducing compute performance
C. Storage latency is delaying training data reads
D. Kubernetes rescheduling is interrupting the workload

Best answer: A

Explanation: The strongest evidence points to network congestion in the lossless RoCE traffic class. Distributed training depends on frequent GPU-to-GPU communication; when ECN marking and PFC pause frames rise on that class, the fabric is signaling congestion and backpressure. That can make GPUs wait for data exchange, which appears as periodic utilization drops even when the GPU hardware is healthy. The Kubernetes and storage observations help exclude common alternatives: the pods are stable, and storage performance is normal. The key troubleshooting move is correlating alerts across layers rather than treating the GPU utilization drop as a compute-only symptom.

Thermal throttling is not supported because Intersight reports the GPUs as healthy with no temperature-related evidence.
Storage bottleneck is unlikely because storage latency and IOPS remain at baseline during the slowdown.
Rescheduling interruption does not fit because the pods remain Running with no restarts or reschedules.

Question 59

Topic: AI Infrastructure Operations and Troubleshooting

During a staged expansion of an on-premises AI training cluster, jobs that span the original and newly added racks show lower all-reduce throughput and intermittent step-time spikes. Intersight reports GPU and server health as normal, and storage latency is unchanged.

Telemetry summary:

Observation	Original racks	New racks
RoCE PFC counters	Stable	Increasing rapidly
ECN marking on AI traffic class	Enabled	Not enabled
Link utilization	Balanced	Bursty on uplinks

Which operations action best supports continued scaling while preserving reliability and performance?

Options:

A. Move the training dataset to a higher-capacity storage tier
B. Align RoCE QoS on the new racks and rerun a staged scale test
C. Disable PFC on the original racks to match the new racks
D. Add more GPU nodes to reduce per-node workload pressure

Best answer: B

Explanation: For AI training workloads, scaling across racks depends on predictable east-west network behavior, especially for RoCE-based GPU communication. The evidence does not indicate a GPU health or storage bottleneck; it shows increasing PFC counters, missing ECN marking, and bursty uplink behavior on the newly added racks. The operationally safe action is to align the congestion-control and QoS policy for the AI traffic class across the fabric, then validate with a staged scale test before admitting larger jobs. This preserves reliability by avoiding pause storms or unfair congestion behavior and preserves performance by keeping collective communication stable as the cluster grows. Adding capacity without fixing the fabric inconsistency can make the scaling problem worse.

More GPUs fails because the current evidence shows healthy compute and a network congestion-control mismatch, not insufficient GPU count.
Storage tier change fails because storage latency is unchanged and the symptom appears during cross-rack collective communication.
Disable PFC fails because removing lossless behavior from working racks can increase RoCE packet loss instead of fixing the inconsistent new-rack policy.

Question 60

Topic: AI Fundamentals and Applications

A data center team is preparing an AI infrastructure proposal for mixed RAG inference and model fine-tuning. Before selecting a validated pod or deploying a new fabric, the team must visualize how compute, GPU, network, storage, and operations requirements fit together and create a shared planning view for application and infrastructure stakeholders. Which Cisco solution is the best fit for this phase?

Options:

A. Use Cisco AI Canvas for planning and visibility
B. Deploy Hyperfabric AI as the first step
C. Select an AI POD without further mapping
D. Tune GPU scheduling in the orchestrator

Best answer: A

Explanation: Cisco AI Canvas fits the planning and visibility stage of an AI infrastructure lifecycle. In this scenario, the team is not yet ready to deploy the fabric, commit to a specific validated AI POD, or tune runtime scheduling. The stated need is to create a shared view of workload requirements across compute, GPU, network, storage, and operations so stakeholders can make an informed infrastructure decision. AI Canvas is positioned for that kind of planning and operational visibility context, while AI PODs and Hyperfabric AI are more directly tied to validated infrastructure deployment and fabric implementation. The key is matching the Cisco AI solution to the current phase: plan and visualize first, then implement the selected architecture.

Fabric first misses the requirement to plan and align stakeholders before deployment.
Pod selection only skips the workload-to-infrastructure mapping requested in the scenario.
GPU scheduling addresses runtime orchestration, not cross-domain infrastructure planning or visibility.

Continue in the web app

Use IT Mastery for interactive Cisco 300-640 DCAI practice with mixed sets, timed mocks, topic drills, explanations, and progress tracking.

Try Cisco 300-640 DCAI on Web

Focused topic pages

AI Infrastructure Operations and Troubleshooting

Official Resources

Free Cisco 300-640 DCAI Practice Exam: Cisco Implementing Data Center AI Infrastructure

Exam snapshot

Full-length exam mix

Practice questions

Questions 1-25

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Question 11

Question 12

Question 13

Question 14

Question 15

Question 16

Question 17

Question 18

Question 19

Question 20

Question 21

Question 22

Question 23

Question 24

Question 25

Questions 26-50

Question 26

Question 27

Question 28

Question 29

Question 30

Question 31

Question 32

Question 33

Question 34

Question 35

Question 36

Question 37

Question 38

Question 39

Question 40

Question 41

Question 42

Question 43

Question 44

Question 45

Question 46

Question 47

Question 48

Question 49

Question 50

Questions 51-60

Question 51

Question 52

Question 53

Question 54

Question 55

Question 56

Question 57

Question 58

Question 59

Question 60

Continue in the web app

Focused topic pages

Browse Certification Practice Tests by Exam Family