Try 10 focused Cisco 300-640 DCAI questions on AI Infrastructure Operations and Troubleshooting, with explanations, then continue with IT Mastery.
Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.
Try Cisco 300-640 DCAI on Web View full Cisco 300-640 DCAI practice page
| Field | Detail |
|---|---|
| Exam route | Cisco 300-640 DCAI |
| Topic area | AI Infrastructure Operations and Troubleshooting |
| Blueprint weight | 20% |
| Page purpose | Focused sample questions before returning to mixed practice |
Use this page to isolate AI Infrastructure Operations and Troubleshooting for Cisco 300-640 DCAI. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.
| Pass | What to do | What to record |
|---|---|---|
| First attempt | Answer without checking the explanation first. | The fact, rule, calculation, or judgment point that controlled your answer. |
| Review | Read the explanation even when you were correct. | Why the best answer is stronger than the closest distractor. |
| Repair | Repeat only missed or uncertain items after a short break. | The pattern behind misses, not the answer letter. |
| Transfer | Return to mixed practice once the topic feels stable. | Whether the same skill holds up when the topic is no longer obvious. |
Blueprint context: 20% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.
These original IT Mastery practice questions are aligned to this topic area. Use them for self-assessment, scope review, and deciding what to drill next.
Topic: AI Infrastructure Operations and Troubleshooting
A team benchmarks a new on-premises generative AI training cluster before production. The workload uses synchronous distributed training across 8 GPU servers over a RoCEv2 leaf-spine fabric. Storage and compute were sized to meet the documented training requirement.
Benchmark summary
| Signal | Observation |
|---|---|
| GPU utilization | 42–55%, drops during gradient exchange |
| CPU utilization | Under 35% on all servers |
| Storage read throughput | Meets target with low latency |
| NCCL all-reduce test | 38% of expected fabric throughput |
| Fabric telemetry | High ECN marks and PFC pauses on RoCE class |
Which design change should the team prioritize?
Options:
A. Reschedule pods to reduce CPU contention
B. Move training data to faster NVMe storage
C. Tune RoCE QoS and load distribution
D. Add more GPU memory per server
Best answer: C
Explanation: The limiting layer is the high-performance network path used by distributed training collectives. GPU utilization drops during gradient exchange, the NCCL all-reduce benchmark is far below expected throughput, and telemetry shows ECN marks plus PFC pauses on the RoCE traffic class. Those signals indicate congestion or lossless Ethernet QoS behavior limiting GPU-to-GPU communication. The appropriate design focus is to validate and tune RoCEv2 QoS, including PFC, ECN, ETS/class bandwidth, and traffic distribution across available fabric paths. Storage is meeting target and CPU is not saturated, so increasing storage or compute capacity would not address the observed bottleneck.
Topic: AI Infrastructure Operations and Troubleshooting
A team runs multi-node generative AI training over a RoCEv2 fabric. During peak jobs, throughput varies even though no links are down. Operations needs a monitoring decision that supports reliability, scalability, and performance before adding more GPU nodes.
| Signal | Observation |
|---|---|
| GPU utilization | Drops during all-reduce phases |
| Fabric telemetry | ECN marks and PFC pauses spike on two leaf switches |
| Storage health | Latency within baseline |
| Server health | Power and thermals normal |
Which decision is best?
Options:
A. Correlate fabric congestion telemetry with GPU and server health baselines
B. Prioritize storage-array tuning because training throughput varies
C. Add GPU nodes because average utilization is below target
D. Monitor only switch link status with periodic SNMP polling
Best answer: A
Explanation: For AI infrastructure operations, the best monitoring decision is to correlate workload-facing symptoms with infrastructure telemetry and system health. Here, GPU utilization drops during all-reduce phases, while ECN marks and PFC pauses spike on specific leaf switches. Storage, power, and thermals are normal, so the strongest signal is network congestion affecting GPU-to-GPU communication over RoCEv2. A correlated view across fabric telemetry and server/GPU health helps operations confirm performance risk, set baselines, alert on congestion, and decide whether scaling requires fabric tuning rather than simply adding nodes.
The key takeaway is to monitor the infrastructure layer that matches the workload symptom and validate it against health indicators from adjacent layers.
Topic: AI Infrastructure Operations and Troubleshooting
A data center team runs distributed generative AI training on Cisco UCS GPU nodes connected through a Nexus data center fabric. Their current dashboard watches only fabric interface errors and utilization, but jobs still show intermittent step-time spikes. The team must keep using Nexus Dashboard and Intersight, preserve tenant isolation, and determine whether delays come from network congestion, GPU/host health, storage I/O, or Kubernetes scheduling. What is the best monitoring decision?
Options:
A. Add more GPU nodes before changing monitoring
B. Increase polling on Nexus interface counters only
C. Correlate fabric, UCS/GPU, storage, and orchestration telemetry by workload
D. Monitor only GPU utilization in Intersight
Best answer: C
Explanation: AI workload monitoring should follow the workload path, not a single infrastructure layer. For distributed training, step-time spikes can be caused by fabric congestion, host or GPU health, storage stalls, or orchestration placement. Nexus Dashboard can provide fabric health and congestion visibility, while Intersight can add UCS server, GPU, firmware, and policy context. Correlating those signals with storage and Kubernetes/job timing creates an end-to-end view that preserves operational boundaries while identifying which layer aligns with the workload symptom. The key improvement is correlation across layers and time, not simply collecting more data from the same layer.
Topic: AI Infrastructure Operations and Troubleshooting
An AI training pod had poor scaling during all-reduce operations. The network team changed the QoS policy by raising the RoCEv2 ETS bandwidth share from 50% to 90% and leaving ECN enabled. After the change, the benchmark and telemetry show:
| Metric | Before | After |
|---|---|---|
| NCCL all-reduce throughput | 135 Gb/s | 315 Gb/s |
| Target all-reduce throughput | 300 Gb/s | 300 Gb/s |
| PFC pause duration | 3,800 ms/min | 120 ms/min |
| RDMA retransmits | 0 | 0 |
| Storage read p95 latency | 3.2 ms | 3.1 ms |
| Kubernetes API p99 latency | 180 ms | 1,400 ms |
The Kubernetes API latency SLO is 500 ms. Which assessment is supported by the facts?
Options:
A. Rebalance ETS to protect control traffic.
B. Rollback ECN because ECN marks indicate failure.
C. Move the dataset to faster storage.
D. Close the incident because all-reduce met target.
Best answer: A
Explanation: The benchmark shows the original measured bottleneck improved: all-reduce throughput now exceeds the target, PFC pause duration dropped sharply, and RDMA retransmits remain at zero. However, the same change raised the RoCEv2 ETS share to 90%, and Kubernetes API p99 latency increased to 1,400 ms, far above the 500 ms SLO. That indicates the remediation may be starving non-RDMA or control-plane traffic during congestion. The right assessment is not a full success or a storage problem; it is a partially successful network remediation that needs QoS rebalancing to preserve AI data-plane performance without risking orchestration health.
Topic: AI Infrastructure Operations and Troubleshooting
A distributed GPU training job across eight Cisco UCS servers suddenly slows down. The workload depends on low-latency RoCEv2 east-west traffic, and storage latency remains within baseline. Nexus Dashboard and Intersight show these correlated events:
| Time | Alert or message |
|---|---|
| 10:02 | ECN marks and PFC pause frames spike on the RoCE traffic class |
| 10:04 | GPU utilization drops from 92% to 38% on multiple servers |
| 10:05 | Kubernetes reports training pod retry timeouts |
What is the BEST infrastructure decision?
Options:
A. Increase storage array throughput
B. Restart the training pods
C. Replace the GPUs with low utilization
D. Prioritize the RoCE fabric congestion alert
Best answer: D
Explanation: The high-priority alert is the one closest to the shared infrastructure dependency that can explain the downstream symptoms. In this incident, the RoCEv2 fabric carries east-west training traffic between GPU servers. ECN marking and PFC pause spikes on that traffic class appear before the GPU utilization drop and pod retries, so the fabric congestion alert should be investigated first. The GPU and Kubernetes messages are important, but they are likely secondary effects of stalled communication between training workers. Storage is explicitly normal, so it should not drive the first response. The key takeaway is to correlate timing and dependency path before treating the loudest application or host symptom as root cause.
Topic: AI Infrastructure Operations and Troubleshooting
A new distributed training job on Cisco UCS GPU nodes shows intermittent all-reduce latency spikes. Intersight shows the servers, GPUs, adapters, firmware compliance, and power/cooling health as normal. The storage system is not reporting latency. The team suspects congestion on the leaf-spine fabric carrying RoCEv2 traffic. What is the best next validation step?
Options:
A. Check Nexus Dashboard fabric telemetry for congestion and drops
B. Use Intersight inventory to confirm GPU model counts
C. Review Intersight firmware compliance for the GPU servers
D. Restart the orchestration pods on the affected nodes
Best answer: A
Explanation: Nexus Dashboard and Intersight provide different operational views. Intersight is the right tool for infrastructure inventory, server health, firmware compliance, and lifecycle visibility across UCS-managed resources. In this scenario, those server-side indicators are already healthy, and the suspected issue is congestion in the Nexus leaf-spine fabric carrying RoCEv2 traffic. Nexus Dashboard is better suited to validate fabric behavior such as interface utilization, drops, congestion events, flow visibility, and correlated fabric telemetry. The key distinction is that healthy compute inventory does not rule out a network-fabric performance issue.
Topic: AI Infrastructure Operations and Troubleshooting
A team reports that distributed AI training jobs on a Cisco UCS GPU cluster take 35% longer after a fabric maintenance window. Nexus Dashboard shows brief ECN marking but no drops on the AI VLANs. Intersight shows all servers healthy, but GPU utilization and per-pod GPU metrics are missing for 6 of 8 nodes because the exporter is down. Storage telemetry shows no latency alarms. What is the best next step?
Options:
A. Replace the affected GPU nodes
B. Restore GPU telemetry and rerun correlation
C. Tune PFC thresholds on the AI VLANs
D. Migrate the dataset to faster storage
Best answer: B
Explanation: Operational telemetry must cover the layers needed to support the troubleshooting conclusion. The available evidence shows only brief ECN marking without drops, healthy server status, and no storage latency alarms. However, the missing GPU utilization and per-pod GPU metrics remove the ability to distinguish GPU saturation, poor GPU placement, pod-level contention, or a fabric issue during the slow training run. The supported action is to restore the missing telemetry and correlate compute, GPU, network, storage, and orchestration signals during the same workload window.
A health summary is not the same as complete performance telemetry; gaps can make a root-cause conclusion unreliable.
Topic: AI Infrastructure Operations and Troubleshooting
A 32-node generative AI training cluster has gradually increased average iteration time from 820 ms to 1,450 ms during peak runs. The operations team reviews the monitoring data.
| Signal | Observation |
|---|---|
| GPU telemetry | Utilization oscillates 35% to 92%; temperatures normal |
| Storage telemetry | p99 read/write latency remains near 2 ms |
| Orchestration state | Pods are stable; no restart spike |
| Nexus Dashboard | AI RoCE class on two leaf uplinks averages 94% utilization; ECN marks and PFC pause frames are rising; no CRC errors |
What is the most likely cause supported by the evidence?
Options:
A. A storage array latency regression
B. Congestion in the lossless RoCE fabric path
C. GPU thermal throttling on training nodes
D. Container restart churn from orchestration instability
Best answer: B
Explanation: The evidence points to a network performance issue in the AI data path, not a compute, storage, or orchestration failure. In RoCE-based AI fabrics, sustained high utilization combined with increasing ECN marks and PFC pause frames indicates congestion management activity in the lossless traffic class. That backpressure can make GPU utilization oscillate because accelerators wait for distributed training communication to complete. Normal storage latency, stable pods, normal temperatures, and no CRC errors narrow the issue away from storage, orchestration, thermal, or physical link faults.
The key operational takeaway is to correlate workload symptoms with fabric telemetry over time, especially congestion signals on the AI traffic class.
Topic: AI Infrastructure Operations and Troubleshooting
A GPU training cluster has stalled three times this week during all-reduce phases. Nexus Dashboard shows brief congestion alerts on two leaf switches, Intersight shows no GPU or server faults, and the Kubernetes events only show pods waiting after the stalls. The network, compute, and platform teams each report no confirmed owner from their tools. What should the operations engineer do next?
Options:
A. Disable PFC on the affected leaf switches
B. Escalate with correlated timestamps and tool outputs
C. Increase the Kubernetes pod restart limit
D. Replace the GPUs in the impacted servers
Best answer: B
Explanation: Escalation is appropriate when symptoms repeat, affect an AI workload, and cannot be isolated to a clear owner with available management-tool evidence. In this case, the stalls involve possible network congestion, compute/GPU health, and orchestration state, but none of the tools confirms a single failing layer. The best next step is to preserve and correlate timestamps, alerts, workload phases, and system-health outputs, then escalate to the appropriate cross-domain support path or vendor support. Making configuration or hardware changes before ownership is validated risks masking the issue or introducing new faults.
Topic: AI Infrastructure Operations and Troubleshooting
A team wants to move a generative AI training cluster from pilot to production. The production design uses 8 GPU nodes, RoCEv2 for east-west GPU communication, shared NVMe-backed storage, and dual fabric paths for resiliency.
Benchmark evidence:
| Test performed | Result |
|---|---|
| Single-node GPU stress test | 92% GPU utilization |
| Single-node local-disk read test | Meets target |
| Multi-node RoCEv2 collective test | Not run |
| Fabric path-failure test | Not run |
Which infrastructure decision is best?
Options:
A. Require additional multi-node and failover benchmarks
B. Approve deployment because GPU utilization is high
C. Tune the training framework batch size
D. Add more GPUs before deployment
Best answer: A
Explanation: Benchmark evidence is sufficient only when it validates the production workload path and stated operational constraints. In this scenario, the passed tests show that one node can drive its GPUs and local disk, but production depends on distributed GPU communication over RoCEv2, shared storage behavior, and dual-path resiliency. Because the multi-node collective test and path-failure test were not run, the evidence cannot confirm latency, congestion handling, or availability under the intended architecture. The best decision is to require targeted additional benchmarks before approving deployment.
High single-node utilization is useful, but it does not prove the AI fabric or failover design is ready for production.
Use the Cisco 300-640 DCAI Practice Test page for the full IT Mastery practice bank, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.
Try Cisco 300-640 DCAI on Web View Cisco 300-640 DCAI Practice Test
Use the full IT Mastery practice page above for the latest review links and practice page.