Cisco 300-640 DCAI: AI Infrastructure Operations and Troubleshooting

Try 10 focused Cisco 300-640 DCAI questions on AI Infrastructure Operations and Troubleshooting, with explanations, then continue with IT Mastery.

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try Cisco 300-640 DCAI on Web View full Cisco 300-640 DCAI practice page

Topic snapshot

FieldDetail
Exam routeCisco 300-640 DCAI
Topic areaAI Infrastructure Operations and Troubleshooting
Blueprint weight20%
Page purposeFocused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate AI Infrastructure Operations and Troubleshooting for Cisco 300-640 DCAI. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

PassWhat to doWhat to record
First attemptAnswer without checking the explanation first.The fact, rule, calculation, or judgment point that controlled your answer.
ReviewRead the explanation even when you were correct.Why the best answer is stronger than the closest distractor.
RepairRepeat only missed or uncertain items after a short break.The pattern behind misses, not the answer letter.
TransferReturn to mixed practice once the topic feels stable.Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 20% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These original IT Mastery practice questions are aligned to this topic area. Use them for self-assessment, scope review, and deciding what to drill next.

Question 1

Topic: AI Infrastructure Operations and Troubleshooting

A team benchmarks a new on-premises generative AI training cluster before production. The workload uses synchronous distributed training across 8 GPU servers over a RoCEv2 leaf-spine fabric. Storage and compute were sized to meet the documented training requirement.

Benchmark summary

SignalObservation
GPU utilization42–55%, drops during gradient exchange
CPU utilizationUnder 35% on all servers
Storage read throughputMeets target with low latency
NCCL all-reduce test38% of expected fabric throughput
Fabric telemetryHigh ECN marks and PFC pauses on RoCE class

Which design change should the team prioritize?

Options:

  • A. Reschedule pods to reduce CPU contention

  • B. Move training data to faster NVMe storage

  • C. Tune RoCE QoS and load distribution

  • D. Add more GPU memory per server

Best answer: C

Explanation: The limiting layer is the high-performance network path used by distributed training collectives. GPU utilization drops during gradient exchange, the NCCL all-reduce benchmark is far below expected throughput, and telemetry shows ECN marks plus PFC pauses on the RoCE traffic class. Those signals indicate congestion or lossless Ethernet QoS behavior limiting GPU-to-GPU communication. The appropriate design focus is to validate and tune RoCEv2 QoS, including PFC, ECN, ETS/class bandwidth, and traffic distribution across available fabric paths. Storage is meeting target and CPU is not saturated, so increasing storage or compute capacity would not address the observed bottleneck.

  • GPU capacity trap fails because memory capacity is not shown as exhausted, and utilization is dropping during network exchange.
  • Storage upgrade trap fails because storage throughput and latency already meet the benchmark target.
  • CPU scheduling trap fails because CPU utilization is low across all servers, so pod CPU contention is not the limiting signal.

Question 2

Topic: AI Infrastructure Operations and Troubleshooting

A team runs multi-node generative AI training over a RoCEv2 fabric. During peak jobs, throughput varies even though no links are down. Operations needs a monitoring decision that supports reliability, scalability, and performance before adding more GPU nodes.

SignalObservation
GPU utilizationDrops during all-reduce phases
Fabric telemetryECN marks and PFC pauses spike on two leaf switches
Storage healthLatency within baseline
Server healthPower and thermals normal

Which decision is best?

Options:

  • A. Correlate fabric congestion telemetry with GPU and server health baselines

  • B. Prioritize storage-array tuning because training throughput varies

  • C. Add GPU nodes because average utilization is below target

  • D. Monitor only switch link status with periodic SNMP polling

Best answer: A

Explanation: For AI infrastructure operations, the best monitoring decision is to correlate workload-facing symptoms with infrastructure telemetry and system health. Here, GPU utilization drops during all-reduce phases, while ECN marks and PFC pauses spike on specific leaf switches. Storage, power, and thermals are normal, so the strongest signal is network congestion affecting GPU-to-GPU communication over RoCEv2. A correlated view across fabric telemetry and server/GPU health helps operations confirm performance risk, set baselines, alert on congestion, and decide whether scaling requires fabric tuning rather than simply adding nodes.

The key takeaway is to monitor the infrastructure layer that matches the workload symptom and validate it against health indicators from adjacent layers.

  • Adding GPUs overbuilds before confirming whether the fabric can carry more distributed-training traffic.
  • Storage tuning targets a layer whose health signal is already within baseline.
  • Link-only polling misses queue congestion, ECN marking, and PFC pause behavior that can hurt RoCEv2 performance.

Question 3

Topic: AI Infrastructure Operations and Troubleshooting

A data center team runs distributed generative AI training on Cisco UCS GPU nodes connected through a Nexus data center fabric. Their current dashboard watches only fabric interface errors and utilization, but jobs still show intermittent step-time spikes. The team must keep using Nexus Dashboard and Intersight, preserve tenant isolation, and determine whether delays come from network congestion, GPU/host health, storage I/O, or Kubernetes scheduling. What is the best monitoring decision?

Options:

  • A. Add more GPU nodes before changing monitoring

  • B. Increase polling on Nexus interface counters only

  • C. Correlate fabric, UCS/GPU, storage, and orchestration telemetry by workload

  • D. Monitor only GPU utilization in Intersight

Best answer: C

Explanation: AI workload monitoring should follow the workload path, not a single infrastructure layer. For distributed training, step-time spikes can be caused by fabric congestion, host or GPU health, storage stalls, or orchestration placement. Nexus Dashboard can provide fabric health and congestion visibility, while Intersight can add UCS server, GPU, firmware, and policy context. Correlating those signals with storage and Kubernetes/job timing creates an end-to-end view that preserves operational boundaries while identifying which layer aligns with the workload symptom. The key improvement is correlation across layers and time, not simply collecting more data from the same layer.

  • Interface-only polling may show link symptoms but still misses GPU, storage, and scheduling causes.
  • GPU-only monitoring moves the blind spot from the network to compute and ignores fabric and storage behavior.
  • Adding GPU nodes overbuilds before evidence identifies the bottleneck and may worsen scheduling or fabric pressure.

Question 4

Topic: AI Infrastructure Operations and Troubleshooting

An AI training pod had poor scaling during all-reduce operations. The network team changed the QoS policy by raising the RoCEv2 ETS bandwidth share from 50% to 90% and leaving ECN enabled. After the change, the benchmark and telemetry show:

MetricBeforeAfter
NCCL all-reduce throughput135 Gb/s315 Gb/s
Target all-reduce throughput300 Gb/s300 Gb/s
PFC pause duration3,800 ms/min120 ms/min
RDMA retransmits00
Storage read p95 latency3.2 ms3.1 ms
Kubernetes API p99 latency180 ms1,400 ms

The Kubernetes API latency SLO is 500 ms. Which assessment is supported by the facts?

Options:

  • A. Rebalance ETS to protect control traffic.

  • B. Rollback ECN because ECN marks indicate failure.

  • C. Move the dataset to faster storage.

  • D. Close the incident because all-reduce met target.

Best answer: A

Explanation: The benchmark shows the original measured bottleneck improved: all-reduce throughput now exceeds the target, PFC pause duration dropped sharply, and RDMA retransmits remain at zero. However, the same change raised the RoCEv2 ETS share to 90%, and Kubernetes API p99 latency increased to 1,400 ms, far above the 500 ms SLO. That indicates the remediation may be starving non-RDMA or control-plane traffic during congestion. The right assessment is not a full success or a storage problem; it is a partially successful network remediation that needs QoS rebalancing to preserve AI data-plane performance without risking orchestration health.

  • Closing too early ignores the new Kubernetes API latency violation after the QoS change.
  • Blaming ECN is unsupported because retransmits stayed at zero and the throughput target was met.
  • Blaming storage is unsupported because storage latency stayed stable and within the stated limit.

Question 5

Topic: AI Infrastructure Operations and Troubleshooting

A distributed GPU training job across eight Cisco UCS servers suddenly slows down. The workload depends on low-latency RoCEv2 east-west traffic, and storage latency remains within baseline. Nexus Dashboard and Intersight show these correlated events:

TimeAlert or message
10:02ECN marks and PFC pause frames spike on the RoCE traffic class
10:04GPU utilization drops from 92% to 38% on multiple servers
10:05Kubernetes reports training pod retry timeouts

What is the BEST infrastructure decision?

Options:

  • A. Increase storage array throughput

  • B. Restart the training pods

  • C. Replace the GPUs with low utilization

  • D. Prioritize the RoCE fabric congestion alert

Best answer: D

Explanation: The high-priority alert is the one closest to the shared infrastructure dependency that can explain the downstream symptoms. In this incident, the RoCEv2 fabric carries east-west training traffic between GPU servers. ECN marking and PFC pause spikes on that traffic class appear before the GPU utilization drop and pod retries, so the fabric congestion alert should be investigated first. The GPU and Kubernetes messages are important, but they are likely secondary effects of stalled communication between training workers. Storage is explicitly normal, so it should not drive the first response. The key takeaway is to correlate timing and dependency path before treating the loudest application or host symptom as root cause.

  • Pod restart trap fails because retries are a symptom after the network congestion begins.
  • GPU replacement trap fails because simultaneous low utilization across servers points away from individual GPU hardware.
  • Storage expansion trap fails because storage latency is stated to be within baseline.

Question 6

Topic: AI Infrastructure Operations and Troubleshooting

A new distributed training job on Cisco UCS GPU nodes shows intermittent all-reduce latency spikes. Intersight shows the servers, GPUs, adapters, firmware compliance, and power/cooling health as normal. The storage system is not reporting latency. The team suspects congestion on the leaf-spine fabric carrying RoCEv2 traffic. What is the best next validation step?

Options:

  • A. Check Nexus Dashboard fabric telemetry for congestion and drops

  • B. Use Intersight inventory to confirm GPU model counts

  • C. Review Intersight firmware compliance for the GPU servers

  • D. Restart the orchestration pods on the affected nodes

Best answer: A

Explanation: Nexus Dashboard and Intersight provide different operational views. Intersight is the right tool for infrastructure inventory, server health, firmware compliance, and lifecycle visibility across UCS-managed resources. In this scenario, those server-side indicators are already healthy, and the suspected issue is congestion in the Nexus leaf-spine fabric carrying RoCEv2 traffic. Nexus Dashboard is better suited to validate fabric behavior such as interface utilization, drops, congestion events, flow visibility, and correlated fabric telemetry. The key distinction is that healthy compute inventory does not rule out a network-fabric performance issue.

  • Firmware compliance is server lifecycle visibility, but the stem already says compliance and health are normal.
  • GPU model counts confirm inventory, not whether the fabric is congested during all-reduce traffic.
  • Pod restart treats orchestration as the issue without evidence from the monitoring facts.

Question 7

Topic: AI Infrastructure Operations and Troubleshooting

A team reports that distributed AI training jobs on a Cisco UCS GPU cluster take 35% longer after a fabric maintenance window. Nexus Dashboard shows brief ECN marking but no drops on the AI VLANs. Intersight shows all servers healthy, but GPU utilization and per-pod GPU metrics are missing for 6 of 8 nodes because the exporter is down. Storage telemetry shows no latency alarms. What is the best next step?

Options:

  • A. Replace the affected GPU nodes

  • B. Restore GPU telemetry and rerun correlation

  • C. Tune PFC thresholds on the AI VLANs

  • D. Migrate the dataset to faster storage

Best answer: B

Explanation: Operational telemetry must cover the layers needed to support the troubleshooting conclusion. The available evidence shows only brief ECN marking without drops, healthy server status, and no storage latency alarms. However, the missing GPU utilization and per-pod GPU metrics remove the ability to distinguish GPU saturation, poor GPU placement, pod-level contention, or a fabric issue during the slow training run. The supported action is to restore the missing telemetry and correlate compute, GPU, network, storage, and orchestration signals during the same workload window.

A health summary is not the same as complete performance telemetry; gaps can make a root-cause conclusion unreliable.

  • PFC tuning is premature because ECN marks without drops do not prove lossless fabric misconfiguration.
  • Storage migration is unsupported because the visible storage telemetry does not show latency pressure.
  • GPU replacement overreaches because missing utilization data is not evidence of failed hardware.

Question 8

Topic: AI Infrastructure Operations and Troubleshooting

A 32-node generative AI training cluster has gradually increased average iteration time from 820 ms to 1,450 ms during peak runs. The operations team reviews the monitoring data.

SignalObservation
GPU telemetryUtilization oscillates 35% to 92%; temperatures normal
Storage telemetryp99 read/write latency remains near 2 ms
Orchestration statePods are stable; no restart spike
Nexus DashboardAI RoCE class on two leaf uplinks averages 94% utilization; ECN marks and PFC pause frames are rising; no CRC errors

What is the most likely cause supported by the evidence?

Options:

  • A. A storage array latency regression

  • B. Congestion in the lossless RoCE fabric path

  • C. GPU thermal throttling on training nodes

  • D. Container restart churn from orchestration instability

Best answer: B

Explanation: The evidence points to a network performance issue in the AI data path, not a compute, storage, or orchestration failure. In RoCE-based AI fabrics, sustained high utilization combined with increasing ECN marks and PFC pause frames indicates congestion management activity in the lossless traffic class. That backpressure can make GPU utilization oscillate because accelerators wait for distributed training communication to complete. Normal storage latency, stable pods, normal temperatures, and no CRC errors narrow the issue away from storage, orchestration, thermal, or physical link faults.

The key operational takeaway is to correlate workload symptoms with fabric telemetry over time, especially congestion signals on the AI traffic class.

  • Storage latency is not supported because the p99 storage latency remains low and stable during the slowdown.
  • Thermal throttling is unlikely because GPU temperatures are normal while utilization swings with communication delay symptoms.
  • Orchestration instability is not indicated because pods are stable and there is no restart spike.

Question 9

Topic: AI Infrastructure Operations and Troubleshooting

A GPU training cluster has stalled three times this week during all-reduce phases. Nexus Dashboard shows brief congestion alerts on two leaf switches, Intersight shows no GPU or server faults, and the Kubernetes events only show pods waiting after the stalls. The network, compute, and platform teams each report no confirmed owner from their tools. What should the operations engineer do next?

Options:

  • A. Disable PFC on the affected leaf switches

  • B. Escalate with correlated timestamps and tool outputs

  • C. Increase the Kubernetes pod restart limit

  • D. Replace the GPUs in the impacted servers

Best answer: B

Explanation: Escalation is appropriate when symptoms repeat, affect an AI workload, and cannot be isolated to a clear owner with available management-tool evidence. In this case, the stalls involve possible network congestion, compute/GPU health, and orchestration state, but none of the tools confirms a single failing layer. The best next step is to preserve and correlate timestamps, alerts, workload phases, and system-health outputs, then escalate to the appropriate cross-domain support path or vendor support. Making configuration or hardware changes before ownership is validated risks masking the issue or introducing new faults.

  • PFC change is premature because a congestion alert alone does not prove lossless Ethernet configuration is the root cause.
  • GPU replacement is unsupported because Intersight reports no server or GPU fault evidence.
  • Restart tuning treats the orchestration symptom but does not explain repeated stalls during the all-reduce phase.

Question 10

Topic: AI Infrastructure Operations and Troubleshooting

A team wants to move a generative AI training cluster from pilot to production. The production design uses 8 GPU nodes, RoCEv2 for east-west GPU communication, shared NVMe-backed storage, and dual fabric paths for resiliency.

Benchmark evidence:

Test performedResult
Single-node GPU stress test92% GPU utilization
Single-node local-disk read testMeets target
Multi-node RoCEv2 collective testNot run
Fabric path-failure testNot run

Which infrastructure decision is best?

Options:

  • A. Require additional multi-node and failover benchmarks

  • B. Approve deployment because GPU utilization is high

  • C. Tune the training framework batch size

  • D. Add more GPUs before deployment

Best answer: A

Explanation: Benchmark evidence is sufficient only when it validates the production workload path and stated operational constraints. In this scenario, the passed tests show that one node can drive its GPUs and local disk, but production depends on distributed GPU communication over RoCEv2, shared storage behavior, and dual-path resiliency. Because the multi-node collective test and path-failure test were not run, the evidence cannot confirm latency, congestion handling, or availability under the intended architecture. The best decision is to require targeted additional benchmarks before approving deployment.

High single-node utilization is useful, but it does not prove the AI fabric or failover design is ready for production.

  • GPU-only evidence is incomplete because distributed training depends on network collectives, not just local accelerator utilization.
  • Adding GPUs overbuilds the compute layer without proving the current bottleneck or resiliency behavior.
  • Batch tuning changes the workload configuration but does not validate the infrastructure evidence required for approval.

Continue with full practice

Use the Cisco 300-640 DCAI Practice Test page for the full IT Mastery practice bank, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try Cisco 300-640 DCAI on Web View Cisco 300-640 DCAI Practice Test

Free review resource

Use the full IT Mastery practice page above for the latest review links and practice page.

Revised on Thursday, May 28, 2026