Try 60 free Cisco 300-640 DCAI questions across the exam domains, with explanations, then continue with full IT Mastery practice.
This free full-length Cisco 300-640 DCAI practice exam includes 60 original IT Mastery questions across the exam domains.
Use these questions for self-assessment, scope review, and deciding what to drill next.
Count note: this page uses the full-length practice count maintained in the Mastery exam catalog. Some certification vendors publish total questions, scored questions, duration, or unscored/pretest-item rules differently; always confirm exam-day rules with the sponsor.
Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.
Try Cisco 300-640 DCAI on Web View full Cisco 300-640 DCAI practice page
| Domain | Weight |
|---|---|
| AI Fundamentals and Applications | 20% |
| AI Infrastructure Components and Architecture | 30% |
| AI Infrastructure Deployment and Data Management | 30% |
| AI Infrastructure Operations and Troubleshooting | 20% |
Use this as one diagnostic run. IT Mastery gives you timed mocks, topic drills, analytics, code-reading practice where relevant, and full practice.
Topic: AI Fundamentals and Applications
A team reports that a RAG chatbot has a high time-to-first-token, but normal token generation speed after the answer starts. Recent telemetry shows:
| Observation | Value |
|---|---|
| Vector DB query p95 | Increased from 400 ms to 4.8 s |
| Storage read latency on vector index volume | High |
| LLM GPU utilization during generation | Normal |
| RoCE congestion drops | None |
| Orchestrator pod restarts | None |
Which stage and infrastructure dependency should be investigated first?
Options:
A. Retrieval: vector-index storage and query latency
B. Generation: GPU memory bandwidth for token decoding
C. Augmentation: prompt-template CPU formatting latency
D. Orchestration: pod restart recovery time
Best answer: A
Explanation: In a RAG workflow, retrieval locates relevant context, augmentation prepares that context with the user prompt, and generation uses the LLM to produce tokens. The symptom is high time-to-first-token with normal token speed after output begins. That points to work before generation, and the telemetry specifically shows degraded vector database query latency plus high storage read latency on the vector index volume. Those are retrieval-stage dependencies. GPU utilization and token generation speed do not support a generation bottleneck, and the absence of pod restarts makes orchestration recovery unlikely. The key takeaway is to map the performance symptom to the RAG stage that owns the stressed infrastructure path.
Topic: AI Infrastructure Deployment and Data Management
A team is deploying a multi-node generative AI training cluster on an Ethernet leaf-spine fabric. The workload uses frequent GPU-to-GPU gradient exchanges and must reduce communication latency and CPU overhead without replacing Ethernet. Which design best maps to these requirements?
Options:
A. Enable RoCEv2 with QoS, PFC, and ECN for AI traffic
B. Move model checkpoints to Fibre Channel storage
C. Increase CPU core count on each training server
D. Use standard TCP forwarding with best-effort QoS
Best answer: A
Explanation: RDMA over Converged Ethernet matters because distributed AI training often depends on fast node-to-node communication, not only local GPU speed. RoCEv2 carries RDMA traffic across an IP-routed Ethernet fabric so GPU servers can exchange large tensors with lower latency and less CPU overhead than traditional TCP-based communication. For AI fabrics, RoCEv2 is typically paired with traffic engineering and congestion controls such as QoS classification, Priority Flow Control, and ECN so the RDMA traffic is protected from loss and congestion collapse. The key design point is preserving Ethernet while making it suitable for high-performance, low-latency AI communication.
Topic: AI Infrastructure Components and Architecture
A team is validating an on-premises GPU cluster for distributed model training before adding more nodes. Training logs show repeated data-loader waits and GPU utilization dropping below 35% during batch reads. RoCE fabric telemetry shows no congestion drops, and the servers have CPU and memory headroom. Which storage evidence best supports prioritizing storage remediation before scale-out?
Options:
A. Dataset repository capacity at 68% with monthly growth
B. Backup network throughput lower than the training fabric
C. Dual-controller storage array operating in active/active mode
D. High p95 read latency on the dataset volume during batch reads
Best answer: D
Explanation: For an AI training bottleneck, the strongest storage evidence is a metric that matches the workload symptom and timing. Data-loader waits during batch reads point to the storage path that feeds training data to the GPUs. If p95 read latency spikes on that dataset volume at the same time GPUs go idle, storage performance is the likely limiter, especially when network congestion and server CPU or memory pressure have been ruled out. Capacity, redundancy, and unrelated backup-path throughput may matter operationally, but they do not explain the observed training stalls. The key is correlating storage performance evidence with the AI workload phase that is slowing down.
Topic: AI Infrastructure Operations and Troubleshooting
A data center team remediated degraded all-reduce performance for an AI training cluster by correcting the QoS policy applied to RoCEv2 traffic. Validation shows that packet drops are gone and the benchmark is back within target. However, telemetry now shows GPU-to-GPU traffic growing 18% month over month and the fabric will exceed the planned utilization threshold next quarter. The operations requirement is to close only validated incidents, preserve troubleshooting knowledge, avoid unnecessary escalation, and prevent recurrence. Which follow-up design best maps to these results?
Options:
A. Escalate to Cisco TAC and defer documentation until root cause is confirmed
B. Document the fix, update monitoring baselines, and open capacity planning
C. Increase GPU reservations in the orchestrator without changing operations records
D. Close the incident and rely on existing alerts for future detection
Best answer: B
Explanation: Remediation follow-up should match the validation outcome. Because the benchmark recovered and drops are gone, the incident can move toward closure without escalation for an unresolved fault. The team should document the corrected QoS mapping and validation evidence so the runbook and incident history are useful later. Because telemetry shows sustained growth and a projected threshold breach, monitoring baselines or alert thresholds should be reviewed, and a capacity-planning action should be created. This separates a fixed incident from a future scalability risk.
Topic: AI Fundamentals and Applications
A data science team is deploying an inference workload that scores 40 million archived support tickets each night to update search metadata. The job must finish before 6:00 a.m., can start after business hours, and has no user-facing response-time requirement. Daytime GPU capacity must remain available for a customer-facing chatbot. Which infrastructure decision is BEST?
Options:
A. Run queued batch jobs on off-peak shared GPUs
B. Prioritize synchronous API scaling over job scheduling
C. Deploy edge inference nodes near each user region
D. Reserve dedicated low-latency GPUs for every request
Best answer: A
Explanation: Batch inference is designed for high-throughput processing where individual request latency is not critical. In this scenario, the ticket-scoring job has a completion deadline but no real-time user interaction, so it should be scheduled as a queued workload during off-peak hours. GPU allocation can be shared, quota-controlled, or lower priority so that daytime interactive inference for the chatbot keeps its capacity. Interactive inference, by contrast, typically needs synchronous serving, low p95/p99 latency, and reserved or rapidly scalable resources.
The key distinction is completion-window scheduling versus per-request response-time guarantees.
Topic: AI Infrastructure Operations and Troubleshooting
A team runs distributed generative AI training on GPU servers using RoCEv2 across a leaf-spine fabric. During all-reduce phases, training time increases and GPU utilization drops, while storage latency remains within baseline. Which telemetry signal best supports the suspected network issue?
Options:
A. Rising ECN marks and PFC pause counters on the RoCE traffic class
B. Frequent pod image pull failures in the orchestration cluster
C. Increasing GPU temperature and power throttling events on the servers
D. Higher read latency on the shared training dataset volume
Best answer: A
Explanation: For distributed AI training, all-reduce phases are highly sensitive to network latency, congestion, and packet-loss behavior. In a RoCEv2 fabric, telemetry such as ECN marking and PFC pause counters on the correct QoS traffic class is strong evidence that congestion management is being exercised during GPU-to-GPU synchronization. The stem already rules against storage by saying storage latency is normal, and falling GPU utilization can be an effect of waiting on network communication rather than a compute fault. The key takeaway is to correlate the signal with the workload phase and infrastructure layer being suspected.
Topic: AI Infrastructure Deployment and Data Management
A team is deploying Cisco UCS GPU servers for distributed AI training. The workload must boot from a shared FC SAN LUN and use local NVMe devices as high-throughput scratch space for checkpoints. The server profile template currently uses a storage policy built for stateless inference nodes with no local disk claim and no FC boot path.
Which design choice best maps to the requirement?
Options:
A. Increase the RoCE traffic class bandwidth for training traffic.
B. Add more GPU memory to each training node profile.
C. Move checkpoints to the orchestration control-plane datastore.
D. Assign a storage policy that defines FC boot and local NVMe scratch use.
Best answer: D
Explanation: A UCS storage policy mismatch can prevent an AI node from being provisioned correctly or can place I/O on the wrong data path. In this scenario, the requirement is explicit: FC SAN boot plus local NVMe scratch for checkpoint-heavy training. A profile built for stateless inference does not claim local disks and does not provide the FC boot path, so it cannot satisfy the deployment requirements. The design should align the UCS storage policy and related profile settings with the intended storage data paths before workload placement. Network QoS or GPU changes may help other bottlenecks, but they do not correct a missing storage path.
Topic: AI Infrastructure Components and Architecture
A team is deploying a production generative AI inference service on Cisco UCS GPU servers. The service must continue serving requests during one server failure or planned host maintenance, and the model endpoint must remain available without manual rebuild. Which compute design best maps to these requirements?
Options:
A. Deploy an N+1 GPU server pool across failure domains with automated workload rescheduling.
B. Deploy active/passive GPU servers with manual model reload after failure.
C. Deploy one larger GPU server with redundant power supplies and NICs.
D. Deploy redundant storage for model weights on a single GPU server.
Best answer: A
Explanation: Production AI inference needs compute redundancy at the service layer, not only component redundancy inside a single host. An N+1 GPU server pool provides enough spare GPU capacity for the workload to keep running when one server fails or is placed into maintenance. Placing servers across failure domains reduces the chance that one chassis, power, or fabric issue removes all serving capacity. Automated orchestration or workload rescheduling keeps the endpoint available without a manual rebuild. Redundant power, NICs, or storage are useful, but they do not replace the need for redundant GPU compute nodes.
Topic: AI Infrastructure Components and Architecture
An AI team is designing compute placement for two workloads. Large-model fine-tuning runs as 8-GPU jobs with heavy GPU-to-GPU collective traffic and must minimize intra-node latency. Several inference services are smaller, independently scaled containers that can share accelerators but need tenant isolation. Which compute architecture best maps to these requirements?
Options:
A. PCIe-only single-GPU nodes for all workloads with scheduler spreading
B. CPU-only training nodes with GPU inference nodes
C. Shared vGPU VMs for training and bare-metal nodes for inference
D. Dedicated NVLink-connected GPU nodes for training; containerized shared-GPU pool for inference
Best answer: D
Explanation: Large-model fine-tuning with 8-GPU jobs and heavy collective operations benefits from GPUs placed in the same server or pod with high-bandwidth, low-latency GPU-to-GPU connectivity such as NVLink or NVSwitch. Keeping those jobs on dedicated, tightly connected GPU nodes avoids fragmentation and reduces intra-node communication delay. Smaller inference services usually scale independently and can often run as containers with GPU sharing or partitioning, provided isolation is enforced by the platform. The key is matching placement to traffic pattern: training needs tightly coupled accelerators, while inference needs elastic, isolated accelerator access.
Topic: AI Infrastructure Components and Architecture
A team reports that a distributed GPU training job performs normally when all workers run in one rack, but step time more than doubles when workers span two racks. GPU utilization drops during gradient synchronization.
Telemetry summary
| Signal | Observation |
|---|---|
| Storage latency | Normal during dataset reads |
| CPU utilization | Below 50% on all workers |
| Leaf-spine links | Sustained above 90% during all-reduce |
| RoCE traffic class | Queue buildup and pause activity |
Which network architecture concern is most likely affecting performance?
Options:
A. Oversubscribed east-west fabric between GPU racks
B. GPU memory capacity mismatch across workers
C. Insufficient north-south Internet bandwidth
D. Slow storage reads during dataset ingestion
Best answer: A
Explanation: Distributed GPU training is highly sensitive to east-west bandwidth and latency because workers frequently exchange gradients or parameters during synchronization. The key evidence is that performance is normal within one rack but degrades when the job crosses racks, while storage and CPU signals are not stressed. Sustained high utilization on leaf-spine links plus RoCE queue buildup and pause activity indicate congestion on the inter-rack data path. For AI fabrics, the architecture must provide enough nonblocking or low-oversubscription east-west capacity for GPU-to-GPU communication. A north-south or storage-focused explanation does not match the scope and timing of the symptom.
Topic: AI Infrastructure Deployment and Data Management
A data center team is troubleshooting slower completion times for a distributed AI training job over a RoCEv2 leaf-spine fabric. The RoCEv2 class is configured as lossless and has an ETS minimum bandwidth share.
Telemetry summary:
| Signal | Observation |
|---|---|
| RoCE packet drops | None detected |
| PFC pause frames | Frequent spikes on the RoCE class |
| ECN-marked packets | Near zero |
| ETS utilization | RoCE class receives its configured share |
Which remediation is best supported by these facts?
Options:
A. Enable PFC on every traffic class
B. Disable PFC for the RoCEv2 class
C. Increase the ETS minimum bandwidth share
D. Tune ECN marking for the RoCEv2 class
Best answer: D
Explanation: PFC, ECN, and ETS solve different problems in an AI fabric. PFC protects a selected lossless class from drops by pausing traffic, but excessive pause activity can create latency and head-of-line blocking. ECN marks packets during congestion so RoCEv2 endpoints can react before buffers fill enough to trigger PFC. ETS allocates bandwidth among classes; it does not signal congestion or prevent pause storms. Because drops are absent, PFC is already protecting the lossless class. Because ECN marks are near zero while PFC pauses spike, the supported fix is to tune ECN marking for the RoCEv2 class.
Topic: AI Infrastructure Components and Architecture
A data science team runs several AI inference services on shared Cisco UCS GPU servers by installing framework and CUDA dependencies directly on the host OS. After each model update, operations observes:
Which remediation best addresses the likely cause?
Options:
A. Add NVLink to improve GPU-to-GPU communication
B. Increase RoCE bandwidth for the inference VLAN
C. Deploy each service as a container with GPU resource requests
D. Create one shared host image for all services
Best answer: C
Explanation: The symptoms point to lifecycle and dependency management problems, not a network, storage, or GPU interconnect bottleneck. Containerization is a better fit when AI workloads need portable runtime environments, dependency isolation, repeatable upgrades and rollbacks, and scheduler-aware GPU placement. Packaging each inference service as an immutable container image avoids mixing framework versions on the host. Using an orchestrator with GPU resource requests also improves placement control because workloads can be scheduled based on available GPU resources rather than manual server assignment. The key takeaway is that containers solve deployment consistency and workload scheduling issues; faster fabric or GPU interconnects do not fix host-level dependency drift.
Topic: AI Infrastructure Components and Architecture
A manufacturer is designing AI infrastructure for computer-vision quality inspection across several plants. Raw camera feeds must remain on-site, inference responses must stay under 20 ms, and model retraining can use cloud GPU capacity only with sanitized datasets. Which infrastructure decision best meets these requirements?
Options:
A. Store video on-site but stream frames to cloud inference
B. Run all inference and training on cloud GPUs
C. Build a fully isolated on-premises AI environment
D. Use on-site GPU inference with secure cloud integration
Best answer: D
Explanation: A purely cloud design is unsuitable when the workload has strict real-time latency and data-location constraints. In this scenario, raw video cannot leave each plant and inference must complete in under 20 ms, so inspection inference should run close to the cameras on on-site or edge GPU infrastructure. Cloud still has a useful role for elastic retraining, but only after data is sanitized and transferred through secure hybrid connectivity. This is a hybrid AI pattern: keep latency-sensitive and regulated data paths local, while using cloud capacity where it does not violate placement or response-time requirements.
Topic: AI Infrastructure Deployment and Data Management
A team is deploying Cisco UCS GPU servers for AI training. The requirements are: OS images must be portable when a server profile moves to replacement hardware; each server needs a fast temporary cache that survives a single local drive failure; shared datasets and checkpoints must remain on the existing redundant Fibre Channel SAN. Which storage design best maps to these requirements?
Options:
A. SAN boot, no local disks, and cache on the FC SAN
B. Local M.2 boot, RAID0 cache, and FC workload LUNs
C. SAN boot, mirrored local cache, and redundant FC workload LUNs
D. Local boot, mirrored cache, and replicated local checkpoints
Best answer: C
Explanation: For UCS-based AI infrastructure, storage policy choices should align each data path with the requirement it serves. Portable OS images are typically handled with a boot policy that points to SAN boot LUNs, allowing the server profile to move to replacement hardware without depending on local boot media. A fast temporary cache belongs on local storage when the workload needs low-latency scratch space, and the single-drive-failure requirement calls for a mirrored or parity-protected local disk policy rather than RAID0. Shared datasets and checkpoints should use the existing redundant FC SAN because they are persistent workload data that must be available beyond one server.
Topic: AI Infrastructure Operations and Troubleshooting
A team reports that distributed GPU training on Cisco UCS servers slowed after a second training job was launched. Intersight shows server health as Good, normal GPU power and temperature, and no GPU link faults. Kubernetes shows all training pods as Running. Storage latency is within baseline. Nexus Dashboard shows rising PFC pause frames and ECN marks on the RoCEv2 traffic class between the GPU nodes. Which action is most supported by these monitoring views?
Options:
A. Replace the affected GPUs in the UCS servers
B. Validate fabric congestion and QoS for the RoCEv2 class
C. Increase storage IOPS for the training dataset
D. Restart the Kubernetes scheduler for the cluster
Best answer: B
Explanation: The monitoring views isolate the problem to the network fabric. Intersight server and GPU health are normal, so the evidence does not support a compute hardware fault. Kubernetes shows the workload is scheduled and running, so the orchestration state is not the main issue. Storage latency is normal, which makes storage performance an unlikely bottleneck. Nexus Dashboard is the relevant view for fabric health, congestion, and RoCEv2 behavior; rising PFC pause frames and ECN marks indicate congestion in the lossless traffic class used for GPU-to-GPU communication. The next step is to validate and remediate RoCEv2 QoS, congestion management, or load distribution in the fabric.
Topic: AI Infrastructure Operations and Troubleshooting
A new distributed training job on Cisco UCS GPU nodes shows intermittent all-reduce latency spikes. Intersight shows the servers, GPUs, adapters, firmware compliance, and power/cooling health as normal. The storage system is not reporting latency. The team suspects congestion on the leaf-spine fabric carrying RoCEv2 traffic. What is the best next validation step?
Options:
A. Review Intersight firmware compliance for the GPU servers
B. Use Intersight inventory to confirm GPU model counts
C. Check Nexus Dashboard fabric telemetry for congestion and drops
D. Restart the orchestration pods on the affected nodes
Best answer: C
Explanation: Nexus Dashboard and Intersight provide different operational views. Intersight is the right tool for infrastructure inventory, server health, firmware compliance, and lifecycle visibility across UCS-managed resources. In this scenario, those server-side indicators are already healthy, and the suspected issue is congestion in the Nexus leaf-spine fabric carrying RoCEv2 traffic. Nexus Dashboard is better suited to validate fabric behavior such as interface utilization, drops, congestion events, flow visibility, and correlated fabric telemetry. The key distinction is that healthy compute inventory does not rule out a network-fabric performance issue.
Topic: AI Fundamentals and Applications
A data science team is moving large-model training into an on-premises data center. Each training job uses 64 GPUs for multi-day runs, repeatedly reads a 20 TB dataset each epoch, and performs frequent gradient synchronization that is sensitive to latency variation. The team wants predictable job completion times. Which design best maps to these workload behaviors?
Options:
A. GPU-dense UCS servers, high-throughput shared storage, and a non-oversubscribed congestion-managed fabric
B. CPU-dense virtualization hosts, capacity-optimized object storage, and oversubscribed access switching
C. GPU servers, archive-tier storage, and best-effort fabric QoS
D. Small edge GPU nodes, local SSD caching, and periodic dataset synchronization
Best answer: A
Explanation: AI training workloads tend to run for long periods with many GPUs active at the same time. To keep those GPUs productive, the storage path must feed large datasets at high throughput, and the network fabric must provide predictable performance for synchronization traffic such as gradient exchange. A non-oversubscribed or carefully engineered fabric with congestion management helps reduce latency variation, while high-throughput shared storage avoids starving GPUs during repeated epoch reads. Designs optimized mainly for CPU density, edge placement, archive capacity, or best-effort connectivity do not address the combined training requirements.
Topic: AI Fundamentals and Applications
A manufacturer is deploying computer-vision inference for quality inspection at 40 factories. Each line must reject defects within 20 ms, continue operating during WAN outages, and keep raw camera feeds inside the factory. Central IT still wants standardized rollout and health monitoring across all sites. Which deployment model best fits these requirements?
Options:
A. Public-cloud GPU autoscaling for all inference
B. Edge AI at each factory with centralized operations
C. Cloud-hosted inference with streamed camera feeds
D. Central on-premises inference over the WAN
Best answer: B
Explanation: Edge AI is the best fit when inference must happen close to the data source with very low latency and local survivability. In this scenario, defect rejection depends on a 20 ms response and must continue during WAN outages, so sending camera streams to a remote cloud or central data center introduces avoidable latency and dependency on connectivity. Keeping raw camera feeds inside each factory also aligns with edge placement because only summaries, model updates, or telemetry need to traverse the WAN. Centralized operations can still manage fleet consistency through orchestration, image/version control, and health monitoring across sites. The key distinction is that compute for time-critical inference stays local, while management can remain centralized.
Topic: AI Infrastructure Operations and Troubleshooting
An AI operations team is investigating a 40% drop in distributed training throughput on a Cisco data center fabric. Nexus Dashboard shows elevated ECN marking on one leaf, Intersight shows intermittent GPU underutilization, and storage latency is within baseline. An engineer concludes that PFC should be enabled globally before the next job run. The change window is limited and the team must avoid unnecessary disruption. Which operational design best maps to the requirement?
Options:
A. Correlate telemetry across fabric, GPU, storage, and job timelines before changing controls
B. Increase storage queue depth because training throughput is below baseline
C. Move the workload to different GPU nodes to bypass the underutilization symptom
D. Enable PFC globally to eliminate possible RoCEv2 loss immediately
Best answer: A
Explanation: The core concept is evidence-driven troubleshooting. A single symptom, such as ECN marking on one leaf, is not enough to justify a broad fabric change like globally enabling PFC. The best operational design is to correlate Nexus Dashboard fabric telemetry, RoCE counters, Intersight GPU utilization, storage latency, and job timing to confirm whether the bottleneck is network congestion, compute scheduling, storage, or orchestration. This approach reduces change risk and avoids masking the real fault. In AI clusters, symptoms often cascade: network congestion can idle GPUs, but GPU underutilization can also result from storage stalls or job placement issues. The key takeaway is to validate the conclusion before remediation.
Topic: AI Infrastructure Deployment and Data Management
A new UCS-based GPU pool was added for distributed AI training. The jobs start successfully, but training throughput is about 40% lower than the validated baseline.
Operational observations:
| Signal | Observation |
|---|---|
| GPU telemetry | High utilization, low clocks, power-limit throttling active |
| Network fabric | No RoCE drops or ECN congestion spikes |
| Storage | Normal read latency and IOPS |
| Recent change | Domain profile uses a capped power policy |
What is the most likely remediation?
Options:
A. Move the dataset to lower-latency block storage
B. Enable PFC on the RoCE traffic class
C. Increase Kubernetes pod CPU requests
D. Raise or remove the power cap after validating rack power and cooling
Best answer: D
Explanation: UCS power policy selection can affect GPU workload performance when it limits the power budget available to a server, chassis, or GPU-dense node. In this scenario, the strongest evidence is not the application scheduler, storage path, or network fabric; it is GPU telemetry showing power-limit throttling while a capped power policy is applied. For AI training, GPUs often need sustained power headroom to maintain boost clocks during long-running compute phases. The appropriate action is to validate facility and rack power and cooling capacity, then adjust the UCS power policy so the GPU nodes can draw the power required for the workload. A power cap may be useful for protection or capacity management, but it can reduce readiness or performance for dense GPU deployments.
Topic: AI Fundamentals and Applications
A team is building an on-premises AI platform for containerized training and RAG inference on Cisco UCS GPU nodes. Jobs must request GPUs, receive the correct pod network connectivity, mount persistent datasets from shared storage, and recover or reschedule when nodes fail. Which orchestration design best maps to these requirements?
Options:
A. Use standalone Slurm with manually provisioned network and storage
B. Use Terraform-managed UCS profiles with static VLANs and LUN mappings
C. Use Nexus Dashboard alerts with Intersight firmware compliance policies
D. Use Kubernetes with GPU device plugins, CNI, CSI, and operators
Best answer: D
Explanation: AI workload orchestration must coordinate runtime placement and dependencies across compute, network, and storage. In this scenario, Kubernetes is the best fit because the control plane schedules containerized jobs, GPU device plugins advertise GPU resources to the scheduler, CNI provides pod network connectivity, CSI connects workloads to persistent storage, and operators/controllers maintain desired state and recover workloads after failures. Infrastructure provisioning tools and monitoring platforms are useful, but they do not by themselves perform runtime scheduling and dependency coordination for containerized AI workloads.
Topic: AI Infrastructure Deployment and Data Management
A data center team has completed deployment of an on-premises AI cluster for RAG indexing and model fine-tuning. Cisco UCS GPU servers report healthy GPUs, the Nexus fabric is up, NVMe-backed file storage is mounted, and Kubernetes nodes show Ready. The production requirement is to ingest large document sets, schedule distributed GPU jobs, sustain low-latency east-west traffic, and export operational telemetry. Which decision BEST determines that the environment is ready?
Options:
A. Add additional GPU nodes before running workload validation.
B. Validate a representative end-to-end AI workflow across storage, fabric, GPU scheduling, and telemetry.
C. Begin production ingestion and monitor only user-facing errors.
D. Declare readiness because all deployed components show healthy status.
Best answer: B
Explanation: Successful component deployment does not equal end-to-end AI infrastructure readiness. In this scenario, individual layers appear healthy, but the workload depends on the combined path: data ingestion from storage, network behavior during east-west GPU communication, container orchestration and GPU scheduling, and telemetry visibility. A representative validation run confirms that these layers work together under conditions similar to the intended RAG indexing and fine-tuning workflow. It can reveal bottlenecks or gaps that isolated health checks miss, such as storage throughput limits, scheduling misconfiguration, or missing telemetry coverage. The key distinction is between components being installed and the integrated AI pipeline being operationally ready.
Topic: AI Infrastructure Deployment and Data Management
A team deployed a leaf-spine fabric for a multi-node GPU training cluster. ECMP is configured across two spine paths, and the acceptance requirement is to prove that training traffic can use the available fabric capacity rather than concentrating on one path. Which validation step best maps to this requirement?
Options:
A. Verify that PFC and ECN are enabled for the RDMA class.
B. Run a representative training workload and verify balanced path utilization in telemetry.
C. Send ICMP pings between GPU nodes and verify low latency.
D. Confirm that the routing table lists equal-cost next hops.
Best answer: B
Explanation: Load-distribution validation must prove forwarding behavior under realistic traffic, not just configuration intent. For an AI training cluster, the strongest validation is to generate representative multi-flow training traffic and use fabric telemetry, such as interface counters or Nexus Dashboard observations, to confirm that utilization is spread across the expected ECMP links or spine paths without congestion drops. This demonstrates that hashing, path selection, and capacity use are working together for the workload. Route entries, ping tests, and QoS feature checks can support readiness, but they do not prove that production-like traffic is using all available fabric capacity.
Topic: AI Infrastructure Operations and Troubleshooting
A distributed GPU training job suddenly slows down during the all-reduce phase. Nexus Dashboard and Intersight show these correlated events:
| Time | Observation |
|---|---|
| 10:14 | High ECN marking and PFC pause frames on the RoCEv2 traffic class between two leaf switches |
| 10:15 | GPU utilization drops from 92% to 38% across several servers |
| 10:16 | Kubernetes reports delayed readiness probes for training pods |
| 10:17 | Storage latency remains within baseline |
Which alert should be treated as the primary high-priority signal?
Options:
A. Delayed Kubernetes readiness probes
B. GPU utilization drop across the servers
C. RoCEv2 congestion on the lossless traffic class
D. Normal storage latency during the incident
Best answer: C
Explanation: In AI infrastructure incidents, the highest-priority alert is the earliest signal that explains the widest set of dependent symptoms. Here, the slowdown occurs during all-reduce, which depends heavily on low-latency GPU-to-GPU network communication. High ECN marking and PFC pause frames on the RoCEv2 traffic class indicate congestion or lossless fabric pressure on the path used by the training job. The later GPU utilization drop is consistent with GPUs waiting on network collectives, and delayed pod readiness can follow when the application becomes unresponsive. Normal storage latency also reduces the likelihood that storage is the root issue. The key is to correlate time, dependency, and blast radius rather than ranking alerts by how visible they are.
Topic: AI Infrastructure Components and Architecture
A multi-GPU training job that normally finishes an epoch in 22 minutes now takes 41 minutes after an orchestration change. Telemetry shows GPU utilization oscillating between 35% and 90%, high all-reduce wait time, normal storage latency, and no congestion drops on the AI fabric. Kubernetes reports MemoryPressure=False. The current placement is two workers on two 4-GPU nodes; the baseline placed all eight GPUs in one server with NVLink. What is the most likely cause or remediation?
Options:
A. Increase pod memory requests to avoid eviction
B. Move the dataset to lower-latency block storage
C. Enable ECN on the storage traffic class
D. Use topology-aware placement on one 8-GPU node
Best answer: D
Explanation: For tightly coupled multi-GPU training, compute-node placement and GPU interconnect topology can dominate performance. The facts point away from memory, storage, and fabric congestion: MemoryPressure=False, storage latency is normal, and the fabric has no congestion drops. The key change is that an eight-GPU job that previously stayed within one NVLink-connected server is now split across two 4-GPU nodes. That can increase synchronization latency and reduce effective GPU utilization during all-reduce operations. A topology-aware scheduler, node affinity, or job policy that keeps the workload on an appropriate 8-GPU node is the supported remediation. The key takeaway is to match GPU placement to the workload’s communication pattern, not just the GPU count.
Topic: AI Infrastructure Operations and Troubleshooting
An AI platform team validates a RAG inference cluster before adding tenants. The production requirement is 40 GB/s aggregate vector-index read throughput with P95 retrieval latency under 25 ms. Nexus Dashboard and Intersight show no active production alerts.
Benchmark result: 62 GB/s aggregate reads, 18 ms P95 latency, 55% average GPU utilization, and normal storage queue depth.
Which operational decision best maps to these results?
Options:
A. Open a network congestion incident and throttle RoCE traffic during inference.
B. Declare a storage bottleneck and expand the NVMe storage tier before onboarding.
C. Treat the result as capacity headroom and phase in tenants with telemetry monitoring.
D. Discard the benchmark because only production failures can inform capacity planning.
Best answer: C
Explanation: A benchmark result indicates capacity headroom when it exceeds the defined workload requirement and does not coincide with production symptoms such as alerts, SLO violations, saturation, or abnormal queueing. In this case, throughput is above 40 GB/s, P95 latency is below 25 ms, GPU utilization is moderate, and storage queue depth is normal. That evidence supports a controlled onboarding plan with continued Nexus Dashboard and Intersight monitoring, not an incident response. Benchmarks help estimate capacity, but they become troubleshooting evidence only when correlated with production degradation or resource saturation.
Topic: AI Infrastructure Deployment and Data Management
An AI training cluster attached to an ACI fabric reports intermittent NCCL timeouts during all-reduce operations. GPU health and storage latency are normal.
Exhibit:
Nexus Dashboard: ECN marks and output drops on leaf-to-spine links
Nexus Dashboard: high-volume RoCEv2 flows between GPU node EPGs
APIC: AI tenant EPGs use the default QoS class
APIC: no RoCEv2 classification or lossless policy is applied
Which remediation is best supported by the observations?
Options:
A. Move the training dataset to local SSD storage
B. Increase Kubernetes GPU replica counts
C. Deploy APIC QoS/PFC/ECN policy for RoCEv2 traffic
D. Disable Nexus Dashboard telemetry collection
Best answer: C
Explanation: Nexus Dashboard provides the fabric visibility needed to identify congestion patterns, while APIC is the policy point for deploying fabric behavior. In this case, the evidence narrows the issue to RoCEv2 traffic between GPU node EPGs: GPUs and storage are healthy, but Nexus Dashboard reports ECN marks and output drops, and APIC shows the traffic remains in the default QoS class. The appropriate remediation is to classify the AI/RoCEv2 traffic and apply the required congestion-management and lossless handling policy through APIC, then validate improvement with Nexus Dashboard telemetry. Scaling compute or changing storage does not address the observed fabric policy gap.
Topic: AI Infrastructure Components and Architecture
A data center team is sizing compute for a new RAG-based support chatbot. The application must answer user prompts with low latency, use an existing vector index for retrieval, and serve many concurrent sessions. The foundation model is already trained and fits within one accelerator’s memory. Which compute plan is the best fit?
Options:
A. Multi-node GPU training cluster with high-speed GPU-to-GPU interconnect
B. CPU-only servers with large memory for the vector database
C. Maximum-density GPU servers sized primarily for checkpoint storage
D. Inference GPU nodes plus CPU/RAM capacity for retrieval services
Best answer: D
Explanation: RAG serving combines retrieval with generative inference. Because the model is already trained and fits on one accelerator, the compute plan should prioritize low-latency inference GPUs and enough CPU and memory to run retrieval, ranking, and vector-index services close to the serving path. Horizontal scaling can add more inference replicas for concurrency. A training cluster with extensive GPU-to-GPU connectivity is mainly justified when model training or large distributed fine-tuning requires synchronized accelerator communication. CPU-only capacity may help retrieval, but it misses the GPU requirement for generative model serving.
Topic: AI Infrastructure Components and Architecture
A manufacturer is designing infrastructure for a computer-vision inference workload that stops robotic equipment when defects are detected. Requirements include inference response under 10 ms, raw production images must remain inside the plant network, and the cloud link has 35 ms round-trip latency. The team also wants to use cloud services for model registry and periodic approved model updates. Which design best maps to these requirements?
Options:
A. Keep all AI components disconnected from the cloud.
B. Run inference in the cloud and stream camera images over the WAN.
C. Store raw images in cloud object storage and cache only recent frames locally.
D. Run inference and store raw images on-premises; use secure hybrid connectivity to cloud services for approved model lifecycle functions.
Best answer: D
Explanation: A purely cloud design is unsuitable when a workload has hard local latency requirements or data-location restrictions that the WAN and cloud placement cannot satisfy. In this scenario, a 35 ms cloud round trip already exceeds the under-10 ms response target, and raw production images are not allowed to leave the plant network. The appropriate architecture is hybrid: keep the time-sensitive inference path and protected data on-premises or at the edge, then use secured cloud integration only for functions that can tolerate latency and comply with policy, such as model registry, governance, or approved model updates. The key is separating the real-time protected data path from non-real-time lifecycle services.
Topic: AI Infrastructure Deployment and Data Management
A Cisco UCS domain will host GPU nodes for an AI inference platform. The upstream fabric already trunks VLAN 110 for management, VLAN 220 for dataset storage, and VLAN 330 for tenant inference. The design must keep these traffic types separated and reachable from the hosts. Which UCS LAN connectivity design best maps to this requirement?
Options:
A. Increase the vNIC QoS MTU for all traffic classes
B. Assign the required VLANs to the appropriate vNIC templates
C. Use a larger MAC address pool for the service profiles
D. Apply an NTP policy to the UCS domain profile
Best answer: B
Explanation: In Cisco UCS, LAN connectivity policies define the host-facing vNICs, and vNIC templates commonly define properties such as fabric placement, MAC pool use, QoS policy, and VLAN membership. For workload reachability and traffic separation, the decisive setting is whether the required VLANs are assigned to the correct vNICs or vNIC templates. If VLAN 330 is not allowed on the tenant inference vNIC, the upstream trunk can be correct and the host still cannot send or receive that traffic. QoS, MAC pools, and NTP can be important, but they do not by themselves place host traffic into the required Layer 2 segments.
Topic: AI Infrastructure Components and Architecture
A team is designing a new AI training pod for large generative model fine-tuning. The workload is batch-oriented, can use nightly data replication, and requires the same GPU count, RoCE fabric, and storage throughput in any location.
Requirements:
| Requirement | Detail |
|---|---|
| Data control | No public cloud storage for training data |
| Sustainability | Prefer renewable power and lower PUE when performance is unchanged |
| Facility limit | Current data center has limited cooling headroom |
| Operations | Central monitoring must remain available |
Which design best maps to these requirements?
Options:
A. Place the pod in a renewable-powered colocation site with low PUE and private connectivity
B. Keep the pod in the current data center to avoid replication changes
C. Distribute smaller GPU nodes across branch edge sites
D. Move the training data and GPUs to a public cloud AI service
Best answer: A
Explanation: Renewable energy and sustainability requirements can change the recommended AI infrastructure location when the workload can tolerate the placement change without losing performance or violating data-control rules. In this scenario, the training workload is batch-oriented and supports nightly replication, so it does not require the pod to stay in the current facility. A renewable-powered colocation site with lower PUE and sufficient cooling headroom better satisfies the sustainability and facility requirements while still allowing the same GPU, RoCE, storage, and monitoring design. The key is that sustainability changes the site recommendation only because the performance and data-control constraints can still be met.
Topic: AI Fundamentals and Applications
An AI platform team must design serving for two inference workloads in a Cisco data center. Workload A scores 20 million records after business close and must finish before 6:00 AM. Workload B serves a customer-facing assistant with small requests, bursty traffic, and a 300 ms p95 response-time target. Which design best maps to these requirements?
Options:
A. Run Workload A as scheduled batch jobs; run Workload B on reserved autoscaled inference endpoints.
B. Run both workloads as scheduled batch jobs on opportunistic GPU capacity.
C. Run Workload A on reserved endpoints; queue Workload B for batch execution.
D. Run both workloads on reserved low-latency inference endpoints.
Best answer: A
Explanation: Batch inference is designed for latency-tolerant work that can be queued, scheduled, and processed in large groups within a completion window. Workload A fits this pattern because it runs after business close and only needs to finish by 6:00 AM, so GPUs can be allocated in scheduled pools or during lower-demand periods. Interactive inference is request/response serving for users or applications with tight latency targets. Workload B fits this pattern because customer requests are bursty and must meet a 300 ms p95 target, requiring ready capacity, autoscaling, and low-latency request routing. The key distinction is not model type; it is the serving behavior and latency tolerance.
Topic: AI Fundamentals and Applications
A manufacturer is deploying computer-vision inference for robotic safety cells at several plants. Each site must make stop/no-stop decisions within 20 ms, continue operating during WAN outages, and keep raw video local while sending events and model metrics to a central environment for retraining. Which design best maps to these requirements?
Options:
A. Deploy CPU-only edge gateways and forward exceptions to the cloud for GPU analysis.
B. Deploy GPU-enabled edge nodes at each plant with local storage and centralized model/telemetry synchronization.
C. Stream raw video to a central cloud GPU cluster for all inference decisions.
D. Use a regional data center for inference and cache only final decisions at each plant.
Best answer: B
Explanation: Edge AI infrastructure places compute close to the data source when workloads require very low latency, local autonomy, or data locality. In this scenario, robotic safety decisions cannot depend on WAN transport or a centralized inference service because the decision must occur within 20 ms and continue during outages. GPU-enabled edge nodes support local computer-vision inference, while local storage keeps raw video on site for privacy and resiliency. A central environment can still support the broader AI lifecycle by receiving events and metrics and distributing validated model updates.
The key takeaway is that edge AI is not isolated from cloud or data center resources; it uses local processing for time-critical decisions and controlled synchronization for management, monitoring, and retraining.
Topic: AI Infrastructure Components and Architecture
A distributed training job was moved to shared storage so checkpoints can survive node failure. Every checkpoint interval, GPU utilization drops sharply for several minutes, but the job does not fail.
| Signal | Observation during checkpoint |
|---|---|
| Storage capacity | 48% used |
| Storage throughput | 19.5 GB/s of 20 GB/s sustained |
| Storage write latency | p99 rises to 170 ms |
| Network fabric | No drops; links below 45% utilization |
| GPU health | No NVLink or XID errors |
| Orchestration | Pods remain Running |
Which remediation is best supported by these observations?
Options:
A. Reschedule workers to reduce GPU-to-GPU hops
B. Tune RoCE congestion controls on the fabric
C. Scale out the checkpoint storage performance tier
D. Add raw capacity to the existing storage pool
Best answer: C
Explanation: The checkpoint path is constrained by storage performance, not available capacity. Capacity is only 48% used, but throughput is essentially at the storage limit and p99 write latency spikes during checkpoints. For AI training, synchronized checkpoint writes can create bursty high-throughput, high-IOPS demand; the storage design must scale performance as well as provide redundancy and availability. A suitable remediation is to scale out the performance tier, such as using a higher-throughput parallel file or NVMe-backed storage design for checkpoints while preserving resilient access.
Topic: AI Infrastructure Deployment and Data Management
An AI team is moving from nightly data staging to streaming a 300 TB training set and shared checkpoints over RoCEv2/NVMe-oF to Cisco UCS GPU nodes. Requirements are no local dataset copies, checkpoint writes must not starve training reads, and stable GPU utilization as pods scale from 8 to 32. During validation, Nexus Dashboard shows congestion and drops on leaf links carrying storage/RDMA traffic, while storage latency and server CPU remain normal. Which infrastructure decision is best?
Options:
A. Increase PVC capacity on the storage class
B. Move checkpoints to node-local NVMe drives
C. Tune fabric QoS for storage/RDMA traffic
D. Add GPU nodes and increase pod replicas
Best answer: C
Explanation: The decisive data-management requirement is continuous shared data access over the fabric, not more compute or storage capacity. Because Nexus Dashboard reports congestion and drops on the links carrying RoCEv2/NVMe-oF traffic, while the storage array and servers are not saturated, the required change is at the fabric policy layer. A suitable decision is to apply end-to-end QoS/lossless treatment for the storage/RDMA class, using mechanisms such as PFC, ECN, and ETS so checkpoint writes and training reads can coexist without causing drops that reduce GPU utilization. The key is to change the layer where the bottleneck is observed.
Topic: AI Infrastructure Deployment and Data Management
In a Cisco UCS domain for an AI training pod, the QoS policy must be corrected before deploying GPU nodes. RoCEv2 gradient-synchronization traffic is mapped by a vNIC QoS policy to Gold; NFS checkpoint traffic is mapped to Silver. Requirements: RoCEv2 needs lossless behavior and 9,000-byte MTU. NFS needs 9,000-byte MTU but must remain drop-eligible. Management stays Best Effort.
Current system classes:
| Class | Packet handling | MTU |
|---|---|---|
| Gold | Drop | 1500 |
| Silver | No drop | 9000 |
| Best Effort | Drop | 1500 |
Which correction best matches the workload requirements?
Options:
A. Set Gold to drop and MTU 9000; keep Silver no drop and MTU 9000.
B. Set Gold and Silver to no drop and MTU 9000.
C. Set Gold to no drop and MTU 9000; set Silver to drop and MTU 9000.
D. Keep the system classes unchanged and increase Gold bandwidth weight.
Best answer: C
Explanation: Cisco UCS system-class settings define shared QoS behavior such as packet drop/no-drop treatment and MTU for traffic mapped into that class. In this scenario, the vNIC QoS mappings already identify Gold for RoCEv2 and Silver for NFS. The mismatch is in the system-class attributes: RoCEv2 requires lossless behavior and jumbo MTU, so Gold must be no-drop with MTU 9000. NFS checkpoint traffic also needs jumbo MTU, but the requirement says it must remain drop-eligible, so Silver should not be no-drop. Bandwidth weights do not fix packet-loss behavior or MTU mismatches.
Topic: AI Fundamentals and Applications
A generative AI chat service runs on GPU-enabled Cisco UCS servers in a Kubernetes cluster. During business-hour spikes, users report slow responses. Intersight and application telemetry show the following:
| Signal | Observation |
|---|---|
| p95 time-to-first-token | Exceeds target during spikes |
| GPU utilization | 92% to 97% on all inference pods |
| Request queue depth | Rises until traffic drops |
| Network/storage health | No congestion or latency alerts |
| Replica count | Fixed at 4 pods |
Which infrastructure control best addresses the observed issue?
Options:
A. Increase storage IOPS for model weights
B. Enable PFC and ECN for RoCE traffic
C. Autoscale inference replicas from queue and GPU metrics
D. Raise batch size to maximize GPU throughput
Best answer: C
Explanation: Generative AI serving must handle variable request arrival rates while keeping response latency predictable. Here, all inference pods are near GPU saturation, queue depth grows, and network and storage telemetry are clean. That points to insufficient serving capacity during spikes, not a fabric or storage bottleneck. An infrastructure control such as Kubernetes autoscaling, driven by request queue depth, GPU utilization, or latency-oriented serving metrics, can add inference replicas and spread demand before queues create unacceptable time-to-first-token. Throughput tuning alone may not meet strict response expectations if it increases waiting time.
Topic: AI Infrastructure Components and Architecture
An AI team is expanding an on-premises training pod from 64 to 128 GPUs for large-model fine-tuning. Pilot telemetry shows RoCE fabric links above 90% utilization during gradient all-reduce, GPUs waiting on network transfers, stable one-way latency within target, and cached storage reads after the first epoch. Which fabric decision is best?
Options:
A. Increase bisection bandwidth with a low-oversubscription leaf-spine fabric.
B. Add local NVMe caching on each GPU server.
C. Collapse the fabric to fewer hops using the existing uplink capacity.
D. Insert inline inspection for all east-west training flows.
Best answer: A
Explanation: This scenario is bandwidth-driven, not latency-driven. Large-model training commonly generates heavy east-west GPU-to-GPU traffic during synchronization operations such as all-reduce. The decisive evidence is high RoCE link utilization and GPUs waiting on network transfers while latency remains within target. A low-oversubscription leaf-spine design with higher bisection bandwidth lets more training traffic move concurrently as the pod scales. A latency-driven design would be favored when small transactions or tail latency dominate, such as real-time inference request paths.
Topic: AI Infrastructure Components and Architecture
A team added four GPU servers to an existing AI training rack. Full-cluster jobs now run slower than before the expansion, but storage latency is normal and the fabric shows no sustained drops or ECN marking.
Telemetry excerpt:
| Signal | Observation |
|---|---|
| Rack PDU load | 94% during training |
| Server events | Power cap asserted |
| GPU telemetry | Clocks reduced under load |
| Inlet temperature | Within target |
Which action best addresses the likely cause?
Options:
A. Redistribute nodes or add rack power capacity
B. Restart the orchestration scheduler
C. Enable ECN marking for the RoCE class
D. Increase storage queue depth for training data
Best answer: A
Explanation: Dense AI scaling is limited by more than node count. In this case, the expansion increased rack power draw until the PDU approached its usable capacity, and the servers asserted power caps. Power capping reduces available GPU power, which lowers GPU clocks and training throughput even when network and storage telemetry look healthy. The inlet temperature being within target makes a thermal hot-aisle issue less likely, but it does not remove the power-density constraint.
The right remediation is to restore power headroom by spreading systems across racks or adding appropriate rack power capacity, then continue monitoring PDU load, server power-cap events, GPU clocks, and thermal telemetry as the cluster scales.
Topic: AI Infrastructure Operations and Troubleshooting
An AI training cluster had unstable step times during distributed jobs. The operations team corrected a mismatched QoS policy on the RoCEv2 fabric. The closure criteria require proof that training throughput returned to baseline and that fabric events no longer align with GPU node stalls. Which operational design best satisfies the closure requirements?
Options:
A. Monitor only GPU utilization for 24 hours before closing the incident
B. Run a storage benchmark and verify that NVMe latency is unchanged
C. Close the incident after confirming the QoS policy now matches the template
D. Rerun the baseline training benchmark while correlating fabric telemetry and GPU node logs
Best answer: D
Explanation: Remediation closure for AI infrastructure should verify the original symptom, not just the configuration change. Because the issue affected distributed training step times and was tied to the RoCEv2 fabric, the team needs a comparable post-remediation benchmark plus telemetry and log correlation from the same validation window. This proves that training throughput returned to the expected baseline and that fabric signals such as congestion, pause behavior, or drops are no longer synchronized with GPU node stalls. A template match is useful evidence that the fix was applied, but it is not enough to prove service recovery.
Topic: AI Infrastructure Components and Architecture
A team redesigned storage for an AI training cluster to maximize usable capacity per rack unit. Since the change, training jobs run 40% longer even though GPU and AI fabric telemetry show no compute faults or RoCE congestion.
Exhibit: Storage observation
| Signal | Observation |
|---|---|
| Workload I/O | many small random reads; bursty checkpoint writes |
| Current target | capacity-optimized HDD file share |
| Protection | erasure-coded pool optimized for usable capacity |
| Symptom | high metadata and read latency during training |
Which remediation best corrects the storage design?
Options:
A. Add more usable HDD capacity to the existing file share
B. Convert all datasets to block LUNs per training node
C. Move hot training data to NVMe-backed shared file storage with redundancy
D. Tune RoCE congestion controls on the AI fabric
Best answer: C
Explanation: AI training storage must match the workload access pattern, not only the required capacity. This workload has many small random reads and bursty checkpoint writes, so metadata latency, read latency, write performance, and redundancy are decisive. A capacity-optimized HDD file share with erasure coding can provide efficient usable capacity, but it is often a poor fit for hot training data when low-latency shared access is required. A better design separates tiers: keep cold or archival data on the capacity tier, and place the active dataset and checkpoints on a high-performance NVMe-backed shared file platform with appropriate redundancy and availability. The fabric and GPU telemetry reduce the likelihood that compute or RoCE congestion is the primary cause.
Topic: AI Infrastructure Operations and Troubleshooting
A data center team is deploying an AI training pod with a Nexus-based leaf-spine fabric and Cisco UCS GPU servers. Operations must see fabric-level congestion and path behavior for east-west RoCEv2 traffic, and also maintain server inventory, hardware health, firmware compliance, and lifecycle actions. Which monitoring design best maps to these requirements?
Options:
A. Use Intersight for RoCEv2 path visibility and server firmware compliance
B. Use Nexus Dashboard for fabric visibility and Intersight for UCS infrastructure visibility
C. Use APIC only for fabric and UCS lifecycle monitoring
D. Use Nexus Dashboard for server lifecycle and fabric congestion visibility
Best answer: B
Explanation: Nexus Dashboard and Intersight provide complementary visibility for AI infrastructure operations. Nexus Dashboard is the better fit for network fabric visibility, including fabric health, telemetry, assurance, endpoint or flow context, and congestion indicators that affect east-west AI traffic such as RoCEv2. Cisco Intersight is the better fit for infrastructure inventory and lifecycle visibility, especially Cisco UCS domains, server health, firmware compliance, advisories, and operational actions across compute infrastructure. In this scenario, the requirements span both the network fabric and UCS server lifecycle, so the design should use each platform for its intended visibility domain rather than forcing one tool to cover both completely.
Topic: AI Infrastructure Deployment and Data Management
A Cisco UCS domain is being prepared for GPU nodes running distributed AI training. The RoCEv2 vNIC QoS policy marks RDMA traffic with CoS 4, but benchmark telemetry shows RDMA retransmits and unstable step times. The system-class summary is shown:
| Traffic class | CoS | Drop behavior | MTU |
|---|---|---|---|
| Best effort | 0 | Drop | 1,500 |
| RDMA | 4 | Drop | 1,500 |
| Management | 6 | Drop | 1,500 |
Which policy correction is the best decision?
Options:
A. Move RDMA traffic to the management system class.
B. Set the RDMA system class to no-drop with jumbo MTU.
C. Leave RDMA as drop traffic and raise best-effort bandwidth.
D. Increase the NTP policy polling frequency.
Best answer: B
Explanation: Cisco UCS system classes define how marked traffic is handled across the fabric, including CoS, MTU, bandwidth treatment, and drop behavior. In this scenario, the vNIC QoS policy already marks RoCEv2 RDMA traffic with CoS 4, but the matching RDMA system class is still configured as drop traffic with a 1,500-byte MTU. That does not meet the workload requirement for low-latency, lossless GPU training traffic. The right correction is to align the system class with the RDMA marking by making the RDMA class no-drop and using a jumbo MTU appropriate for high-throughput AI flows. The key is correcting the UCS QoS/system-class layer, not changing unrelated management or timing policies.
Topic: AI Infrastructure Operations and Troubleshooting
A distributed training job across eight Cisco UCS GPU nodes slows after a new rack is added. Single-node GPU benchmarks remain normal, storage latency is unchanged, and Kubernetes pods are Running with no restarts.
Telemetry summary
| Source | Observation |
|---|---|
| Intersight | GPU utilization oscillates between 25% and 45% |
| Nexus Dashboard | High PFC pause frames on the RoCEv2 class |
| Nexus Dashboard | ECN/CNP spikes on two new leaf-spine links |
Which troubleshooting conclusion and remediation path are most supported?
Options:
A. Storage bottleneck; move the training dataset to faster block storage
B. Network congestion; verify RoCEv2 QoS, PFC, ECN, ETS, and link distribution
C. Orchestration fault; recreate the pods with stricter node affinity
D. GPU interconnect fault; replace the affected GPUs or NVLink components
Best answer: B
Explanation: The strongest evidence points to the network layer, specifically congestion or inconsistent lossless transport behavior for RoCEv2 traffic after the new rack was added. Distributed training depends heavily on east-west GPU-to-GPU communication, so low and oscillating GPU utilization can result when all-reduce traffic is delayed. Normal single-node benchmarks reduce the likelihood of a local GPU problem, unchanged storage latency reduces the likelihood of a data path issue, and healthy pod state reduces the likelihood of an orchestration failure. High PFC pause frames plus ECN/CNP spikes on the new links support validating QoS class mapping, PFC, ECN, ETS, and load distribution across the new fabric paths.
Topic: AI Infrastructure Deployment and Data Management
A team is deploying a Cisco UCS-based GPU cluster for distributed model training. Jobs are sensitive to GPU clock throttling, the facility provides two independent power feeds to each chassis, and operations requires the cluster to remain online if either feed fails. Cooling capacity is already validated. Which UCS power policy decision is best?
Options:
A. Use N+1 redundancy and tune storage QoS
B. Use grid redundancy with an aggressive power cap
C. Use grid redundancy and no power cap
D. Use non-redundant power and no power cap
Best answer: C
Explanation: Cisco UCS power policy behavior should match both the AI workload and the facility power design. For dense GPU training, power capping can reduce available server power and cause GPU frequency throttling, which directly affects training performance. Because the chassis has two independent feeds and must survive the loss of either feed, grid redundancy is the appropriate UCS power redundancy model. Cooling has already been validated, so there is no stated reason to trade performance for a restrictive cap. The key is to preserve feed-level resiliency without constraining GPU power unnecessarily.
Topic: AI Fundamentals and Applications
A financial services company is designing infrastructure for a generative AI assistant. Customer records must remain in the data center, inference must respond with low latency for internal users, and the team needs temporary extra GPU capacity for periodic model fine-tuning. Which infrastructure decision best matches these requirements?
Options:
A. Deploy edge GPU nodes in branch offices and synchronize all records to them
B. Move customer records, inference, and fine-tuning entirely to a public cloud GPU service
C. Build a larger on-premises GPU cluster sized for peak fine-tuning demand
D. Keep sensitive data and inference on-premises, and burst fine-tuning jobs to cloud GPUs over secure connectivity
Best answer: D
Explanation: Hybrid AI infrastructure combines on-premises and cloud resources so each workload component runs where it best fits. In this scenario, customer records and low-latency inference should stay on-premises because of data control and response-time requirements. Periodic fine-tuning can use cloud GPU capacity because it is bursty and does not require permanent peak-sized local infrastructure. Secure connectivity and controlled data or artifact synchronization are key hybrid characteristics, especially when sensitive datasets cannot freely move to the cloud.
The key takeaway is to place data-sensitive, latency-sensitive services on-premises while using cloud resources for elastic scale when the workload allows it.
Topic: AI Infrastructure Components and Architecture
A team is building an on-premises GPU cluster for model training. Eight GPU servers must concurrently read the same curated image dataset and write checkpoints to a common path. The storage design must provide high aggregate throughput, a shared namespace, and continued access if one storage node fails. Which storage access model best fits these requirements?
Options:
A. Single-controller NFS export on one storage node
B. Dedicated Fibre Channel block LUN per server
C. Local NVMe SSDs in each GPU server
D. Scale-out file storage with a shared namespace
Best answer: D
Explanation: The core requirement is shared, high-throughput access for multiple GPU servers. Training jobs commonly need many workers to read the same dataset and write checkpoints to locations visible to the whole job. A scale-out file storage design, such as a parallel or clustered file service, provides a common namespace and can aggregate throughput across storage nodes while maintaining availability when a node fails. Block storage can deliver strong performance, but a basic per-server LUN does not provide a shared file namespace by itself. Local NVMe is very fast for one host but does not solve shared access or resiliency across servers. The key takeaway is to match multi-node AI training data access to a shared file model, not isolated block or local storage.
Topic: AI Fundamentals and Applications
A team trained a model on an on-premises Cisco UCS GPU cluster and then deployed it as a RAG service across two data centers. After document updates, users at one site receive stale answers. Intersight shows inference GPU utilization below 35%, Nexus telemetry shows no congestion, and logs show the vector index at site B is 18 hours behind site A. Which lifecycle transition was most likely missed?
Options:
A. Single-GPU to distributed-training transition
B. Development-to-production serving transition
C. Production-inference to experimentation transition
D. Data-ingestion to model-training transition
Best answer: B
Explanation: Moving an AI workload from development or training into production serving changes infrastructure requirements. A RAG service depends on current retrieval data, synchronized indexes, protected storage, and monitoring coverage for the serving path, not just GPU availability. In this case, low GPU utilization and clean network telemetry make GPU capacity and fabric congestion unlikely. The stale answers align with an operational production-serving gap: the vector index at one site is not staying synchronized after document updates.
The key takeaway is that lifecycle transitions often shift the bottleneck from model training resources to data freshness, protection, and observability for the deployed service.
Topic: AI Infrastructure Components and Architecture
A team is deploying a large generative AI training workload in an AI pod. The job uses model parallelism, exchanges tensors frequently between GPUs, must run as containers, and the data center wants to avoid adding extra idle GPU nodes for capacity. Which compute architecture is the BEST fit?
Options:
A. NVLink-connected GPU-dense nodes with topology-aware container placement
B. PCIe-only single-GPU nodes scaled out across Ethernet
C. Shared vGPU hosts for all training containers
D. Existing CPU cluster with higher-performance storage
Best answer: A
Explanation: Model-parallel generative AI training is sensitive to GPU-to-GPU latency and bandwidth because GPUs exchange intermediate tensors during each training step. A GPU-dense compute node with NVLink or NVSwitch keeps those exchanges on a high-bandwidth local GPU fabric, while container orchestration can place pods according to GPU topology so a job consumes GPUs within the same connectivity domain before scaling out. This matches the workload without adding unnecessary nodes. Scaling many PCIe-only or single-GPU servers shifts the bottleneck to the network, and storage upgrades do not solve the primary compute communication requirement.
Topic: AI Infrastructure Deployment and Data Management
An AI platform team is validating distributed training on Cisco UCS GPU servers. The data-management requirements are: one authoritative 40 TB image dataset, concurrent reads by all training pods, snapshot-based rollback, and no manual per-node data copies. The current Kubernetes cluster uses only local NVMe on each worker, and fabric telemetry shows no congestion during tests. Which design change best maps to these requirements?
Options:
A. Enable ECN and PFC for the RoCE traffic class.
B. Integrate shared high-throughput file storage with RWX persistent volumes.
C. Add more GPU and CPU memory to each UCS server.
D. Pin training pods to nodes with the largest local NVMe.
Best answer: B
Explanation: These requirements are primarily data-management and storage-integration requirements, not a fabric or compute scaling problem. A single authoritative dataset with concurrent pod access and snapshot rollback needs shared storage that supports high-throughput reads, persistence, and data protection. In Kubernetes, that storage should be exposed through persistent volumes that allow the required access mode, such as read-write-many for multiple training pods. Local NVMe can be useful for scratch space or caching, but it creates copies, version drift, and operational rollback challenges when used as the primary dataset location. The fabric may still need validation for throughput, but the stated telemetry does not point to congestion as the blocking issue.
Topic: AI Infrastructure Deployment and Data Management
A Cisco UCS GPU cluster will run distributed fine-tuning jobs. The OS must boot from local M.2 RAID-1, but training datasets and checkpoints must use an existing dual-fabric Fibre Channel array with host multipathing. The current UCS policy set defines only the local disk storage policy and LAN vNICs, so the hosts cannot see the training LUNs. Which design correction best supports the stated AI workload data path?
Options:
A. Convert the local M.2 RAID-1 devices into shared training storage
B. Keep local boot and add dual-fabric FC vHBAs for data LUNs
C. Mount the array through the management network using NFS
D. Add only a LAN QoS policy for RoCE-enabled storage traffic
Best answer: B
Explanation: The workload has two distinct storage paths: local M.2 RAID-1 for operating system boot and dual-fabric Fibre Channel SAN for the AI data path. Correcting the UCS design means retaining the local disk policy for boot while adding the SAN connectivity needed by the training LUNs, such as vHBAs mapped to both FC fabrics, proper VSAN association, and downstream zoning and LUN masking. That supports multipathing and keeps the high-volume dataset and checkpoint traffic on the required FC storage path. A LAN-only or local-disk-only change does not make the SAN LUNs visible to the hosts.
Topic: AI Infrastructure Deployment and Data Management
A team is deploying a Cisco UCS GPU cluster for distributed model training. The workload must sustain peak GPU utilization during nightly jobs, preserve PSU redundancy, and avoid power-cap or thermal throttling in a rack where cooling capacity cannot be increased. Which evidence BEST validates that the selected UCS power policy supports the workload requirement?
Options:
A. A fabric utilization chart showing east-west training traffic stays below link capacity
B. A facility PUE report showing the data hall is more efficient than last quarter
C. A UCS inventory report showing all servers have identical GPU models and firmware levels
D. Intersight telemetry from a representative training run showing power headroom, healthy PSU redundancy, and no throttling
Best answer: D
Explanation: Power policy validation for an AI training cluster should be based on observed behavior under representative load, not only on static configuration. The decisive evidence is telemetry that shows the servers can draw the power needed for sustained GPU operation while preserving the required PSU redundancy state and avoiding power-cap or thermal throttling. In Cisco UCS/Intersight context, useful signals include server and chassis power draw versus policy limits, PSU redundancy health, thermal status, and throttling indicators during a benchmark or production-like training job. Inventory, network utilization, and facility efficiency can be relevant to other checks, but they do not prove the power policy supports the workload.
Topic: AI Infrastructure Deployment and Data Management
A data center team is preparing to release an APIC-managed AI-ready fabric for a new GPU training cluster. Nexus Dashboard shows the following validation summary:
| Check | Visible evidence |
|---|---|
| Fabric reachability | All spines and leaves reachable |
| Fabric health | No critical device faults |
| Policy consistency | QoS/RoCE policy inconsistent on leaf-103 |
| APIC deployment state | Tenant and interface policies deployed, except leaf-103 has a pending policy fault |
What is the best readiness decision?
Options:
A. Remediate leaf-103 policy inconsistency before release
B. Release the fabric because all switches are reachable
C. Run a GPU benchmark before checking policy consistency
D. Release the fabric because no critical device faults exist
Best answer: A
Explanation: AI-ready fabric readiness requires more than basic device reachability. For GPU training traffic, fabric status, health, and policy consistency must all support the intended behavior. The evidence shows that devices are reachable and there are no critical device faults, but the QoS/RoCE policy is inconsistent on leaf-103 and APIC still reports a pending policy fault. That means one part of the fabric may not enforce the same traffic class, congestion, or lossless behavior as the rest of the deployment. The appropriate decision is to hold release, remediate the policy inconsistency, and revalidate health and consistency before onboarding the workload.
Topic: AI Infrastructure Components and Architecture
An enterprise runs a RAG pipeline on a Cisco UCS GPU cluster and wants to burst indexing jobs to a cloud GPU environment. Sensitive source documents must remain encrypted in transit, the cloud jobs need data updates within 15 minutes, and the same workload must be able to move back on premises without application changes. Which design best corrects a hybrid design that currently uses manual file copies over the public Internet?
Options:
A. Use private encrypted connectivity, continuous data replication, and portable container orchestration across both sites.
B. Create nightly VM snapshots and restore them in the cloud when bursting is needed.
C. Keep all documents on premises and send public API calls to cloud inference endpoints.
D. Add larger cloud GPU instances and rebuild the dataset in the cloud for each burst.
Best answer: A
Explanation: A hybrid AI design must align data locality, synchronization, secure connectivity, and workload portability. In this scenario, the key gaps are not GPU capacity alone; the current design cannot move current data securely or run the same workload consistently across locations. Private encrypted connectivity protects sensitive data in transit, continuous or policy-based replication meets the 15-minute freshness requirement, and container-based orchestration with compatible runtime policies supports moving the RAG indexing workload between the Cisco UCS GPU cluster and cloud GPU capacity. The closest traps either solve compute capacity only or move workloads without meeting the data freshness and security constraints.
Topic: AI Fundamentals and Applications
A healthcare company is deploying a RAG-based clinical assistant. Patient records must remain on-premises for compliance, but the team wants to burst GPU-intensive embedding refresh jobs to the cloud during monthly updates. The design must use encrypted connectivity, keep indexes synchronized, and allow workloads to move without changing the application packaging.
Which hybrid AI infrastructure design best maps to these requirements?
Options:
A. Cloud-only storage and compute with regional backup replication
B. Edge inference nodes with periodic USB data transfer
C. On-prem data lake with cloud GPU bursting over secure connectivity
D. On-prem-only GPU cluster with local file storage
Best answer: C
Explanation: A hybrid AI infrastructure combines on-premises and cloud resources while preserving placement, security, and portability requirements. In this scenario, regulated patient records stay on-premises, while burstable GPU capacity in the cloud handles temporary embedding refresh demand. The design should include encrypted private or VPN connectivity, data/index synchronization controls, and consistent orchestration or container packaging so the workload can run in either environment without application redesign. Cloud-only fails the data-residency constraint, and on-prem-only fails the burst-capacity requirement. The key characteristic is coordinated use of both environments, not simply adding remote access or backups.
Topic: AI Infrastructure Deployment and Data Management
A converged AI cluster uses Ethernet for management, application/API traffic, and NVMe-oF over RoCEv2 storage. During training checkpoints, jobs stall and storage latency spikes, while management access and application endpoints remain healthy.
Telemetry summary:
| Signal | Observation |
|---|---|
| Storage vNIC marking | CoS 0, best effort |
| Fabric no-drop class | CoS 4, PFC/ECN enabled |
| Drops | CoS 0 output drops during checkpoints |
| Management traffic | Stable, CoS 0 |
Which remediation is most directly supported by the evidence?
Options:
A. Move management traffic into the no-drop system class
B. Prioritize application/API traffic above storage traffic
C. Disable PFC and ECN on the fabric no-drop class
D. Map the storage/RoCE vNIC to the no-drop system class
Best answer: D
Explanation: In a converged AI environment, different traffic types need different QoS treatment. Management traffic is typically best effort and should remain reliable but not lossless. Application/API traffic may need priority or bandwidth controls, but it usually does not require no-drop behavior. RoCEv2 storage traffic, however, is sensitive to packet loss and should be mapped consistently to a lossless system class with PFC and ECN enabled across the host vNIC policy and fabric. The evidence shows storage marked CoS 0 and experiencing drops, while the fabric’s lossless treatment exists on CoS 4. The supported fix is to align the storage/RoCE marking with the no-drop class.
Topic: AI Infrastructure Components and Architecture
A data center team plans to add four GPU racks for an AI training pod. Each rack requires 30 kW of IT power and rejects about 30 kW of heat. Site policy requires N+1 UPS and cooling, with projected steady-state load at or below 80% of N+1 usable capacity.
| Item | Current state |
|---|---|
| Current IT/heat load | 590 kW |
| UPS capacity | 4 × 300 kW modules |
| Cooling capacity | 5 × 200 kW units |
Which design best supports the expansion without reducing reliability?
Options:
A. Install the GPU racks using the existing UPS and cooling capacity.
B. Add one 300-kW UPS module before installing the GPU racks.
C. Use renewable energy credits to offset the new rack power.
D. Add one 200-kW cooling unit before installing the GPU racks.
Best answer: D
Explanation: For AI rack expansion, reliability must be checked against usable N+1 capacity, not total installed capacity. The expansion adds 120 kW, so the projected IT and heat load is 710 kW. UPS N+1 usable capacity is 900 kW, and 80% of that is 720 kW, so power remains within policy. Cooling N+1 usable capacity is 800 kW, and 80% of that is only 640 kW, so the planned heat load would violate the cooling reliability margin. Adding a 200-kW cooling unit increases N+1 usable cooling to 1,000 kW, making the 80% policy limit 800 kW. The key is that the limiting subsystem is cooling, not power.
Topic: AI Infrastructure Operations and Troubleshooting
A distributed training job now takes twice as long to complete. The operations team correlates Kubernetes state, Intersight health, and Nexus Dashboard alerts for the same 10-minute window.
| Time | Observation |
|---|---|
| 10:02 | Pods remain Running; no restarts or reschedules |
| 10:04 | GPUs healthy; utilization drops from 95% to 45% periodically |
| 10:05 | Nexus Dashboard reports rising ECN marks and PFC pause frames on the RoCE lossless class |
| 10:06 | Storage latency and IOPS remain at baseline |
Which likely cause is best supported by these correlated alerts and logs?
Options:
A. RoCE traffic-class congestion is stalling GPU communication
B. GPU thermal throttling is reducing compute performance
C. Storage latency is delaying training data reads
D. Kubernetes rescheduling is interrupting the workload
Best answer: A
Explanation: The strongest evidence points to network congestion in the lossless RoCE traffic class. Distributed training depends on frequent GPU-to-GPU communication; when ECN marking and PFC pause frames rise on that class, the fabric is signaling congestion and backpressure. That can make GPUs wait for data exchange, which appears as periodic utilization drops even when the GPU hardware is healthy. The Kubernetes and storage observations help exclude common alternatives: the pods are stable, and storage performance is normal. The key troubleshooting move is correlating alerts across layers rather than treating the GPU utilization drop as a compute-only symptom.
Running with no restarts or reschedules.Topic: AI Infrastructure Operations and Troubleshooting
During a staged expansion of an on-premises AI training cluster, jobs that span the original and newly added racks show lower all-reduce throughput and intermittent step-time spikes. Intersight reports GPU and server health as normal, and storage latency is unchanged.
Telemetry summary:
| Observation | Original racks | New racks |
|---|---|---|
| RoCE PFC counters | Stable | Increasing rapidly |
| ECN marking on AI traffic class | Enabled | Not enabled |
| Link utilization | Balanced | Bursty on uplinks |
Which operations action best supports continued scaling while preserving reliability and performance?
Options:
A. Move the training dataset to a higher-capacity storage tier
B. Align RoCE QoS on the new racks and rerun a staged scale test
C. Disable PFC on the original racks to match the new racks
D. Add more GPU nodes to reduce per-node workload pressure
Best answer: B
Explanation: For AI training workloads, scaling across racks depends on predictable east-west network behavior, especially for RoCE-based GPU communication. The evidence does not indicate a GPU health or storage bottleneck; it shows increasing PFC counters, missing ECN marking, and bursty uplink behavior on the newly added racks. The operationally safe action is to align the congestion-control and QoS policy for the AI traffic class across the fabric, then validate with a staged scale test before admitting larger jobs. This preserves reliability by avoiding pause storms or unfair congestion behavior and preserves performance by keeping collective communication stable as the cluster grows. Adding capacity without fixing the fabric inconsistency can make the scaling problem worse.
Topic: AI Fundamentals and Applications
A data center team is preparing an AI infrastructure proposal for mixed RAG inference and model fine-tuning. Before selecting a validated pod or deploying a new fabric, the team must visualize how compute, GPU, network, storage, and operations requirements fit together and create a shared planning view for application and infrastructure stakeholders. Which Cisco solution is the best fit for this phase?
Options:
A. Use Cisco AI Canvas for planning and visibility
B. Deploy Hyperfabric AI as the first step
C. Select an AI POD without further mapping
D. Tune GPU scheduling in the orchestrator
Best answer: A
Explanation: Cisco AI Canvas fits the planning and visibility stage of an AI infrastructure lifecycle. In this scenario, the team is not yet ready to deploy the fabric, commit to a specific validated AI POD, or tune runtime scheduling. The stated need is to create a shared view of workload requirements across compute, GPU, network, storage, and operations so stakeholders can make an informed infrastructure decision. AI Canvas is positioned for that kind of planning and operational visibility context, while AI PODs and Hyperfabric AI are more directly tied to validated infrastructure deployment and fabric implementation. The key is matching the Cisco AI solution to the current phase: plan and visualize first, then implement the selected architecture.
Use the Cisco 300-640 DCAI Practice Test page for the full IT Mastery practice bank, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.
Try Cisco 300-640 DCAI on Web View Cisco 300-640 DCAI Practice Test
Use the full IT Mastery practice page above for the latest review links and practice page.