Try 10 focused Cisco 300-640 DCAI questions on AI Infrastructure Deployment and Data Management, with explanations, then continue with IT Mastery.
Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.
Try Cisco 300-640 DCAI on Web View full Cisco 300-640 DCAI practice page
| Field | Detail |
|---|---|
| Exam route | Cisco 300-640 DCAI |
| Topic area | AI Infrastructure Deployment and Data Management |
| Blueprint weight | 30% |
| Page purpose | Focused sample questions before returning to mixed practice |
Use this page to isolate AI Infrastructure Deployment and Data Management for Cisco 300-640 DCAI. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.
| Pass | What to do | What to record |
|---|---|---|
| First attempt | Answer without checking the explanation first. | The fact, rule, calculation, or judgment point that controlled your answer. |
| Review | Read the explanation even when you were correct. | Why the best answer is stronger than the closest distractor. |
| Repair | Repeat only missed or uncertain items after a short break. | The pattern behind misses, not the answer letter. |
| Transfer | Return to mixed practice once the topic feels stable. | Whether the same skill holds up when the topic is no longer obvious. |
Blueprint context: 30% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.
These original IT Mastery practice questions are aligned to this topic area. Use them for self-assessment, scope review, and deciding what to drill next.
Topic: AI Infrastructure Deployment and Data Management
A team reports that a distributed GPU training job slows down only during the all-reduce phase. The RoCEv2 leaf-spine fabric uses ECMP, and Nexus Dashboard shows the following during the slowdown:
| Observation | Value |
|---|---|
| Leaf101 → Spine1 uplink | 96% utilization, rising ECN marks |
| Leaf101 → Spine2-4 uplinks | 18-25% utilization |
| Link status | No failures |
| Storage and GPU health | Normal |
Which issue should be investigated first?
Options:
A. Kubernetes GPU scheduling failure
B. ECMP hash polarization of long-lived GPU flows
C. Checkpoint storage saturation during training
D. Per-packet load balancing required for RoCEv2
Best answer: B
Explanation: Distributed training all-reduce creates large, long-lived east-west GPU-to-GPU flows. In an ECMP fabric, these flows are typically assigned to paths by a hash. If several elephant flows hash to the same uplink, one spine path can congest while other equal-cost paths remain underused. The telemetry shows exactly that pattern: one uplink is near saturation with ECN marks, while parallel uplinks are lightly used and endpoints are healthy. The first validation should focus on flow distribution, hashing entropy, and whether the fabric supports safer dynamic or flowlet-based load distribution for this traffic. Per-packet load balancing is not the right conclusion because packet reordering can hurt RDMA/RoCEv2 performance.
Topic: AI Infrastructure Deployment and Data Management
A team is deploying Cisco UCS GPU servers for distributed AI training. The workload must boot from a shared FC SAN LUN and use local NVMe devices as high-throughput scratch space for checkpoints. The server profile template currently uses a storage policy built for stateless inference nodes with no local disk claim and no FC boot path.
Which design choice best maps to the requirement?
Options:
A. Add more GPU memory to each training node profile.
B. Move checkpoints to the orchestration control-plane datastore.
C. Assign a storage policy that defines FC boot and local NVMe scratch use.
D. Increase the RoCE traffic class bandwidth for training traffic.
Best answer: C
Explanation: A UCS storage policy mismatch can prevent an AI node from being provisioned correctly or can place I/O on the wrong data path. In this scenario, the requirement is explicit: FC SAN boot plus local NVMe scratch for checkpoint-heavy training. A profile built for stateless inference does not claim local disks and does not provide the FC boot path, so it cannot satisfy the deployment requirements. The design should align the UCS storage policy and related profile settings with the intended storage data paths before workload placement. Network QoS or GPU changes may help other bottlenecks, but they do not correct a missing storage path.
Topic: AI Infrastructure Deployment and Data Management
A team adds a second rack of GPU servers to an AI training cluster. The original rack is healthy, but jobs that span both racks show high all-reduce latency. Hyperfabric shows the new switches and links as discovered, and the intended fabric design includes both racks, but the status is Pending deployment with compliance drift on the new rack. What is the best next action?
Options:
A. Manually configure matching VLANs on the new switches
B. Change the UCS power policy to maximum performance
C. Reschedule all training pods to the original rack
D. Deploy the pending Hyperfabric intent and verify compliance
Best answer: D
Explanation: Hyperfabric is used to simplify AI fabric deployment and lifecycle operations by applying the intended fabric design consistently across discovered infrastructure. In this case, the new rack is physically discovered and included in the design, but the fabric state is still pending deployment and showing compliance drift. That supports a management-plane remediation: deploy the pending intent and confirm the fabric returns to compliance before tuning unrelated compute or orchestration settings.
Manual switch changes can create more drift from the intended design. The key takeaway is to use Hyperfabric as the source of truth for fabric rollout and validation when simplified AI fabric lifecycle management is the requirement.
Topic: AI Infrastructure Deployment and Data Management
A Kubernetes-based AI training cluster uses RoCEv2-attached NVMe storage for datasets and checkpoints. A new data-management rule requires every training job to write full checkpoints to the shared storage every 10 minutes. After the change, jobs pause during checkpoints and GPU utilization drops, but storage controllers report spare IOPS and no media latency alerts.
Telemetry during checkpoint window
| Signal | Observation |
|---|---|
| Nexus Dashboard | ECN marks and PFC pause frames spike |
| Intersight | No GPU or CPU faults |
| Storage | Queue depth normal |
Which remediation is most supported by these facts?
Options:
A. Increase storage controller cache capacity
B. Add GPU workers to each training job
C. Tune fabric QoS for RoCE checkpoint traffic
D. Change the Kubernetes image pull policy
Best answer: C
Explanation: The data-management change added periodic large checkpoint writes, and the telemetry points to the fabric rather than compute or storage. ECN marks and PFC pause frames increasing during the checkpoint window indicate congestion on the RoCE path. Because storage queue depth and media latency are normal, the storage system is not the bottleneck. Because Intersight shows no GPU or CPU faults, adding compute does not address the observed pause. The appropriate infrastructure response is to validate and tune the fabric traffic classes that carry checkpoint traffic, such as QoS, ETS bandwidth allocation, ECN behavior, and lossless handling for RoCEv2.
Topic: AI Infrastructure Deployment and Data Management
A Cisco UCS GPU cluster will run distributed fine-tuning jobs. The OS must boot from local M.2 RAID-1, but training datasets and checkpoints must use an existing dual-fabric Fibre Channel array with host multipathing. The current UCS policy set defines only the local disk storage policy and LAN vNICs, so the hosts cannot see the training LUNs. Which design correction best supports the stated AI workload data path?
Options:
A. Mount the array through the management network using NFS
B. Keep local boot and add dual-fabric FC vHBAs for data LUNs
C. Add only a LAN QoS policy for RoCE-enabled storage traffic
D. Convert the local M.2 RAID-1 devices into shared training storage
Best answer: B
Explanation: The workload has two distinct storage paths: local M.2 RAID-1 for operating system boot and dual-fabric Fibre Channel SAN for the AI data path. Correcting the UCS design means retaining the local disk policy for boot while adding the SAN connectivity needed by the training LUNs, such as vHBAs mapped to both FC fabrics, proper VSAN association, and downstream zoning and LUN masking. That supports multipathing and keeps the high-volume dataset and checkpoint traffic on the required FC storage path. A LAN-only or local-disk-only change does not make the SAN LUNs visible to the hosts.
Topic: AI Infrastructure Deployment and Data Management
A team is deploying Cisco UCS GPU servers for a new generative AI training cluster. The workload has short periods where all GPUs draw near peak power, and the operations team requires PSU redundancy. The current template uses a conservative capped power policy from a CPU-only cluster, and several GPU profiles remain unassociated during staging. Which design best maps to these requirements?
Options:
A. Use an uncapped performance power policy and verify redundant power capacity
B. Use a low-power policy to reduce thermal load
C. Keep the capped policy and increase storage IOPS
D. Leave power unchanged and tune RoCE QoS classes
Best answer: A
Explanation: Power policy selection can directly affect GPU workload readiness when the policy limits or cannot reserve enough power for the server configuration. GPU training workloads often have bursty, high power demand, so a policy inherited from CPU-only servers may cause component throttling or prevent profile association if redundant power cannot be guaranteed. The design should allow the GPU nodes to draw required peak power and confirm that chassis/rack power and cooling can support that draw under the required redundancy model.
The key takeaway is that power policy is not just an efficiency setting for dense GPU deployments; it is a deployment and performance prerequisite.
Topic: AI Infrastructure Deployment and Data Management
An operations team is adding a second Cisco UCS domain to an AI cluster that runs distributed GPU training. The scheduler can place jobs on either domain, but training requires consistent RoCEv2 QoS behavior and aligned timestamps for log correlation. Intersight shows:
| UCS domain | Domain profile state | Relevant evidence |
|---|---|---|
| Domain-A | AI-Domain-Profile compliant | RoCE QoS class, port/VLAN policy, NTP |
| Domain-B | No assigned profile | Default QoS, manual ports, local time |
Which infrastructure decision is BEST before allowing workload placement on Domain-B?
Options:
A. Increase storage IOPS for datasets used by Domain-B jobs
B. Assign the approved UCS domain profile to Domain-B and verify compliance
C. Add scheduler labels to Domain-B and place only smaller training jobs
D. Create server profiles with larger GPU reservations for Domain-B
Best answer: B
Explanation: UCS domain profiles are used to make fabric-level settings consistent across UCS domains, which is critical when AI workloads can be placed on multiple domains. In this scenario, the placement decision depends on whether Domain-B has the same operational baseline as Domain-A: RoCE QoS behavior for distributed training traffic, consistent port/VLAN configuration, and NTP for useful log correlation. Assigning the approved domain profile and checking compliance gives both enforcement and troubleshooting evidence. Scheduler labels alone do not fix fabric drift, and server-level or storage changes do not address the observed domain-wide inconsistencies.
Topic: AI Infrastructure Deployment and Data Management
An AI training cluster attached to an ACI fabric reports intermittent NCCL timeouts during all-reduce operations. GPU health and storage latency are normal.
Exhibit:
Nexus Dashboard: ECN marks and output drops on leaf-to-spine links
Nexus Dashboard: high-volume RoCEv2 flows between GPU node EPGs
APIC: AI tenant EPGs use the default QoS class
APIC: no RoCEv2 classification or lossless policy is applied
Which remediation is best supported by the observations?
Options:
A. Deploy APIC QoS/PFC/ECN policy for RoCEv2 traffic
B. Disable Nexus Dashboard telemetry collection
C. Increase Kubernetes GPU replica counts
D. Move the training dataset to local SSD storage
Best answer: A
Explanation: Nexus Dashboard provides the fabric visibility needed to identify congestion patterns, while APIC is the policy point for deploying fabric behavior. In this case, the evidence narrows the issue to RoCEv2 traffic between GPU node EPGs: GPUs and storage are healthy, but Nexus Dashboard reports ECN marks and output drops, and APIC shows the traffic remains in the default QoS class. The appropriate remediation is to classify the AI/RoCEv2 traffic and apply the required congestion-management and lossless handling policy through APIC, then validate improvement with Nexus Dashboard telemetry. Scaling compute or changing storage does not address the observed fabric policy gap.
Topic: AI Infrastructure Deployment and Data Management
A team is onboarding a distributed GPU training workload that will use RoCEv2 for east-west parameter exchange across a leaf-spine fabric. The requirement is to confirm that the fabric is ready for low-latency, loss-sensitive RDMA traffic before production jobs are scheduled. Which validation evidence best confirms RoCE readiness?
Options:
A. All switch interfaces are up at the expected speed with matching MTU values
B. The GPU servers pass a CUDA compute benchmark without fabric congestion alerts
C. RoCEv2 test traffic meets latency and throughput targets with no RDMA class drops and expected ECN/CNP behavior
D. The storage array reports sufficient capacity and successful file-system mounts
Best answer: C
Explanation: RoCE readiness is confirmed by validating the data path that RDMA traffic will actually use. For an AI training workload, the evidence should show that RoCEv2 flows meet the required latency and throughput while the loss-sensitive traffic class avoids drops. It should also confirm that congestion management is working, such as ECN marking and CNP response under controlled congestion. Link state, MTU, GPU health, and storage capacity are useful checks, but they do not prove that the end-to-end RoCE fabric can carry RDMA traffic reliably under workload-like conditions.
Topic: AI Infrastructure Deployment and Data Management
A team onboarded eight GPU servers with Intersight. Server inventory, firmware compliance, and health are green. Kubernetes shows the training pods are running, but NCCL all-reduce performance drops sharply when jobs span racks. Switch telemetry shows congestion on the GPU east-west path, while storage latency remains normal. Which action is most supported by these observations?
Options:
A. Create a new Intersight firmware compliance policy
B. Use Hyperfabric to validate and remediate the AI fabric design
C. Refresh Intersight inventory for the GPU servers
D. Expand the storage volume used by the training dataset
Best answer: B
Explanation: Hyperfabric is used to deploy and optimize the AI network fabric that carries high-bandwidth, low-latency GPU-to-GPU traffic. In this scenario, server lifecycle state, firmware compliance, and orchestration state are healthy, but cross-rack collective communication degrades and switch telemetry shows congestion on the east-west path. That evidence supports validating the AI fabric design, connectivity, and congestion behavior rather than changing server lifecycle management settings. Intersight remains valuable for visibility, inventory, health, and lifecycle management, but those functions do not by themselves fix an underdesigned or congested AI fabric.
Use the Cisco 300-640 DCAI Practice Test page for the full IT Mastery practice bank, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.
Try Cisco 300-640 DCAI on Web View Cisco 300-640 DCAI Practice Test
Use the full IT Mastery practice page above for the latest review links and practice page.