Cisco 300-640 DCAI: AI Infrastructure Deployment and Data Management

Try 10 focused Cisco 300-640 DCAI questions on AI Infrastructure Deployment and Data Management, with explanations, then continue with IT Mastery.

Open the matching IT Mastery practice page for timed mocks, topic drills, progress tracking, explanations, and full practice.

Try Cisco 300-640 DCAI on Web View full Cisco 300-640 DCAI practice page

Topic snapshot

FieldDetail
Exam routeCisco 300-640 DCAI
Topic areaAI Infrastructure Deployment and Data Management
Blueprint weight30%
Page purposeFocused sample questions before returning to mixed practice

How to use this topic drill

Use this page to isolate AI Infrastructure Deployment and Data Management for Cisco 300-640 DCAI. Work through the 10 questions first, then review the explanations and return to mixed practice in IT Mastery.

PassWhat to doWhat to record
First attemptAnswer without checking the explanation first.The fact, rule, calculation, or judgment point that controlled your answer.
ReviewRead the explanation even when you were correct.Why the best answer is stronger than the closest distractor.
RepairRepeat only missed or uncertain items after a short break.The pattern behind misses, not the answer letter.
TransferReturn to mixed practice once the topic feels stable.Whether the same skill holds up when the topic is no longer obvious.

Blueprint context: 30% of the practice outline. A focused topic score can overstate readiness if you recognize the pattern too quickly, so use it as repair work before timed mixed sets.

Sample questions

These original IT Mastery practice questions are aligned to this topic area. Use them for self-assessment, scope review, and deciding what to drill next.

Question 1

Topic: AI Infrastructure Deployment and Data Management

A team reports that a distributed GPU training job slows down only during the all-reduce phase. The RoCEv2 leaf-spine fabric uses ECMP, and Nexus Dashboard shows the following during the slowdown:

ObservationValue
Leaf101 → Spine1 uplink96% utilization, rising ECN marks
Leaf101 → Spine2-4 uplinks18-25% utilization
Link statusNo failures
Storage and GPU healthNormal

Which issue should be investigated first?

Options:

  • A. Kubernetes GPU scheduling failure

  • B. ECMP hash polarization of long-lived GPU flows

  • C. Checkpoint storage saturation during training

  • D. Per-packet load balancing required for RoCEv2

Best answer: B

Explanation: Distributed training all-reduce creates large, long-lived east-west GPU-to-GPU flows. In an ECMP fabric, these flows are typically assigned to paths by a hash. If several elephant flows hash to the same uplink, one spine path can congest while other equal-cost paths remain underused. The telemetry shows exactly that pattern: one uplink is near saturation with ECN marks, while parallel uplinks are lightly used and endpoints are healthy. The first validation should focus on flow distribution, hashing entropy, and whether the fabric supports safer dynamic or flowlet-based load distribution for this traffic. Per-packet load balancing is not the right conclusion because packet reordering can hurt RDMA/RoCEv2 performance.

  • Storage saturation does not fit because storage health is normal and the slowdown occurs during all-reduce, not checkpoint I/O.
  • GPU scheduling failure does not fit because the job is running and the symptom appears during a network-heavy training phase.
  • Per-packet balancing is risky for RoCEv2 because reordering can degrade RDMA behavior even if it fills links.

Question 2

Topic: AI Infrastructure Deployment and Data Management

A team is deploying Cisco UCS GPU servers for distributed AI training. The workload must boot from a shared FC SAN LUN and use local NVMe devices as high-throughput scratch space for checkpoints. The server profile template currently uses a storage policy built for stateless inference nodes with no local disk claim and no FC boot path.

Which design choice best maps to the requirement?

Options:

  • A. Add more GPU memory to each training node profile.

  • B. Move checkpoints to the orchestration control-plane datastore.

  • C. Assign a storage policy that defines FC boot and local NVMe scratch use.

  • D. Increase the RoCE traffic class bandwidth for training traffic.

Best answer: C

Explanation: A UCS storage policy mismatch can prevent an AI node from being provisioned correctly or can place I/O on the wrong data path. In this scenario, the requirement is explicit: FC SAN boot plus local NVMe scratch for checkpoint-heavy training. A profile built for stateless inference does not claim local disks and does not provide the FC boot path, so it cannot satisfy the deployment requirements. The design should align the UCS storage policy and related profile settings with the intended storage data paths before workload placement. Network QoS or GPU changes may help other bottlenecks, but they do not correct a missing storage path.

  • RoCE tuning addresses network transport behavior, not the missing FC boot path or local NVMe claim.
  • More GPU memory changes compute capacity but does not make required storage devices available to the profile.
  • Control-plane datastore is not an appropriate high-throughput checkpoint target for GPU training nodes.

Question 3

Topic: AI Infrastructure Deployment and Data Management

A team adds a second rack of GPU servers to an AI training cluster. The original rack is healthy, but jobs that span both racks show high all-reduce latency. Hyperfabric shows the new switches and links as discovered, and the intended fabric design includes both racks, but the status is Pending deployment with compliance drift on the new rack. What is the best next action?

Options:

  • A. Manually configure matching VLANs on the new switches

  • B. Change the UCS power policy to maximum performance

  • C. Reschedule all training pods to the original rack

  • D. Deploy the pending Hyperfabric intent and verify compliance

Best answer: D

Explanation: Hyperfabric is used to simplify AI fabric deployment and lifecycle operations by applying the intended fabric design consistently across discovered infrastructure. In this case, the new rack is physically discovered and included in the design, but the fabric state is still pending deployment and showing compliance drift. That supports a management-plane remediation: deploy the pending intent and confirm the fabric returns to compliance before tuning unrelated compute or orchestration settings.

Manual switch changes can create more drift from the intended design. The key takeaway is to use Hyperfabric as the source of truth for fabric rollout and validation when simplified AI fabric lifecycle management is the requirement.

  • Manual switch changes may bypass the declared fabric intent and worsen compliance drift.
  • Power policy tuning does not address a pending network fabric deployment state.
  • Pod rescheduling avoids the affected rack but does not remediate the fabric lifecycle issue.

Question 4

Topic: AI Infrastructure Deployment and Data Management

A Kubernetes-based AI training cluster uses RoCEv2-attached NVMe storage for datasets and checkpoints. A new data-management rule requires every training job to write full checkpoints to the shared storage every 10 minutes. After the change, jobs pause during checkpoints and GPU utilization drops, but storage controllers report spare IOPS and no media latency alerts.

Telemetry during checkpoint window

SignalObservation
Nexus DashboardECN marks and PFC pause frames spike
IntersightNo GPU or CPU faults
StorageQueue depth normal

Which remediation is most supported by these facts?

Options:

  • A. Increase storage controller cache capacity

  • B. Add GPU workers to each training job

  • C. Tune fabric QoS for RoCE checkpoint traffic

  • D. Change the Kubernetes image pull policy

Best answer: C

Explanation: The data-management change added periodic large checkpoint writes, and the telemetry points to the fabric rather than compute or storage. ECN marks and PFC pause frames increasing during the checkpoint window indicate congestion on the RoCE path. Because storage queue depth and media latency are normal, the storage system is not the bottleneck. Because Intersight shows no GPU or CPU faults, adding compute does not address the observed pause. The appropriate infrastructure response is to validate and tune the fabric traffic classes that carry checkpoint traffic, such as QoS, ETS bandwidth allocation, ECN behavior, and lossless handling for RoCEv2.

  • More GPUs does not address the pause-frame and ECN evidence showing congestion in the network path.
  • Storage cache is not supported because the storage controllers show normal queue depth and no media latency alerts.
  • Image pulls affect container startup, not recurring checkpoint stalls during active training.

Question 5

Topic: AI Infrastructure Deployment and Data Management

A Cisco UCS GPU cluster will run distributed fine-tuning jobs. The OS must boot from local M.2 RAID-1, but training datasets and checkpoints must use an existing dual-fabric Fibre Channel array with host multipathing. The current UCS policy set defines only the local disk storage policy and LAN vNICs, so the hosts cannot see the training LUNs. Which design correction best supports the stated AI workload data path?

Options:

  • A. Mount the array through the management network using NFS

  • B. Keep local boot and add dual-fabric FC vHBAs for data LUNs

  • C. Add only a LAN QoS policy for RoCE-enabled storage traffic

  • D. Convert the local M.2 RAID-1 devices into shared training storage

Best answer: B

Explanation: The workload has two distinct storage paths: local M.2 RAID-1 for operating system boot and dual-fabric Fibre Channel SAN for the AI data path. Correcting the UCS design means retaining the local disk policy for boot while adding the SAN connectivity needed by the training LUNs, such as vHBAs mapped to both FC fabrics, proper VSAN association, and downstream zoning and LUN masking. That supports multipathing and keeps the high-volume dataset and checkpoint traffic on the required FC storage path. A LAN-only or local-disk-only change does not make the SAN LUNs visible to the hosts.

  • Local-only storage fails because M.2 boot devices do not provide the required shared SAN path for datasets and checkpoints.
  • Management NFS violates the stated data path and would move training traffic onto the wrong network.
  • LAN QoS only may help Ethernet storage designs, but it does not create Fibre Channel host access to SAN LUNs.

Question 6

Topic: AI Infrastructure Deployment and Data Management

A team is deploying Cisco UCS GPU servers for a new generative AI training cluster. The workload has short periods where all GPUs draw near peak power, and the operations team requires PSU redundancy. The current template uses a conservative capped power policy from a CPU-only cluster, and several GPU profiles remain unassociated during staging. Which design best maps to these requirements?

Options:

  • A. Use an uncapped performance power policy and verify redundant power capacity

  • B. Use a low-power policy to reduce thermal load

  • C. Keep the capped policy and increase storage IOPS

  • D. Leave power unchanged and tune RoCE QoS classes

Best answer: A

Explanation: Power policy selection can directly affect GPU workload readiness when the policy limits or cannot reserve enough power for the server configuration. GPU training workloads often have bursty, high power demand, so a policy inherited from CPU-only servers may cause component throttling or prevent profile association if redundant power cannot be guaranteed. The design should allow the GPU nodes to draw required peak power and confirm that chassis/rack power and cooling can support that draw under the required redundancy model.

The key takeaway is that power policy is not just an efficiency setting for dense GPU deployments; it is a deployment and performance prerequisite.

  • Storage tuning does not resolve a power allocation or association problem caused by an undersized cap.
  • Low-power mode may worsen GPU throttling and conflicts with the near-peak training requirement.
  • RoCE QoS tuning helps lossless fabric behavior but does not make additional server power available.

Question 7

Topic: AI Infrastructure Deployment and Data Management

An operations team is adding a second Cisco UCS domain to an AI cluster that runs distributed GPU training. The scheduler can place jobs on either domain, but training requires consistent RoCEv2 QoS behavior and aligned timestamps for log correlation. Intersight shows:

UCS domainDomain profile stateRelevant evidence
Domain-AAI-Domain-Profile compliantRoCE QoS class, port/VLAN policy, NTP
Domain-BNo assigned profileDefault QoS, manual ports, local time

Which infrastructure decision is BEST before allowing workload placement on Domain-B?

Options:

  • A. Increase storage IOPS for datasets used by Domain-B jobs

  • B. Assign the approved UCS domain profile to Domain-B and verify compliance

  • C. Add scheduler labels to Domain-B and place only smaller training jobs

  • D. Create server profiles with larger GPU reservations for Domain-B

Best answer: B

Explanation: UCS domain profiles are used to make fabric-level settings consistent across UCS domains, which is critical when AI workloads can be placed on multiple domains. In this scenario, the placement decision depends on whether Domain-B has the same operational baseline as Domain-A: RoCE QoS behavior for distributed training traffic, consistent port/VLAN configuration, and NTP for useful log correlation. Assigning the approved domain profile and checking compliance gives both enforcement and troubleshooting evidence. Scheduler labels alone do not fix fabric drift, and server-level or storage changes do not address the observed domain-wide inconsistencies.

  • Scheduler-only placement fails because labels can steer jobs but cannot enforce QoS, port, VLAN, or NTP consistency.
  • Server profile changes miss the domain-wide fabric settings that affect RoCE behavior and operational evidence.
  • Storage tuning is unsupported by the evidence; the issue is domain profile drift, not dataset I/O.

Question 8

Topic: AI Infrastructure Deployment and Data Management

An AI training cluster attached to an ACI fabric reports intermittent NCCL timeouts during all-reduce operations. GPU health and storage latency are normal.

Exhibit:

Nexus Dashboard: ECN marks and output drops on leaf-to-spine links
Nexus Dashboard: high-volume RoCEv2 flows between GPU node EPGs
APIC: AI tenant EPGs use the default QoS class
APIC: no RoCEv2 classification or lossless policy is applied

Which remediation is best supported by the observations?

Options:

  • A. Deploy APIC QoS/PFC/ECN policy for RoCEv2 traffic

  • B. Disable Nexus Dashboard telemetry collection

  • C. Increase Kubernetes GPU replica counts

  • D. Move the training dataset to local SSD storage

Best answer: A

Explanation: Nexus Dashboard provides the fabric visibility needed to identify congestion patterns, while APIC is the policy point for deploying fabric behavior. In this case, the evidence narrows the issue to RoCEv2 traffic between GPU node EPGs: GPUs and storage are healthy, but Nexus Dashboard reports ECN marks and output drops, and APIC shows the traffic remains in the default QoS class. The appropriate remediation is to classify the AI/RoCEv2 traffic and apply the required congestion-management and lossless handling policy through APIC, then validate improvement with Nexus Dashboard telemetry. Scaling compute or changing storage does not address the observed fabric policy gap.

  • More replicas can increase distributed traffic and does not fix congestion handling in the fabric.
  • Local SSDs target storage latency, but the storage telemetry is normal.
  • Disabling telemetry removes visibility and does not remediate drops or QoS misclassification.

Question 9

Topic: AI Infrastructure Deployment and Data Management

A team is onboarding a distributed GPU training workload that will use RoCEv2 for east-west parameter exchange across a leaf-spine fabric. The requirement is to confirm that the fabric is ready for low-latency, loss-sensitive RDMA traffic before production jobs are scheduled. Which validation evidence best confirms RoCE readiness?

Options:

  • A. All switch interfaces are up at the expected speed with matching MTU values

  • B. The GPU servers pass a CUDA compute benchmark without fabric congestion alerts

  • C. RoCEv2 test traffic meets latency and throughput targets with no RDMA class drops and expected ECN/CNP behavior

  • D. The storage array reports sufficient capacity and successful file-system mounts

Best answer: C

Explanation: RoCE readiness is confirmed by validating the data path that RDMA traffic will actually use. For an AI training workload, the evidence should show that RoCEv2 flows meet the required latency and throughput while the loss-sensitive traffic class avoids drops. It should also confirm that congestion management is working, such as ECN marking and CNP response under controlled congestion. Link state, MTU, GPU health, and storage capacity are useful checks, but they do not prove that the end-to-end RoCE fabric can carry RDMA traffic reliably under workload-like conditions.

  • Link checks only miss congestion-control behavior and do not prove lossless or low-loss RDMA forwarding.
  • GPU benchmarking validates compute capability, not the RoCEv2 network path used for parameter exchange.
  • Storage readiness confirms a different infrastructure layer and does not validate east-west RDMA traffic.

Question 10

Topic: AI Infrastructure Deployment and Data Management

A team onboarded eight GPU servers with Intersight. Server inventory, firmware compliance, and health are green. Kubernetes shows the training pods are running, but NCCL all-reduce performance drops sharply when jobs span racks. Switch telemetry shows congestion on the GPU east-west path, while storage latency remains normal. Which action is most supported by these observations?

Options:

  • A. Create a new Intersight firmware compliance policy

  • B. Use Hyperfabric to validate and remediate the AI fabric design

  • C. Refresh Intersight inventory for the GPU servers

  • D. Expand the storage volume used by the training dataset

Best answer: B

Explanation: Hyperfabric is used to deploy and optimize the AI network fabric that carries high-bandwidth, low-latency GPU-to-GPU traffic. In this scenario, server lifecycle state, firmware compliance, and orchestration state are healthy, but cross-rack collective communication degrades and switch telemetry shows congestion on the east-west path. That evidence supports validating the AI fabric design, connectivity, and congestion behavior rather than changing server lifecycle management settings. Intersight remains valuable for visibility, inventory, health, and lifecycle management, but those functions do not by themselves fix an underdesigned or congested AI fabric.

  • Inventory refresh may update visibility, but the current observations already identify a fabric-path performance problem.
  • Firmware policy is not supported because server health and compliance are already green.
  • Storage expansion does not fit because storage latency is normal and the symptom appears during cross-rack all-reduce traffic.

Continue with full practice

Use the Cisco 300-640 DCAI Practice Test page for the full IT Mastery practice bank, mixed-topic practice, timed mock exams, explanations, and web/mobile app access.

Try Cisco 300-640 DCAI on Web View Cisco 300-640 DCAI Practice Test

Free review resource

Use the full IT Mastery practice page above for the latest review links and practice page.

Revised on Thursday, May 28, 2026