300-640 DCAI — Cisco Implementing Data Center AI Infrastructure Quick Reference

Last revised: July 1, 2026

Compact Cisco 300-640 DCAI reference for AI data center fabrics, RoCEv2, QoS, Nexus/UCS operations, and troubleshooting.

Quick Reference purpose

Use this as an independent compact review for Cisco Cisco Implementing Data Center AI Infrastructure (300-640 DCAI). Focus on implementation decisions: AI fabric design, RoCEv2 behavior, lossless Ethernet, Cisco Nexus operations, Cisco UCS/Intersight compute lifecycle, storage paths, automation, observability, and troubleshooting.

AI data center architecture map

Layer / plane	Typical Cisco-related components	What to know for 300-640 DCAI	Common exam trap
GPU compute	Cisco UCS C-Series / X-Series GPU-capable servers, VIC/NIC/HCA, GPU drivers, firmware policies	GPU-to-NIC locality, firmware compatibility, BIOS settings, power/cooling, lifecycle control	Treating GPU performance as only a server issue; network and storage often limit utilization
Backend AI fabric	Cisco Nexus leaf-spine, routed Ethernet, RoCEv2, PFC, ECN, QoS	Low latency, low loss, high east-west bandwidth, consistent QoS and MTU	Enabling PFC broadly instead of only the lossless RDMA class
Frontend / service fabric	Nexus leaf-spine, VLAN/VRF, VXLAN EVPN, load balancers, firewalls	User/API access, tenant segmentation, app access to inference endpoints	Mixing frontend bursty traffic with backend GPU collective traffic without isolation
Storage / data fabric	Ethernet storage, NFS, object, parallel file systems, NVMe/TCP, FC where used	Data ingest, checkpointing, model/data access, throughput and metadata behavior	Optimizing only GPU fabric while dataset reads/checkpoints remain bottlenecked
Management / OOB	OOB management network, Cisco Intersight, Cisco Nexus Dashboard, NDFC, AAA, syslog, telemetry	Secure access, inventory, automation, fabric health, configuration consistency	Managing devices in-band only and losing control during data-plane incidents
Automation / intent	Cisco Nexus Dashboard Fabric Controller, Intersight, APIs, Ansible/Terraform, templates	Repeatable fabric and server deployment, drift detection, rollback planning	Manual per-device changes that create QoS or MTU inconsistency
Observability	Nexus telemetry, interface counters, queue counters, PFC/ECN counters, server/GPU metrics	Correlate job slowdown with drops, pause frames, ECN marks, congestion, host issues	Looking only at packet drops; congestion can show as ECN marks or PFC pause without drops

AI workload traffic patterns

Workload pattern	Primary bottleneck	Network behavior	Design priority
Distributed training	East-west GPU-to-GPU communication	Large synchronized bursts, all-reduce/all-to-all, sensitivity to tail latency	Nonblocking or low-oversubscription leaf-spine, consistent RDMA QoS
Inference	North-south request/response plus backend calls	Latency-sensitive, often smaller flows, may be horizontally scaled	Frontend resilience, load balancing, segmentation, predictable latency
Data ingest / preprocessing	Storage throughput and CPU pipeline	Read-heavy, metadata-heavy, sometimes bursty	Storage locality, caching, separate QoS class from RDMA
Checkpointing	Write bandwidth and storage consistency	Periodic large writes can congest links	Isolate or rate-manage checkpoint traffic; avoid starving RDMA
Model distribution	One-to-many reads or image pulls	Bursty pulls from registries/object stores	Local caching, registry placement, bandwidth planning
Multi-tenant AI cluster	Isolation and noisy-neighbor control	Competing traffic classes and jobs	VRF/VLAN/ACL segmentation, QoS, quotas at orchestration layer

Training vs inference distinctions

Dimension	Training	Inference
Main goal	Maximize GPU utilization and job completion speed	Minimize response latency and maintain availability
Dominant traffic	East-west GPU synchronization, dataset reads, checkpoints	Client/API traffic, model serving, feature/data lookups
Failure impact	Job restart, checkpoint recovery, wasted GPU time	User-facing outage or degraded service
Network concern	Lossless/low-loss RDMA, ECMP balance, congestion control	Load balancing, service segmentation, autoscaling, observability
Exam decision point	Prefer high-bandwidth backend fabric with strict QoS consistency	Prefer resilient frontend design with security and traffic isolation

Fabric design selection

Design choice	Choose when	Avoid / watch for
Leaf-spine Clos	Need predictable scale-out, ECMP, uniform hop count	Incorrect cabling or uneven uplinks causing hot spots
Rail-optimized fabric	GPU servers have multiple NICs/rails and traffic should stay balanced per rail	Misaligned rail cabling; one rail congests while others are idle
Separate backend and frontend fabrics	Need strict separation between RDMA training traffic and user/service traffic	More cabling and operational domains
Converged fabric with strict QoS	Cabling budget or architecture requires shared links	Requires disciplined classification, queuing, monitoring, and change control
L3 routed backend fabric	Need ECMP scale and simple failure domains for RoCEv2	Forgetting end-to-end MTU, DSCP/CoS, and ECN consistency
VXLAN EVPN fabric	Need tenant segmentation, L2 extension, VRFs, workload mobility	Do not assume overlay fixes physical congestion
vPC host attachment	Need L2 dual-homing to a pair of switches	vPC peer link is not a substitute for proper spine capacity
OOB management	Need reliable device access during fabric failure	Skipping AAA, logging, and route separation

Typical AI fabric planes

    flowchart LR
	    User[Users / APIs] --> FE[Frontend service fabric]
	    FE --> App[Inference / scheduler / apps]
	    App --> Storage[Dataset / model storage]
	    GPU[GPU servers] <--> BE[Backend AI fabric: RoCEv2 / RDMA]
	    GPU --> Storage
	    Mgmt[OOB management] --> GPU
	    Mgmt --> Nexus[Cisco Nexus switches]
	    Ops[Cisco Intersight / Nexus Dashboard / NDFC] --> Mgmt

Capacity and bottleneck math

Use oversubscription and bisection thinking rather than memorizing arbitrary numbers.

\[ \text{Oversubscription ratio}=\frac{\text{Total server-facing bandwidth}}{\text{Total fabric-facing uplink bandwidth}} \]\[ \text{Effective throughput} \le \min(\text{GPU demand},\ \text{NIC capacity},\ \text{fabric path capacity},\ \text{storage throughput}) \]

Metric	Meaning	Exam use
Oversubscription	Downlink demand compared with uplink capacity	Lower oversubscription is preferred for synchronous distributed training
Bisection bandwidth	Available bandwidth between two halves of the cluster	Important for all-to-all and all-reduce traffic
Tail latency	High-percentile latency, not just average latency	A few slow flows can delay synchronized training
Queue depth	Amount of buffered traffic	Rising queues indicate congestion before drops appear
ECN marks	Congestion signal without dropping packets	Validate congestion management is active
PFC pause frames	Link-level pause for a priority	Useful when controlled; dangerous if persistent or spreading
GPU utilization	Time GPU is doing useful work	Low utilization can be network, storage, CPU, or orchestration related

RoCEv2 and lossless Ethernet reference

Item	Role	Key implementation point
RDMA	Direct memory access between hosts without heavy CPU involvement	Improves throughput/latency for GPU clusters and storage-like workloads
RoCEv2	RDMA over UDP/IP Ethernet	Routable; depends on correct QoS, MTU, and congestion handling
PFC	Priority Flow Control; per-priority Layer 2 pause	Apply only to selected no-drop class, usually RDMA
ECN	Explicit Congestion Notification	Marks packets before drops; host/NIC congestion algorithm reacts
DCQCN	Data Center Quantized Congestion Notification	NIC-side congestion control commonly associated with RoCEv2 fabrics
DCB	Data Center Bridging feature set	Includes mechanisms such as PFC and ETS concepts
ETS	Enhanced Transmission Selection	Allocates bandwidth among traffic classes
DSCP	Layer 3 QoS marking	Useful across routed RoCEv2 fabrics
CoS / PCP	Layer 2 priority marking	Used for priority behavior on Ethernet links
MTU	Maximum transmission unit	Must be consistent end-to-end for intended jumbo behavior

PFC, ECN, and drops

Mechanism	Layer / scope	What happens	Why it matters	Trap
Tail drop	Queue overflow	Packet is dropped	TCP may recover; RDMA can suffer severe performance impact	Waiting for drops before investigating congestion
PFC	L2, link-local, per priority	Receiver pauses sender for a priority	Prevents loss in no-drop class	Overuse can create head-of-line blocking and pause storms
ECN	L3 marking, end-to-end signal	Switch marks packets instead of dropping	Sender reduces rate before severe congestion	ECN marking alone does nothing if hosts/NICs do not react
WRED/RED with ECN	Queue management	Marks or drops based on thresholds	Early congestion signaling	Bad thresholds can mark too late or too aggressively
DCQCN	Host/NIC algorithm	Adjusts RDMA transmit rate	Stabilizes RoCEv2 under congestion	Requires consistent network and NIC configuration

RoCEv2 implementation checklist

Use a dedicated QoS class for RDMA traffic.
Map RDMA traffic consistently: application/NIC marking → DSCP/CoS → switch qos-group → queue.
Enable PFC only for the intended no-drop priority.
Configure ECN/WRED behavior for early congestion signaling where supported.
Keep MTU consistent across server NICs, switch interfaces, port channels, routed links, and overlays where applicable.
Verify the trust boundary: do switches trust server markings, rewrite them, or classify by ACL?
Avoid mixing RDMA with storage bursts, backup, checkpoint, or general TCP traffic in the same no-drop queue.
Monitor PFC pause counters and ECN marks continuously; “no drops” is not proof of a healthy fabric.

Cisco Nexus QoS mental model

QoS stage	Question to ask	NX-OS concept	Validation focus
Classification	Which packets are RDMA, storage, control, or best effort?	`class-map type qos`, ACL/DSCP/CoS matching	Packets enter the intended class
Marking	What internal forwarding class is used?	`qos-group`, DSCP/CoS rewrite where used	Marking is consistent across hops
Network QoS	Which classes are no-drop and what MTU applies?	`policy-map type network-qos`	PFC priority and MTU are correct
Queuing	How is bandwidth/buffering allocated?	`policy-map type queuing`	Queue behavior under congestion
Interface binding	Where do policies apply?	System QoS and interface policy attachment	No missing links in the path
Monitoring	Are queues pausing, marking, or dropping?	Interface, queuing, PFC, policy counters	Correlate counters with job symptoms

Illustrative NX-OS QoS skeleton

Platform syntax and feature availability vary by Cisco Nexus model and NX-OS release. Treat this as a pattern, not a copy-paste answer.

class-map type qos match-any RDMA-QOS
  match dscp <rdma-dscp>

policy-map type qos AI-CLASSIFY
  class RDMA-QOS
    set qos-group <rdma-qos-group>

class-map type network-qos RDMA-NO-DROP
  match qos-group <rdma-qos-group>

policy-map type network-qos AI-NETWORK-QOS
  class type network-qos RDMA-NO-DROP
    mtu <jumbo-mtu>
    pause pfc-cos <rdma-cos>

policy-map type queuing AI-QUEUING
  class type queuing <rdma-queue-class>
    bandwidth percent <reserved-percent>
    random-detect ecn

system qos
  service-policy type qos input AI-CLASSIFY
  service-policy type network-qos AI-NETWORK-QOS
  service-policy type queuing output AI-QUEUING

Useful verification commands

show interface ethernet <slot/port>
show interface ethernet <slot/port> counters
show interface ethernet <slot/port> priority-flow-control
show queuing interface ethernet <slot/port>
show policy-map interface ethernet <slot/port>
show class-map type qos
show policy-map type network-qos
show policy-map type queuing
show lldp neighbors
show port-channel summary
show ip route
show bgp ipv4 unicast summary
show logging last <lines>

Routing, ECMP, and overlay decisions

Topic	High-yield point	Troubleshooting clue
L3 underlay	Provides routed reachability and ECMP between leaves and spines	Missing route, failed adjacency, or asymmetric MTU causes traffic black holes
ECMP	Spreads flows across equal-cost paths	Large elephant flows can still hash unevenly
BGP underlay	Common for scalable leaf-spine designs	Check neighbor state, advertised prefixes, next hops
OSPF/IS-IS underlay	Also possible in routed fabrics	Check area/level, MTU, adjacency, passive interfaces
BFD	Speeds failure detection where implemented	False positives can flap paths if timers are too aggressive
VXLAN EVPN	Adds scalable L2/L3 overlay and tenant segmentation	Overlay reachability still depends on underlay health
VRF	Separates routing tables and tenants	Wrong VRF is a common cause of “reachable from one place only”
vPC	Dual-homed L2 access to a switch pair	Peer-link congestion or orphan-port behavior can affect flows
Multicast / BUM handling	Needed in some overlay designs	Misconfigured replication affects ARP/ND/flooding behavior
DCI	Interconnects sites/fabrics	Avoid assuming latency-sensitive training can span sites without specialized design

Underlay vs overlay

Requirement	Prefer
Simple high-performance backend RDMA fabric	Routed L3 underlay with ECMP
Multi-tenant application networks	VXLAN EVPN with VRFs
Need L2 adjacency for specific workloads	EVPN/VXLAN or controlled L2 design
Strict isolation between AI backend and user traffic	Separate fabrics or separate VRFs/classes
Operational consistency at scale	NDFC templates and intent-based fabric management

Cisco UCS and GPU compute reference

Area	What to validate	Why it matters
Firmware	Server, BIOS, GPU, NIC/HCA, storage controller versions	Mismatched firmware can break RDMA, driver compatibility, or performance
Drivers	GPU driver, CUDA stack where relevant, NIC/RDMA drivers	Host stack must align with hardware and workload framework
PCIe topology	GPU-to-NIC locality, NUMA domain, slot placement	Poor locality adds latency and CPU/memory overhead
GPU interconnect	PCIe, NVLink/NVSwitch where present	Scale-up bandwidth differs from scale-out fabric bandwidth
BIOS settings	Performance profile, virtualization, SR-IOV, power settings as required	Default power-saving settings can reduce throughput
NIC features	RoCEv2, PFC/ECN support, MTU, offloads	Host NIC must participate in congestion control
Power/cooling	Rack power, airflow, thermal headroom	Throttling looks like performance degradation, not a link failure
Server identity	Service profiles / server profiles, MAC/WWN/IP policies	Enables repeatable deployment and replacement
Inventory	Cisco Intersight / UCS Manager visibility	Speeds lifecycle, compliance, and fault isolation

Cisco Intersight vs UCS Manager vs Nexus Dashboard

Tool	Primary scope	Use for
Cisco Intersight	Server and infrastructure lifecycle management	UCS inventory, firmware policies, profiles, health, automation
Cisco UCS Manager	UCS domain management	Fabric Interconnect-attached UCS configuration and policies
Cisco Nexus Dashboard	Data center operational platform	Hosting apps for fabric operations, insights, and automation
Cisco Nexus Dashboard Fabric Controller	Fabric automation and lifecycle	Nexus fabric design, deployment, templates, consistency
Nexus Dashboard Insights	Visibility and assurance	Telemetry, anomalies, change impact, troubleshooting context

Storage and data path reference

Storage pattern	Best fit	Network consideration	Trap
NFS / NAS	Shared datasets, simpler operations	Throughput, metadata performance, mount design	Single mount or filer path becomes hot spot
Object storage	Large datasets, model artifacts, data lake workflows	HTTP/API throughput, caching, locality	Many small objects can stress metadata/API path
Parallel file system	Large-scale training datasets	High aggregate bandwidth and metadata scaling	Misconfigured clients can limit performance
NVMe/TCP	High-performance block over Ethernet	Separate QoS and congestion planning	Sharing RDMA no-drop class without design
Fibre Channel / SAN	Enterprise block storage environments	Separate FC fabric or converged design where used	Assuming SAN bandwidth equals training data throughput
Local NVMe cache	Hot data, preprocessing, temporary shards	Data distribution and cache warm-up	Cache miss storms during job start
Container registry	Images, model-serving components	Pull storms during scale-out	No local mirror or pre-pull strategy

Storage vs backend RDMA

Question	If yes	Design response
Does checkpointing coincide with training synchronization?	Storage writes can congest fabric	Separate QoS class or schedule/checkpoint tuning
Is dataset read throughput below GPU demand?	GPUs wait idle	Improve storage parallelism, caching, or data placement
Are storage and RDMA sharing uplinks?	Congestion coupling is possible	Monitor queues by class; reserve bandwidth carefully
Are small-file reads dominant?	Metadata may bottleneck	Optimize dataset format, caching, or parallel filesystem metadata
Is storage traffic marked the same as RDMA?	No-drop queue can be polluted	Reclassify and isolate storage traffic

Security, segmentation, and governance

Control	Use for	Exam focus
AAA with TACACS+/RADIUS	Centralized admin authentication and authorization	Role separation and auditability
RBAC	Limit operator privileges	Least privilege for fabric/server operations
Management VRF / OOB	Isolate device management	Reachability during fabric incidents
SSH/HTTPS only	Secure administrative access	Disable insecure management protocols
SNMPv3 / secure telemetry	Authenticated monitoring	Avoid cleartext community strings
ACLs	Restrict management and tenant traffic	Apply in correct direction and VRF
VRFs	Routing isolation	Tenant/job/environment segmentation
CoPP	Protect switch control plane	Prevent data-plane events from overwhelming CPU
Image/firmware governance	Trusted software lifecycle	Consistent versions, controlled upgrades
Secrets handling	Protect API tokens, registry credentials, keys	Avoid embedding secrets in templates or scripts

AI-specific security considerations

Separate management, storage, frontend, and backend fabric access.
Restrict who can change QoS, PFC, ECN, and fabric templates; mistakes can affect the whole cluster.
Protect datasets, model artifacts, checkpoints, and container registries.
Use change control for firmware, driver, and CUDA-related stack changes.
Monitor for configuration drift between rails, leaves, and server NICs.
Keep automation credentials scoped and rotated.

Automation and operations decision table

Need	Prefer	Why
Build or modify Nexus fabrics consistently	Cisco Nexus Dashboard Fabric Controller	Intent/templates reduce per-device drift
Manage UCS firmware and server profiles	Cisco Intersight or UCS Manager	Central lifecycle and policy control
Query switch state programmatically	NX-API, NETCONF/RESTCONF, gNMI where supported	Enables validation and telemetry workflows
Repeat configuration tasks	Ansible/Terraform or vendor-supported automation	Idempotent changes and version control
Validate pre/post change state	Automated checks plus Nexus telemetry	Catch MTU, QoS, adjacency, and counter regressions
Correlate incidents	Nexus Dashboard Insights, syslog, telemetry, job metrics	AI performance issues cross device boundaries
Operate Kubernetes-based AI workloads	Kubernetes tools plus infrastructure telemetry	Cluster scheduler symptoms may originate in network/storage

Automation safeguards

Maintain a source of truth for fabric topology, addressing, VRFs, QoS classes, and cabling.
Validate generated config before deployment.
Use staged rollout for QoS, PFC, ECN, and MTU changes.
Capture pre-change counters and control-plane state.
Confirm rollback steps before changing fabric-wide policies.
Test one rail or failure domain when possible before broad rollout.

Troubleshooting workflow

    flowchart TD
	    A[Symptom: AI job slow or failing] --> B{Reachability issue?}
	    B -->|Yes| C[Check L1/L2/L3: link, VLAN/VRF, route, MTU, ACL]
	    B -->|No| D{Drops, ECN, or PFC counters?}
	    D -->|Drops| E[Check queue policy, congestion, bad optics, CRC/FEC, oversubscription]
	    D -->|ECN marks| F[Validate ECN thresholds and host/NIC congestion response]
	    D -->|PFC pause| G[Find congested receiver, no-drop queue, HOL blocking, pause propagation]
	    D -->|None obvious| H{GPU utilization low?}
	    H -->|Yes| I[Check storage, CPU preprocessing, NUMA, drivers, scheduler]
	    H -->|No| J[Check application, batch size, framework, job placement]
	    C --> K[Retest with counters cleared or time-bounded telemetry]
	    E --> K
	    F --> K
	    G --> K
	    I --> K
	    J --> K

Symptom-to-cause matrix

Symptom	Likely causes	Verify	Corrective direction
RDMA connection fails	MTU mismatch, wrong DSCP/CoS, PFC disabled, ACL/VRF issue, NIC driver mismatch	Ping with large packet where appropriate, route/VRF, PFC/QoS counters, host RDMA tools	Align MTU, routing, QoS, NIC settings
RDMA works but slow	ECMP imbalance, congestion, ECN not reacting, PFC pause, storage bottleneck	Queue depth, ECN marks, PFC counters, link utilization, GPU utilization	Tune traffic placement, QoS, congestion control, storage path
PFC pause storm	Overloaded receiver, no-drop class too broad, buffer threshold issue, head-of-line blocking	PFC Rx/Tx by interface and priority, queue occupancy	Narrow no-drop traffic, relieve congestion, review thresholds
Drops in RDMA class	PFC not active, wrong priority mapping, queue/buffer pressure	Interface drops, policy counters, DSCP/CoS mapping	Fix classification and no-drop policy; reduce congestion
ECN marks but no performance improvement	Host/NIC not reacting, wrong traffic class, thresholds ineffective	NIC counters, switch ECN counters, DSCP mapping	Align NIC congestion control and switch ECN behavior
One rail congested	Cabling imbalance, hashing issue, failed link, uneven job placement	Per-rail utilization, LLDP, port-channel/ECMP state	Correct cabling, restore links, rebalance workloads
BGP adjacency down	IP mismatch, VRF error, ACL, MTU, authentication, interface down	Neighbor state, logs, interface status, route table	Fix underlay config and physical link
VXLAN tenant unreachable	VNI/VRF mismatch, EVPN route issue, NVE peer issue, underlay failure	EVPN routes, NVE peers, VRF routes	Correct overlay mapping and underlay reachability
GPU utilization low	Storage slow, CPU preprocessing slow, network congestion, scheduler placement	GPU metrics, storage metrics, queue counters, host CPU	Remove data path bottleneck; tune placement
High CRC/FEC errors	Optics/cable issue, dirty fiber, speed/FEC mismatch	Interface counters, transceiver detail, logs	Replace/clean optics/cables; align link settings
Intermittent job failures	Microbursts, thermal throttling, link flaps, driver issues	Time-correlated telemetry, environment sensors, logs	Correlate by timestamp and isolate failure domain

High-yield distinctions

Distinction	Know this
Lossless Ethernet vs no congestion	Lossless mechanisms reduce drops; they do not eliminate congestion or guarantee performance
PFC vs global pause	PFC pauses selected priorities; global pause stops all traffic on a link and is usually undesirable in data centers
ECN vs PFC	ECN signals congestion end-to-end; PFC pauses a local link priority
DSCP vs CoS	DSCP is L3 marking; CoS/PCP is L2 marking. Routed fabrics commonly need DSCP consistency
`qos-group` vs DSCP	`qos-group` is internal switch classification; DSCP is carried in the packet
`type qos` vs `type network-qos` vs `type queuing`	Classification/marking vs no-drop/MTU behavior vs scheduling/bandwidth behavior
Underlay vs overlay	Underlay provides IP reachability; overlay provides tenant/L2/L3 virtualization
ECMP vs port channel	ECMP balances routed next hops; port channels bundle links between the same logical neighbors
Scale-up vs scale-out	Scale-up uses local GPU interconnects inside a server/chassis; scale-out uses the network between servers
Training vs inference	Training is dominated by synchronized east-west and storage traffic; inference is dominated by service latency and availability
NDFC vs Intersight	NDFC manages Nexus fabric intent; Intersight manages server/infrastructure lifecycle
Control plane vs data plane	Control plane builds routes/state; data plane forwards traffic. Both must be healthy
Drops vs errors	Drops may be congestion/policy; errors often indicate physical or link-layer problems
Low average utilization vs microbursts	Average link use can look safe while short bursts fill queues

Implementation review checklist

Fabric

Leaf-spine cabling matches the intended topology and rail design.
All uplinks/downlinks run expected speed and duplex with clean counters.
Routing adjacencies are stable.
ECMP paths are present and balanced enough for workload patterns.
VRFs/VLANs/VNIs match the design.
vPC peer links and keepalives are healthy where vPC is used.
OOB management remains reachable during data-plane changes.

RoCEv2 / QoS

RDMA traffic is classified consistently at ingress.
DSCP/CoS/qos-group mapping is consistent across the fabric.
PFC is enabled only on the intended priority.
ECN marking is configured for the intended queue where supported.
MTU is consistent across server NICs, switchports, port channels, routed links, and overlays.
Queue counters are monitored before and after changes.
Storage and checkpoint traffic do not pollute the RDMA no-drop queue.

Compute

UCS/server firmware, NIC firmware, GPU drivers, and OS drivers are compatible.
BIOS/performance settings match workload requirements.
GPU-to-NIC locality is understood.
Power and cooling are sufficient under sustained load.
Intersight/UCS profiles reflect desired identity and firmware policy.
Host NIC settings align with switch QoS, PFC, ECN, and MTU.

Storage

Dataset path throughput matches expected GPU consumption.
Checkpoint traffic is planned and monitored.
Storage traffic has its own class or policy when needed.
Metadata bottlenecks are considered for small-file workloads.
Registry/model artifact pulls are cached or staged for scale-out events.

Operations

NDFC/Intersight templates are version controlled or otherwise governed.
Telemetry covers interfaces, queues, PFC, ECN, routes, server health, and job metrics.
AAA/RBAC and management VRFs are in place.
Pre-change and post-change validations are defined.
Rollback steps are documented for QoS, MTU, routing, and firmware changes.

Exam preparation focus

For Cisco 300-640 DCAI, practice explaining not just what each component does, but why you would choose it in an AI data center:

When to separate backend RDMA and frontend traffic.
How RoCEv2 depends on PFC, ECN, MTU, and marking consistency.
How to troubleshoot slow training when there are no obvious packet drops.
How Cisco Nexus fabric operations differ from Cisco UCS/Intersight server lifecycle tasks.
How NDFC, Nexus Dashboard, telemetry, and automation reduce drift.
How storage, compute, and network bottlenecks interact.

Next step for practice

Work through timed scenario questions that force you to pick the best fabric design, QoS action, Cisco Nexus verification command, UCS/Intersight operation, or troubleshooting path from a realistic AI infrastructure symptom.

Scenario Guide

AI Fundamentals and Applications