300-640 DCAI — Cisco Implementing Data Center AI Infrastructure Quick Reference

Compact Cisco 300-640 DCAI reference for AI data center fabrics, RoCEv2, QoS, Nexus/UCS operations, and troubleshooting.

Quick Reference purpose

Use this as an independent compact review for Cisco Cisco Implementing Data Center AI Infrastructure (300-640 DCAI). Focus on implementation decisions: AI fabric design, RoCEv2 behavior, lossless Ethernet, Cisco Nexus operations, Cisco UCS/Intersight compute lifecycle, storage paths, automation, observability, and troubleshooting.

AI data center architecture map

Layer / planeTypical Cisco-related componentsWhat to know for 300-640 DCAICommon exam trap
GPU computeCisco UCS C-Series / X-Series GPU-capable servers, VIC/NIC/HCA, GPU drivers, firmware policiesGPU-to-NIC locality, firmware compatibility, BIOS settings, power/cooling, lifecycle controlTreating GPU performance as only a server issue; network and storage often limit utilization
Backend AI fabricCisco Nexus leaf-spine, routed Ethernet, RoCEv2, PFC, ECN, QoSLow latency, low loss, high east-west bandwidth, consistent QoS and MTUEnabling PFC broadly instead of only the lossless RDMA class
Frontend / service fabricNexus leaf-spine, VLAN/VRF, VXLAN EVPN, load balancers, firewallsUser/API access, tenant segmentation, app access to inference endpointsMixing frontend bursty traffic with backend GPU collective traffic without isolation
Storage / data fabricEthernet storage, NFS, object, parallel file systems, NVMe/TCP, FC where usedData ingest, checkpointing, model/data access, throughput and metadata behaviorOptimizing only GPU fabric while dataset reads/checkpoints remain bottlenecked
Management / OOBOOB management network, Cisco Intersight, Cisco Nexus Dashboard, NDFC, AAA, syslog, telemetrySecure access, inventory, automation, fabric health, configuration consistencyManaging devices in-band only and losing control during data-plane incidents
Automation / intentCisco Nexus Dashboard Fabric Controller, Intersight, APIs, Ansible/Terraform, templatesRepeatable fabric and server deployment, drift detection, rollback planningManual per-device changes that create QoS or MTU inconsistency
ObservabilityNexus telemetry, interface counters, queue counters, PFC/ECN counters, server/GPU metricsCorrelate job slowdown with drops, pause frames, ECN marks, congestion, host issuesLooking only at packet drops; congestion can show as ECN marks or PFC pause without drops

AI workload traffic patterns

Workload patternPrimary bottleneckNetwork behaviorDesign priority
Distributed trainingEast-west GPU-to-GPU communicationLarge synchronized bursts, all-reduce/all-to-all, sensitivity to tail latencyNonblocking or low-oversubscription leaf-spine, consistent RDMA QoS
InferenceNorth-south request/response plus backend callsLatency-sensitive, often smaller flows, may be horizontally scaledFrontend resilience, load balancing, segmentation, predictable latency
Data ingest / preprocessingStorage throughput and CPU pipelineRead-heavy, metadata-heavy, sometimes burstyStorage locality, caching, separate QoS class from RDMA
CheckpointingWrite bandwidth and storage consistencyPeriodic large writes can congest linksIsolate or rate-manage checkpoint traffic; avoid starving RDMA
Model distributionOne-to-many reads or image pullsBursty pulls from registries/object storesLocal caching, registry placement, bandwidth planning
Multi-tenant AI clusterIsolation and noisy-neighbor controlCompeting traffic classes and jobsVRF/VLAN/ACL segmentation, QoS, quotas at orchestration layer

Training vs inference distinctions

DimensionTrainingInference
Main goalMaximize GPU utilization and job completion speedMinimize response latency and maintain availability
Dominant trafficEast-west GPU synchronization, dataset reads, checkpointsClient/API traffic, model serving, feature/data lookups
Failure impactJob restart, checkpoint recovery, wasted GPU timeUser-facing outage or degraded service
Network concernLossless/low-loss RDMA, ECMP balance, congestion controlLoad balancing, service segmentation, autoscaling, observability
Exam decision pointPrefer high-bandwidth backend fabric with strict QoS consistencyPrefer resilient frontend design with security and traffic isolation

Fabric design selection

Design choiceChoose whenAvoid / watch for
Leaf-spine ClosNeed predictable scale-out, ECMP, uniform hop countIncorrect cabling or uneven uplinks causing hot spots
Rail-optimized fabricGPU servers have multiple NICs/rails and traffic should stay balanced per railMisaligned rail cabling; one rail congests while others are idle
Separate backend and frontend fabricsNeed strict separation between RDMA training traffic and user/service trafficMore cabling and operational domains
Converged fabric with strict QoSCabling budget or architecture requires shared linksRequires disciplined classification, queuing, monitoring, and change control
L3 routed backend fabricNeed ECMP scale and simple failure domains for RoCEv2Forgetting end-to-end MTU, DSCP/CoS, and ECN consistency
VXLAN EVPN fabricNeed tenant segmentation, L2 extension, VRFs, workload mobilityDo not assume overlay fixes physical congestion
vPC host attachmentNeed L2 dual-homing to a pair of switchesvPC peer link is not a substitute for proper spine capacity
OOB managementNeed reliable device access during fabric failureSkipping AAA, logging, and route separation

Typical AI fabric planes

    flowchart LR
	    User[Users / APIs] --> FE[Frontend service fabric]
	    FE --> App[Inference / scheduler / apps]
	    App --> Storage[Dataset / model storage]
	    GPU[GPU servers] <--> BE[Backend AI fabric: RoCEv2 / RDMA]
	    GPU --> Storage
	    Mgmt[OOB management] --> GPU
	    Mgmt --> Nexus[Cisco Nexus switches]
	    Ops[Cisco Intersight / Nexus Dashboard / NDFC] --> Mgmt

Capacity and bottleneck math

Use oversubscription and bisection thinking rather than memorizing arbitrary numbers.

\[ \text{Oversubscription ratio}=\frac{\text{Total server-facing bandwidth}}{\text{Total fabric-facing uplink bandwidth}} \]\[ \text{Effective throughput} \le \min(\text{GPU demand},\ \text{NIC capacity},\ \text{fabric path capacity},\ \text{storage throughput}) \]
MetricMeaningExam use
OversubscriptionDownlink demand compared with uplink capacityLower oversubscription is preferred for synchronous distributed training
Bisection bandwidthAvailable bandwidth between two halves of the clusterImportant for all-to-all and all-reduce traffic
Tail latencyHigh-percentile latency, not just average latencyA few slow flows can delay synchronized training
Queue depthAmount of buffered trafficRising queues indicate congestion before drops appear
ECN marksCongestion signal without dropping packetsValidate congestion management is active
PFC pause framesLink-level pause for a priorityUseful when controlled; dangerous if persistent or spreading
GPU utilizationTime GPU is doing useful workLow utilization can be network, storage, CPU, or orchestration related

RoCEv2 and lossless Ethernet reference

ItemRoleKey implementation point
RDMADirect memory access between hosts without heavy CPU involvementImproves throughput/latency for GPU clusters and storage-like workloads
RoCEv2RDMA over UDP/IP EthernetRoutable; depends on correct QoS, MTU, and congestion handling
PFCPriority Flow Control; per-priority Layer 2 pauseApply only to selected no-drop class, usually RDMA
ECNExplicit Congestion NotificationMarks packets before drops; host/NIC congestion algorithm reacts
DCQCNData Center Quantized Congestion NotificationNIC-side congestion control commonly associated with RoCEv2 fabrics
DCBData Center Bridging feature setIncludes mechanisms such as PFC and ETS concepts
ETSEnhanced Transmission SelectionAllocates bandwidth among traffic classes
DSCPLayer 3 QoS markingUseful across routed RoCEv2 fabrics
CoS / PCPLayer 2 priority markingUsed for priority behavior on Ethernet links
MTUMaximum transmission unitMust be consistent end-to-end for intended jumbo behavior

PFC, ECN, and drops

MechanismLayer / scopeWhat happensWhy it mattersTrap
Tail dropQueue overflowPacket is droppedTCP may recover; RDMA can suffer severe performance impactWaiting for drops before investigating congestion
PFCL2, link-local, per priorityReceiver pauses sender for a priorityPrevents loss in no-drop classOveruse can create head-of-line blocking and pause storms
ECNL3 marking, end-to-end signalSwitch marks packets instead of droppingSender reduces rate before severe congestionECN marking alone does nothing if hosts/NICs do not react
WRED/RED with ECNQueue managementMarks or drops based on thresholdsEarly congestion signalingBad thresholds can mark too late or too aggressively
DCQCNHost/NIC algorithmAdjusts RDMA transmit rateStabilizes RoCEv2 under congestionRequires consistent network and NIC configuration

RoCEv2 implementation checklist

  • Use a dedicated QoS class for RDMA traffic.
  • Map RDMA traffic consistently: application/NIC marking → DSCP/CoS → switch qos-group → queue.
  • Enable PFC only for the intended no-drop priority.
  • Configure ECN/WRED behavior for early congestion signaling where supported.
  • Keep MTU consistent across server NICs, switch interfaces, port channels, routed links, and overlays where applicable.
  • Verify the trust boundary: do switches trust server markings, rewrite them, or classify by ACL?
  • Avoid mixing RDMA with storage bursts, backup, checkpoint, or general TCP traffic in the same no-drop queue.
  • Monitor PFC pause counters and ECN marks continuously; “no drops” is not proof of a healthy fabric.

Cisco Nexus QoS mental model

QoS stageQuestion to askNX-OS conceptValidation focus
ClassificationWhich packets are RDMA, storage, control, or best effort?class-map type qos, ACL/DSCP/CoS matchingPackets enter the intended class
MarkingWhat internal forwarding class is used?qos-group, DSCP/CoS rewrite where usedMarking is consistent across hops
Network QoSWhich classes are no-drop and what MTU applies?policy-map type network-qosPFC priority and MTU are correct
QueuingHow is bandwidth/buffering allocated?policy-map type queuingQueue behavior under congestion
Interface bindingWhere do policies apply?System QoS and interface policy attachmentNo missing links in the path
MonitoringAre queues pausing, marking, or dropping?Interface, queuing, PFC, policy countersCorrelate counters with job symptoms

Illustrative NX-OS QoS skeleton

Platform syntax and feature availability vary by Cisco Nexus model and NX-OS release. Treat this as a pattern, not a copy-paste answer.

class-map type qos match-any RDMA-QOS
  match dscp <rdma-dscp>

policy-map type qos AI-CLASSIFY
  class RDMA-QOS
    set qos-group <rdma-qos-group>

class-map type network-qos RDMA-NO-DROP
  match qos-group <rdma-qos-group>

policy-map type network-qos AI-NETWORK-QOS
  class type network-qos RDMA-NO-DROP
    mtu <jumbo-mtu>
    pause pfc-cos <rdma-cos>

policy-map type queuing AI-QUEUING
  class type queuing <rdma-queue-class>
    bandwidth percent <reserved-percent>
    random-detect ecn

system qos
  service-policy type qos input AI-CLASSIFY
  service-policy type network-qos AI-NETWORK-QOS
  service-policy type queuing output AI-QUEUING

Useful verification commands

show interface ethernet <slot/port>
show interface ethernet <slot/port> counters
show interface ethernet <slot/port> priority-flow-control
show queuing interface ethernet <slot/port>
show policy-map interface ethernet <slot/port>
show class-map type qos
show policy-map type network-qos
show policy-map type queuing
show lldp neighbors
show port-channel summary
show ip route
show bgp ipv4 unicast summary
show logging last <lines>

Routing, ECMP, and overlay decisions

TopicHigh-yield pointTroubleshooting clue
L3 underlayProvides routed reachability and ECMP between leaves and spinesMissing route, failed adjacency, or asymmetric MTU causes traffic black holes
ECMPSpreads flows across equal-cost pathsLarge elephant flows can still hash unevenly
BGP underlayCommon for scalable leaf-spine designsCheck neighbor state, advertised prefixes, next hops
OSPF/IS-IS underlayAlso possible in routed fabricsCheck area/level, MTU, adjacency, passive interfaces
BFDSpeeds failure detection where implementedFalse positives can flap paths if timers are too aggressive
VXLAN EVPNAdds scalable L2/L3 overlay and tenant segmentationOverlay reachability still depends on underlay health
VRFSeparates routing tables and tenantsWrong VRF is a common cause of “reachable from one place only”
vPCDual-homed L2 access to a switch pairPeer-link congestion or orphan-port behavior can affect flows
Multicast / BUM handlingNeeded in some overlay designsMisconfigured replication affects ARP/ND/flooding behavior
DCIInterconnects sites/fabricsAvoid assuming latency-sensitive training can span sites without specialized design

Underlay vs overlay

RequirementPrefer
Simple high-performance backend RDMA fabricRouted L3 underlay with ECMP
Multi-tenant application networksVXLAN EVPN with VRFs
Need L2 adjacency for specific workloadsEVPN/VXLAN or controlled L2 design
Strict isolation between AI backend and user trafficSeparate fabrics or separate VRFs/classes
Operational consistency at scaleNDFC templates and intent-based fabric management

Cisco UCS and GPU compute reference

AreaWhat to validateWhy it matters
FirmwareServer, BIOS, GPU, NIC/HCA, storage controller versionsMismatched firmware can break RDMA, driver compatibility, or performance
DriversGPU driver, CUDA stack where relevant, NIC/RDMA driversHost stack must align with hardware and workload framework
PCIe topologyGPU-to-NIC locality, NUMA domain, slot placementPoor locality adds latency and CPU/memory overhead
GPU interconnectPCIe, NVLink/NVSwitch where presentScale-up bandwidth differs from scale-out fabric bandwidth
BIOS settingsPerformance profile, virtualization, SR-IOV, power settings as requiredDefault power-saving settings can reduce throughput
NIC featuresRoCEv2, PFC/ECN support, MTU, offloadsHost NIC must participate in congestion control
Power/coolingRack power, airflow, thermal headroomThrottling looks like performance degradation, not a link failure
Server identityService profiles / server profiles, MAC/WWN/IP policiesEnables repeatable deployment and replacement
InventoryCisco Intersight / UCS Manager visibilitySpeeds lifecycle, compliance, and fault isolation

Cisco Intersight vs UCS Manager vs Nexus Dashboard

ToolPrimary scopeUse for
Cisco IntersightServer and infrastructure lifecycle managementUCS inventory, firmware policies, profiles, health, automation
Cisco UCS ManagerUCS domain managementFabric Interconnect-attached UCS configuration and policies
Cisco Nexus DashboardData center operational platformHosting apps for fabric operations, insights, and automation
Cisco Nexus Dashboard Fabric ControllerFabric automation and lifecycleNexus fabric design, deployment, templates, consistency
Nexus Dashboard InsightsVisibility and assuranceTelemetry, anomalies, change impact, troubleshooting context

Storage and data path reference

Storage patternBest fitNetwork considerationTrap
NFS / NASShared datasets, simpler operationsThroughput, metadata performance, mount designSingle mount or filer path becomes hot spot
Object storageLarge datasets, model artifacts, data lake workflowsHTTP/API throughput, caching, localityMany small objects can stress metadata/API path
Parallel file systemLarge-scale training datasetsHigh aggregate bandwidth and metadata scalingMisconfigured clients can limit performance
NVMe/TCPHigh-performance block over EthernetSeparate QoS and congestion planningSharing RDMA no-drop class without design
Fibre Channel / SANEnterprise block storage environmentsSeparate FC fabric or converged design where usedAssuming SAN bandwidth equals training data throughput
Local NVMe cacheHot data, preprocessing, temporary shardsData distribution and cache warm-upCache miss storms during job start
Container registryImages, model-serving componentsPull storms during scale-outNo local mirror or pre-pull strategy

Storage vs backend RDMA

QuestionIf yesDesign response
Does checkpointing coincide with training synchronization?Storage writes can congest fabricSeparate QoS class or schedule/checkpoint tuning
Is dataset read throughput below GPU demand?GPUs wait idleImprove storage parallelism, caching, or data placement
Are storage and RDMA sharing uplinks?Congestion coupling is possibleMonitor queues by class; reserve bandwidth carefully
Are small-file reads dominant?Metadata may bottleneckOptimize dataset format, caching, or parallel filesystem metadata
Is storage traffic marked the same as RDMA?No-drop queue can be pollutedReclassify and isolate storage traffic

Security, segmentation, and governance

ControlUse forExam focus
AAA with TACACS+/RADIUSCentralized admin authentication and authorizationRole separation and auditability
RBACLimit operator privilegesLeast privilege for fabric/server operations
Management VRF / OOBIsolate device managementReachability during fabric incidents
SSH/HTTPS onlySecure administrative accessDisable insecure management protocols
SNMPv3 / secure telemetryAuthenticated monitoringAvoid cleartext community strings
ACLsRestrict management and tenant trafficApply in correct direction and VRF
VRFsRouting isolationTenant/job/environment segmentation
CoPPProtect switch control planePrevent data-plane events from overwhelming CPU
Image/firmware governanceTrusted software lifecycleConsistent versions, controlled upgrades
Secrets handlingProtect API tokens, registry credentials, keysAvoid embedding secrets in templates or scripts

AI-specific security considerations

  • Separate management, storage, frontend, and backend fabric access.
  • Restrict who can change QoS, PFC, ECN, and fabric templates; mistakes can affect the whole cluster.
  • Protect datasets, model artifacts, checkpoints, and container registries.
  • Use change control for firmware, driver, and CUDA-related stack changes.
  • Monitor for configuration drift between rails, leaves, and server NICs.
  • Keep automation credentials scoped and rotated.

Automation and operations decision table

NeedPreferWhy
Build or modify Nexus fabrics consistentlyCisco Nexus Dashboard Fabric ControllerIntent/templates reduce per-device drift
Manage UCS firmware and server profilesCisco Intersight or UCS ManagerCentral lifecycle and policy control
Query switch state programmaticallyNX-API, NETCONF/RESTCONF, gNMI where supportedEnables validation and telemetry workflows
Repeat configuration tasksAnsible/Terraform or vendor-supported automationIdempotent changes and version control
Validate pre/post change stateAutomated checks plus Nexus telemetryCatch MTU, QoS, adjacency, and counter regressions
Correlate incidentsNexus Dashboard Insights, syslog, telemetry, job metricsAI performance issues cross device boundaries
Operate Kubernetes-based AI workloadsKubernetes tools plus infrastructure telemetryCluster scheduler symptoms may originate in network/storage

Automation safeguards

  • Maintain a source of truth for fabric topology, addressing, VRFs, QoS classes, and cabling.
  • Validate generated config before deployment.
  • Use staged rollout for QoS, PFC, ECN, and MTU changes.
  • Capture pre-change counters and control-plane state.
  • Confirm rollback steps before changing fabric-wide policies.
  • Test one rail or failure domain when possible before broad rollout.

Troubleshooting workflow

    flowchart TD
	    A[Symptom: AI job slow or failing] --> B{Reachability issue?}
	    B -->|Yes| C[Check L1/L2/L3: link, VLAN/VRF, route, MTU, ACL]
	    B -->|No| D{Drops, ECN, or PFC counters?}
	    D -->|Drops| E[Check queue policy, congestion, bad optics, CRC/FEC, oversubscription]
	    D -->|ECN marks| F[Validate ECN thresholds and host/NIC congestion response]
	    D -->|PFC pause| G[Find congested receiver, no-drop queue, HOL blocking, pause propagation]
	    D -->|None obvious| H{GPU utilization low?}
	    H -->|Yes| I[Check storage, CPU preprocessing, NUMA, drivers, scheduler]
	    H -->|No| J[Check application, batch size, framework, job placement]
	    C --> K[Retest with counters cleared or time-bounded telemetry]
	    E --> K
	    F --> K
	    G --> K
	    I --> K
	    J --> K

Symptom-to-cause matrix

SymptomLikely causesVerifyCorrective direction
RDMA connection failsMTU mismatch, wrong DSCP/CoS, PFC disabled, ACL/VRF issue, NIC driver mismatchPing with large packet where appropriate, route/VRF, PFC/QoS counters, host RDMA toolsAlign MTU, routing, QoS, NIC settings
RDMA works but slowECMP imbalance, congestion, ECN not reacting, PFC pause, storage bottleneckQueue depth, ECN marks, PFC counters, link utilization, GPU utilizationTune traffic placement, QoS, congestion control, storage path
PFC pause stormOverloaded receiver, no-drop class too broad, buffer threshold issue, head-of-line blockingPFC Rx/Tx by interface and priority, queue occupancyNarrow no-drop traffic, relieve congestion, review thresholds
Drops in RDMA classPFC not active, wrong priority mapping, queue/buffer pressureInterface drops, policy counters, DSCP/CoS mappingFix classification and no-drop policy; reduce congestion
ECN marks but no performance improvementHost/NIC not reacting, wrong traffic class, thresholds ineffectiveNIC counters, switch ECN counters, DSCP mappingAlign NIC congestion control and switch ECN behavior
One rail congestedCabling imbalance, hashing issue, failed link, uneven job placementPer-rail utilization, LLDP, port-channel/ECMP stateCorrect cabling, restore links, rebalance workloads
BGP adjacency downIP mismatch, VRF error, ACL, MTU, authentication, interface downNeighbor state, logs, interface status, route tableFix underlay config and physical link
VXLAN tenant unreachableVNI/VRF mismatch, EVPN route issue, NVE peer issue, underlay failureEVPN routes, NVE peers, VRF routesCorrect overlay mapping and underlay reachability
GPU utilization lowStorage slow, CPU preprocessing slow, network congestion, scheduler placementGPU metrics, storage metrics, queue counters, host CPURemove data path bottleneck; tune placement
High CRC/FEC errorsOptics/cable issue, dirty fiber, speed/FEC mismatchInterface counters, transceiver detail, logsReplace/clean optics/cables; align link settings
Intermittent job failuresMicrobursts, thermal throttling, link flaps, driver issuesTime-correlated telemetry, environment sensors, logsCorrelate by timestamp and isolate failure domain

High-yield distinctions

DistinctionKnow this
Lossless Ethernet vs no congestionLossless mechanisms reduce drops; they do not eliminate congestion or guarantee performance
PFC vs global pausePFC pauses selected priorities; global pause stops all traffic on a link and is usually undesirable in data centers
ECN vs PFCECN signals congestion end-to-end; PFC pauses a local link priority
DSCP vs CoSDSCP is L3 marking; CoS/PCP is L2 marking. Routed fabrics commonly need DSCP consistency
qos-group vs DSCPqos-group is internal switch classification; DSCP is carried in the packet
type qos vs type network-qos vs type queuingClassification/marking vs no-drop/MTU behavior vs scheduling/bandwidth behavior
Underlay vs overlayUnderlay provides IP reachability; overlay provides tenant/L2/L3 virtualization
ECMP vs port channelECMP balances routed next hops; port channels bundle links between the same logical neighbors
Scale-up vs scale-outScale-up uses local GPU interconnects inside a server/chassis; scale-out uses the network between servers
Training vs inferenceTraining is dominated by synchronized east-west and storage traffic; inference is dominated by service latency and availability
NDFC vs IntersightNDFC manages Nexus fabric intent; Intersight manages server/infrastructure lifecycle
Control plane vs data planeControl plane builds routes/state; data plane forwards traffic. Both must be healthy
Drops vs errorsDrops may be congestion/policy; errors often indicate physical or link-layer problems
Low average utilization vs microburstsAverage link use can look safe while short bursts fill queues

Implementation review checklist

Fabric

  • Leaf-spine cabling matches the intended topology and rail design.
  • All uplinks/downlinks run expected speed and duplex with clean counters.
  • Routing adjacencies are stable.
  • ECMP paths are present and balanced enough for workload patterns.
  • VRFs/VLANs/VNIs match the design.
  • vPC peer links and keepalives are healthy where vPC is used.
  • OOB management remains reachable during data-plane changes.

RoCEv2 / QoS

  • RDMA traffic is classified consistently at ingress.
  • DSCP/CoS/qos-group mapping is consistent across the fabric.
  • PFC is enabled only on the intended priority.
  • ECN marking is configured for the intended queue where supported.
  • MTU is consistent across server NICs, switchports, port channels, routed links, and overlays.
  • Queue counters are monitored before and after changes.
  • Storage and checkpoint traffic do not pollute the RDMA no-drop queue.

Compute

  • UCS/server firmware, NIC firmware, GPU drivers, and OS drivers are compatible.
  • BIOS/performance settings match workload requirements.
  • GPU-to-NIC locality is understood.
  • Power and cooling are sufficient under sustained load.
  • Intersight/UCS profiles reflect desired identity and firmware policy.
  • Host NIC settings align with switch QoS, PFC, ECN, and MTU.

Storage

  • Dataset path throughput matches expected GPU consumption.
  • Checkpoint traffic is planned and monitored.
  • Storage traffic has its own class or policy when needed.
  • Metadata bottlenecks are considered for small-file workloads.
  • Registry/model artifact pulls are cached or staged for scale-out events.

Operations

  • NDFC/Intersight templates are version controlled or otherwise governed.
  • Telemetry covers interfaces, queues, PFC, ECN, routes, server health, and job metrics.
  • AAA/RBAC and management VRFs are in place.
  • Pre-change and post-change validations are defined.
  • Rollback steps are documented for QoS, MTU, routing, and firmware changes.

Exam preparation focus

For Cisco 300-640 DCAI, practice explaining not just what each component does, but why you would choose it in an AI data center:

  • When to separate backend RDMA and frontend traffic.
  • How RoCEv2 depends on PFC, ECN, MTU, and marking consistency.
  • How to troubleshoot slow training when there are no obvious packet drops.
  • How Cisco Nexus fabric operations differ from Cisco UCS/Intersight server lifecycle tasks.
  • How NDFC, Nexus Dashboard, telemetry, and automation reduce drift.
  • How storage, compute, and network bottlenecks interact.

Next step for practice

Work through timed scenario questions that force you to pick the best fabric design, QoS action, Cisco Nexus verification command, UCS/Intersight operation, or troubleshooting path from a realistic AI infrastructure symptom.