Quick Reference purpose
Use this as an independent compact review for Cisco Cisco Implementing Data Center AI Infrastructure (300-640 DCAI). Focus on implementation decisions: AI fabric design, RoCEv2 behavior, lossless Ethernet, Cisco Nexus operations, Cisco UCS/Intersight compute lifecycle, storage paths, automation, observability, and troubleshooting.
AI data center architecture map
| Layer / plane | Typical Cisco-related components | What to know for 300-640 DCAI | Common exam trap |
|---|
| GPU compute | Cisco UCS C-Series / X-Series GPU-capable servers, VIC/NIC/HCA, GPU drivers, firmware policies | GPU-to-NIC locality, firmware compatibility, BIOS settings, power/cooling, lifecycle control | Treating GPU performance as only a server issue; network and storage often limit utilization |
| Backend AI fabric | Cisco Nexus leaf-spine, routed Ethernet, RoCEv2, PFC, ECN, QoS | Low latency, low loss, high east-west bandwidth, consistent QoS and MTU | Enabling PFC broadly instead of only the lossless RDMA class |
| Frontend / service fabric | Nexus leaf-spine, VLAN/VRF, VXLAN EVPN, load balancers, firewalls | User/API access, tenant segmentation, app access to inference endpoints | Mixing frontend bursty traffic with backend GPU collective traffic without isolation |
| Storage / data fabric | Ethernet storage, NFS, object, parallel file systems, NVMe/TCP, FC where used | Data ingest, checkpointing, model/data access, throughput and metadata behavior | Optimizing only GPU fabric while dataset reads/checkpoints remain bottlenecked |
| Management / OOB | OOB management network, Cisco Intersight, Cisco Nexus Dashboard, NDFC, AAA, syslog, telemetry | Secure access, inventory, automation, fabric health, configuration consistency | Managing devices in-band only and losing control during data-plane incidents |
| Automation / intent | Cisco Nexus Dashboard Fabric Controller, Intersight, APIs, Ansible/Terraform, templates | Repeatable fabric and server deployment, drift detection, rollback planning | Manual per-device changes that create QoS or MTU inconsistency |
| Observability | Nexus telemetry, interface counters, queue counters, PFC/ECN counters, server/GPU metrics | Correlate job slowdown with drops, pause frames, ECN marks, congestion, host issues | Looking only at packet drops; congestion can show as ECN marks or PFC pause without drops |
AI workload traffic patterns
| Workload pattern | Primary bottleneck | Network behavior | Design priority |
|---|
| Distributed training | East-west GPU-to-GPU communication | Large synchronized bursts, all-reduce/all-to-all, sensitivity to tail latency | Nonblocking or low-oversubscription leaf-spine, consistent RDMA QoS |
| Inference | North-south request/response plus backend calls | Latency-sensitive, often smaller flows, may be horizontally scaled | Frontend resilience, load balancing, segmentation, predictable latency |
| Data ingest / preprocessing | Storage throughput and CPU pipeline | Read-heavy, metadata-heavy, sometimes bursty | Storage locality, caching, separate QoS class from RDMA |
| Checkpointing | Write bandwidth and storage consistency | Periodic large writes can congest links | Isolate or rate-manage checkpoint traffic; avoid starving RDMA |
| Model distribution | One-to-many reads or image pulls | Bursty pulls from registries/object stores | Local caching, registry placement, bandwidth planning |
| Multi-tenant AI cluster | Isolation and noisy-neighbor control | Competing traffic classes and jobs | VRF/VLAN/ACL segmentation, QoS, quotas at orchestration layer |
Training vs inference distinctions
| Dimension | Training | Inference |
|---|
| Main goal | Maximize GPU utilization and job completion speed | Minimize response latency and maintain availability |
| Dominant traffic | East-west GPU synchronization, dataset reads, checkpoints | Client/API traffic, model serving, feature/data lookups |
| Failure impact | Job restart, checkpoint recovery, wasted GPU time | User-facing outage or degraded service |
| Network concern | Lossless/low-loss RDMA, ECMP balance, congestion control | Load balancing, service segmentation, autoscaling, observability |
| Exam decision point | Prefer high-bandwidth backend fabric with strict QoS consistency | Prefer resilient frontend design with security and traffic isolation |
Fabric design selection
| Design choice | Choose when | Avoid / watch for |
|---|
| Leaf-spine Clos | Need predictable scale-out, ECMP, uniform hop count | Incorrect cabling or uneven uplinks causing hot spots |
| Rail-optimized fabric | GPU servers have multiple NICs/rails and traffic should stay balanced per rail | Misaligned rail cabling; one rail congests while others are idle |
| Separate backend and frontend fabrics | Need strict separation between RDMA training traffic and user/service traffic | More cabling and operational domains |
| Converged fabric with strict QoS | Cabling budget or architecture requires shared links | Requires disciplined classification, queuing, monitoring, and change control |
| L3 routed backend fabric | Need ECMP scale and simple failure domains for RoCEv2 | Forgetting end-to-end MTU, DSCP/CoS, and ECN consistency |
| VXLAN EVPN fabric | Need tenant segmentation, L2 extension, VRFs, workload mobility | Do not assume overlay fixes physical congestion |
| vPC host attachment | Need L2 dual-homing to a pair of switches | vPC peer link is not a substitute for proper spine capacity |
| OOB management | Need reliable device access during fabric failure | Skipping AAA, logging, and route separation |
Typical AI fabric planes
flowchart LR
User[Users / APIs] --> FE[Frontend service fabric]
FE --> App[Inference / scheduler / apps]
App --> Storage[Dataset / model storage]
GPU[GPU servers] <--> BE[Backend AI fabric: RoCEv2 / RDMA]
GPU --> Storage
Mgmt[OOB management] --> GPU
Mgmt --> Nexus[Cisco Nexus switches]
Ops[Cisco Intersight / Nexus Dashboard / NDFC] --> Mgmt
Capacity and bottleneck math
Use oversubscription and bisection thinking rather than memorizing arbitrary numbers.
\[
\text{Oversubscription ratio}=\frac{\text{Total server-facing bandwidth}}{\text{Total fabric-facing uplink bandwidth}}
\]\[
\text{Effective throughput} \le \min(\text{GPU demand},\ \text{NIC capacity},\ \text{fabric path capacity},\ \text{storage throughput})
\]
| Metric | Meaning | Exam use |
|---|
| Oversubscription | Downlink demand compared with uplink capacity | Lower oversubscription is preferred for synchronous distributed training |
| Bisection bandwidth | Available bandwidth between two halves of the cluster | Important for all-to-all and all-reduce traffic |
| Tail latency | High-percentile latency, not just average latency | A few slow flows can delay synchronized training |
| Queue depth | Amount of buffered traffic | Rising queues indicate congestion before drops appear |
| ECN marks | Congestion signal without dropping packets | Validate congestion management is active |
| PFC pause frames | Link-level pause for a priority | Useful when controlled; dangerous if persistent or spreading |
| GPU utilization | Time GPU is doing useful work | Low utilization can be network, storage, CPU, or orchestration related |
RoCEv2 and lossless Ethernet reference
| Item | Role | Key implementation point |
|---|
| RDMA | Direct memory access between hosts without heavy CPU involvement | Improves throughput/latency for GPU clusters and storage-like workloads |
| RoCEv2 | RDMA over UDP/IP Ethernet | Routable; depends on correct QoS, MTU, and congestion handling |
| PFC | Priority Flow Control; per-priority Layer 2 pause | Apply only to selected no-drop class, usually RDMA |
| ECN | Explicit Congestion Notification | Marks packets before drops; host/NIC congestion algorithm reacts |
| DCQCN | Data Center Quantized Congestion Notification | NIC-side congestion control commonly associated with RoCEv2 fabrics |
| DCB | Data Center Bridging feature set | Includes mechanisms such as PFC and ETS concepts |
| ETS | Enhanced Transmission Selection | Allocates bandwidth among traffic classes |
| DSCP | Layer 3 QoS marking | Useful across routed RoCEv2 fabrics |
| CoS / PCP | Layer 2 priority marking | Used for priority behavior on Ethernet links |
| MTU | Maximum transmission unit | Must be consistent end-to-end for intended jumbo behavior |
PFC, ECN, and drops
| Mechanism | Layer / scope | What happens | Why it matters | Trap |
|---|
| Tail drop | Queue overflow | Packet is dropped | TCP may recover; RDMA can suffer severe performance impact | Waiting for drops before investigating congestion |
| PFC | L2, link-local, per priority | Receiver pauses sender for a priority | Prevents loss in no-drop class | Overuse can create head-of-line blocking and pause storms |
| ECN | L3 marking, end-to-end signal | Switch marks packets instead of dropping | Sender reduces rate before severe congestion | ECN marking alone does nothing if hosts/NICs do not react |
| WRED/RED with ECN | Queue management | Marks or drops based on thresholds | Early congestion signaling | Bad thresholds can mark too late or too aggressively |
| DCQCN | Host/NIC algorithm | Adjusts RDMA transmit rate | Stabilizes RoCEv2 under congestion | Requires consistent network and NIC configuration |
RoCEv2 implementation checklist
- Use a dedicated QoS class for RDMA traffic.
- Map RDMA traffic consistently: application/NIC marking → DSCP/CoS → switch qos-group → queue.
- Enable PFC only for the intended no-drop priority.
- Configure ECN/WRED behavior for early congestion signaling where supported.
- Keep MTU consistent across server NICs, switch interfaces, port channels, routed links, and overlays where applicable.
- Verify the trust boundary: do switches trust server markings, rewrite them, or classify by ACL?
- Avoid mixing RDMA with storage bursts, backup, checkpoint, or general TCP traffic in the same no-drop queue.
- Monitor PFC pause counters and ECN marks continuously; “no drops” is not proof of a healthy fabric.
Cisco Nexus QoS mental model
| QoS stage | Question to ask | NX-OS concept | Validation focus |
|---|
| Classification | Which packets are RDMA, storage, control, or best effort? | class-map type qos, ACL/DSCP/CoS matching | Packets enter the intended class |
| Marking | What internal forwarding class is used? | qos-group, DSCP/CoS rewrite where used | Marking is consistent across hops |
| Network QoS | Which classes are no-drop and what MTU applies? | policy-map type network-qos | PFC priority and MTU are correct |
| Queuing | How is bandwidth/buffering allocated? | policy-map type queuing | Queue behavior under congestion |
| Interface binding | Where do policies apply? | System QoS and interface policy attachment | No missing links in the path |
| Monitoring | Are queues pausing, marking, or dropping? | Interface, queuing, PFC, policy counters | Correlate counters with job symptoms |
Illustrative NX-OS QoS skeleton
Platform syntax and feature availability vary by Cisco Nexus model and NX-OS release. Treat this as a pattern, not a copy-paste answer.
class-map type qos match-any RDMA-QOS
match dscp <rdma-dscp>
policy-map type qos AI-CLASSIFY
class RDMA-QOS
set qos-group <rdma-qos-group>
class-map type network-qos RDMA-NO-DROP
match qos-group <rdma-qos-group>
policy-map type network-qos AI-NETWORK-QOS
class type network-qos RDMA-NO-DROP
mtu <jumbo-mtu>
pause pfc-cos <rdma-cos>
policy-map type queuing AI-QUEUING
class type queuing <rdma-queue-class>
bandwidth percent <reserved-percent>
random-detect ecn
system qos
service-policy type qos input AI-CLASSIFY
service-policy type network-qos AI-NETWORK-QOS
service-policy type queuing output AI-QUEUING
Useful verification commands
show interface ethernet <slot/port>
show interface ethernet <slot/port> counters
show interface ethernet <slot/port> priority-flow-control
show queuing interface ethernet <slot/port>
show policy-map interface ethernet <slot/port>
show class-map type qos
show policy-map type network-qos
show policy-map type queuing
show lldp neighbors
show port-channel summary
show ip route
show bgp ipv4 unicast summary
show logging last <lines>
Routing, ECMP, and overlay decisions
| Topic | High-yield point | Troubleshooting clue |
|---|
| L3 underlay | Provides routed reachability and ECMP between leaves and spines | Missing route, failed adjacency, or asymmetric MTU causes traffic black holes |
| ECMP | Spreads flows across equal-cost paths | Large elephant flows can still hash unevenly |
| BGP underlay | Common for scalable leaf-spine designs | Check neighbor state, advertised prefixes, next hops |
| OSPF/IS-IS underlay | Also possible in routed fabrics | Check area/level, MTU, adjacency, passive interfaces |
| BFD | Speeds failure detection where implemented | False positives can flap paths if timers are too aggressive |
| VXLAN EVPN | Adds scalable L2/L3 overlay and tenant segmentation | Overlay reachability still depends on underlay health |
| VRF | Separates routing tables and tenants | Wrong VRF is a common cause of “reachable from one place only” |
| vPC | Dual-homed L2 access to a switch pair | Peer-link congestion or orphan-port behavior can affect flows |
| Multicast / BUM handling | Needed in some overlay designs | Misconfigured replication affects ARP/ND/flooding behavior |
| DCI | Interconnects sites/fabrics | Avoid assuming latency-sensitive training can span sites without specialized design |
Underlay vs overlay
| Requirement | Prefer |
|---|
| Simple high-performance backend RDMA fabric | Routed L3 underlay with ECMP |
| Multi-tenant application networks | VXLAN EVPN with VRFs |
| Need L2 adjacency for specific workloads | EVPN/VXLAN or controlled L2 design |
| Strict isolation between AI backend and user traffic | Separate fabrics or separate VRFs/classes |
| Operational consistency at scale | NDFC templates and intent-based fabric management |
Cisco UCS and GPU compute reference
| Area | What to validate | Why it matters |
|---|
| Firmware | Server, BIOS, GPU, NIC/HCA, storage controller versions | Mismatched firmware can break RDMA, driver compatibility, or performance |
| Drivers | GPU driver, CUDA stack where relevant, NIC/RDMA drivers | Host stack must align with hardware and workload framework |
| PCIe topology | GPU-to-NIC locality, NUMA domain, slot placement | Poor locality adds latency and CPU/memory overhead |
| GPU interconnect | PCIe, NVLink/NVSwitch where present | Scale-up bandwidth differs from scale-out fabric bandwidth |
| BIOS settings | Performance profile, virtualization, SR-IOV, power settings as required | Default power-saving settings can reduce throughput |
| NIC features | RoCEv2, PFC/ECN support, MTU, offloads | Host NIC must participate in congestion control |
| Power/cooling | Rack power, airflow, thermal headroom | Throttling looks like performance degradation, not a link failure |
| Server identity | Service profiles / server profiles, MAC/WWN/IP policies | Enables repeatable deployment and replacement |
| Inventory | Cisco Intersight / UCS Manager visibility | Speeds lifecycle, compliance, and fault isolation |
Cisco Intersight vs UCS Manager vs Nexus Dashboard
| Tool | Primary scope | Use for |
|---|
| Cisco Intersight | Server and infrastructure lifecycle management | UCS inventory, firmware policies, profiles, health, automation |
| Cisco UCS Manager | UCS domain management | Fabric Interconnect-attached UCS configuration and policies |
| Cisco Nexus Dashboard | Data center operational platform | Hosting apps for fabric operations, insights, and automation |
| Cisco Nexus Dashboard Fabric Controller | Fabric automation and lifecycle | Nexus fabric design, deployment, templates, consistency |
| Nexus Dashboard Insights | Visibility and assurance | Telemetry, anomalies, change impact, troubleshooting context |
Storage and data path reference
| Storage pattern | Best fit | Network consideration | Trap |
|---|
| NFS / NAS | Shared datasets, simpler operations | Throughput, metadata performance, mount design | Single mount or filer path becomes hot spot |
| Object storage | Large datasets, model artifacts, data lake workflows | HTTP/API throughput, caching, locality | Many small objects can stress metadata/API path |
| Parallel file system | Large-scale training datasets | High aggregate bandwidth and metadata scaling | Misconfigured clients can limit performance |
| NVMe/TCP | High-performance block over Ethernet | Separate QoS and congestion planning | Sharing RDMA no-drop class without design |
| Fibre Channel / SAN | Enterprise block storage environments | Separate FC fabric or converged design where used | Assuming SAN bandwidth equals training data throughput |
| Local NVMe cache | Hot data, preprocessing, temporary shards | Data distribution and cache warm-up | Cache miss storms during job start |
| Container registry | Images, model-serving components | Pull storms during scale-out | No local mirror or pre-pull strategy |
Storage vs backend RDMA
| Question | If yes | Design response |
|---|
| Does checkpointing coincide with training synchronization? | Storage writes can congest fabric | Separate QoS class or schedule/checkpoint tuning |
| Is dataset read throughput below GPU demand? | GPUs wait idle | Improve storage parallelism, caching, or data placement |
| Are storage and RDMA sharing uplinks? | Congestion coupling is possible | Monitor queues by class; reserve bandwidth carefully |
| Are small-file reads dominant? | Metadata may bottleneck | Optimize dataset format, caching, or parallel filesystem metadata |
| Is storage traffic marked the same as RDMA? | No-drop queue can be polluted | Reclassify and isolate storage traffic |
Security, segmentation, and governance
| Control | Use for | Exam focus |
|---|
| AAA with TACACS+/RADIUS | Centralized admin authentication and authorization | Role separation and auditability |
| RBAC | Limit operator privileges | Least privilege for fabric/server operations |
| Management VRF / OOB | Isolate device management | Reachability during fabric incidents |
| SSH/HTTPS only | Secure administrative access | Disable insecure management protocols |
| SNMPv3 / secure telemetry | Authenticated monitoring | Avoid cleartext community strings |
| ACLs | Restrict management and tenant traffic | Apply in correct direction and VRF |
| VRFs | Routing isolation | Tenant/job/environment segmentation |
| CoPP | Protect switch control plane | Prevent data-plane events from overwhelming CPU |
| Image/firmware governance | Trusted software lifecycle | Consistent versions, controlled upgrades |
| Secrets handling | Protect API tokens, registry credentials, keys | Avoid embedding secrets in templates or scripts |
AI-specific security considerations
- Separate management, storage, frontend, and backend fabric access.
- Restrict who can change QoS, PFC, ECN, and fabric templates; mistakes can affect the whole cluster.
- Protect datasets, model artifacts, checkpoints, and container registries.
- Use change control for firmware, driver, and CUDA-related stack changes.
- Monitor for configuration drift between rails, leaves, and server NICs.
- Keep automation credentials scoped and rotated.
Automation and operations decision table
| Need | Prefer | Why |
|---|
| Build or modify Nexus fabrics consistently | Cisco Nexus Dashboard Fabric Controller | Intent/templates reduce per-device drift |
| Manage UCS firmware and server profiles | Cisco Intersight or UCS Manager | Central lifecycle and policy control |
| Query switch state programmatically | NX-API, NETCONF/RESTCONF, gNMI where supported | Enables validation and telemetry workflows |
| Repeat configuration tasks | Ansible/Terraform or vendor-supported automation | Idempotent changes and version control |
| Validate pre/post change state | Automated checks plus Nexus telemetry | Catch MTU, QoS, adjacency, and counter regressions |
| Correlate incidents | Nexus Dashboard Insights, syslog, telemetry, job metrics | AI performance issues cross device boundaries |
| Operate Kubernetes-based AI workloads | Kubernetes tools plus infrastructure telemetry | Cluster scheduler symptoms may originate in network/storage |
Automation safeguards
- Maintain a source of truth for fabric topology, addressing, VRFs, QoS classes, and cabling.
- Validate generated config before deployment.
- Use staged rollout for QoS, PFC, ECN, and MTU changes.
- Capture pre-change counters and control-plane state.
- Confirm rollback steps before changing fabric-wide policies.
- Test one rail or failure domain when possible before broad rollout.
Troubleshooting workflow
flowchart TD
A[Symptom: AI job slow or failing] --> B{Reachability issue?}
B -->|Yes| C[Check L1/L2/L3: link, VLAN/VRF, route, MTU, ACL]
B -->|No| D{Drops, ECN, or PFC counters?}
D -->|Drops| E[Check queue policy, congestion, bad optics, CRC/FEC, oversubscription]
D -->|ECN marks| F[Validate ECN thresholds and host/NIC congestion response]
D -->|PFC pause| G[Find congested receiver, no-drop queue, HOL blocking, pause propagation]
D -->|None obvious| H{GPU utilization low?}
H -->|Yes| I[Check storage, CPU preprocessing, NUMA, drivers, scheduler]
H -->|No| J[Check application, batch size, framework, job placement]
C --> K[Retest with counters cleared or time-bounded telemetry]
E --> K
F --> K
G --> K
I --> K
J --> K
Symptom-to-cause matrix
| Symptom | Likely causes | Verify | Corrective direction |
|---|
| RDMA connection fails | MTU mismatch, wrong DSCP/CoS, PFC disabled, ACL/VRF issue, NIC driver mismatch | Ping with large packet where appropriate, route/VRF, PFC/QoS counters, host RDMA tools | Align MTU, routing, QoS, NIC settings |
| RDMA works but slow | ECMP imbalance, congestion, ECN not reacting, PFC pause, storage bottleneck | Queue depth, ECN marks, PFC counters, link utilization, GPU utilization | Tune traffic placement, QoS, congestion control, storage path |
| PFC pause storm | Overloaded receiver, no-drop class too broad, buffer threshold issue, head-of-line blocking | PFC Rx/Tx by interface and priority, queue occupancy | Narrow no-drop traffic, relieve congestion, review thresholds |
| Drops in RDMA class | PFC not active, wrong priority mapping, queue/buffer pressure | Interface drops, policy counters, DSCP/CoS mapping | Fix classification and no-drop policy; reduce congestion |
| ECN marks but no performance improvement | Host/NIC not reacting, wrong traffic class, thresholds ineffective | NIC counters, switch ECN counters, DSCP mapping | Align NIC congestion control and switch ECN behavior |
| One rail congested | Cabling imbalance, hashing issue, failed link, uneven job placement | Per-rail utilization, LLDP, port-channel/ECMP state | Correct cabling, restore links, rebalance workloads |
| BGP adjacency down | IP mismatch, VRF error, ACL, MTU, authentication, interface down | Neighbor state, logs, interface status, route table | Fix underlay config and physical link |
| VXLAN tenant unreachable | VNI/VRF mismatch, EVPN route issue, NVE peer issue, underlay failure | EVPN routes, NVE peers, VRF routes | Correct overlay mapping and underlay reachability |
| GPU utilization low | Storage slow, CPU preprocessing slow, network congestion, scheduler placement | GPU metrics, storage metrics, queue counters, host CPU | Remove data path bottleneck; tune placement |
| High CRC/FEC errors | Optics/cable issue, dirty fiber, speed/FEC mismatch | Interface counters, transceiver detail, logs | Replace/clean optics/cables; align link settings |
| Intermittent job failures | Microbursts, thermal throttling, link flaps, driver issues | Time-correlated telemetry, environment sensors, logs | Correlate by timestamp and isolate failure domain |
High-yield distinctions
| Distinction | Know this |
|---|
| Lossless Ethernet vs no congestion | Lossless mechanisms reduce drops; they do not eliminate congestion or guarantee performance |
| PFC vs global pause | PFC pauses selected priorities; global pause stops all traffic on a link and is usually undesirable in data centers |
| ECN vs PFC | ECN signals congestion end-to-end; PFC pauses a local link priority |
| DSCP vs CoS | DSCP is L3 marking; CoS/PCP is L2 marking. Routed fabrics commonly need DSCP consistency |
qos-group vs DSCP | qos-group is internal switch classification; DSCP is carried in the packet |
type qos vs type network-qos vs type queuing | Classification/marking vs no-drop/MTU behavior vs scheduling/bandwidth behavior |
| Underlay vs overlay | Underlay provides IP reachability; overlay provides tenant/L2/L3 virtualization |
| ECMP vs port channel | ECMP balances routed next hops; port channels bundle links between the same logical neighbors |
| Scale-up vs scale-out | Scale-up uses local GPU interconnects inside a server/chassis; scale-out uses the network between servers |
| Training vs inference | Training is dominated by synchronized east-west and storage traffic; inference is dominated by service latency and availability |
| NDFC vs Intersight | NDFC manages Nexus fabric intent; Intersight manages server/infrastructure lifecycle |
| Control plane vs data plane | Control plane builds routes/state; data plane forwards traffic. Both must be healthy |
| Drops vs errors | Drops may be congestion/policy; errors often indicate physical or link-layer problems |
| Low average utilization vs microbursts | Average link use can look safe while short bursts fill queues |
Implementation review checklist
Fabric
- Leaf-spine cabling matches the intended topology and rail design.
- All uplinks/downlinks run expected speed and duplex with clean counters.
- Routing adjacencies are stable.
- ECMP paths are present and balanced enough for workload patterns.
- VRFs/VLANs/VNIs match the design.
- vPC peer links and keepalives are healthy where vPC is used.
- OOB management remains reachable during data-plane changes.
RoCEv2 / QoS
- RDMA traffic is classified consistently at ingress.
- DSCP/CoS/qos-group mapping is consistent across the fabric.
- PFC is enabled only on the intended priority.
- ECN marking is configured for the intended queue where supported.
- MTU is consistent across server NICs, switchports, port channels, routed links, and overlays.
- Queue counters are monitored before and after changes.
- Storage and checkpoint traffic do not pollute the RDMA no-drop queue.
Compute
- UCS/server firmware, NIC firmware, GPU drivers, and OS drivers are compatible.
- BIOS/performance settings match workload requirements.
- GPU-to-NIC locality is understood.
- Power and cooling are sufficient under sustained load.
- Intersight/UCS profiles reflect desired identity and firmware policy.
- Host NIC settings align with switch QoS, PFC, ECN, and MTU.
Storage
- Dataset path throughput matches expected GPU consumption.
- Checkpoint traffic is planned and monitored.
- Storage traffic has its own class or policy when needed.
- Metadata bottlenecks are considered for small-file workloads.
- Registry/model artifact pulls are cached or staged for scale-out events.
Operations
- NDFC/Intersight templates are version controlled or otherwise governed.
- Telemetry covers interfaces, queues, PFC, ECN, routes, server health, and job metrics.
- AAA/RBAC and management VRFs are in place.
- Pre-change and post-change validations are defined.
- Rollback steps are documented for QoS, MTU, routing, and firmware changes.
Exam preparation focus
For Cisco 300-640 DCAI, practice explaining not just what each component does, but why you would choose it in an AI data center:
- When to separate backend RDMA and frontend traffic.
- How RoCEv2 depends on PFC, ECN, MTU, and marking consistency.
- How to troubleshoot slow training when there are no obvious packet drops.
- How Cisco Nexus fabric operations differ from Cisco UCS/Intersight server lifecycle tasks.
- How NDFC, Nexus Dashboard, telemetry, and automation reduce drift.
- How storage, compute, and network bottlenecks interact.
Next step for practice
Work through timed scenario questions that force you to pick the best fabric design, QoS action, Cisco Nexus verification command, UCS/Intersight operation, or troubleshooting path from a realistic AI infrastructure symptom.