300-640 DCAI — Cisco Implementing Data Center AI Infrastructure Quick Review
Quick Review for Cisco Implementing Data Center AI Infrastructure (300-640 DCAI): AI data center fabric, RoCE, QoS, compute, storage, automation, and troubleshooting.
Quick Review purpose
This Quick Review is for candidates preparing for Cisco Cisco Implementing Data Center AI Infrastructure (300-640 DCAI). Use it to refresh high-yield concepts before moving into topic drills, mock exams, and detailed explanations.
The exam identity is:
| Item | Value |
|---|---|
| Vendor/provider | Cisco |
| Official exam title | Cisco Implementing Data Center AI Infrastructure (300-640 DCAI) |
| Official exam code | 300-640 DCAI |
This page is IT Mastery review support. It is designed to pair naturally with an IT Mastery practice plan using original practice questions, a question bank, targeted topic drills, and detailed explanations.
High-yield mental model
AI infrastructure is not “just fast networking.” It is an end-to-end system where GPU utilization depends on the combined behavior of compute, NICs, fabric, storage, automation, and observability.
| Layer | What to connect quickly | Common candidate trap |
|---|---|---|
| AI workload | Training, inference, data ingest, checkpointing, east-west communication | Treating every AI workload as the same traffic pattern |
| Compute | CPU, GPU, memory, PCIe, NUMA locality, NIC placement | Troubleshooting the network while the bottleneck is host-side |
| Fabric | Leaf-spine design, ECMP, bandwidth, latency, failure domains | Assuming “link up” means the fabric is ready for AI traffic |
| RDMA/RoCE | Low-latency transport, lossless or low-loss behavior, congestion control | Confusing PFC, ECN, and end-host congestion reaction |
| QoS | Classification, marking, queueing, buffer management, no-drop classes | Enabling lossless behavior too broadly |
| Storage | Dataset read, write, checkpoint, object/file/block access patterns | Ignoring storage as a cause of low GPU utilization |
| Operations | Telemetry, baselines, change control, automation, templates | Making isolated changes without pre/post validation |
A strong 300-640 DCAI review mindset is: What is the bottleneck, where is it measured, and which control plane or data plane mechanism is responsible?
AI workload patterns to recognize
| Workload pattern | Infrastructure concern | Review focus |
|---|---|---|
| Distributed training | Heavy east-west traffic between GPU nodes | Fabric bandwidth, ECMP, RDMA, congestion control |
| Model inference | Latency, availability, scaling, north-south and service-to-service traffic | Load balancing, segmentation, observability |
| Data preprocessing | CPU, memory, storage, and network throughput | Storage path and host bottlenecks |
| Checkpointing | Large periodic writes to storage | Storage throughput, QoS isolation, burst handling |
| Model serving at scale | Predictable latency and fault isolation | Placement, traffic engineering, monitoring |
| Multi-tenant AI platform | Isolation between teams, workloads, or environments | VRFs, VLANs/VNIs, policy, RBAC, quota awareness |
Training versus inference
| Area | Training | Inference |
|---|---|---|
| Traffic profile | Large east-west exchanges, synchronization, data ingest | Client/service traffic, API calls, sometimes east-west microservices |
| Key risk | GPU idle time due to network/storage bottlenecks | Latency spikes and inconsistent response time |
| Design emphasis | High bandwidth, predictable congestion behavior | Availability, scaling, observability, segmentation |
| Troubleshooting clue | Low GPU utilization across many nodes | Tail latency, failed requests, capacity imbalance |
Data center fabric foundations
AI clusters often depend on a predictable, high-throughput data center fabric. For exam review, focus less on memorizing product names and more on why a design choice supports AI traffic.
Core fabric concepts
| Concept | What to know | Why it matters for AI infrastructure |
|---|---|---|
| Leaf-spine / Clos | Every leaf has multiple paths through spines | Supports scale-out bandwidth and predictable latency |
| ECMP | Equal-cost paths distribute flows across links | Prevents single-path congestion when hashing is effective |
| Oversubscription | Downlink capacity can exceed uplink capacity | AI training can expose oversubscription quickly |
| Failure domains | Isolate failures by rack, leaf pair, spine, pod, or site | Prevents one issue from impacting all workloads |
| MTU consistency | Large frames must be supported end to end if used | Mismatches create drops, fragmentation, or poor performance |
| Underlay reachability | Loopbacks, routed links, and routing protocol health | Overlay and RDMA designs depend on stable reachability |
| Management network | Out-of-band or logically separate access | Required for recovery, automation, and observability |
Quick design checks
Ask these questions when reviewing a scenario:
Is traffic mostly east-west or north-south? Distributed AI training usually stresses east-west paths.
Is the fabric nonblocking enough for the workload? A fabric that is acceptable for general virtualization may be insufficient for GPU clusters.
Are paths symmetric and predictable? Asymmetry can complicate troubleshooting, hashing, and telemetry interpretation.
Is ECMP actually distributing traffic? A single elephant flow, poor hashing inputs, or polarization can overload one path while others sit idle.
Are MTU, QoS, and RDMA settings consistent from host to switch to host? Partial configuration is a common cause of intermittent failures.
RoCEv2, RDMA, and lossless Ethernet
RDMA allows applications to move data directly between memory regions with lower CPU overhead and lower latency. In Ethernet AI fabrics, RoCEv2 is commonly associated with RDMA over UDP/IP, which means the IP fabric, QoS policy, and host NIC configuration all matter.
What to remember
| Topic | Fast review |
|---|---|
| RDMA purpose | Reduce latency and CPU overhead for high-throughput communication |
| RoCEv2 | RDMA over UDP/IP; can operate across routed IP fabrics when designed correctly |
| Lossless behavior | Usually implemented only for selected traffic classes, not for all traffic |
| PFC | Pauses traffic per priority to avoid drops for a no-drop class |
| ECN | Marks packets before congestion becomes severe |
| End-host reaction | Hosts must react to congestion marks through configured congestion control behavior |
| MTU | Must be consistent end to end, including host, switch, and any routed path |
| Validation | Ping is not enough; validate with workload-like traffic and counters |
PFC versus ECN versus congestion control
| Mechanism | Operates where | Purpose | Common mistake |
|---|---|---|---|
| PFC | Layer 2 priority | Pause a specific priority to prevent drops | Enabling it on too many classes or links |
| ECN | IP header marking | Signal congestion before buffer overflow | Marking traffic without host reaction |
| DCQCN or similar host behavior | End host / NIC stack | Reduce sending rate after congestion signal | Configuring switches but ignoring NIC settings |
| QoS classification | Host and switch | Put RDMA traffic into the intended class | Mismatched DSCP/CoS/priority mappings |
| Buffer thresholds | Switch queues | Control when traffic is marked or paused | Thresholds too aggressive or too late |
RDMA implementation checklist
Use this checklist when reviewing a configuration or troubleshooting scenario:
| Check | What you are looking for |
|---|---|
| Host NIC support | RDMA/RoCE capability, drivers, firmware, and correct mode |
| Switch support | Feature support and correct interface/policy application |
| Traffic classification | RDMA traffic mapped to the intended QoS class |
| PFC scope | Enabled only where required, typically for the RDMA priority |
| ECN marking | Thresholds and queues aligned with the congestion design |
| MTU | Consistent jumbo or standard MTU across all participating links |
| Routing | Stable underlay reachability and ECMP paths |
| Counters | Drops, pause frames, ECN marks, queue occupancy, retransmission symptoms |
| Workload validation | Tests that resemble the AI job, not only basic connectivity |
Common RoCE/RDMA traps
- “Lossless” does not mean “no congestion.” It means the design tries to avoid drops for selected traffic while still controlling congestion.
- PFC is not a substitute for good fabric design. A congested or oversubscribed design can still perform poorly.
- PFC can spread congestion. Pause behavior can create head-of-line blocking if applied carelessly.
- ECN requires end-host participation. Switch marking alone does not reduce sender rate.
- MTU mismatch can look intermittent. Small tests may pass while real workloads fail.
- One bad link can affect a whole job. Distributed training often waits for the slowest participant.
- QoS markings must be trusted and preserved intentionally. Do not assume DSCP, CoS, or priority values survive every boundary.
QoS review for AI fabrics
QoS in AI infrastructure is about protecting latency-sensitive and loss-sensitive traffic without starving other traffic.
| QoS function | Review question | Candidate mistake |
|---|---|---|
| Classification | How is traffic identified? | Assuming all traffic from GPU nodes is RDMA |
| Marking | Which DSCP/CoS/priority is assigned? | Marking at the host but not trusting or mapping at the switch |
| Queueing | Which queue carries RDMA or storage traffic? | Putting too many traffic types into one no-drop queue |
| Scheduling | How is bandwidth shared under congestion? | Overprioritizing one class until management or storage suffers |
| Policing/shaping | Is traffic limited at an edge or boundary? | Applying a limiter that breaks expected throughput |
| Buffer management | When are packets marked or paused? | Ignoring microbursts and queue thresholds |
| Verification | What counters prove behavior? | Relying only on interface up/up status |
Decision rule
When a scenario asks what to fix first, prioritize in this order:
- Correct classification and marking
- Correct queue and PFC/ECN policy
- Consistent MTU
- Host NIC congestion-control behavior
- Fabric capacity and ECMP distribution
- Workload-level validation
If the traffic is not classified correctly, every downstream QoS mechanism may be irrelevant.
VXLAN EVPN, segmentation, and routing
AI infrastructure can use simple routed fabrics, overlays, or segmented multi-tenant designs. Be ready to identify the control plane and data plane responsibilities.
| Component | Role | Review focus |
|---|---|---|
| Underlay | Provides IP reachability between fabric nodes | Routing adjacency, loopbacks, ECMP, MTU |
| Overlay | Provides tenant or workload segmentation | VNIs, VRFs, anycast gateway, endpoint reachability |
| VXLAN | Encapsulates Layer 2 or Layer 3 tenant traffic over IP | VTEPs, VNIs, encapsulation overhead |
| EVPN | Control plane for endpoint and route information | BGP EVPN state, route types, import/export logic |
| Anycast gateway | Distributed default gateway across fabric | Consistent gateway IP/MAC behavior |
| VRF | Routing isolation | Correct route leaking or isolation policy |
Common EVPN/VXLAN review traps
- Underlay reachability must work before overlay troubleshooting is meaningful.
- A VTEP loopback issue can look like an endpoint issue.
- VNI/VRF mismatches can isolate workloads even when VLANs appear correct locally.
- MTU must account for encapsulation overhead.
- Control-plane reachability and data-plane forwarding are related but not the same.
- Route import/export mistakes can create either black holes or unintended reachability.
Compute, GPU, and host networking
AI infrastructure performance depends heavily on server architecture. A network configuration may be correct while the host still cannot feed GPUs efficiently.
| Area | What to review | Why it matters |
|---|---|---|
| GPU placement | Which GPUs are attached to which CPU/PCIe domains | Affects latency and throughput |
| NUMA locality | CPU, memory, NIC, and GPU proximity | Poor locality can reduce performance |
| PCIe capacity | Lanes, generations, oversubscription | Limits GPU/NIC data movement |
| NIC placement | NIC-to-GPU path, dual-homing, redundancy | Affects RDMA and traffic distribution |
| Firmware/drivers | Compatibility between NIC, GPU, OS, and platform | Mismatches cause instability or feature loss |
| SR-IOV / virtualization | Direct device access or virtual functions | Can improve performance but complicates policy |
| Container runtime | GPU device visibility and network attachment | Workload may fail despite working hardware |
| Time sync | NTP/PTP or platform time consistency | Helps logs, telemetry, and distributed operations |
Host-side troubleshooting clues
| Symptom | Possible host-side cause |
|---|---|
| GPU utilization low on one node | Local CPU, memory, PCIe, driver, or NIC issue |
| GPU utilization low across all nodes | Fabric, storage, synchronization, or workload design issue |
| RDMA test fails but IP works | NIC mode, driver, PFC/ECN mapping, firewall, or MTU issue |
| Performance differs between identical nodes | Firmware, cabling, slot placement, BIOS/platform settings |
| Container cannot access GPU | Runtime, device plugin, permissions, scheduling, or driver stack |
Storage and data pipeline review
AI systems are often starved by storage before the network fabric is fully used. Review how data enters, moves through, and leaves the training or inference pipeline.
| Storage pattern | Infrastructure concern | Exam-prep angle |
|---|---|---|
| Dataset reads | Sustained read throughput and metadata performance | GPU idle time may be storage-related |
| Checkpoint writes | Periodic large writes | Bursts can impact other traffic |
| Object storage | Scale and durability | Application access pattern matters |
| File storage | Shared dataset access | Metadata and small-file behavior can bottleneck |
| Block storage | Low-latency volumes | Multipathing and QoS may matter |
| NVMe/TCP or NVMe/RDMA | High-performance storage transport | MTU, congestion, and network isolation matter |
| Backup/replication | Background bandwidth usage | Can interfere with training if not controlled |
Storage troubleshooting decision points
- If all nodes slow down during checkpointing, check storage throughput, network queues, and QoS isolation.
- If only one node is slow, check local mount, path, NIC, driver, and cabling.
- If small-file workloads are slow, metadata performance may be the bottleneck.
- If large sequential reads are slow, check path bandwidth and storage backend limits.
- If storage and RDMA share links, verify classification and congestion behavior.
Cisco operations, management, and automation
For Cisco data center AI infrastructure, be comfortable with how implementation and operations tools fit together. You do not need to treat tools as magic boxes; understand what each tool controls or observes.
| Cisco-related area | What to know conceptually |
|---|---|
| Cisco Nexus switching | Fabric interfaces, routing, QoS, telemetry, counters, software lifecycle |
| Cisco Nexus Dashboard Fabric Controller | Fabric design, deployment, templates, compliance, lifecycle operations |
| Cisco Nexus Dashboard / insights-style telemetry | Visibility, anomaly detection, flow/counter correlation, health views |
| Cisco UCS environments | Server policies, firmware, inventory, connectivity, compute lifecycle |
| Cisco Intersight | Cloud-based or connected operations model for infrastructure management and automation |
| APIs and automation | Repeatable configuration, validation, inventory, drift detection |
Automation review checklist
| Principle | Practical meaning |
|---|---|
| Idempotency | Reapplying automation should not create unintended changes |
| Source of truth | Inventory, addressing, VLAN/VNI/VRF, and policy data should be consistent |
| Pre-checks | Validate reachability, platform state, versions, and dependencies before change |
| Post-checks | Confirm control plane, counters, health, and intended policy after change |
| Rollback | Know how to restore known-good state |
| Drift detection | Identify manual changes that differ from intended state |
| Change scope | Understand blast radius before modifying templates or shared policy |
Observability signals to correlate
Do not troubleshoot with a single counter. Correlate:
- Interface errors and discards
- Queue drops and queue depth
- PFC pause frames
- ECN marks
- Link utilization and microburst indicators
- Routing adjacency state
- EVPN/VTEP state if overlays are used
- Host NIC counters
- GPU utilization
- Storage latency and throughput
- Application logs and job timing
Security, isolation, and governance
AI infrastructure frequently carries sensitive datasets, model artifacts, credentials, and multi-tenant workloads.
| Area | Review focus |
|---|---|
| Management plane | AAA, RBAC, secure access, logging, management VRF or network |
| Segmentation | VRFs, VLANs, VNIs, ACLs, policy boundaries |
| Tenant isolation | Prevent unintended communication between teams or environments |
| Secrets | Protect tokens, keys, registry credentials, and automation variables |
| Image and firmware integrity | Use approved versions and controlled updates |
| Logging | Maintain usable audit and troubleshooting data |
| Least privilege | Grant operators and automation only needed access |
Common security mistakes
- Reusing broad admin credentials in automation.
- Mixing management, storage, and workload traffic without clear policy.
- Allowing route leaking without an explicit purpose.
- Ignoring logging until after an incident.
- Treating AI lab environments as exempt from production controls.
Troubleshooting patterns
Fast symptom-to-check table
| Symptom | First checks | Likely area |
|---|---|---|
| RDMA traffic fails, normal IP works | NIC mode, MTU, QoS mapping, PFC/ECN, ACLs | Host/fabric QoS |
| Training job slow across many nodes | Fabric utilization, ECMP, queue stats, storage throughput | Fabric or storage |
| One node consistently slow | NIC counters, cabling, GPU/NUMA placement, driver/firmware | Host or access layer |
| PFC pause frames high | Congestion point, no-drop class scope, buffer thresholds | QoS/congestion |
| ECN marks high but no relief | Host congestion-control reaction, thresholds, workload burst | End-host/fabric |
| EVPN endpoint unreachable | Underlay reachability, VTEP state, VNI/VRF mapping | Overlay/control plane |
| Storage spikes during checkpoints | Storage backend, QoS, network class, write path | Storage/data pipeline |
| Automation change breaks many nodes | Template scope, source-of-truth error, rollback plan | Automation/governance |
Good troubleshooting order
- Define the failure. Is it loss, latency, low throughput, failed adjacency, or workload timeout?
- Scope the blast radius. One host, one rack, one fabric path, one tenant, or all workloads?
- Separate host, fabric, and storage. Use counters and tests that isolate each layer.
- Verify the control plane. Routing, EVPN, management reachability, and policy distribution.
- Verify the data plane. Interfaces, queues, drops, MTU, encapsulation, and path utilization.
- Check end-host settings. NIC mode, drivers, firmware, container/device access, congestion control.
- Validate with representative traffic. Basic ping is not enough for AI workloads.
- Change one variable at a time. Then compare pre/post telemetry.
Last-minute review tables
Mechanism matching
| If the question says… | Think… |
|---|---|
| “Low latency CPU bypass” | RDMA |
| “RDMA over routed IP Ethernet” | RoCEv2 |
| “Pause only one priority” | PFC |
| “Mark congestion before dropping” | ECN |
| “Sender slows after congestion signal” | Host congestion control |
| “Tenant segmentation over IP fabric” | VXLAN EVPN / VRF / VNI |
| “Distributed gateway on multiple leafs” | Anycast gateway |
| “Traffic uses one path while others idle” | ECMP hashing or polarization |
| “Small tests pass, real workload fails” | MTU, QoS, microbursts, workload scale |
Candidate mistakes to avoid
| Mistake | Better exam behavior |
|---|---|
| Choosing the fastest-looking fix | Identify the layer and mechanism first |
| Treating PFC as universally good | Limit no-drop behavior to required traffic |
| Ignoring host configuration | RDMA depends on NIC, driver, firmware, and OS settings |
| Overlooking storage | GPU idle time often starts with data access |
| Assuming overlay issue before checking underlay | Underlay reachability comes first |
| Trusting one metric | Correlate counters, telemetry, and workload symptoms |
| Memorizing commands without purpose | Know what each command or view proves |
| Practicing only definitions | Use scenario-based original practice questions |
How to connect this review to question-bank practice
Use this Quick Review first, then move into IT Mastery practice. The goal is not to reread theory; it is to force decision-making under exam-style conditions.
| Practice area | Best drill type | What detailed explanations should clarify |
|---|---|---|
| RoCE/RDMA | Scenario questions | Why PFC, ECN, MTU, and host settings interact |
| QoS | Configuration and troubleshooting drills | Which mechanism solves which symptom |
| Fabric design | Design-choice questions | Bandwidth, ECMP, failure domain, and scale tradeoffs |
| VXLAN EVPN | Control-plane/data-plane questions | Underlay versus overlay responsibility |
| Compute/GPU | Host bottleneck scenarios | NIC, PCIe, NUMA, driver, and firmware clues |
| Storage | Performance troubleshooting | Dataset, checkpoint, and backend bottlenecks |
| Automation | Change-control questions | Idempotency, validation, rollback, drift |
| Operations | Telemetry interpretation | Which counter or signal proves the issue |
Recommended practice loop
- Start with topic drills on RDMA, QoS, fabric design, and troubleshooting.
- Review every missed question using the detailed explanations, not just the correct answer.
- Build a short error log with three columns: concept missed, clue ignored, rule to remember.
- Move to mixed question bank sets to practice switching topics.
- Finish with timed mock exams only after your topic-level accuracy is stable.
Practical next step
Review the tables above, then begin targeted topic drills with original practice questions on RoCEv2, QoS, fabric design, compute/storage bottlenecks, and Cisco data center operations. Use the detailed explanations to turn each missed question into a specific rule you can apply on the real Cisco Implementing Data Center AI Infrastructure (300-640 DCAI) exam.
Continue in IT Mastery
Use this Quick Review as a final concept map, then move into IT Mastery for focused topic drills, mixed practice sets, timed mock exams, and detailed explanations. The practice questions are original IT Mastery practice items; they are not official Cisco questions, copied live-exam content, or exam dumps.