300-640 DCAI — Cisco Implementing Data Center AI Infrastructure Quick Review

Last revised: July 1, 2026

Quick Review for Cisco Implementing Data Center AI Infrastructure (300-640 DCAI): AI data center fabric, RoCE, QoS, compute, storage, automation, and troubleshooting.

Quick Review purpose

This Quick Review is for candidates preparing for Cisco Cisco Implementing Data Center AI Infrastructure (300-640 DCAI). Use it to refresh high-yield concepts before moving into topic drills, mock exams, and detailed explanations.

The exam identity is:

Item	Value
Vendor/provider	Cisco
Official exam title	Cisco Implementing Data Center AI Infrastructure (300-640 DCAI)
Official exam code	300-640 DCAI

This page is IT Mastery review support. It is designed to pair naturally with an IT Mastery practice plan using original practice questions, a question bank, targeted topic drills, and detailed explanations.

High-yield mental model

AI infrastructure is not “just fast networking.” It is an end-to-end system where GPU utilization depends on the combined behavior of compute, NICs, fabric, storage, automation, and observability.

Layer	What to connect quickly	Common candidate trap
AI workload	Training, inference, data ingest, checkpointing, east-west communication	Treating every AI workload as the same traffic pattern
Compute	CPU, GPU, memory, PCIe, NUMA locality, NIC placement	Troubleshooting the network while the bottleneck is host-side
Fabric	Leaf-spine design, ECMP, bandwidth, latency, failure domains	Assuming “link up” means the fabric is ready for AI traffic
RDMA/RoCE	Low-latency transport, lossless or low-loss behavior, congestion control	Confusing PFC, ECN, and end-host congestion reaction
QoS	Classification, marking, queueing, buffer management, no-drop classes	Enabling lossless behavior too broadly
Storage	Dataset read, write, checkpoint, object/file/block access patterns	Ignoring storage as a cause of low GPU utilization
Operations	Telemetry, baselines, change control, automation, templates	Making isolated changes without pre/post validation

A strong 300-640 DCAI review mindset is: What is the bottleneck, where is it measured, and which control plane or data plane mechanism is responsible?

AI workload patterns to recognize

Workload pattern	Infrastructure concern	Review focus
Distributed training	Heavy east-west traffic between GPU nodes	Fabric bandwidth, ECMP, RDMA, congestion control
Model inference	Latency, availability, scaling, north-south and service-to-service traffic	Load balancing, segmentation, observability
Data preprocessing	CPU, memory, storage, and network throughput	Storage path and host bottlenecks
Checkpointing	Large periodic writes to storage	Storage throughput, QoS isolation, burst handling
Model serving at scale	Predictable latency and fault isolation	Placement, traffic engineering, monitoring
Multi-tenant AI platform	Isolation between teams, workloads, or environments	VRFs, VLANs/VNIs, policy, RBAC, quota awareness

Training versus inference

Area	Training	Inference
Traffic profile	Large east-west exchanges, synchronization, data ingest	Client/service traffic, API calls, sometimes east-west microservices
Key risk	GPU idle time due to network/storage bottlenecks	Latency spikes and inconsistent response time
Design emphasis	High bandwidth, predictable congestion behavior	Availability, scaling, observability, segmentation
Troubleshooting clue	Low GPU utilization across many nodes	Tail latency, failed requests, capacity imbalance

Data center fabric foundations

AI clusters often depend on a predictable, high-throughput data center fabric. For exam review, focus less on memorizing product names and more on why a design choice supports AI traffic.

Core fabric concepts

Concept	What to know	Why it matters for AI infrastructure
Leaf-spine / Clos	Every leaf has multiple paths through spines	Supports scale-out bandwidth and predictable latency
ECMP	Equal-cost paths distribute flows across links	Prevents single-path congestion when hashing is effective
Oversubscription	Downlink capacity can exceed uplink capacity	AI training can expose oversubscription quickly
Failure domains	Isolate failures by rack, leaf pair, spine, pod, or site	Prevents one issue from impacting all workloads
MTU consistency	Large frames must be supported end to end if used	Mismatches create drops, fragmentation, or poor performance
Underlay reachability	Loopbacks, routed links, and routing protocol health	Overlay and RDMA designs depend on stable reachability
Management network	Out-of-band or logically separate access	Required for recovery, automation, and observability

Quick design checks

Ask these questions when reviewing a scenario:

Is traffic mostly east-west or north-south? Distributed AI training usually stresses east-west paths.
Is the fabric nonblocking enough for the workload? A fabric that is acceptable for general virtualization may be insufficient for GPU clusters.
Are paths symmetric and predictable? Asymmetry can complicate troubleshooting, hashing, and telemetry interpretation.
Is ECMP actually distributing traffic? A single elephant flow, poor hashing inputs, or polarization can overload one path while others sit idle.
Are MTU, QoS, and RDMA settings consistent from host to switch to host? Partial configuration is a common cause of intermittent failures.

RoCEv2, RDMA, and lossless Ethernet

RDMA allows applications to move data directly between memory regions with lower CPU overhead and lower latency. In Ethernet AI fabrics, RoCEv2 is commonly associated with RDMA over UDP/IP, which means the IP fabric, QoS policy, and host NIC configuration all matter.

What to remember

Topic	Fast review
RDMA purpose	Reduce latency and CPU overhead for high-throughput communication
RoCEv2	RDMA over UDP/IP; can operate across routed IP fabrics when designed correctly
Lossless behavior	Usually implemented only for selected traffic classes, not for all traffic
PFC	Pauses traffic per priority to avoid drops for a no-drop class
ECN	Marks packets before congestion becomes severe
End-host reaction	Hosts must react to congestion marks through configured congestion control behavior
MTU	Must be consistent end to end, including host, switch, and any routed path
Validation	Ping is not enough; validate with workload-like traffic and counters

PFC versus ECN versus congestion control

Mechanism	Operates where	Purpose	Common mistake
PFC	Layer 2 priority	Pause a specific priority to prevent drops	Enabling it on too many classes or links
ECN	IP header marking	Signal congestion before buffer overflow	Marking traffic without host reaction
DCQCN or similar host behavior	End host / NIC stack	Reduce sending rate after congestion signal	Configuring switches but ignoring NIC settings
QoS classification	Host and switch	Put RDMA traffic into the intended class	Mismatched DSCP/CoS/priority mappings
Buffer thresholds	Switch queues	Control when traffic is marked or paused	Thresholds too aggressive or too late

RDMA implementation checklist

Use this checklist when reviewing a configuration or troubleshooting scenario:

Check	What you are looking for
Host NIC support	RDMA/RoCE capability, drivers, firmware, and correct mode
Switch support	Feature support and correct interface/policy application
Traffic classification	RDMA traffic mapped to the intended QoS class
PFC scope	Enabled only where required, typically for the RDMA priority
ECN marking	Thresholds and queues aligned with the congestion design
MTU	Consistent jumbo or standard MTU across all participating links
Routing	Stable underlay reachability and ECMP paths
Counters	Drops, pause frames, ECN marks, queue occupancy, retransmission symptoms
Workload validation	Tests that resemble the AI job, not only basic connectivity

Common RoCE/RDMA traps

“Lossless” does not mean “no congestion.” It means the design tries to avoid drops for selected traffic while still controlling congestion.
PFC is not a substitute for good fabric design. A congested or oversubscribed design can still perform poorly.
PFC can spread congestion. Pause behavior can create head-of-line blocking if applied carelessly.
ECN requires end-host participation. Switch marking alone does not reduce sender rate.
MTU mismatch can look intermittent. Small tests may pass while real workloads fail.
One bad link can affect a whole job. Distributed training often waits for the slowest participant.
QoS markings must be trusted and preserved intentionally. Do not assume DSCP, CoS, or priority values survive every boundary.

QoS review for AI fabrics

QoS in AI infrastructure is about protecting latency-sensitive and loss-sensitive traffic without starving other traffic.

QoS function	Review question	Candidate mistake
Classification	How is traffic identified?	Assuming all traffic from GPU nodes is RDMA
Marking	Which DSCP/CoS/priority is assigned?	Marking at the host but not trusting or mapping at the switch
Queueing	Which queue carries RDMA or storage traffic?	Putting too many traffic types into one no-drop queue
Scheduling	How is bandwidth shared under congestion?	Overprioritizing one class until management or storage suffers
Policing/shaping	Is traffic limited at an edge or boundary?	Applying a limiter that breaks expected throughput
Buffer management	When are packets marked or paused?	Ignoring microbursts and queue thresholds
Verification	What counters prove behavior?	Relying only on interface up/up status

Decision rule

When a scenario asks what to fix first, prioritize in this order:

Correct classification and marking
Correct queue and PFC/ECN policy
Consistent MTU
Host NIC congestion-control behavior
Fabric capacity and ECMP distribution
Workload-level validation

If the traffic is not classified correctly, every downstream QoS mechanism may be irrelevant.

VXLAN EVPN, segmentation, and routing

AI infrastructure can use simple routed fabrics, overlays, or segmented multi-tenant designs. Be ready to identify the control plane and data plane responsibilities.

Component	Role	Review focus
Underlay	Provides IP reachability between fabric nodes	Routing adjacency, loopbacks, ECMP, MTU
Overlay	Provides tenant or workload segmentation	VNIs, VRFs, anycast gateway, endpoint reachability
VXLAN	Encapsulates Layer 2 or Layer 3 tenant traffic over IP	VTEPs, VNIs, encapsulation overhead
EVPN	Control plane for endpoint and route information	BGP EVPN state, route types, import/export logic
Anycast gateway	Distributed default gateway across fabric	Consistent gateway IP/MAC behavior
VRF	Routing isolation	Correct route leaking or isolation policy

Common EVPN/VXLAN review traps

Underlay reachability must work before overlay troubleshooting is meaningful.
A VTEP loopback issue can look like an endpoint issue.
VNI/VRF mismatches can isolate workloads even when VLANs appear correct locally.
MTU must account for encapsulation overhead.
Control-plane reachability and data-plane forwarding are related but not the same.
Route import/export mistakes can create either black holes or unintended reachability.

Compute, GPU, and host networking

AI infrastructure performance depends heavily on server architecture. A network configuration may be correct while the host still cannot feed GPUs efficiently.

Area	What to review	Why it matters
GPU placement	Which GPUs are attached to which CPU/PCIe domains	Affects latency and throughput
NUMA locality	CPU, memory, NIC, and GPU proximity	Poor locality can reduce performance
PCIe capacity	Lanes, generations, oversubscription	Limits GPU/NIC data movement
NIC placement	NIC-to-GPU path, dual-homing, redundancy	Affects RDMA and traffic distribution
Firmware/drivers	Compatibility between NIC, GPU, OS, and platform	Mismatches cause instability or feature loss
SR-IOV / virtualization	Direct device access or virtual functions	Can improve performance but complicates policy
Container runtime	GPU device visibility and network attachment	Workload may fail despite working hardware
Time sync	NTP/PTP or platform time consistency	Helps logs, telemetry, and distributed operations

Host-side troubleshooting clues

Symptom	Possible host-side cause
GPU utilization low on one node	Local CPU, memory, PCIe, driver, or NIC issue
GPU utilization low across all nodes	Fabric, storage, synchronization, or workload design issue
RDMA test fails but IP works	NIC mode, driver, PFC/ECN mapping, firewall, or MTU issue
Performance differs between identical nodes	Firmware, cabling, slot placement, BIOS/platform settings
Container cannot access GPU	Runtime, device plugin, permissions, scheduling, or driver stack

Storage and data pipeline review

AI systems are often starved by storage before the network fabric is fully used. Review how data enters, moves through, and leaves the training or inference pipeline.

Storage pattern	Infrastructure concern	Exam-prep angle
Dataset reads	Sustained read throughput and metadata performance	GPU idle time may be storage-related
Checkpoint writes	Periodic large writes	Bursts can impact other traffic
Object storage	Scale and durability	Application access pattern matters
File storage	Shared dataset access	Metadata and small-file behavior can bottleneck
Block storage	Low-latency volumes	Multipathing and QoS may matter
NVMe/TCP or NVMe/RDMA	High-performance storage transport	MTU, congestion, and network isolation matter
Backup/replication	Background bandwidth usage	Can interfere with training if not controlled

Storage troubleshooting decision points

If all nodes slow down during checkpointing, check storage throughput, network queues, and QoS isolation.
If only one node is slow, check local mount, path, NIC, driver, and cabling.
If small-file workloads are slow, metadata performance may be the bottleneck.
If large sequential reads are slow, check path bandwidth and storage backend limits.
If storage and RDMA share links, verify classification and congestion behavior.

Cisco operations, management, and automation

For Cisco data center AI infrastructure, be comfortable with how implementation and operations tools fit together. You do not need to treat tools as magic boxes; understand what each tool controls or observes.

Cisco-related area	What to know conceptually
Cisco Nexus switching	Fabric interfaces, routing, QoS, telemetry, counters, software lifecycle
Cisco Nexus Dashboard Fabric Controller	Fabric design, deployment, templates, compliance, lifecycle operations
Cisco Nexus Dashboard / insights-style telemetry	Visibility, anomaly detection, flow/counter correlation, health views
Cisco UCS environments	Server policies, firmware, inventory, connectivity, compute lifecycle
Cisco Intersight	Cloud-based or connected operations model for infrastructure management and automation
APIs and automation	Repeatable configuration, validation, inventory, drift detection

Automation review checklist

Principle	Practical meaning
Idempotency	Reapplying automation should not create unintended changes
Source of truth	Inventory, addressing, VLAN/VNI/VRF, and policy data should be consistent
Pre-checks	Validate reachability, platform state, versions, and dependencies before change
Post-checks	Confirm control plane, counters, health, and intended policy after change
Rollback	Know how to restore known-good state
Drift detection	Identify manual changes that differ from intended state
Change scope	Understand blast radius before modifying templates or shared policy

Observability signals to correlate

Do not troubleshoot with a single counter. Correlate:

Interface errors and discards
Queue drops and queue depth
PFC pause frames
ECN marks
Link utilization and microburst indicators
Routing adjacency state
EVPN/VTEP state if overlays are used
Host NIC counters
GPU utilization
Storage latency and throughput
Application logs and job timing

Security, isolation, and governance

AI infrastructure frequently carries sensitive datasets, model artifacts, credentials, and multi-tenant workloads.

Area	Review focus
Management plane	AAA, RBAC, secure access, logging, management VRF or network
Segmentation	VRFs, VLANs, VNIs, ACLs, policy boundaries
Tenant isolation	Prevent unintended communication between teams or environments
Secrets	Protect tokens, keys, registry credentials, and automation variables
Image and firmware integrity	Use approved versions and controlled updates
Logging	Maintain usable audit and troubleshooting data
Least privilege	Grant operators and automation only needed access

Common security mistakes

Reusing broad admin credentials in automation.
Mixing management, storage, and workload traffic without clear policy.
Allowing route leaking without an explicit purpose.
Ignoring logging until after an incident.
Treating AI lab environments as exempt from production controls.

Troubleshooting patterns

Fast symptom-to-check table

Symptom	First checks	Likely area
RDMA traffic fails, normal IP works	NIC mode, MTU, QoS mapping, PFC/ECN, ACLs	Host/fabric QoS
Training job slow across many nodes	Fabric utilization, ECMP, queue stats, storage throughput	Fabric or storage
One node consistently slow	NIC counters, cabling, GPU/NUMA placement, driver/firmware	Host or access layer
PFC pause frames high	Congestion point, no-drop class scope, buffer thresholds	QoS/congestion
ECN marks high but no relief	Host congestion-control reaction, thresholds, workload burst	End-host/fabric
EVPN endpoint unreachable	Underlay reachability, VTEP state, VNI/VRF mapping	Overlay/control plane
Storage spikes during checkpoints	Storage backend, QoS, network class, write path	Storage/data pipeline
Automation change breaks many nodes	Template scope, source-of-truth error, rollback plan	Automation/governance

Good troubleshooting order

Define the failure. Is it loss, latency, low throughput, failed adjacency, or workload timeout?
Scope the blast radius. One host, one rack, one fabric path, one tenant, or all workloads?
Separate host, fabric, and storage. Use counters and tests that isolate each layer.
Verify the control plane. Routing, EVPN, management reachability, and policy distribution.
Verify the data plane. Interfaces, queues, drops, MTU, encapsulation, and path utilization.
Check end-host settings. NIC mode, drivers, firmware, container/device access, congestion control.
Validate with representative traffic. Basic ping is not enough for AI workloads.
Change one variable at a time. Then compare pre/post telemetry.

Last-minute review tables

Mechanism matching

If the question says…	Think…
“Low latency CPU bypass”	RDMA
“RDMA over routed IP Ethernet”	RoCEv2
“Pause only one priority”	PFC
“Mark congestion before dropping”	ECN
“Sender slows after congestion signal”	Host congestion control
“Tenant segmentation over IP fabric”	VXLAN EVPN / VRF / VNI
“Distributed gateway on multiple leafs”	Anycast gateway
“Traffic uses one path while others idle”	ECMP hashing or polarization
“Small tests pass, real workload fails”	MTU, QoS, microbursts, workload scale

Candidate mistakes to avoid

Mistake	Better exam behavior
Choosing the fastest-looking fix	Identify the layer and mechanism first
Treating PFC as universally good	Limit no-drop behavior to required traffic
Ignoring host configuration	RDMA depends on NIC, driver, firmware, and OS settings
Overlooking storage	GPU idle time often starts with data access
Assuming overlay issue before checking underlay	Underlay reachability comes first
Trusting one metric	Correlate counters, telemetry, and workload symptoms
Memorizing commands without purpose	Know what each command or view proves
Practicing only definitions	Use scenario-based original practice questions

How to connect this review to question-bank practice

Use this Quick Review first, then move into IT Mastery practice. The goal is not to reread theory; it is to force decision-making under exam-style conditions.

Practice area	Best drill type	What detailed explanations should clarify
RoCE/RDMA	Scenario questions	Why PFC, ECN, MTU, and host settings interact
QoS	Configuration and troubleshooting drills	Which mechanism solves which symptom
Fabric design	Design-choice questions	Bandwidth, ECMP, failure domain, and scale tradeoffs
VXLAN EVPN	Control-plane/data-plane questions	Underlay versus overlay responsibility
Compute/GPU	Host bottleneck scenarios	NIC, PCIe, NUMA, driver, and firmware clues
Storage	Performance troubleshooting	Dataset, checkpoint, and backend bottlenecks
Automation	Change-control questions	Idempotency, validation, rollback, drift
Operations	Telemetry interpretation	Which counter or signal proves the issue

Recommended practice loop

Start with topic drills on RDMA, QoS, fabric design, and troubleshooting.
Review every missed question using the detailed explanations, not just the correct answer.
Build a short error log with three columns: concept missed, clue ignored, rule to remember.
Move to mixed question bank sets to practice switching topics.
Finish with timed mock exams only after your topic-level accuracy is stable.

Practical next step

Review the tables above, then begin targeted topic drills with original practice questions on RoCEv2, QoS, fabric design, compute/storage bottlenecks, and Cisco data center operations. Use the detailed explanations to turn each missed question into a specific rule you can apply on the real Cisco Implementing Data Center AI Infrastructure (300-640 DCAI) exam.

Continue in IT Mastery

Use this Quick Review as a final concept map, then move into IT Mastery for focused topic drills, mixed practice sets, timed mock exams, and detailed explanations. The practice questions are original IT Mastery practice items; they are not official Cisco questions, copied live-exam content, or exam dumps.

Study Plan