300-640 DCAI — Cisco Implementing Data Center AI Infrastructure Exam Blueprint

Practical exam blueprint for Cisco Implementing Data Center AI Infrastructure (300-640 DCAI) exam readiness.

How to Use This Exam Blueprint

Use this independent Exam Blueprint to organize your preparation for the Cisco Implementing Data Center AI Infrastructure (300-640 DCAI) exam. It translates likely readiness areas into practical tasks: what you should be able to explain, configure, validate, and troubleshoot.

No exact official weights are assumed here. Treat Cisco’s published exam information as the source of truth for current scope, then use this checklist to test whether you are operationally ready.

A strong candidate should be able to:

  • Explain why AI infrastructure stresses the data center differently from traditional enterprise workloads.
  • Choose appropriate Cisco data center fabric, compute, storage, automation, and observability approaches for AI workloads.
  • Understand lossless Ethernet, RDMA, RoCEv2, congestion control, and QoS design tradeoffs.
  • Validate configuration artifacts and troubleshoot degraded training or inference performance.
  • Reason through implementation scenarios, not just recall product names.

Topic-area readiness map

Readiness areaWhat to reviewReady means you can…
AI workload fundamentalsTraining, inference, GPU clusters, east-west traffic, data pipelines, job behaviorExplain how AI workloads drive bandwidth, latency, storage, and resiliency requirements
Data center fabric designLeaf-spine, ECMP, underlay/overlay, scale-out design, oversubscription, failure domainsSelect a fabric approach for GPU clusters and justify capacity, redundancy, and operational tradeoffs
RDMA and lossless EthernetRoCEv2, PFC, ECN, DCB concepts, jumbo MTU, queue mappingExplain end-to-end requirements for low-loss transport and identify misconfiguration risks
QoS and congestion managementClassification, marking, queuing, buffer behavior, priority classes, congestion signalsMap AI traffic to queues and troubleshoot drops, pauses, latency, or throughput collapse
Cisco data center switchingNexus switching concepts, NX-OS validation, port channels, routing, VXLAN/EVPN where applicableInterpret common Cisco show outputs and connect configuration choices to AI fabric behavior
Compute and GPU platform integrationGPU servers, NICs, DPUs, PCIe, NUMA awareness, firmware, drivers, power and coolingIdentify platform dependencies that affect AI job performance and reliability
Storage and data movementFile, object, block, distributed storage, caching, ingest, dataset localityMatch storage patterns to AI workload needs and detect storage bottlenecks
Automation and orchestrationTemplates, APIs, infrastructure as code, fabric controllers, validation pipelinesDescribe how repeatable provisioning reduces risk in large GPU-cluster deployments
Observability and telemetryStreaming telemetry, interface counters, queue metrics, logs, flow data, GPU/node metricsBuild a troubleshooting view that connects application symptoms to infrastructure signals
Security and segmentationAAA, RBAC, VRFs, tenant isolation, management-plane protection, secure automationApply least privilege and segmentation without breaking high-performance AI workflows
Operations and lifecycleFirmware alignment, change windows, rollback plans, capacity planning, documentationPlan safe changes in an environment where small mismatches can cause major performance loss

AI infrastructure fundamentals

Core concepts to know

  • Difference between AI training, fine-tuning, inference, batch inference, and interactive inference.
  • Why distributed training creates heavy east-west traffic between GPU nodes.
  • Why storage throughput and data preprocessing can bottleneck expensive GPU capacity.
  • How low latency, high bandwidth, and predictable packet delivery affect training completion time.
  • Why “link is up” does not mean “fabric is healthy” for AI workloads.
  • How application behavior, framework communication patterns, NIC behavior, and network design interact.
  • Why tail latency and microbursts matter for synchronized distributed jobs.
  • How failure of one node, link, queue, or path can reduce the efficiency of an entire training job.

Can you explain these distinctions?

PromptYou should be able to answer
Training vs inferenceWhich one is more likely to require large-scale GPU-to-GPU communication? Which one is more latency-sensitive to users?
North-south vs east-westWhy do AI clusters often emphasize east-west bandwidth inside the fabric?
Bandwidth vs latencyWhen is raw throughput the main concern, and when does jitter or tail latency become critical?
OversubscriptionWhy might a traditional oversubscription ratio be unacceptable for a large training cluster?
Storage vs network bottleneckHow would symptoms differ if GPUs are waiting for data rather than waiting on inter-node communication?
Resiliency vs performanceWhy can a redundant design still perform poorly if traffic hashing or queue policy is wrong?

Data center fabric design for AI workloads

Fabric architecture checklist

  • Understand the role of leaf, spine, border, management, and out-of-band components.
  • Explain why scale-out fabrics are commonly used for GPU clusters.
  • Identify where ECMP helps and where it does not solve congestion by itself.
  • Know how link speed, port count, cabling, optics, and spine capacity affect cluster scale.
  • Understand failure-domain design: rack, leaf pair, spine, power domain, management domain.
  • Review when VXLAN/EVPN, VRFs, VLANs, and routed underlays may appear in a data center design.
  • Understand why consistent MTU and QoS treatment must be end-to-end for RDMA-style workloads.
  • Know the difference between production data traffic, storage traffic, management traffic, telemetry traffic, and control-plane traffic.
  • Be able to reason about adding a rack of GPU servers without creating hidden bottlenecks.

Capacity and oversubscription checks

Be ready to calculate or reason through basic bandwidth relationships. You do not need invented exam-specific numbers; focus on the method.

\[ \text{Oversubscription ratio} = \frac{\text{Total server-facing bandwidth}}{\text{Total fabric-facing bandwidth}} \]

Use this to answer scenario questions such as:

  • If each rack adds more GPU nodes, do spine uplinks still provide enough aggregate capacity?
  • Does a proposed design meet the expected traffic pattern, or only the link-speed requirement?
  • Which is the bottleneck: server NIC, leaf uplink, spine capacity, storage path, or application pipeline?
  • What happens to effective bandwidth during a link or spine failure?
  • Are traffic patterns balanced enough for ECMP to use available paths efficiently?

Design decision prompts

ScenarioDecision points
New GPU training podLeaf-spine capacity, nonblocking requirements, cabling plan, power/cooling, management access, telemetry baseline
Expanding an existing fabricAvailable spine ports, uplink utilization, QoS consistency, route scale, automation template changes
Mixed AI and general workloadsTraffic isolation, QoS classes, tenant segmentation, storage access, noisy-neighbor controls
Multi-tenant AI platformVRF or segmentation model, RBAC, quotas or policy boundaries, observability per tenant
High-throughput storage accessNetwork path to storage, storage protocol behavior, cache strategy, congestion domain
Latency-sensitive inferencePlacement close to consumers, load-balancing behavior, failure handling, monitoring of tail latency

RDMA, RoCEv2, and lossless Ethernet readiness

Concepts to master

  • What RDMA is intended to provide and why AI workloads may benefit from it.
  • How RoCEv2 relies on IP networking while still requiring careful loss and congestion handling.
  • Why packet drops can severely affect RDMA traffic compared with ordinary TCP applications.
  • The purpose and risk of Priority Flow Control.
  • The purpose of Explicit Congestion Notification.
  • How traffic classification and marking must remain consistent across hosts, switches, and paths.
  • Why jumbo MTU mismatch can cause hard-to-diagnose performance issues.
  • Why PFC should be scoped carefully and not treated as a universal “make everything lossless” setting.
  • How congestion spreading, pause storms, or head-of-line blocking can harm an AI fabric.
  • How host NIC settings and switch QoS policies must align.

PFC, ECN, and QoS comparison

ControlPurposeWhat to watch for
QoS classificationPlaces traffic into the intended class or queueWrong DSCP/CoS marking, remarking at boundaries, mixed traffic in lossless queues
Priority Flow ControlPauses traffic for selected priorities to avoid lossPause propagation, head-of-line blocking, enabling too broadly
ECNSignals congestion before drops occurThreshold mismatch, host response behavior, inconsistent configuration
Queuing policyAllocates bandwidth and scheduling behaviorStarvation, wrong queue mapping, insufficient buffer allocation
MTUSupports larger frames where requiredEnd-to-end mismatch, silent fragmentation or drops, host-switch inconsistency
Congestion monitoringDetects early signs of fabric stressIgnoring queue depth, pause counters, microbursts, and retransmission symptoms

End-to-end RoCE readiness checklist

  • Confirm server NIC, OS, driver, and firmware expectations.
  • Confirm switch interface speed, optics, cabling, and error counters.
  • Confirm MTU consistency from host to host across all paths.
  • Confirm DSCP/CoS marking at the host and preservation across the fabric.
  • Confirm queue mapping and lossless class configuration.
  • Confirm PFC is applied only where intended.
  • Confirm ECN behavior and congestion thresholds are consistent with the design.
  • Confirm routing and ECMP paths are symmetric enough for expected behavior.
  • Confirm storage or management traffic is not competing inside the same lossless class.
  • Confirm telemetry exists for drops, pause frames, queue depth, and utilization.

Cisco data center switching and fabric operations

For the Cisco Implementing Data Center AI Infrastructure (300-640 DCAI) exam, be comfortable connecting Cisco data center implementation concepts to AI infrastructure outcomes. Do not study commands in isolation; study what each command proves.

Cisco-oriented readiness tasks

  • Identify the role of Cisco Nexus switching in an AI data center fabric.
  • Interpret common NX-OS interface, routing, port-channel, QoS, and overlay validation outputs.
  • Understand how underlay routing supports equal-cost path selection and resilient forwarding.
  • Understand VXLAN/EVPN terminology where it appears in data center fabric designs.
  • Explain how VLANs, VNIs, VRFs, and route targets relate in overlay scenarios.
  • Validate that physical links, transceivers, port channels, and neighbors match the intended design.
  • Check that configuration is consistent across redundant fabric devices.
  • Recognize when a problem is physical, Layer 2, Layer 3, overlay, QoS, host, or application related.
  • Know why management-plane access, AAA, logging, and change control matter during fabric implementation.

Command and validation readiness

Be able to explain what you would look for in outputs similar to these. Exact syntax can vary by platform, software release, and configuration style.

show interface ethernet x/y
show interface ethernet x/y counters errors
show interface ethernet x/y transceiver details
show lldp neighbors
show port-channel summary
show running-config interface ethernet x/y
show ip route
show ip bgp summary
show bgp l2vpn evpn summary
show nve peers
show nve vni
show policy-map interface ethernet x/y
show queuing interface ethernet x/y
show interface ethernet x/y priority-flow-control
show logging log

What each validation should prove

Validation targetEvidence to look for
Physical healthLink speed, duplex where relevant, optics status, CRC/errors, flaps, FEC-related symptoms
Cabling correctnessLLDP neighbor matches design, expected leaf/server/spine adjacency
Port-channel healthMembers bundled, no suspended links, hashing suitable for expected flows
Underlay routingExpected adjacencies, route reachability, ECMP paths present
Overlay statusPeers established, VNIs present, endpoints or routes learned as expected
QoS and queuesTraffic in expected class, drops or pauses understood, policy applied at correct interface
PFC/ECN behaviorPause counters and congestion signals consistent with design, not unexpectedly increasing
Management readinessAAA, logging, time synchronization, telemetry, backup, and rollback access available

Compute, GPU, and platform integration

Compute readiness checklist

  • Understand the relationship between GPU, CPU, memory, NIC, PCIe, storage, and operating system.
  • Explain why a GPU server can be network-bound, storage-bound, CPU-bound, or thermally constrained.
  • Know why NIC placement, PCIe topology, and NUMA locality can affect performance.
  • Review the role of firmware, BIOS settings, drivers, CUDA or accelerator software stacks where applicable.
  • Know how DPUs or smart NICs may affect networking, security, telemetry, or offload behavior.
  • Identify why consistent firmware and driver baselines matter across a training cluster.
  • Understand high-level Kubernetes or scheduler interactions if AI workloads are containerized.
  • Recognize that node health includes GPU, NIC, disk, thermal, power, and OS signals.

Platform dependency table

ComponentAI infrastructure concernExam-style readiness cue
GPUUtilization, memory, interconnect, thermal limitsCan you explain why GPUs are idle even when the job is running?
CPUData preprocessing, orchestration, interrupts, storage stackCan you identify when the CPU is starving the GPU?
NICRDMA support, speed, queueing, offloads, firmwareCan you align NIC settings with fabric QoS?
PCIe / local interconnectBandwidth path between GPU, CPU, NIC, and storageCan you spot a placement or topology bottleneck?
MemoryDataset staging, host memory pressure, page behaviorCan you explain memory pressure symptoms?
Local storageCaching, temporary files, checkpointingCan you separate storage delay from network delay?
Power and coolingSustained GPU performance, throttling, rack densityCan you consider facilities constraints in an implementation plan?
Firmware and driversCompatibility and stabilityCan you plan a safe baseline and rollback strategy?

Storage and data pipeline readiness

AI infrastructure is not only a network exam topic. If data cannot reach GPUs fast enough, the fabric may look idle while the job still performs poorly.

Storage topics to review

  • File, block, and object storage characteristics at a practical level.
  • Distributed file systems and parallel read behavior.
  • Dataset staging, caching, preprocessing, and checkpointing patterns.
  • High-throughput reads for training versus low-latency access for inference.
  • Storage network segmentation and QoS interaction.
  • Impact of many small files versus fewer large objects.
  • Backup, replication, and recovery expectations for datasets and model artifacts.
  • Metadata bottlenecks and control-plane pressure in large data pipelines.
  • Data locality: when moving compute to data is better than moving data to compute.

Storage decision checks

If the scenario says…Think about…
GPUs are underutilizedData loader, storage throughput, preprocessing, CPU saturation, network path to storage
Training slows during checkpointsStorage write path, burst handling, queue depth, shared fabric congestion
Many jobs read the same datasetCaching strategy, object/file layout, storage fan-out, metadata scaling
Inference response time is inconsistentModel loading, cache misses, storage latency, network path, autoscaling behavior
Storage traffic competes with RDMA trafficSegmentation, QoS class separation, queue policy, congestion isolation
Dataset transfer affects trainingScheduling bulk transfers, rate limiting, path isolation, telemetry alerts

Automation, orchestration, and repeatability

Automation readiness checklist

  • Explain why manual per-switch configuration is risky in large AI fabrics.
  • Understand the purpose of templates, golden configurations, and configuration drift detection.
  • Recognize where APIs, CLI automation, infrastructure as code, and controller-based workflows fit.
  • Know how Cisco data center management tools may be used for fabric operations where applicable.
  • Understand pre-change validation and post-change verification.
  • Be able to describe a safe deployment pipeline for adding racks, VLANs, VRFs, QoS policies, or telemetry.
  • Know how to automate without exposing credentials or bypassing change control.
  • Understand idempotency at a practical level: rerunning automation should not create unintended changes.
  • Know the difference between intended state, running state, and observed state.

Implementation artifact checklist

ArtifactWhat it should contain
Physical topologyRack layout, server-to-leaf mapping, spine links, cabling standards
IP address planLoopbacks, routed links, management, host networks, storage networks
VLAN/VRF/VNI mapSegmentation model, tenant boundaries, routing relationships
QoS policy mapTraffic classes, markings, queue mapping, PFC/ECN scope
Host configuration standardNIC settings, MTU, driver/firmware baseline, OS prerequisites
Telemetry planInterface, queue, routing, overlay, host, GPU, storage, and application metrics
Change planScope, dependencies, validation steps, rollback plan, maintenance window
Troubleshooting runbookSymptom-to-signal mapping and escalation path
Security planAAA, RBAC, secrets handling, management access, audit logging
Capacity modelPort use, bandwidth, power, cooling, growth assumptions

Observability and troubleshooting readiness

Signals to collect and correlate

  • Interface utilization, errors, discards, and link flaps.
  • Queue occupancy, tail drops, WRED/ECN-related counters where available.
  • PFC pause counters by priority.
  • Port-channel member state and load distribution.
  • Routing adjacency status and route changes.
  • Overlay peer and endpoint state where overlays are used.
  • Host NIC counters, RDMA counters, driver logs, and OS network statistics.
  • GPU utilization, memory use, temperature, and job-level metrics.
  • Storage throughput, latency, queue depth, metadata load, and cache hit rate.
  • Application logs from training frameworks, inference platforms, or job schedulers.

Troubleshooting workflow

StepQuestionExamples of evidence
1. Define the symptomIs it slow training, failed job, packet loss, link flap, or inconsistent inference latency?Job logs, user report, monitoring alert
2. Scope the blast radiusOne host, one rack, one fabric, one tenant, or all jobs?Affected nodes, interfaces, VRFs, queues
3. Check recent changeWas there a config, firmware, cabling, routing, or policy change?Change records, config diffs, automation logs
4. Validate physical layerAre links clean and stable?Errors, optics, LLDP, FEC symptoms, flaps
5. Validate routing/fabricAre expected paths available?Routing tables, adjacencies, overlay peers
6. Validate QoS/lossless behaviorAre traffic classes, PFC, ECN, and queues behaving as intended?Queue counters, pause frames, drops, markings
7. Validate host and storageAre GPUs waiting on network, CPU, storage, or data pipeline?GPU metrics, NIC counters, storage metrics
8. Confirm remediationDid the fix restore performance without creating a new risk?Before/after metrics, job completion time, alerts

Security and governance readiness

Security topics to review

  • Management-plane isolation and secure administrative access.
  • AAA, RBAC, logging, and auditability.
  • Least privilege for operators, automation accounts, and service accounts.
  • Secure handling of API tokens, SSH keys, certificates, and secrets.
  • Segmentation using VRFs, VLANs, policy constructs, or other data center mechanisms.
  • Separation of management, storage, tenant, and training traffic where appropriate.
  • Secure telemetry export and log retention.
  • Firmware and software image integrity.
  • Secure baseline configuration for switches, servers, and orchestration systems.
  • Change governance for high-impact QoS, routing, and fabric-wide settings.

Security decision prompts

ScenarioWhat a ready candidate considers
Shared GPU clusterTenant isolation, access control, data separation, audit logs
Automation account needs device accessLeast privilege, credential storage, rotation, command authorization
Telemetry platform receives fabric dataSecure transport, RBAC, retention, sensitive metadata exposure
Developer needs troubleshooting accessRole-based visibility, temporary access, logging, approval path
New storage network addedSegmentation, firewall or policy path, QoS interaction, data protection
Emergency change requestedRisk, rollback, approvals, validation, post-change review

Scenario and decision-point practice

Use these prompts to test whether you can reason through implementation choices.

Scenario 1: Distributed training is slower than expected

Checklist:

  • Compare expected vs actual GPU utilization.
  • Check whether all nodes are affected or only a rack/subset.
  • Review fabric utilization and ECMP path balance.
  • Check interface errors, discards, and link flaps.
  • Inspect PFC pause counters and queue drops.
  • Validate MTU consistency.
  • Confirm DSCP/CoS marking and queue mapping.
  • Check storage read throughput and data preprocessing load.
  • Review recent changes to fabric, host drivers, firmware, or job configuration.
  • Confirm whether the problem follows a host, link, rack, or workload.

Scenario 2: RoCE traffic has intermittent performance collapse

Checklist:

  • Verify the intended lossless priority is marked correctly at the host.
  • Confirm the marking is preserved through the fabric.
  • Confirm PFC is enabled only for the intended class.
  • Look for pause storms or increasing pause counters.
  • Check ECN thresholds and host response behavior.
  • Confirm no bulk storage or backup traffic is sharing the lossless queue.
  • Validate MTU end-to-end.
  • Check for physical errors that could trigger retransmission or degraded performance.
  • Correlate congestion events with job timing.

Scenario 3: A new GPU rack is being added

Checklist:

  • Confirm available leaf ports, uplink capacity, and spine capacity.
  • Confirm cabling, optics, link speeds, and power/cooling readiness.
  • Update IP addressing, VLANs, VRFs, VNIs, and routing as needed.
  • Apply consistent QoS, PFC, ECN, and MTU policy.
  • Validate automation templates before deployment.
  • Confirm telemetry onboarding.
  • Run post-install validation before accepting production jobs.
  • Check whether the new rack changes oversubscription or failure-domain assumptions.

Scenario 4: Inference latency is inconsistent

Checklist:

  • Determine whether latency is network, application, storage, or model-loading related.
  • Check load balancer or service routing behavior if applicable.
  • Review CPU, GPU, memory, and network utilization.
  • Inspect tail latency, not just average latency.
  • Confirm that autoscaling or scheduling behavior is not moving workloads unpredictably.
  • Check whether storage cache misses or model fetches correlate with latency spikes.
  • Validate security controls are not introducing unexpected path changes or bottlenecks.

Common weak areas and traps

Weak areaWhy it hurts candidatesHow to fix it
Memorizing terms without traffic reasoningAI infrastructure questions often require cause-and-effect thinkingPractice mapping workload behavior to fabric, host, and storage signals
Treating PFC as universally goodPFC can prevent drops but can also spread congestionKnow where it should apply and what counters reveal trouble
Ignoring host configurationThe switch may be correct while NIC, driver, MTU, or marking is wrongInclude host-side validation in every RoCE checklist
Looking only at average utilizationAI issues often involve microbursts, queues, pauses, or tail latencyReview queue, pause, and burst-related telemetry
Assuming ECMP guarantees balanceFlow hashing and traffic patterns can still create hot spotsUnderstand path distribution and flow characteristics
Forgetting storageGPUs can be idle because data is not arriving fast enoughInclude storage throughput and data pipeline checks
Confusing underlay and overlay symptomsRouting reachability, VXLAN/EVPN state, and endpoint learning are different checksBuild a layered troubleshooting sequence
Skipping physical-layer evidenceBad optics, cabling, or flaps can look like application instabilityStart with interface and transceiver health
Overlooking change controlFabric-wide QoS or MTU changes can have broad impactPair every change with validation and rollback
Studying product names onlyThe exam title is implementation-orientedPractice “what would you configure, verify, or troubleshoot?” prompts

Consolidated “Can you do this?” checklist

Before exam day, you should be able to answer “yes” to most of these.

Architecture and design

  • Can you describe a leaf-spine AI fabric and its failure domains?
  • Can you explain why AI training traffic is often east-west intensive?
  • Can you reason through oversubscription and capacity after a link or spine failure?
  • Can you choose when segmentation is needed and what it protects?
  • Can you identify when an overlay design changes troubleshooting steps?
  • Can you connect cabling, optics, port speed, and topology to cluster scale?

RDMA, QoS, and congestion

  • Can you explain RoCEv2 at a practical level?
  • Can you distinguish QoS classification, PFC, and ECN?
  • Can you identify symptoms of pause-related congestion?
  • Can you validate that traffic markings are preserved end-to-end?
  • Can you explain why MTU consistency matters?
  • Can you troubleshoot queue drops without assuming the application is at fault?

Cisco implementation and validation

  • Can you identify the Cisco data center devices and tools relevant to the scenario?
  • Can you interpret common NX-OS show commands for interface, port-channel, routing, overlay, and QoS state?
  • Can you explain what evidence proves a fabric is ready for AI workload testing?
  • Can you separate physical, routing, overlay, QoS, host, and storage problems?
  • Can you describe safe implementation steps for adding or changing fabric policy?

Compute, storage, and operations

  • Can you explain why GPUs may be idle even when the network is healthy?
  • Can you identify host-side dependencies for high-performance networking?
  • Can you compare file, block, and object storage concerns for AI workflows?
  • Can you plan telemetry that includes switches, hosts, GPUs, storage, and applications?
  • Can you describe a rollback plan for a fabric-wide configuration change?
  • Can you apply security controls without breaking performance-critical paths?

Final-week checklist

Seven to five days out

  • Re-read the Cisco exam identity and current public exam information for Cisco Implementing Data Center AI Infrastructure (300-640 DCAI).
  • Build a one-page map of the major readiness areas: AI workloads, fabric, RDMA/QoS, compute, storage, automation, telemetry, security.
  • Review your weakest infrastructure layer first, not the topic you already know best.
  • Practice explaining PFC, ECN, QoS, MTU, and RoCEv2 out loud in plain language.
  • Review topology diagrams and trace traffic from GPU node to GPU node and GPU node to storage.

Four to two days out

  • Work through troubleshooting scenarios without looking at notes.
  • Review Cisco data center validation commands and what each output proves.
  • Practice identifying whether a symptom is physical, routing, overlay, QoS, host, storage, or application related.
  • Review automation and change-control workflows.
  • Revisit security, management-plane, and telemetry topics.

Final day

  • Do a quick pass through your personal weak-area notes.
  • Review definitions only after scenario practice, not instead of it.
  • Memorize no unsupported numbers, quotas, or weights.
  • Focus on decision logic: what would you check first, what evidence confirms it, and what risk does the fix introduce?
  • Rest enough to read scenario wording carefully.

Practical next step

Use this Exam Blueprint as a gap-finding tool: mark each item as confident, needs review, or needs hands-on practice. Then spend your next study session on scenario questions and lab-style validation tasks for the weakest marked areas, especially RDMA/QoS behavior, Cisco fabric verification, and AI workload troubleshooting.