300-640 DCAI — Cisco Implementing Data Center AI Infrastructure Exam Blueprint

Last revised: July 1, 2026

Practical exam blueprint for Cisco Implementing Data Center AI Infrastructure (300-640 DCAI) exam readiness.

How to Use This Exam Blueprint

Use this independent Exam Blueprint to organize your preparation for the Cisco Implementing Data Center AI Infrastructure (300-640 DCAI) exam. It translates likely readiness areas into practical tasks: what you should be able to explain, configure, validate, and troubleshoot.

No exact official weights are assumed here. Treat Cisco’s published exam information as the source of truth for current scope, then use this checklist to test whether you are operationally ready.

A strong candidate should be able to:

Explain why AI infrastructure stresses the data center differently from traditional enterprise workloads.
Choose appropriate Cisco data center fabric, compute, storage, automation, and observability approaches for AI workloads.
Understand lossless Ethernet, RDMA, RoCEv2, congestion control, and QoS design tradeoffs.
Validate configuration artifacts and troubleshoot degraded training or inference performance.
Reason through implementation scenarios, not just recall product names.

Topic-area readiness map

Readiness area	What to review	Ready means you can…
AI workload fundamentals	Training, inference, GPU clusters, east-west traffic, data pipelines, job behavior	Explain how AI workloads drive bandwidth, latency, storage, and resiliency requirements
Data center fabric design	Leaf-spine, ECMP, underlay/overlay, scale-out design, oversubscription, failure domains	Select a fabric approach for GPU clusters and justify capacity, redundancy, and operational tradeoffs
RDMA and lossless Ethernet	RoCEv2, PFC, ECN, DCB concepts, jumbo MTU, queue mapping	Explain end-to-end requirements for low-loss transport and identify misconfiguration risks
QoS and congestion management	Classification, marking, queuing, buffer behavior, priority classes, congestion signals	Map AI traffic to queues and troubleshoot drops, pauses, latency, or throughput collapse
Cisco data center switching	Nexus switching concepts, NX-OS validation, port channels, routing, VXLAN/EVPN where applicable	Interpret common Cisco show outputs and connect configuration choices to AI fabric behavior
Compute and GPU platform integration	GPU servers, NICs, DPUs, PCIe, NUMA awareness, firmware, drivers, power and cooling	Identify platform dependencies that affect AI job performance and reliability
Storage and data movement	File, object, block, distributed storage, caching, ingest, dataset locality	Match storage patterns to AI workload needs and detect storage bottlenecks
Automation and orchestration	Templates, APIs, infrastructure as code, fabric controllers, validation pipelines	Describe how repeatable provisioning reduces risk in large GPU-cluster deployments
Observability and telemetry	Streaming telemetry, interface counters, queue metrics, logs, flow data, GPU/node metrics	Build a troubleshooting view that connects application symptoms to infrastructure signals
Security and segmentation	AAA, RBAC, VRFs, tenant isolation, management-plane protection, secure automation	Apply least privilege and segmentation without breaking high-performance AI workflows
Operations and lifecycle	Firmware alignment, change windows, rollback plans, capacity planning, documentation	Plan safe changes in an environment where small mismatches can cause major performance loss

AI infrastructure fundamentals

Core concepts to know

Difference between AI training, fine-tuning, inference, batch inference, and interactive inference.
Why distributed training creates heavy east-west traffic between GPU nodes.
Why storage throughput and data preprocessing can bottleneck expensive GPU capacity.
How low latency, high bandwidth, and predictable packet delivery affect training completion time.
Why “link is up” does not mean “fabric is healthy” for AI workloads.
How application behavior, framework communication patterns, NIC behavior, and network design interact.
Why tail latency and microbursts matter for synchronized distributed jobs.
How failure of one node, link, queue, or path can reduce the efficiency of an entire training job.

Can you explain these distinctions?

Prompt	You should be able to answer
Training vs inference	Which one is more likely to require large-scale GPU-to-GPU communication? Which one is more latency-sensitive to users?
North-south vs east-west	Why do AI clusters often emphasize east-west bandwidth inside the fabric?
Bandwidth vs latency	When is raw throughput the main concern, and when does jitter or tail latency become critical?
Oversubscription	Why might a traditional oversubscription ratio be unacceptable for a large training cluster?
Storage vs network bottleneck	How would symptoms differ if GPUs are waiting for data rather than waiting on inter-node communication?
Resiliency vs performance	Why can a redundant design still perform poorly if traffic hashing or queue policy is wrong?

Data center fabric design for AI workloads

Fabric architecture checklist

Understand the role of leaf, spine, border, management, and out-of-band components.
Explain why scale-out fabrics are commonly used for GPU clusters.
Identify where ECMP helps and where it does not solve congestion by itself.
Know how link speed, port count, cabling, optics, and spine capacity affect cluster scale.
Understand failure-domain design: rack, leaf pair, spine, power domain, management domain.
Review when VXLAN/EVPN, VRFs, VLANs, and routed underlays may appear in a data center design.
Understand why consistent MTU and QoS treatment must be end-to-end for RDMA-style workloads.
Know the difference between production data traffic, storage traffic, management traffic, telemetry traffic, and control-plane traffic.
Be able to reason about adding a rack of GPU servers without creating hidden bottlenecks.

Capacity and oversubscription checks

Be ready to calculate or reason through basic bandwidth relationships. You do not need invented exam-specific numbers; focus on the method.

\[ \text{Oversubscription ratio} = \frac{\text{Total server-facing bandwidth}}{\text{Total fabric-facing bandwidth}} \]

Use this to answer scenario questions such as:

If each rack adds more GPU nodes, do spine uplinks still provide enough aggregate capacity?
Does a proposed design meet the expected traffic pattern, or only the link-speed requirement?
Which is the bottleneck: server NIC, leaf uplink, spine capacity, storage path, or application pipeline?
What happens to effective bandwidth during a link or spine failure?
Are traffic patterns balanced enough for ECMP to use available paths efficiently?

Design decision prompts

Scenario	Decision points
New GPU training pod	Leaf-spine capacity, nonblocking requirements, cabling plan, power/cooling, management access, telemetry baseline
Expanding an existing fabric	Available spine ports, uplink utilization, QoS consistency, route scale, automation template changes
Mixed AI and general workloads	Traffic isolation, QoS classes, tenant segmentation, storage access, noisy-neighbor controls
Multi-tenant AI platform	VRF or segmentation model, RBAC, quotas or policy boundaries, observability per tenant
High-throughput storage access	Network path to storage, storage protocol behavior, cache strategy, congestion domain
Latency-sensitive inference	Placement close to consumers, load-balancing behavior, failure handling, monitoring of tail latency

RDMA, RoCEv2, and lossless Ethernet readiness

Concepts to master

What RDMA is intended to provide and why AI workloads may benefit from it.
How RoCEv2 relies on IP networking while still requiring careful loss and congestion handling.
Why packet drops can severely affect RDMA traffic compared with ordinary TCP applications.
The purpose and risk of Priority Flow Control.
The purpose of Explicit Congestion Notification.
How traffic classification and marking must remain consistent across hosts, switches, and paths.
Why jumbo MTU mismatch can cause hard-to-diagnose performance issues.
Why PFC should be scoped carefully and not treated as a universal “make everything lossless” setting.
How congestion spreading, pause storms, or head-of-line blocking can harm an AI fabric.
How host NIC settings and switch QoS policies must align.

PFC, ECN, and QoS comparison

Control	Purpose	What to watch for
QoS classification	Places traffic into the intended class or queue	Wrong DSCP/CoS marking, remarking at boundaries, mixed traffic in lossless queues
Priority Flow Control	Pauses traffic for selected priorities to avoid loss	Pause propagation, head-of-line blocking, enabling too broadly
ECN	Signals congestion before drops occur	Threshold mismatch, host response behavior, inconsistent configuration
Queuing policy	Allocates bandwidth and scheduling behavior	Starvation, wrong queue mapping, insufficient buffer allocation
MTU	Supports larger frames where required	End-to-end mismatch, silent fragmentation or drops, host-switch inconsistency
Congestion monitoring	Detects early signs of fabric stress	Ignoring queue depth, pause counters, microbursts, and retransmission symptoms

End-to-end RoCE readiness checklist

Confirm server NIC, OS, driver, and firmware expectations.
Confirm switch interface speed, optics, cabling, and error counters.
Confirm MTU consistency from host to host across all paths.
Confirm DSCP/CoS marking at the host and preservation across the fabric.
Confirm queue mapping and lossless class configuration.
Confirm PFC is applied only where intended.
Confirm ECN behavior and congestion thresholds are consistent with the design.
Confirm routing and ECMP paths are symmetric enough for expected behavior.
Confirm storage or management traffic is not competing inside the same lossless class.
Confirm telemetry exists for drops, pause frames, queue depth, and utilization.

Cisco data center switching and fabric operations

For the Cisco Implementing Data Center AI Infrastructure (300-640 DCAI) exam, be comfortable connecting Cisco data center implementation concepts to AI infrastructure outcomes. Do not study commands in isolation; study what each command proves.

Cisco-oriented readiness tasks

Identify the role of Cisco Nexus switching in an AI data center fabric.
Interpret common NX-OS interface, routing, port-channel, QoS, and overlay validation outputs.
Understand how underlay routing supports equal-cost path selection and resilient forwarding.
Understand VXLAN/EVPN terminology where it appears in data center fabric designs.
Explain how VLANs, VNIs, VRFs, and route targets relate in overlay scenarios.
Validate that physical links, transceivers, port channels, and neighbors match the intended design.
Check that configuration is consistent across redundant fabric devices.
Recognize when a problem is physical, Layer 2, Layer 3, overlay, QoS, host, or application related.
Know why management-plane access, AAA, logging, and change control matter during fabric implementation.

Command and validation readiness

Be able to explain what you would look for in outputs similar to these. Exact syntax can vary by platform, software release, and configuration style.

show interface ethernet x/y
show interface ethernet x/y counters errors
show interface ethernet x/y transceiver details
show lldp neighbors
show port-channel summary
show running-config interface ethernet x/y
show ip route
show ip bgp summary
show bgp l2vpn evpn summary
show nve peers
show nve vni
show policy-map interface ethernet x/y
show queuing interface ethernet x/y
show interface ethernet x/y priority-flow-control
show logging log

What each validation should prove

Validation target	Evidence to look for
Physical health	Link speed, duplex where relevant, optics status, CRC/errors, flaps, FEC-related symptoms
Cabling correctness	LLDP neighbor matches design, expected leaf/server/spine adjacency
Port-channel health	Members bundled, no suspended links, hashing suitable for expected flows
Underlay routing	Expected adjacencies, route reachability, ECMP paths present
Overlay status	Peers established, VNIs present, endpoints or routes learned as expected
QoS and queues	Traffic in expected class, drops or pauses understood, policy applied at correct interface
PFC/ECN behavior	Pause counters and congestion signals consistent with design, not unexpectedly increasing
Management readiness	AAA, logging, time synchronization, telemetry, backup, and rollback access available

Compute, GPU, and platform integration

Compute readiness checklist

Understand the relationship between GPU, CPU, memory, NIC, PCIe, storage, and operating system.
Explain why a GPU server can be network-bound, storage-bound, CPU-bound, or thermally constrained.
Know why NIC placement, PCIe topology, and NUMA locality can affect performance.
Review the role of firmware, BIOS settings, drivers, CUDA or accelerator software stacks where applicable.
Know how DPUs or smart NICs may affect networking, security, telemetry, or offload behavior.
Identify why consistent firmware and driver baselines matter across a training cluster.
Understand high-level Kubernetes or scheduler interactions if AI workloads are containerized.
Recognize that node health includes GPU, NIC, disk, thermal, power, and OS signals.

Platform dependency table

Component	AI infrastructure concern	Exam-style readiness cue
GPU	Utilization, memory, interconnect, thermal limits	Can you explain why GPUs are idle even when the job is running?
CPU	Data preprocessing, orchestration, interrupts, storage stack	Can you identify when the CPU is starving the GPU?
NIC	RDMA support, speed, queueing, offloads, firmware	Can you align NIC settings with fabric QoS?
PCIe / local interconnect	Bandwidth path between GPU, CPU, NIC, and storage	Can you spot a placement or topology bottleneck?
Memory	Dataset staging, host memory pressure, page behavior	Can you explain memory pressure symptoms?
Local storage	Caching, temporary files, checkpointing	Can you separate storage delay from network delay?
Power and cooling	Sustained GPU performance, throttling, rack density	Can you consider facilities constraints in an implementation plan?
Firmware and drivers	Compatibility and stability	Can you plan a safe baseline and rollback strategy?

Storage and data pipeline readiness

AI infrastructure is not only a network exam topic. If data cannot reach GPUs fast enough, the fabric may look idle while the job still performs poorly.

Storage topics to review

File, block, and object storage characteristics at a practical level.
Distributed file systems and parallel read behavior.
Dataset staging, caching, preprocessing, and checkpointing patterns.
High-throughput reads for training versus low-latency access for inference.
Storage network segmentation and QoS interaction.
Impact of many small files versus fewer large objects.
Backup, replication, and recovery expectations for datasets and model artifacts.
Metadata bottlenecks and control-plane pressure in large data pipelines.
Data locality: when moving compute to data is better than moving data to compute.

Storage decision checks

If the scenario says…	Think about…
GPUs are underutilized	Data loader, storage throughput, preprocessing, CPU saturation, network path to storage
Training slows during checkpoints	Storage write path, burst handling, queue depth, shared fabric congestion
Many jobs read the same dataset	Caching strategy, object/file layout, storage fan-out, metadata scaling
Inference response time is inconsistent	Model loading, cache misses, storage latency, network path, autoscaling behavior
Storage traffic competes with RDMA traffic	Segmentation, QoS class separation, queue policy, congestion isolation
Dataset transfer affects training	Scheduling bulk transfers, rate limiting, path isolation, telemetry alerts

Automation, orchestration, and repeatability

Automation readiness checklist

Explain why manual per-switch configuration is risky in large AI fabrics.
Understand the purpose of templates, golden configurations, and configuration drift detection.
Recognize where APIs, CLI automation, infrastructure as code, and controller-based workflows fit.
Know how Cisco data center management tools may be used for fabric operations where applicable.
Understand pre-change validation and post-change verification.
Be able to describe a safe deployment pipeline for adding racks, VLANs, VRFs, QoS policies, or telemetry.
Know how to automate without exposing credentials or bypassing change control.
Understand idempotency at a practical level: rerunning automation should not create unintended changes.
Know the difference between intended state, running state, and observed state.

Implementation artifact checklist

Artifact	What it should contain
Physical topology	Rack layout, server-to-leaf mapping, spine links, cabling standards
IP address plan	Loopbacks, routed links, management, host networks, storage networks
VLAN/VRF/VNI map	Segmentation model, tenant boundaries, routing relationships
QoS policy map	Traffic classes, markings, queue mapping, PFC/ECN scope
Host configuration standard	NIC settings, MTU, driver/firmware baseline, OS prerequisites
Telemetry plan	Interface, queue, routing, overlay, host, GPU, storage, and application metrics
Change plan	Scope, dependencies, validation steps, rollback plan, maintenance window
Troubleshooting runbook	Symptom-to-signal mapping and escalation path
Security plan	AAA, RBAC, secrets handling, management access, audit logging
Capacity model	Port use, bandwidth, power, cooling, growth assumptions

Observability and troubleshooting readiness

Signals to collect and correlate

Interface utilization, errors, discards, and link flaps.
Queue occupancy, tail drops, WRED/ECN-related counters where available.
PFC pause counters by priority.
Port-channel member state and load distribution.
Routing adjacency status and route changes.
Overlay peer and endpoint state where overlays are used.
Host NIC counters, RDMA counters, driver logs, and OS network statistics.
GPU utilization, memory use, temperature, and job-level metrics.
Storage throughput, latency, queue depth, metadata load, and cache hit rate.
Application logs from training frameworks, inference platforms, or job schedulers.

Troubleshooting workflow

Step	Question	Examples of evidence
1. Define the symptom	Is it slow training, failed job, packet loss, link flap, or inconsistent inference latency?	Job logs, user report, monitoring alert
2. Scope the blast radius	One host, one rack, one fabric, one tenant, or all jobs?	Affected nodes, interfaces, VRFs, queues
3. Check recent change	Was there a config, firmware, cabling, routing, or policy change?	Change records, config diffs, automation logs
4. Validate physical layer	Are links clean and stable?	Errors, optics, LLDP, FEC symptoms, flaps
5. Validate routing/fabric	Are expected paths available?	Routing tables, adjacencies, overlay peers
6. Validate QoS/lossless behavior	Are traffic classes, PFC, ECN, and queues behaving as intended?	Queue counters, pause frames, drops, markings
7. Validate host and storage	Are GPUs waiting on network, CPU, storage, or data pipeline?	GPU metrics, NIC counters, storage metrics
8. Confirm remediation	Did the fix restore performance without creating a new risk?	Before/after metrics, job completion time, alerts

Security and governance readiness

Security topics to review

Management-plane isolation and secure administrative access.
AAA, RBAC, logging, and auditability.
Least privilege for operators, automation accounts, and service accounts.
Secure handling of API tokens, SSH keys, certificates, and secrets.
Segmentation using VRFs, VLANs, policy constructs, or other data center mechanisms.
Separation of management, storage, tenant, and training traffic where appropriate.
Secure telemetry export and log retention.
Firmware and software image integrity.
Secure baseline configuration for switches, servers, and orchestration systems.
Change governance for high-impact QoS, routing, and fabric-wide settings.

Security decision prompts

Scenario	What a ready candidate considers
Shared GPU cluster	Tenant isolation, access control, data separation, audit logs
Automation account needs device access	Least privilege, credential storage, rotation, command authorization
Telemetry platform receives fabric data	Secure transport, RBAC, retention, sensitive metadata exposure
Developer needs troubleshooting access	Role-based visibility, temporary access, logging, approval path
New storage network added	Segmentation, firewall or policy path, QoS interaction, data protection
Emergency change requested	Risk, rollback, approvals, validation, post-change review

Scenario and decision-point practice

Use these prompts to test whether you can reason through implementation choices.

Scenario 1: Distributed training is slower than expected

Checklist:

Compare expected vs actual GPU utilization.
Check whether all nodes are affected or only a rack/subset.
Review fabric utilization and ECMP path balance.
Check interface errors, discards, and link flaps.
Inspect PFC pause counters and queue drops.
Validate MTU consistency.
Confirm DSCP/CoS marking and queue mapping.
Check storage read throughput and data preprocessing load.
Review recent changes to fabric, host drivers, firmware, or job configuration.
Confirm whether the problem follows a host, link, rack, or workload.

Scenario 2: RoCE traffic has intermittent performance collapse

Checklist:

Verify the intended lossless priority is marked correctly at the host.
Confirm the marking is preserved through the fabric.
Confirm PFC is enabled only for the intended class.
Look for pause storms or increasing pause counters.
Check ECN thresholds and host response behavior.
Confirm no bulk storage or backup traffic is sharing the lossless queue.
Validate MTU end-to-end.
Check for physical errors that could trigger retransmission or degraded performance.
Correlate congestion events with job timing.

Scenario 3: A new GPU rack is being added

Checklist:

Confirm available leaf ports, uplink capacity, and spine capacity.
Confirm cabling, optics, link speeds, and power/cooling readiness.
Update IP addressing, VLANs, VRFs, VNIs, and routing as needed.
Apply consistent QoS, PFC, ECN, and MTU policy.
Validate automation templates before deployment.
Confirm telemetry onboarding.
Run post-install validation before accepting production jobs.
Check whether the new rack changes oversubscription or failure-domain assumptions.

Scenario 4: Inference latency is inconsistent

Checklist:

Determine whether latency is network, application, storage, or model-loading related.
Check load balancer or service routing behavior if applicable.
Review CPU, GPU, memory, and network utilization.
Inspect tail latency, not just average latency.
Confirm that autoscaling or scheduling behavior is not moving workloads unpredictably.
Check whether storage cache misses or model fetches correlate with latency spikes.
Validate security controls are not introducing unexpected path changes or bottlenecks.

Common weak areas and traps

Weak area	Why it hurts candidates	How to fix it
Memorizing terms without traffic reasoning	AI infrastructure questions often require cause-and-effect thinking	Practice mapping workload behavior to fabric, host, and storage signals
Treating PFC as universally good	PFC can prevent drops but can also spread congestion	Know where it should apply and what counters reveal trouble
Ignoring host configuration	The switch may be correct while NIC, driver, MTU, or marking is wrong	Include host-side validation in every RoCE checklist
Looking only at average utilization	AI issues often involve microbursts, queues, pauses, or tail latency	Review queue, pause, and burst-related telemetry
Assuming ECMP guarantees balance	Flow hashing and traffic patterns can still create hot spots	Understand path distribution and flow characteristics
Forgetting storage	GPUs can be idle because data is not arriving fast enough	Include storage throughput and data pipeline checks
Confusing underlay and overlay symptoms	Routing reachability, VXLAN/EVPN state, and endpoint learning are different checks	Build a layered troubleshooting sequence
Skipping physical-layer evidence	Bad optics, cabling, or flaps can look like application instability	Start with interface and transceiver health
Overlooking change control	Fabric-wide QoS or MTU changes can have broad impact	Pair every change with validation and rollback
Studying product names only	The exam title is implementation-oriented	Practice “what would you configure, verify, or troubleshoot?” prompts

Consolidated “Can you do this?” checklist

Before exam day, you should be able to answer “yes” to most of these.

Architecture and design

Can you describe a leaf-spine AI fabric and its failure domains?
Can you explain why AI training traffic is often east-west intensive?
Can you reason through oversubscription and capacity after a link or spine failure?
Can you choose when segmentation is needed and what it protects?
Can you identify when an overlay design changes troubleshooting steps?
Can you connect cabling, optics, port speed, and topology to cluster scale?

RDMA, QoS, and congestion

Can you explain RoCEv2 at a practical level?
Can you distinguish QoS classification, PFC, and ECN?
Can you identify symptoms of pause-related congestion?
Can you validate that traffic markings are preserved end-to-end?
Can you explain why MTU consistency matters?
Can you troubleshoot queue drops without assuming the application is at fault?

Cisco implementation and validation

Can you identify the Cisco data center devices and tools relevant to the scenario?
Can you interpret common NX-OS show commands for interface, port-channel, routing, overlay, and QoS state?
Can you explain what evidence proves a fabric is ready for AI workload testing?
Can you separate physical, routing, overlay, QoS, host, and storage problems?
Can you describe safe implementation steps for adding or changing fabric policy?

Compute, storage, and operations

Can you explain why GPUs may be idle even when the network is healthy?
Can you identify host-side dependencies for high-performance networking?
Can you compare file, block, and object storage concerns for AI workflows?
Can you plan telemetry that includes switches, hosts, GPUs, storage, and applications?
Can you describe a rollback plan for a fabric-wide configuration change?
Can you apply security controls without breaking performance-critical paths?

Final-week checklist

Seven to five days out

Re-read the Cisco exam identity and current public exam information for Cisco Implementing Data Center AI Infrastructure (300-640 DCAI).
Build a one-page map of the major readiness areas: AI workloads, fabric, RDMA/QoS, compute, storage, automation, telemetry, security.
Review your weakest infrastructure layer first, not the topic you already know best.
Practice explaining PFC, ECN, QoS, MTU, and RoCEv2 out loud in plain language.
Review topology diagrams and trace traffic from GPU node to GPU node and GPU node to storage.

Four to two days out

Work through troubleshooting scenarios without looking at notes.
Review Cisco data center validation commands and what each output proves.
Practice identifying whether a symptom is physical, routing, overlay, QoS, host, storage, or application related.
Review automation and change-control workflows.
Revisit security, management-plane, and telemetry topics.

Final day

Do a quick pass through your personal weak-area notes.
Review definitions only after scenario practice, not instead of it.
Memorize no unsupported numbers, quotas, or weights.
Focus on decision logic: what would you check first, what evidence confirms it, and what risk does the fix introduce?
Rest enough to read scenario wording carefully.

Practical next step

Use this Exam Blueprint as a gap-finding tool: mark each item as confident, needs review, or needs hands-on practice. Then spend your next study session on scenario questions and lab-style validation tasks for the weakest marked areas, especially RDMA/QoS behavior, Cisco fabric verification, and AI workload troubleshooting.

Study Plan

Scenario Guide

300-640 DCAI — Cisco Implementing Data Center AI Infrastructure Exam Blueprint

How to Use This Exam Blueprint

Topic-area readiness map

AI infrastructure fundamentals

Core concepts to know

Can you explain these distinctions?

Data center fabric design for AI workloads

Fabric architecture checklist

Capacity and oversubscription checks

Design decision prompts

RDMA, RoCEv2, and lossless Ethernet readiness

Concepts to master

PFC, ECN, and QoS comparison

End-to-end RoCE readiness checklist

Cisco data center switching and fabric operations

Cisco-oriented readiness tasks

Command and validation readiness

What each validation should prove

Compute, GPU, and platform integration

Compute readiness checklist

Platform dependency table

Storage and data pipeline readiness

Storage topics to review

Storage decision checks

Automation, orchestration, and repeatability

Automation readiness checklist

Implementation artifact checklist

Observability and troubleshooting readiness

Signals to collect and correlate

Troubleshooting workflow

Security and governance readiness

Security topics to review

Security decision prompts

Scenario and decision-point practice

Scenario 1: Distributed training is slower than expected

Scenario 2: RoCE traffic has intermittent performance collapse

Scenario 3: A new GPU rack is being added

Scenario 4: Inference latency is inconsistent

Common weak areas and traps

Consolidated “Can you do this?” checklist

Architecture and design

RDMA, QoS, and congestion

Cisco implementation and validation

Compute, storage, and operations

Final-week checklist

Seven to five days out

Four to two days out

Final day

Practical next step

Browse Certification Practice Tests by Exam Family