Free & Open Source

Know what your cluster can actually do.

Most teams don't know their real GPU performance. They guess based on vendor specs, then wonder why distributed training runs slower than expected. DeepLM Insights baselines your entire stack — compute, interconnect, and network — in one run.

Built for

Cluster AdminsML EngineersResearchersStudentsAI BuildersDevOps
01

GPU Compute & Memory

How fast are your GPUs — really?

Establish per-GPU compute throughput baselines, detect thermal throttling and vendor-imposed power limits, and validate memory bandwidth. Compare actual performance against vendor specs to find GPUs that aren't pulling their weight.

What you learn

  • Actual TFLOPS vs. vendor-published peak (FP16/BF16)
  • Whether power limits are set below spec
  • HBM bandwidth — are you hitting ±5% of rated speed?
  • Thermal throttle thresholds under sustained load
  • Which GPUs in your fleet are underperforming

Tests included

Peak FP16/BF16 TFLOPS
GEMM microbenchmark (cuBLAS / rocBLAS)
Vector math saturation
Custom CUDA kernel + power monitoring
HBM bandwidth
Stream benchmark, peak read/write GB/s
Thermal ramp test
Sustained GEMM, log temp + SM clock/sec

Pass / Fail Thresholds

TFLOPS within ±5% of vendor spec
HBM bandwidth within ±5% of spec
Temperature stabilizes below throttle threshold
No throttle reason codes beyond [Active]
Script: baseline_gpu.py
02

GPU Interconnect & Intra-Node

Are your GPUs actually talking to each other?

Validate GPU-to-GPU bandwidth within a node. Catch misconfigured NVLink bridges, disabled NVSwitch lanes, PCIe gen mismatches, and NUMA affinity issues that silently kill distributed training performance.

What you learn

  • NVLink peer-to-peer bandwidth between all GPU pairs
  • Whether NVSwitch mesh is fully connected or degraded
  • PCIe bandwidth matching expected gen (4 vs 5)
  • NUMA affinity — is each GPU on the right node?
  • Which link pairs are bottlenecking your all-reduce

Tests included

NVLink p2p bandwidth
p2pBandwidthLatencyTest, all GPU pairs
NVLink topology check
nvidia-smi nvlink --status
PCIe bandwidth
Host-device transfer, gen validation
NUMA affinity check
nvidia-smi topo -m vs expected

Pass / Fail Thresholds

NVLink p2p ≥ 450 GB/s per link pair (H100 SXM)
Zero NVLink errors
PCIe matches gen spec (~64 GB/s bidirectional for PCIe 5)
All GPUs on expected NUMA nodes
Script: baseline_interconnect.py
03

Network & Multi-Node Collectives

Can your cluster actually scale?

Baseline the full network stack — Ethernet throughput, InfiniBand RDMA bandwidth, and multi-node collective communication. This is where most clusters silently lose 30-50% of their theoretical distributed training performance.

What you learn

  • Per-NIC Ethernet throughput and asymmetry issues
  • InfiniBand port speeds and RDMA bandwidth at line rate
  • Whether NCCL is using IB transport or falling back to TCP
  • AllReduce bus bandwidth across your actual node count
  • Which node pairs have bad ports, cables, or switches

Tests included

Ethernet throughput
iPerf3 single/multi-stream, bidirectional
IB/RDMA bandwidth
ib_read_bw, ib_write_bw (perftest suite)
NCCL transport validation
all_reduce_perf with NCCL_DEBUG=INFO
AllReduce sweep
nccl-tests, 1KB → 8GB message sizes

Pass / Fail Thresholds

IB link speed matches fabric spec (e.g., NDR 400Gb/s)
RDMA write bandwidth ≥ 90% of line rate
NCCL transport confirmed as IB, not NET/Socket
AllReduce regression < 10% from baseline
Script: baseline_network.py

Get started in 5 minutes.

Works on any SLURM or Kubernetes cluster with NVIDIA, AMD, or Intel GPUs.

terminal
1. Clone
git clone https://github.com/deeplm/deeplm-insights.git
cd deeplm-insights
2. Install
pip install -r requirements.txt
3. Run baselines
# Single node — GPU compute + interconnect
python baseline_gpu.py
python baseline_interconnect.py

# Multi-node — network + collectives
python baseline_network.py --nodes nodelist.txt
4. View results
# Structured JSON output for each stage
cat results/gpu_baseline.json
cat results/interconnect_baseline.json
cat results/network_baseline.json
Results are structured JSON — pipe into your monitoring stack or compare across runs.

Coming next

Storage baselining, checkpoint latency, and the full pre-flight harness

Stages 4 and 5 add storage I/O benchmarking (single-node and distributed), checkpoint latency profiling (sync and async), and a unified regression harness that runs as a SLURM prolog or K8s init container.

Stop guessing. Start measuring.

One command. Full cluster baseline. Free forever.

Get DeepLM Insights