Know what your cluster can actually do.
Most teams don't know their real GPU performance. They guess based on vendor specs, then wonder why distributed training runs slower than expected. DeepLM Insights baselines your entire stack — compute, interconnect, and network — in one run.
Built for
GPU Compute & Memory
How fast are your GPUs — really?
Establish per-GPU compute throughput baselines, detect thermal throttling and vendor-imposed power limits, and validate memory bandwidth. Compare actual performance against vendor specs to find GPUs that aren't pulling their weight.
What you learn
- Actual TFLOPS vs. vendor-published peak (FP16/BF16)
- Whether power limits are set below spec
- HBM bandwidth — are you hitting ±5% of rated speed?
- Thermal throttle thresholds under sustained load
- Which GPUs in your fleet are underperforming
Tests included
Pass / Fail Thresholds
baseline_gpu.pyGPU Interconnect & Intra-Node
Are your GPUs actually talking to each other?
Validate GPU-to-GPU bandwidth within a node. Catch misconfigured NVLink bridges, disabled NVSwitch lanes, PCIe gen mismatches, and NUMA affinity issues that silently kill distributed training performance.
What you learn
- NVLink peer-to-peer bandwidth between all GPU pairs
- Whether NVSwitch mesh is fully connected or degraded
- PCIe bandwidth matching expected gen (4 vs 5)
- NUMA affinity — is each GPU on the right node?
- Which link pairs are bottlenecking your all-reduce
Tests included
Pass / Fail Thresholds
baseline_interconnect.pyNetwork & Multi-Node Collectives
Can your cluster actually scale?
Baseline the full network stack — Ethernet throughput, InfiniBand RDMA bandwidth, and multi-node collective communication. This is where most clusters silently lose 30-50% of their theoretical distributed training performance.
What you learn
- Per-NIC Ethernet throughput and asymmetry issues
- InfiniBand port speeds and RDMA bandwidth at line rate
- Whether NCCL is using IB transport or falling back to TCP
- AllReduce bus bandwidth across your actual node count
- Which node pairs have bad ports, cables, or switches
Tests included
Pass / Fail Thresholds
baseline_network.pyGet started in 5 minutes.
Works on any SLURM or Kubernetes cluster with NVIDIA, AMD, or Intel GPUs.
git clone https://github.com/deeplm/deeplm-insights.git
cd deeplm-insightspip install -r requirements.txt# Single node — GPU compute + interconnect
python baseline_gpu.py
python baseline_interconnect.py
# Multi-node — network + collectives
python baseline_network.py --nodes nodelist.txt# Structured JSON output for each stage
cat results/gpu_baseline.json
cat results/interconnect_baseline.json
cat results/network_baseline.jsonComing next
Storage baselining, checkpoint latency, and the full pre-flight harness
Stages 4 and 5 add storage I/O benchmarking (single-node and distributed), checkpoint latency profiling (sync and async), and a unified regression harness that runs as a SLURM prolog or K8s init container.