Open source tools for GPU clusters.

Two tools. One goal: stop guessing about your cluster's real performance. Both free, both open source.

Free & Open Source

DeepLM Insights

Real-time Grafana dashboards and Prometheus metrics for HPC/SLURM GPU clusters. Track job performance, GPU utilization, power consumption, and checkpoint efficiency across your entire fleet — setup takes minutes.

Get Insights Quickstart ↓

5 Grafana dashboards included

Job Insights

Per-job CPU/GPU utilization, memory, priority, and power consumption

System Overview

Cluster-wide power, GPU temperature, fan speed, CPU/memory per node

Live Jobs

Real-time active job monitoring with 5-second refresh

Historical Jobs

Job duration analysis, completion rates, CPU hours by user

Checkpoint Analysis

Sync vs async checkpoint strategy comparison — stall time and overhead

How it works

SLURM prologue/epilogue hooks collect per-job metrics at start and end
Flask metrics API exposes Prometheus-format data on :5000/metrics
Cassandra stores job history for long-term trend analysis
Optional NVIDIA BCM integration for real GPU power data (falls back to TDP estimate)
Full stack deployable in one command via Docker Compose

Stack

SLURM hooks→ Flask metrics API→ Cassandra→ Prometheus→ Grafana

Get started in 5 minutes.

Requires Docker Compose and a SLURM cluster. Cassandra is included in the compose stack.

terminal

1. Clone

git clone https://github.com/DeepLM/Insights.git
cd Insights

2. Configure

cp .env.example .env
# Edit .env with your Cassandra host, compute nodes, BCM credentials

3. Start the stack

docker compose up -d

4. Install SLURM hooks

sudo cp slurm_hooks/prologue.sh /etc/slurm/prologue.sh
sudo cp slurm_hooks/epilogue.sh /etc/slurm/epilogue.sh
sudo chmod +x /etc/slurm/prologue.sh /etc/slurm/epilogue.sh

Services

Grafana      → http://localhost:3000  (admin / changeme)
Prometheus   → http://localhost:9090
Metrics API  → http://localhost:5000/metrics
Cassandra UI → http://localhost:5002

Bare metal install also supported: pip install . then run the metrics API directly.

Coming Soon

DeepLM Baseline

Multi-stage performance benchmarking for GPU clusters. Run three targeted test suites — compute, interconnect, and network — to establish ground-truth baselines before and after any infrastructure change. Structured JSON output for every stage.

GPU Compute & Memory

How fast are your GPUs — really?

Establish per-GPU compute throughput baselines, detect thermal throttling and vendor-imposed power limits, and validate memory bandwidth. Compare actual performance against vendor specs to find GPUs that aren't pulling their weight.

What you learn

Actual TFLOPS vs. vendor-published peak (FP16/BF16)
Whether power limits are set below spec
HBM bandwidth — are you hitting ±5% of rated speed?
Thermal throttle thresholds under sustained load
Which GPUs in your fleet are underperforming

Tests included

Peak FP16/BF16 TFLOPS

GEMM microbenchmark (cuBLAS / rocBLAS)

Vector math saturation

Custom CUDA kernel + power monitoring

HBM bandwidth

Stream benchmark, peak read/write GB/s

Thermal ramp test

Sustained GEMM, log temp + SM clock/sec

Pass / Fail Thresholds

→ TFLOPS within ±5% of vendor spec

→ HBM bandwidth within ±5% of spec

→ Temperature stabilizes below throttle threshold

→ No throttle reason codes beyond [Active]

GPU Interconnect & Intra-Node

Are your GPUs actually talking to each other?

Validate GPU-to-GPU bandwidth within a node. Catch misconfigured NVLink bridges, disabled NVSwitch lanes, PCIe gen mismatches, and NUMA affinity issues that silently kill distributed training performance.

What you learn

NVLink peer-to-peer bandwidth between all GPU pairs
Whether NVSwitch mesh is fully connected or degraded
PCIe bandwidth matching expected gen (4 vs 5)
NUMA affinity — is each GPU on the right node?
Which link pairs are bottlenecking your all-reduce

Tests included

NVLink p2p bandwidth

p2pBandwidthLatencyTest, all GPU pairs

NVLink topology check

nvidia-smi nvlink --status

PCIe bandwidth

Host-device transfer, gen validation

NUMA affinity check

nvidia-smi topo -m vs expected

Pass / Fail Thresholds

→ NVLink p2p ≥ 450 GB/s per link pair (H100 SXM)

→ Zero NVLink errors

→ PCIe matches gen spec (~64 GB/s bidirectional for PCIe 5)

→ All GPUs on expected NUMA nodes

Network & Multi-Node Collectives

Can your cluster actually scale?

Baseline the full network stack — Ethernet throughput, InfiniBand RDMA bandwidth, and multi-node collective communication. This is where most clusters silently lose 30–50% of their theoretical distributed training performance.

What you learn

Per-NIC Ethernet throughput and asymmetry issues
InfiniBand port speeds and RDMA bandwidth at line rate
Whether NCCL is using IB transport or falling back to TCP
AllReduce bus bandwidth across your actual node count
Which node pairs have bad ports, cables, or switches

Tests included

Ethernet throughput

iPerf3 single/multi-stream, bidirectional

IB/RDMA bandwidth

ib_read_bw, ib_write_bw (perftest suite)

NCCL transport validation

all_reduce_perf with NCCL_DEBUG=INFO

AllReduce sweep

nccl-tests, 1KB → 8GB message sizes

Pass / Fail Thresholds

→ IB link speed matches fabric spec (e.g., NDR 400Gb/s)

→ RDMA write bandwidth ≥ 90% of line rate

→ NCCL transport confirmed as IB, not NET/Socket

→ AllReduce regression < 10% from baseline

Coming next

Storage baselining, checkpoint latency, and the full pre-flight harness

Stages 4 and 5 add storage I/O benchmarking (single-node and distributed), checkpoint latency profiling (sync and async), and a unified regression harness that runs as a SLURM prolog or K8s init container.

Coming soon

Start with Insights today.

Real-time dashboards for your SLURM cluster. Free forever.

Get Insights on GitHub