All posts
·3 min read

Introducing DeepLM Insights: Open-Source GPU Cluster Monitoring

Today we're open-sourcing DeepLM Insights — a complete monitoring stack for SLURM GPU clusters that gives you real-time, job-level visibility into what your cluster is actually doing.

DeepLM Team
Engineering

You can't optimize what you can't see.

Most HPC teams know their cluster is underperforming. They can feel it — jobs take longer than they should, GPUs sit idle between runs, and nobody knows which users are consuming the most resources or why certain jobs stall.

The problem isn't awareness. It's visibility.

Today we're open-sourcing DeepLM Insights — a complete monitoring stack for SLURM GPU clusters that gives you real-time, job-level insight into what your cluster is actually doing.

What You Get

Five purpose-built Grafana dashboards, backed by Prometheus metrics and a Cassandra data store:

  • Job Insights — Per-job CPU/GPU utilization, memory consumption, priority, and power draw. See exactly which jobs are efficient and which are wasting resources.
  • System Overview — Cluster-wide power consumption, GPU temperatures, fan speeds, and per-node CPU/memory utilization at a glance.
  • Live Jobs — Real-time monitoring of active jobs with 5-second refresh. Know what's running right now.
  • Historical Jobs — Job duration analysis, completion rates, and CPU hours broken down by user. Understand long-term patterns.
  • Checkpoint Analysis — Compare sync vs. async checkpoint strategies side by side. Quantify stall time and overhead so you can make data-driven decisions about your checkpointing approach.

How It Works

DeepLM Insights integrates directly with your SLURM cluster through prologue and epilogue hooks. When a job starts, the hooks push metadata and telemetry to a lightweight Flask API. That API exposes Prometheus-formatted metrics, which feed into Grafana.

If you have NVIDIA Base Command Manager (BCM), the system pulls real GPU power, temperature, and utilization data from the BCM REST API. Without BCM, it falls back to TDP-based estimation — so you get useful power metrics either way.

SLURM Hooks → Flask API → Cassandra + Prometheus → Grafana

The whole stack runs as a Docker Compose deployment. One docker compose up -d and you're live.

Why We're Open-Sourcing This

We built DeepLM Insights because every cluster we work with has the same problem: they're flying blind. Commercial monitoring tools are either too expensive, too generic, or don't integrate with SLURM at the job level.

This is table stakes. Every GPU cluster should have this level of visibility, regardless of budget. So we're giving it away.

It also serves a practical purpose: you can't optimize a cluster you don't understand. DeepLM Insights gives teams the data they need to identify waste — and when they're ready to act on it, DeepLM's optimization platform is there.

Get Started

git clone https://github.com/DeepLM/Insights.git
cd Insights
cp .env.example .env
docker compose up -d

Grafana is at localhost:3000. Dashboards are pre-provisioned. You'll be looking at real data within minutes of connecting your SLURM hooks.

Full setup instructions, Cassandra schema, and SLURM hook installation are in the README on GitHub.

What's Next

DeepLM Insights is the first of several open-source tools we're releasing. Next up: DeepLM Baseline — a benchmarking suite that tests your cluster's actual GPU compute, interconnect, and network performance against vendor specs. No more guessing whether your hardware is performing to spec.

If you're running a GPU cluster and want to stop flying blind, give Insights a try. Stars, issues, and PRs are welcome.

Free & Open Source

Stop flying blind.

One docker compose up and you have five production-ready Grafana dashboards watching your cluster. No vendor lock-in, no license fees.

Deploy Insights — it's free