Boltforge — GPU Cluster Scheduler

Boltforge

GPU cluster scheduler for model training

Boltforge is a scheduler built for one job: keeping every accelerator in your cluster pinned at full tilt. It packs training runs onto fragmented GPUs, preempts low-priority work in milliseconds, and checkpoints anything it bumps — so a spot reclaim costs you a few minutes, not a few days. The cards you already pay for finally earn their power bill.

Self-hosted or managed
Slurm & Kubernetes native
94% mean fleet utilization

boltforge :: cluster-east-1

$ bolt submit train.yaml --gpus 64xH100 --priority high

→ queued job  llama-70b-sft  (64 GPUs requested)
→ topology: packed onto 8 nodes, NVLink-local, 0 stragglers
→ preempted 2 dev jobs · checkpointed in 3.2s · requeued

  NODE        GPUS   UTIL    JOB
  gpu-07      8/8    99%     llama-70b-sft
  gpu-08      8/8    98%     llama-70b-sft
  gpu-11      8/8    97%     llama-70b-sft

→ scheduled in 412ms · fleet utilization now 94.1%
→ est. cost saved this hour vs. static pools: $1,840

$ _

Drops into the stack your training team already runs

SlurmKubernetesRayNCCL / NVLinkPyTorch · JAXPrometheusSlurmKubernetesRayNCCL / NVLinkPyTorch · JAXPrometheus

Why a real scheduler matters

Your cluster isn't slow.It's half empty.

Most teams buy more GPUs to fix a scheduling problem. Boltforge finds the capacity you already own — the half-used nodes, the gaps between jobs, the cards idling overnight — and fills them before it lets you spend another dollar.

Topology-aware packing

Boltforge knows which GPUs share an NVLink island, a PCIe switch, and a network rail. It packs each job onto the tightest interconnect free, so an all-reduce never crawls across a slow hop. It bin-packs the whole fleet — not just grabs the first idle card it finds.

Millisecond preemption

When a high-priority run lands, Boltforge checkpoints and evicts lower-priority jobs in milliseconds, hands over the GPUs, and requeues the bumped work the instant capacity frees. No human paging at midnight, no waiting for a job to politely finish.

Checkpoint-on-evict

Every preemptable job is snapshotted to fast storage before its GPUs are taken. A spot reclaim or a priority bump resumes from the last step, not from zero — so an interruption costs minutes, never a full restart.

Spot + reserved arbitrage

Boltforge blends your reserved capacity with spot and burst cloud GPUs, routes each job to the cheapest pool that still hits its deadline, and drains spot gracefully the moment a reclaim warning fires.

Fair-share with deadlines

Hierarchical quotas stop one team from starving the rest, while deadline-aware queuing still gets the launch-blocking run out the door on time. You set the priorities; the scheduler does the math, every cycle.

What changes when the scheduler does its job

94%

Mean fleet utilization

412ms

Median time to schedule

3.2s

Median checkpoint-on-evict

2.4x

More training runs per GPU-month

Under the hood

Built by peoplewho have watcheda run die at 3am.

Long training jobs fail in boring, expensive ways: a node dies, a spot pool gets reclaimed, an interconnect degrades. Boltforge assumes every one of those will happen and keeps the run alive anyway.

Gang scheduling, done right

A distributed job gets all of its GPUs at once or none of them — no half-allocated run deadlocked waiting on the last node. Boltforge reserves the whole gang atomically or holds it in queue until the slot exists.

Self-healing runs

When a node falls over mid-training, Boltforge catches the failure, pulls a healthy replacement from the pool, restores the latest checkpoint, and rejoins the collective — usually before your on-call phone has finished buzzing.

Live straggler detection

One slow GPU drags an entire data-parallel job down to its pace. Boltforge watches per-rank step times, flags the laggard, and reschedules around degraded hardware before it taxes the whole run.

Per-job cost accounting

Every run carries a live GPU-hour and dollar meter, attributed to a team and a budget. Finance sees exactly which experiments burned the quarter — no more reverse-engineering a cloud invoice in a spreadsheet.

Where it runs

One scheduler, every cluster shape.

Boltforge runs the same way whether you own four nodes or four thousand, on metal in your own cage or burst across three clouds. Pick the shape that matches your fleet.

On-prem

The bare-metal lab

A reserved cluster of H100s in your own racks. Boltforge maps the NVLink and rail topology once, then keeps every node packed and self-healing — no cloud egress, no surprise bill.

Hybrid

Own first, burst second

Run on your reserved cards until the queue backs up, then overflow to spot and on-demand cloud through Burst Bridge. Jobs land on the cheapest pool that still makes the deadline.

Multi-cloud

Spot across three providers

Treat AWS, GCP, and CoreWeave spot capacity as one schedulable pool. Boltforge chases the lowest price per GPU-hour and drains any pool the instant a reclaim warning lands.

Shared research

One cluster, twelve teams

Hierarchical fair-share quotas split a single fleet across rival teams without the weekly Slack war. The launch-critical run still ships; everyone else queues fairly behind it.

Air-gapped

The classified cage

A fully offline deployment with SSO, SAML, and audit logs, no outbound calls, and custom rack profiles. The whole control plane runs inside your security boundary.

Dev + prod

Notebooks beside training

Low-priority dev and notebook sessions soak up idle cards between big runs, then get checkpointed and bumped in milliseconds the moment a production job needs the GPUs back.

From the cluster

ML teams stopped buying GPUs they didn't need.

“We were about to sign for another 256 H100s. Boltforge pushed our existing fleet from 60% to 93% utilization in a week and we tore up the order. It paid for itself before the first invoice cleared.”

Wen Li Zhao

Head of ML Infrastructure, Aperture Labs

“Spot reclaims used to mean a lost day and a grumpy researcher. Now a reclaim is a 3-second checkpoint and a requeue. Half my team doesn't even notice it happened anymore.”

Diego Marchetti

Staff Platform Engineer, Northwind AI

“The fair-share quotas ended the Slack wars over who gets the cluster this week. Boltforge enforces the policy, the launch-critical run still ships on time, and I got my evenings back.”

Amara Okonkwo

Director of Research Engineering, Helix Foundry

Pricing

Priced on the GPUs you save,not the ones you own.

Run the open scheduler free on your own cluster. Pay only when you want the managed control plane, hybrid burst, and a team in the loop. No per-seat tax on your researchers.

Open

The self-hosted scheduler, all of it.

$0/forever

Topology-aware gang scheduling
Preemption + checkpoint-on-evict
Slurm & Kubernetes integration
Up to 64 GPUs
Community Slack

Scale

For teams running production training.

$12/GPU / mo

Unlimited GPUs
Fleet Console + cost accounting
Burst Bridge to spot & cloud
Fair-share quotas & deadlines
Priority engineering support

Frontier

For labs training at the frontier.

Custom

Multi-thousand-GPU fleets
Air-gapped & on-prem deploy
SSO, SAML & audit logs
Custom topology & rack profiles
Dedicated scheduling engineer

The questions infra leads actually ask.

Do I have to rip out Slurm or Kubernetes?

No. Boltforge runs as a native Kubernetes scheduler or sits alongside Slurm as the policy and preemption layer. Your researchers keep submitting jobs the way they already do — sbatch, kubectl, or the Boltforge CLI — and the scheduler quietly makes better placement decisions underneath.

How does preemption avoid losing training progress?

Every preemptable job is checkpointed to fast storage before its GPUs are reclaimed — a median of 3.2 seconds in our fleets. When the job is requeued it resumes from the last step, not from scratch. You configure how aggressively low-priority work can be bumped; Boltforge handles the snapshot and the restore.

What does topology-aware actually buy me?

Collective operations like all-reduce are only as fast as the slowest link between ranks. Boltforge maps your NVLink islands, PCIe switches, and network rails, then packs each distributed job onto the tightest interconnect available. The result is fewer cross-node hops, less communication overhead, and noticeably faster steps on the exact same hardware.

Can it span on-prem and cloud GPUs at once?

Yes. Burst Bridge presents your owned cluster and spot or on-demand cloud GPUs as one schedulable pool. Jobs run on your own cards first; when the queue backs up, Boltforge overflows to the cheapest cloud capacity that still meets the deadline, then drains it gracefully the moment a reclaim warning fires or your own GPUs free up.

How do you measure 94% utilization?

We sample actual GPU compute occupancy per device, not whether a card is merely allocated. A GPU only counts as utilized when it's doing real work. The 94% figure is the mean across production fleets running mixed training and dev workloads — and the Fleet Console shows you your own number, live, so you never have to take ours on faith.

Will the open tier stay genuinely free?

Yes. The self-hosted scheduler — gang scheduling, preemption, checkpoint-on-evict, and Slurm and Kubernetes integration — is free forever for up to 64 GPUs. You pay for the managed control plane, hybrid burst, and support, never for the core scheduler and never per researcher seat.

Put your idle GPUs back to work tonight.

Install the open scheduler in an afternoon, or have a managed cluster provisioned in days. No sales call required to watch your utilization number climb.

Stop payingfor idleGPUs.