Boltforge is a scheduler built for one job: keeping every accelerator in your cluster pinned at full tilt. It packs training runs onto fragmented GPUs, preempts low-priority work in milliseconds, and checkpoints anything it bumps — so a spot reclaim costs you a few minutes, not a few days. The cards you already pay for finally earn their power bill.
$ bolt submit train.yaml --gpus 64xH100 --priority high
→ queued job llama-70b-sft (64 GPUs requested)
→ topology: packed onto 8 nodes, NVLink-local, 0 stragglers
→ preempted 2 dev jobs · checkpointed in 3.2s · requeued
NODE GPUS UTIL JOB
gpu-07 8/8 99% llama-70b-sft
gpu-08 8/8 98% llama-70b-sft
gpu-11 8/8 97% llama-70b-sft
→ scheduled in 412ms · fleet utilization now 94.1%
→ est. cost saved this hour vs. static pools: $1,840
$ _Drops into the stack your training team already runs
Most teams buy more GPUs to fix a scheduling problem. Boltforge finds the capacity you already own — the half-used nodes, the gaps between jobs, the cards idling overnight — and fills them before it lets you spend another dollar.
Boltforge knows which GPUs share an NVLink island, a PCIe switch, and a network rail. It packs each job onto the tightest interconnect free, so an all-reduce never crawls across a slow hop. It bin-packs the whole fleet — not just grabs the first idle card it finds.
When a high-priority run lands, Boltforge checkpoints and evicts lower-priority jobs in milliseconds, hands over the GPUs, and requeues the bumped work the instant capacity frees. No human paging at midnight, no waiting for a job to politely finish.
Every preemptable job is snapshotted to fast storage before its GPUs are taken. A spot reclaim or a priority bump resumes from the last step, not from zero — so an interruption costs minutes, never a full restart.
Boltforge blends your reserved capacity with spot and burst cloud GPUs, routes each job to the cheapest pool that still hits its deadline, and drains spot gracefully the moment a reclaim warning fires.
Hierarchical quotas stop one team from starving the rest, while deadline-aware queuing still gets the launch-blocking run out the door on time. You set the priorities; the scheduler does the math, every cycle.
What changes when the scheduler does its job
Long training jobs fail in boring, expensive ways: a node dies, a spot pool gets reclaimed, an interconnect degrades. Boltforge assumes every one of those will happen and keeps the run alive anyway.
A distributed job gets all of its GPUs at once or none of them — no half-allocated run deadlocked waiting on the last node. Boltforge reserves the whole gang atomically or holds it in queue until the slot exists.
When a node falls over mid-training, Boltforge catches the failure, pulls a healthy replacement from the pool, restores the latest checkpoint, and rejoins the collective — usually before your on-call phone has finished buzzing.
One slow GPU drags an entire data-parallel job down to its pace. Boltforge watches per-rank step times, flags the laggard, and reschedules around degraded hardware before it taxes the whole run.
Every run carries a live GPU-hour and dollar meter, attributed to a team and a budget. Finance sees exactly which experiments burned the quarter — no more reverse-engineering a cloud invoice in a spreadsheet.
Boltforge runs the same way whether you own four nodes or four thousand, on metal in your own cage or burst across three clouds. Pick the shape that matches your fleet.
A reserved cluster of H100s in your own racks. Boltforge maps the NVLink and rail topology once, then keeps every node packed and self-healing — no cloud egress, no surprise bill.
Run on your reserved cards until the queue backs up, then overflow to spot and on-demand cloud through Burst Bridge. Jobs land on the cheapest pool that still makes the deadline.
Treat AWS, GCP, and CoreWeave spot capacity as one schedulable pool. Boltforge chases the lowest price per GPU-hour and drains any pool the instant a reclaim warning lands.
Hierarchical fair-share quotas split a single fleet across rival teams without the weekly Slack war. The launch-critical run still ships; everyone else queues fairly behind it.
A fully offline deployment with SSO, SAML, and audit logs, no outbound calls, and custom rack profiles. The whole control plane runs inside your security boundary.
Low-priority dev and notebook sessions soak up idle cards between big runs, then get checkpointed and bumped in milliseconds the moment a production job needs the GPUs back.
“We were about to sign for another 256 H100s. Boltforge pushed our existing fleet from 60% to 93% utilization in a week and we tore up the order. It paid for itself before the first invoice cleared.”
“Spot reclaims used to mean a lost day and a grumpy researcher. Now a reclaim is a 3-second checkpoint and a requeue. Half my team doesn't even notice it happened anymore.”
“The fair-share quotas ended the Slack wars over who gets the cluster this week. Boltforge enforces the policy, the launch-critical run still ships on time, and I got my evenings back.”
Run the open scheduler free on your own cluster. Pay only when you want the managed control plane, hybrid burst, and a team in the loop. No per-seat tax on your researchers.
The self-hosted scheduler, all of it.
For teams running production training.
For labs training at the frontier.
No. Boltforge runs as a native Kubernetes scheduler or sits alongside Slurm as the policy and preemption layer. Your researchers keep submitting jobs the way they already do — sbatch, kubectl, or the Boltforge CLI — and the scheduler quietly makes better placement decisions underneath.
Every preemptable job is checkpointed to fast storage before its GPUs are reclaimed — a median of 3.2 seconds in our fleets. When the job is requeued it resumes from the last step, not from scratch. You configure how aggressively low-priority work can be bumped; Boltforge handles the snapshot and the restore.
Collective operations like all-reduce are only as fast as the slowest link between ranks. Boltforge maps your NVLink islands, PCIe switches, and network rails, then packs each distributed job onto the tightest interconnect available. The result is fewer cross-node hops, less communication overhead, and noticeably faster steps on the exact same hardware.
Yes. Burst Bridge presents your owned cluster and spot or on-demand cloud GPUs as one schedulable pool. Jobs run on your own cards first; when the queue backs up, Boltforge overflows to the cheapest cloud capacity that still meets the deadline, then drains it gracefully the moment a reclaim warning fires or your own GPUs free up.
We sample actual GPU compute occupancy per device, not whether a card is merely allocated. A GPU only counts as utilized when it's doing real work. The 94% figure is the mean across production fleets running mixed training and dev workloads — and the Fleet Console shows you your own number, live, so you never have to take ours on faith.
Yes. The self-hosted scheduler — gang scheduling, preemption, checkpoint-on-evict, and Slurm and Kubernetes integration — is free forever for up to 64 GPUs. You pay for the managed control plane, hybrid burst, and support, never for the core scheduler and never per researcher seat.
Install the open scheduler in an afternoon, or have a managed cluster provisioned in days. No sales call required to watch your utilization number climb.