Embergrad — Distributed model fine-tuning & training orchestration

Embergrad

Distributed training, orchestrated

Embergrad is the orchestration layer for distributed fine-tuning. Describe a run once, and we shard it across a thousand GPUs in any cloud, survive every spot preemption, and hand you a converged checkpoint — without a single line of cluster glue code.

Powering training runs for frontier labs, model startups, and applied AI teams

Foundry LabsNorthlight AITensorworksHelio ResearchParallaxBrightforgeFoundry LabsNorthlight AITensorworksHelio ResearchParallaxBrightforge

The orchestrator

You write the run.We run the cluster.

Distributed training is mostly plumbing — sharding, checkpointing, NCCL tuning, restarting jobs that died at hour nine. Embergrad owns all of it, so your team ships models instead of babysitting a cluster.

Parallelism, decided for you

Point Embergrad at a model and a GPU budget, and the planner picks the right mix of FSDP, tensor, pipeline, and sequence parallelism — sharding optimizer state and gradients to fit memory and keep the interconnect saturated. No more hand-tuning a topology for every model size.

Preemption is a non-event

Run on spot and interruptible capacity without fear. When a node gets reclaimed, the scheduler reshards the survivors, restores the last checkpoint, and resumes the step — usually before you'd notice in the loss curve.

Checkpoints that don't stall the GPUs

Asynchronous, sharded checkpointing streams state to object storage in the background every ninety seconds. Saving a 70B model no longer means a five-minute pause where every accelerator sits idle.

Elastic by the step

Add capacity mid-run to finish before a deadline, or release it overnight when prices spike. Embergrad rebalances the data-parallel groups live — no restart, no reconfigure.

Every run, reproducible

Each launch pins the dataset version, base weights, hyperparameters, and container digest. Re-run a result from three weeks ago and get the same model, byte for byte.

How a run goes

From a config file to a converged checkpoint.

Four steps, no cluster engineering in between. You describe the run; the orchestrator handles every layer underneath it.

01 · Define

Declare the run

Hand Embergrad your training loop and a budget — '70B, 768 H100s, under $5K.' No topology, no launch scripts, no NCCL flags.

02 · Compile

Plan the topology

The planner profiles layer shapes and memory, then compiles a parallelism strategy that fits VRAM and saturates the interconnect.

03 · Schedule

Place on the cheapest fleet

The broker prices capacity across clouds and regions, books the cheapest pool that meets your bandwidth needs, and starts the run.

04 · Run

Converge through chaos

Spot reclaims reshard live, checkpoints stream every 90s, and a cost meter ticks against your ceiling until the loss hits target.

The numbers behind the fleet

57%

Median model FLOPs utilization

3,000+

GPUs schedulable in one run

<90s

Checkpoint interval, async

99.95%

Run completion rate

Cost & visibility

See every dollarthe run is burning —while it burns.

Training is the most expensive thing your team does. Embergrad puts a meter on it: live MFU, cost-to-convergence, and the cheapest capacity in any region, so a run never quietly torches a budget.

Multi-cloud capacity broker

Embergrad continuously prices H100 and A100 capacity across clouds and regions, then places your run on the cheapest pool that meets its interconnect needs — failing over to another provider when one runs dry.

Cost-to-convergence, live

A running estimate of total spend and time-to-target updates every step from real throughput, so you kill a doomed run early instead of reading the bill on Friday.

Loss & throughput in one pane

Stream loss, gradient norm, MFU, and per-rank utilization into one dashboard or straight to your Weights & Biases project. Spot a stalled rank or a diverging loss the moment it happens.

Hard budget ceilings

Set a dollar cap or a wall-clock SLO per run. Embergrad honors it — pausing, checkpointing, and alerting before the run blows through the limit.

Teams training on Embergrad

Smaller teams, bigger runs.

“We were burning a full engineer on babysitting jobs that died on spot capacity. Embergrad made preemption invisible — our 70B fine-tune finished on interruptible GPUs at a third of on-demand cost, untouched overnight.”

Mara Devlin

Head of Training, Foundry Labs

“The planner picked a better parallelism config than the one I'd hand-tuned for a month. MFU went from 41% to 57% on the same hardware, which is real money at our scale.”

Sasha Pohl

Staff ML Engineer, Tensorworks

“Cost-to-convergence is the feature I didn't know I needed. We caught a diverging run at hour two and saved eleven thousand dollars in compute before lunch.”

Idris Bello

Founding Engineer, Northlight AI

Pricing

Pay for GPU-hours, not for the orchestrator.

Bring your own cloud accounts or buy capacity through us. Either way you pay for compute plus a flat orchestration fee — no per-seat tax, no reserved-capacity minimums.

Builder

For experiments and small fine-tunes.

$0/mo

Up to 8 GPUs per run
Bring your own cloud
Automatic FSDP & checkpointing
Live loss & cost dashboard
Community support

Scale

For teams training in production.

8%of GPU spend

Up to 512 GPUs per run
Multi-cloud capacity broker
Spot preemption recovery
Elastic scaling mid-run
Budget ceilings & SLOs
Priority support

Frontier

For thousand-GPU runs and labs.

Custom

3,000+ GPUs per run
Reserved & private capacity
VPC peering + zero data retention
Custom parallelism policies
SSO + audit logs
Named training engineer

Questions, answered.

Do I have to rewrite my training code?

No. Wrap your existing PyTorch or Hugging Face training loop with the Embergrad SDK, or hand us a config and a container. We inject the distributed strategy, checkpointing, and elastic restart hooks — your model and optimizer code stay exactly as they are.

How does Embergrad survive spot preemption?

The scheduler checkpoints sharded state to object storage every ninety seconds. When a node is reclaimed, it reshards the surviving ranks, restores the latest checkpoint, and resumes the step. Most reclaims cost seconds, not a restart, and never corrupt the run.

Which clouds and accelerators do you support?

AWS, GCP, Azure, and major neoclouds, on H100, A100, and L40S. You can connect your own accounts so compute is billed directly to you, or buy capacity through our broker and get one invoice across providers.

How do you choose the parallelism strategy?

The planner profiles your model's layer shapes, memory footprint, and your GPU and interconnect budget, then searches combinations of FSDP, tensor, pipeline, and sequence parallelism for the highest throughput that fits in memory. You can pin or override any axis when you want manual control.

Will the orchestrator slow my training down?

The opposite is the goal. Checkpointing is asynchronous, restarts reuse warm capacity, and the planner optimizes for model FLOPs utilization — most teams see MFU rise after they switch, because hand-tuned topologies rarely keep the interconnect this busy.

Is my data and are my weights private?

Always. Embergrad orchestrates training inside your cloud or an isolated tenant — we never train on your data, and Frontier plans support VPC peering with zero retention, so datasets and checkpoints never persist on our infrastructure.

Your next model is one run away.

Connect a cloud account, point us at a base model, and watch the loss curve fall. No cluster to build, no sales call to start.

Light the run.Watch itconverge.