Embergrad is the orchestration layer for distributed fine-tuning. Describe a run once, and we shard it across a thousand GPUs in any cloud, survive every spot preemption, and hand you a converged checkpoint — without a single line of cluster glue code.
Powering training runs for frontier labs, model startups, and applied AI teams
Distributed training is mostly plumbing — sharding, checkpointing, NCCL tuning, restarting jobs that died at hour nine. Embergrad owns all of it, so your team ships models instead of babysitting a cluster.
Point Embergrad at a model and a GPU budget, and the planner picks the right mix of FSDP, tensor, pipeline, and sequence parallelism — sharding optimizer state and gradients to fit memory and keep the interconnect saturated. No more hand-tuning a topology for every model size.
Run on spot and interruptible capacity without fear. When a node gets reclaimed, the scheduler reshards the survivors, restores the last checkpoint, and resumes the step — usually before you'd notice in the loss curve.
Asynchronous, sharded checkpointing streams state to object storage in the background every ninety seconds. Saving a 70B model no longer means a five-minute pause where every accelerator sits idle.
Add capacity mid-run to finish before a deadline, or release it overnight when prices spike. Embergrad rebalances the data-parallel groups live — no restart, no reconfigure.
Each launch pins the dataset version, base weights, hyperparameters, and container digest. Re-run a result from three weeks ago and get the same model, byte for byte.
Four steps, no cluster engineering in between. You describe the run; the orchestrator handles every layer underneath it.
Hand Embergrad your training loop and a budget — '70B, 768 H100s, under $5K.' No topology, no launch scripts, no NCCL flags.
The planner profiles layer shapes and memory, then compiles a parallelism strategy that fits VRAM and saturates the interconnect.
The broker prices capacity across clouds and regions, books the cheapest pool that meets your bandwidth needs, and starts the run.
Spot reclaims reshard live, checkpoints stream every 90s, and a cost meter ticks against your ceiling until the loss hits target.
The numbers behind the fleet
Training is the most expensive thing your team does. Embergrad puts a meter on it: live MFU, cost-to-convergence, and the cheapest capacity in any region, so a run never quietly torches a budget.
Embergrad continuously prices H100 and A100 capacity across clouds and regions, then places your run on the cheapest pool that meets its interconnect needs — failing over to another provider when one runs dry.
A running estimate of total spend and time-to-target updates every step from real throughput, so you kill a doomed run early instead of reading the bill on Friday.
Stream loss, gradient norm, MFU, and per-rank utilization into one dashboard or straight to your Weights & Biases project. Spot a stalled rank or a diverging loss the moment it happens.
Set a dollar cap or a wall-clock SLO per run. Embergrad honors it — pausing, checkpointing, and alerting before the run blows through the limit.
“We were burning a full engineer on babysitting jobs that died on spot capacity. Embergrad made preemption invisible — our 70B fine-tune finished on interruptible GPUs at a third of on-demand cost, untouched overnight.”
“The planner picked a better parallelism config than the one I'd hand-tuned for a month. MFU went from 41% to 57% on the same hardware, which is real money at our scale.”
“Cost-to-convergence is the feature I didn't know I needed. We caught a diverging run at hour two and saved eleven thousand dollars in compute before lunch.”
Bring your own cloud accounts or buy capacity through us. Either way you pay for compute plus a flat orchestration fee — no per-seat tax, no reserved-capacity minimums.
For experiments and small fine-tunes.
For teams training in production.
For thousand-GPU runs and labs.
No. Wrap your existing PyTorch or Hugging Face training loop with the Embergrad SDK, or hand us a config and a container. We inject the distributed strategy, checkpointing, and elastic restart hooks — your model and optimizer code stay exactly as they are.
The scheduler checkpoints sharded state to object storage every ninety seconds. When a node is reclaimed, it reshards the surviving ranks, restores the latest checkpoint, and resumes the step. Most reclaims cost seconds, not a restart, and never corrupt the run.
AWS, GCP, Azure, and major neoclouds, on H100, A100, and L40S. You can connect your own accounts so compute is billed directly to you, or buy capacity through our broker and get one invoice across providers.
The planner profiles your model's layer shapes, memory footprint, and your GPU and interconnect budget, then searches combinations of FSDP, tensor, pipeline, and sequence parallelism for the highest throughput that fits in memory. You can pin or override any axis when you want manual control.
The opposite is the goal. Checkpointing is asynchronous, restarts reuse warm capacity, and the planner optimizes for model FLOPs utilization — most teams see MFU rise after they switch, because hand-tuned topologies rarely keep the interconnect this busy.
Always. Embergrad orchestrates training inside your cloud or an isolated tenant — we never train on your data, and Frontier plans support VPC peering with zero retention, so datasets and checkpoints never persist on our infrastructure.
Connect a cloud account, point us at a base model, and watch the loss curve fall. No cluster to build, no sales call to start.