Lumen
Inference infrastructure

Lumen serves and routes every open and frontier model behind one OpenAI-compatible endpoint. Stream the first token before a user blinks, reroute around a failing provider mid-response, and never pay for a GPU that's sitting idle.

  • ~40ms time-to-first-token
  • One endpoint, 120+ models
  • No GPU fleet to operate
drop-in — change one base URL
from openai import OpenAI

client = OpenAI(
  base_url="https://api.lumen.ai/v1",
  api_key=LUMEN_KEY,
)

stream = client.chat.completions.create(
  model="lumen/auto",   # router picks the rest
  messages=msgs,
  stream=True,
)
# first token in ~40ms

Inference under the hood for AI products, agent platforms, and dev tools

CortexRelaypointSynthwave AIVoltgridPolaris AgentsBacklineCortexRelaypointSynthwave AIVoltgridPolaris AgentsBackline
The serving layer

One endpoint.Every model.No cold starts.

Standing up vLLM, pinning CUDA versions, and overprovisioning H100s for a spike that may never come is someone else's job now. Lumen runs the fleet; your product just makes an API call.

First token in ~40ms

Speculative decoding, continuous batching, and a warm-pool scheduler hold tail latency flat. Median TTFT stays near 40ms while throughput climbs, so chat feels instant and agent loops stop stalling between steps.

120+ models, one schema

Llama, Mixtral, Qwen, DeepSeek, and frontier APIs all answer the same OpenAI-compatible request. Change the model string; leave the rest of your code untouched.

Streaming, end to end

True server-sent streaming with backpressure. The first word paints while the rest of the response is still being generated downstream.

Zero to thousands, by the second

Capacity tracks real traffic in real time. A request storm on Tuesday scales itself; a dead-quiet night bills you nothing.

KV-cache that remembers

Shared prefixes and long system prompts are cached across calls, so repeat context skips most of the prefill — less latency, lower cost, same output.

What the fleet does in a month

41ms
Median time-to-first-token
120+
Models on one endpoint
99.99%
Inference uptime
8T+
Tokens served monthly
Intelligent routing

A router that picksthe right modelfor every request.

Most prompts don't need your most expensive model. Lumen scores each request, sends it to the cheapest model that clears your quality bar, and fails over the instant a provider starts to degrade.

Quality-aware routing

Per-request scoring sends hard prompts to frontier models and easy ones to fast open models — lower spend, no visible drop in output quality.

Failover mid-stream

Health checks run continuously. When a provider slows or errors, traffic reroutes to a healthy replica during the response — before your user ever sees a spinner.

Cost & latency budgets

Set a price ceiling or a latency SLO on a route and the router honors it on every call, choosing the model that fits the constraint.

Shadow & canary traffic

Mirror live requests to a candidate model in the dark, compare outputs and latency side by side, then shift traffic once the numbers earn it.

Under the hood

The path of a single request

Every call crosses five hops before the first token returns. We tuned each one until the whole round trip disappeared.

<2ms

01 — Edge ingress

Your request lands at the nearest region, gets authed, and is tagged with your route's cost and latency budget.

router

02 — Difficulty scoring

A lightweight scorer reads the prompt and ranks the candidate models that can clear your quality bar.

no cold start

03 — Warm-pool dispatch

The request is handed to a model replica that's already loaded and batching — there is no spin-up to wait on.

~40ms TTFT

04 — Speculative decode

A draft model proposes tokens the target model verifies in parallel, so the first token streams almost immediately.

live failover

05 — Stream + watch

Tokens stream back with backpressure while health checks stand ready to reroute the tail if a replica falters.

Engineers on Lumen

What teams say after the migration.

We deleted our entire vLLM cluster and our p99 went down, not up. The router quietly cut our model spend 38% in the first month — nobody flagged a single quality regression.

P
Priya Nadkarni
Staff Engineer, Cortex

A provider had a bad night and we slept through it. Failover rerouted every request mid-stream and our error rate never moved off zero.

T
Tomas Eklund
Head of Platform, Relaypoint

Time-to-first-token is the number our users actually feel. Lumen got ours under 50ms across six models, and our agent loops finally read as real-time instead of laggy.

W
Wei Chen
Founding Engineer, Polaris Agents
Pricing

Pay per token, not per idle GPU.

Usage-based, no minimums, no reserved-capacity bill waiting at month end. Bring your own provider keys or buy capacity from us — it's one invoice either way.

Developer

For prototypes and side projects.

$0/mo
  • $5 free inference credit
  • All 120+ models
  • OpenAI-compatible API
  • Token streaming + auto-router
  • Community support
Most popular

Scale

For products in production.

$0.18/M tokens
  • Volume token pricing
  • Quality-aware routing + failover
  • KV-cache reuse
  • Latency & cost budgets
  • 99.99% uptime SLA
  • Priority support

Enterprise

For dedicated, high-volume workloads.

Custom
  • Reserved GPU capacity
  • Private model deployments
  • VPC peering + zero retention
  • Custom routing policies
  • SSO + audit logs
  • Named solutions engineer

Questions, answered.

How is Lumen faster than self-hosting the same model?

We run a warm GPU fleet with speculative decoding, continuous batching, and prefix caching tuned per model. Every call hits a replica that's already loaded, so there are no cold starts and median time-to-first-token sits around 40ms.

Is the API really drop-in compatible?

Yes. Point your existing OpenAI client at api.lumen.ai/v1, change the model string, and you're done. Chat, completions, embeddings, tool calls, and streaming all match the schema you already write against.

How does the router decide which model to use?

Each request is scored for difficulty and matched against the cost or latency budget on its route. Easy prompts go to fast open models, hard ones to frontier models, and you can pin a specific model any time you want to override the decision.

What happens when a model provider goes down?

Continuous health checks catch degradation in real time and reroute traffic to a healthy replica or alternate provider mid-stream. Failover is automatic and, in almost every case, invisible to your users.

Do you train on or retain our prompts?

No. Prompts and completions are never used for training. On Scale and Enterprise you can turn on zero retention, which drops payloads from our systems the moment the response is delivered.

Stream your first token in minutes.

Create a key, change one base URL, and call any model. No sales call stands between you and your first request.