Lumen — LLM Inference Serving & Routing

Lumen

Inference infrastructure

Lumen serves and routes every open and frontier model behind one OpenAI-compatible endpoint. Stream the first token before a user blinks, reroute around a failing provider mid-response, and never pay for a GPU that's sitting idle.

~40ms time-to-first-token
One endpoint, 120+ models
No GPU fleet to operate

drop-in — change one base URL

from openai import OpenAI

client = OpenAI(
  base_url="https://api.lumen.ai/v1",
  api_key=LUMEN_KEY,
)

stream = client.chat.completions.create(
  model="lumen/auto",   # router picks the rest
  messages=msgs,
  stream=True,
)
# first token in ~40ms

Inference under the hood for AI products, agent platforms, and dev tools

CortexRelaypointSynthwave AIVoltgridPolaris AgentsBacklineCortexRelaypointSynthwave AIVoltgridPolaris AgentsBackline

The serving layer

One endpoint.Every model.No cold starts.

Standing up vLLM, pinning CUDA versions, and overprovisioning H100s for a spike that may never come is someone else's job now. Lumen runs the fleet; your product just makes an API call.

First token in ~40ms

Speculative decoding, continuous batching, and a warm-pool scheduler hold tail latency flat. Median TTFT stays near 40ms while throughput climbs, so chat feels instant and agent loops stop stalling between steps.

120+ models, one schema

Llama, Mixtral, Qwen, DeepSeek, and frontier APIs all answer the same OpenAI-compatible request. Change the model string; leave the rest of your code untouched.

Streaming, end to end

True server-sent streaming with backpressure. The first word paints while the rest of the response is still being generated downstream.

Zero to thousands, by the second

Capacity tracks real traffic in real time. A request storm on Tuesday scales itself; a dead-quiet night bills you nothing.

KV-cache that remembers

Shared prefixes and long system prompts are cached across calls, so repeat context skips most of the prefill — less latency, lower cost, same output.

What the fleet does in a month

41ms

Median time-to-first-token

120+

Models on one endpoint

99.99%

Inference uptime

8T+

Tokens served monthly

Intelligent routing

A router that picksthe right modelfor every request.

Most prompts don't need your most expensive model. Lumen scores each request, sends it to the cheapest model that clears your quality bar, and fails over the instant a provider starts to degrade.

Quality-aware routing

Per-request scoring sends hard prompts to frontier models and easy ones to fast open models — lower spend, no visible drop in output quality.

Failover mid-stream

Health checks run continuously. When a provider slows or errors, traffic reroutes to a healthy replica during the response — before your user ever sees a spinner.

Cost & latency budgets

Set a price ceiling or a latency SLO on a route and the router honors it on every call, choosing the model that fits the constraint.

Shadow & canary traffic

Mirror live requests to a candidate model in the dark, compare outputs and latency side by side, then shift traffic once the numbers earn it.

Under the hood

The path of a single request

Every call crosses five hops before the first token returns. We tuned each one until the whole round trip disappeared.

<2ms

01 — Edge ingress

Your request lands at the nearest region, gets authed, and is tagged with your route's cost and latency budget.

router

02 — Difficulty scoring

A lightweight scorer reads the prompt and ranks the candidate models that can clear your quality bar.

no cold start

03 — Warm-pool dispatch

The request is handed to a model replica that's already loaded and batching — there is no spin-up to wait on.

~40ms TTFT

04 — Speculative decode

A draft model proposes tokens the target model verifies in parallel, so the first token streams almost immediately.

live failover

05 — Stream + watch

Tokens stream back with backpressure while health checks stand ready to reroute the tail if a replica falters.

Engineers on Lumen

What teams say after the migration.

“We deleted our entire vLLM cluster and our p99 went down, not up. The router quietly cut our model spend 38% in the first month — nobody flagged a single quality regression.”

Priya Nadkarni

Staff Engineer, Cortex

“A provider had a bad night and we slept through it. Failover rerouted every request mid-stream and our error rate never moved off zero.”

Tomas Eklund

Head of Platform, Relaypoint

“Time-to-first-token is the number our users actually feel. Lumen got ours under 50ms across six models, and our agent loops finally read as real-time instead of laggy.”

Wei Chen

Founding Engineer, Polaris Agents

Pricing

Pay per token, not per idle GPU.

Usage-based, no minimums, no reserved-capacity bill waiting at month end. Bring your own provider keys or buy capacity from us — it's one invoice either way.

Developer

For prototypes and side projects.

$0/mo

$5 free inference credit
All 120+ models
OpenAI-compatible API
Token streaming + auto-router
Community support

Scale

For products in production.

$0.18/M tokens

Volume token pricing
Quality-aware routing + failover
KV-cache reuse
Latency & cost budgets
99.99% uptime SLA
Priority support

Enterprise

For dedicated, high-volume workloads.

Custom

Reserved GPU capacity
Private model deployments
VPC peering + zero retention
Custom routing policies
SSO + audit logs
Named solutions engineer

Questions, answered.

How is Lumen faster than self-hosting the same model?

We run a warm GPU fleet with speculative decoding, continuous batching, and prefix caching tuned per model. Every call hits a replica that's already loaded, so there are no cold starts and median time-to-first-token sits around 40ms.

Is the API really drop-in compatible?

Yes. Point your existing OpenAI client at api.lumen.ai/v1, change the model string, and you're done. Chat, completions, embeddings, tool calls, and streaming all match the schema you already write against.

How does the router decide which model to use?

Each request is scored for difficulty and matched against the cost or latency budget on its route. Easy prompts go to fast open models, hard ones to frontier models, and you can pin a specific model any time you want to override the decision.

What happens when a model provider goes down?

Continuous health checks catch degradation in real time and reroute traffic to a healthy replica or alternate provider mid-stream. Failover is automatic and, in almost every case, invisible to your users.

Do you train on or retain our prompts?

No. Prompts and completions are never used for training. On Scale and Enterprise you can turn on zero retention, which drops payloads from our systems the moment the response is delivered.

Stream your first token in minutes.

Create a key, change one base URL, and call any model. No sales call stands between you and your first request.

From promptto first tokenin 40 milliseconds.