Lumen serves and routes every open and frontier model behind one OpenAI-compatible endpoint. Stream the first token before a user blinks, reroute around a failing provider mid-response, and never pay for a GPU that's sitting idle.
from openai import OpenAI
client = OpenAI(
base_url="https://api.lumen.ai/v1",
api_key=LUMEN_KEY,
)
stream = client.chat.completions.create(
model="lumen/auto", # router picks the rest
messages=msgs,
stream=True,
)
# first token in ~40msInference under the hood for AI products, agent platforms, and dev tools
Standing up vLLM, pinning CUDA versions, and overprovisioning H100s for a spike that may never come is someone else's job now. Lumen runs the fleet; your product just makes an API call.
Speculative decoding, continuous batching, and a warm-pool scheduler hold tail latency flat. Median TTFT stays near 40ms while throughput climbs, so chat feels instant and agent loops stop stalling between steps.
Llama, Mixtral, Qwen, DeepSeek, and frontier APIs all answer the same OpenAI-compatible request. Change the model string; leave the rest of your code untouched.
True server-sent streaming with backpressure. The first word paints while the rest of the response is still being generated downstream.
Capacity tracks real traffic in real time. A request storm on Tuesday scales itself; a dead-quiet night bills you nothing.
Shared prefixes and long system prompts are cached across calls, so repeat context skips most of the prefill — less latency, lower cost, same output.
What the fleet does in a month
Most prompts don't need your most expensive model. Lumen scores each request, sends it to the cheapest model that clears your quality bar, and fails over the instant a provider starts to degrade.
Per-request scoring sends hard prompts to frontier models and easy ones to fast open models — lower spend, no visible drop in output quality.
Health checks run continuously. When a provider slows or errors, traffic reroutes to a healthy replica during the response — before your user ever sees a spinner.
Set a price ceiling or a latency SLO on a route and the router honors it on every call, choosing the model that fits the constraint.
Mirror live requests to a candidate model in the dark, compare outputs and latency side by side, then shift traffic once the numbers earn it.
Every call crosses five hops before the first token returns. We tuned each one until the whole round trip disappeared.
Your request lands at the nearest region, gets authed, and is tagged with your route's cost and latency budget.
A lightweight scorer reads the prompt and ranks the candidate models that can clear your quality bar.
The request is handed to a model replica that's already loaded and batching — there is no spin-up to wait on.
A draft model proposes tokens the target model verifies in parallel, so the first token streams almost immediately.
Tokens stream back with backpressure while health checks stand ready to reroute the tail if a replica falters.
“We deleted our entire vLLM cluster and our p99 went down, not up. The router quietly cut our model spend 38% in the first month — nobody flagged a single quality regression.”
“A provider had a bad night and we slept through it. Failover rerouted every request mid-stream and our error rate never moved off zero.”
“Time-to-first-token is the number our users actually feel. Lumen got ours under 50ms across six models, and our agent loops finally read as real-time instead of laggy.”
Usage-based, no minimums, no reserved-capacity bill waiting at month end. Bring your own provider keys or buy capacity from us — it's one invoice either way.
For prototypes and side projects.
For products in production.
For dedicated, high-volume workloads.
We run a warm GPU fleet with speculative decoding, continuous batching, and prefix caching tuned per model. Every call hits a replica that's already loaded, so there are no cold starts and median time-to-first-token sits around 40ms.
Yes. Point your existing OpenAI client at api.lumen.ai/v1, change the model string, and you're done. Chat, completions, embeddings, tool calls, and streaming all match the schema you already write against.
Each request is scored for difficulty and matched against the cost or latency budget on its route. Easy prompts go to fast open models, hard ones to frontier models, and you can pin a specific model any time you want to override the decision.
Continuous health checks catch degradation in real time and reroute traffic to a healthy replica or alternate provider mid-stream. Failover is automatic and, in almost every case, invisible to your users.
No. Prompts and completions are never used for training. On Scale and Enterprise you can turn on zero retention, which drops payloads from our systems the moment the response is delivered.
Create a key, change one base URL, and call any model. No sales call stands between you and your first request.