Promptfoundry — LLM prompt engineering & evaluation platform

Promptfoundry

Prompt source control · evals · production tracing

Promptfoundry is the version-controlled home for every prompt your product runs. Edit a system prompt, run it against your golden set, and watch the regressions surface in seconds — instead of in a Monday-morning support ticket. No green eval, no merge.

Versioned prompts with diffs
Eval-gated releases
Bring any model, no lock-in

promptfoundry — support-triage · v18 → v19

$ pf eval run support-triage --against v18

  suite: support-triage   240 cases   4 graders

- You are a support agent. Be helpful and concise.
+ You are a support agent. If unsure, say so and
+ escalate. Never invent an order ID or refund policy.

  accuracy        91.2%   +3.4   ████████████░
  tone match      96.0%   +1.1   █████████████
  hallucination    0.8%   -0.6   ▏ (under 1.5% gate)
  cost / 1k        $1.84   -22%

  ✓ 238 pass   ⚠ 2 regressions   → review before merge
  gate: PASS — eligible to promote to production

The prompt layer behind AI products, copilots, and agent teams

HelmlineQuartz CopilotDriftwork AINorthpeakTalia HealthRelayframeHelmlineQuartz CopilotDriftwork AINorthpeakTalia HealthRelayframe

The editor

An IDE for prompts,wired to a test suite.

The model underneath you changes without warning, and your prompt is the one thing you actually control. Promptfoundry gives that prompt source control, a review process, and a hard number for whether a change made things better or worse.

A real prompt editor

Variables, partials, and reusable system blocks with type-checked inputs and a live token count. Diff every revision side by side, leave inline comments, and roll back to any version the way you would a pull request — because that is exactly what a prompt change is.

Eval suites that gate releases

Bundle your hardest cases into a suite, attach graders, and block any prompt that drops below your quality bar. No green run, no ship.

Graders for fuzzy output

Score with exact-match, JSON-schema, regex, or an LLM judge against a rubric you write — so “is this answer actually good” turns into a number you can trend over time.

Every model, one pass

Run the same prompt across GPT, Claude, Gemini, and open models in a single sweep and compare quality, latency, and cost before you commit a dollar to any provider.

Regression radar

Every run is diffed against the last green one. The cases that broke float to the top with the exact output that changed, so you fix the regression instead of hunting for it.

How it works

Four steps from a guessto a guarantee.

The same loop every change runs through — author, prove, ship, and watch — so quality is something you can repeat instead of something you got lucky on.

01 · Author

Write it like code

Compose the prompt with typed variables and shared system blocks. Every save is a version with an author, a diff, and a comment thread.

02 · Prove

Run the eval gate

Fire the suite against your golden set. Graders score each case and the run is compared to the last green one before anything is allowed to merge.

03 · Ship

Promote with a record

Merge the version that passed and promote it to production. The scores that cleared the gate travel with it, so every release has a receipt.

04 · Watch

Catch the drift

Live traffic gets traced and scored against the same baselines. When quality slips, the failing calls drop straight back into your test set.

What teams actually measure on Promptfoundry

240+

Cases in the median eval suite

8 sec

Typical full-suite run

31%

Avg. token cost cut after tuning

More prompt changes shipped per week

From the editor to production

Ship the prompt,then watch itunder real traffic.

A prompt that aced your evals can still rot the moment live users hit it. Promptfoundry follows every prompt past the merge — capturing real calls, flagging the bad ones, and feeding them straight back into the suite that guards the next release.

Live tracing

Capture every prompt, completion, tool call, latency, and token cost in production with a one-line SDK wrap. Search and replay any call exactly as it ran.

Drift alerts

When a provider silently swaps a model version and your output quality slips, Promptfoundry catches the shift against your baselines and pages you before your users file a ticket.

Capture-to-eval

Turn a thumbs-down or a flagged production call into a permanent test case in one click, so today's failure becomes the regression guard that blocks it tomorrow.

Online A/B routing

Split live traffic between two prompt versions, watch win-rate and cost settle in real time, and promote the winner without shipping new code.

Teams on Promptfoundry

Prompt engineering stopped being guesswork.

“We used to change a system prompt, ship it, and learn it was broken from angry users. Now nothing merges without a green eval. Our hallucination rate is down two-thirds and I actually sleep the night before a launch.”

Renata Vasquez

Lead AI Engineer, Helmline

“The model-comparison view paid for the whole platform in a week. We found a smaller model that matched our quality at a third of the cost, and the evals proved it before we touched a line of production.”

Daniel Boon

Head of ML, Quartz Copilot

“Capture-to-eval is the feature I didn't know I was missing. Every production flag becomes a test case, so the same bug can't ship twice. Our regression suite basically writes itself now.”

Aisha Mensah

Founding Engineer, Driftwork AI

Pricing

Priced for the team writing the prompts.

Start free on a real project. Scale by seats and eval runs, never by a hidden markup on tokens. No annual lock-in to sign away first.

Free

For solo builders and prototypes.

$0/mo

1 project, 2 editors
Prompt versioning + diffs
1,000 eval runs / mo
Exact-match & schema graders
Community support

Team

For teams shipping AI to production.

$149/mo

Unlimited projects, 10 editors
LLM-judge graders + golden sets
50,000 eval runs / mo
Production tracing + drift alerts
Eval-gated approvals
Priority support

Enterprise

For regulated, high-volume AI teams.

Custom

Unlimited editors + eval runs
Self-hosted or VPC deployment
Zero prompt retention
SSO, SAML & audit logs
Custom graders & SLAs
Named solutions engineer

Questions, answered straight.

Do I have to route my model calls through Promptfoundry?

No. You can author and evaluate prompts entirely in the platform and keep calling the model yourself. If you want live tracing and drift alerts, wrap your existing client with our one-line SDK — we observe the call asynchronously, we don't sit in the critical path or add latency to your users.

Which models and providers do you support?

Any provider with an API — OpenAI, Anthropic, Google, Mistral, and self-hosted open models behind an OpenAI-compatible endpoint. The same prompt and the same eval suite run across all of them, so switching or comparing models is a dropdown, not a rewrite.

How do evals grade open-ended answers?

Each case is scored by one or more graders: exact-match and JSON-schema for structured output, regex for format checks, and an LLM judge with a rubric you write for tone, faithfulness, or helpfulness. Scores roll up per suite, so quality becomes a trend line instead of a vibe.

What exactly counts as a regression?

Every eval run is compared to the last one that passed. Any case that scored well before and scores worse now is flagged as a regression and surfaced with the exact output that changed. You review the diff and decide — the prompt can't merge until you do.

Is my prompt and customer data safe?

Prompts and traced data are never used to train any model, ours or a provider's. Team plans isolate data per workspace, and Enterprise adds self-hosted or VPC deployment with zero retention, SSO, and full audit logging for your compliance reviews.

Make your next prompt changea measured one.

Import a prompt, build an eval suite, and catch your first regression this afternoon. Free to start — no sales call to get past first.

Stop shippingpromptson a hunch.