Promptfoundry is the version-controlled home for every prompt your product runs. Edit a system prompt, run it against your golden set, and watch the regressions surface in seconds — instead of in a Monday-morning support ticket. No green eval, no merge.
$ pf eval run support-triage --against v18
suite: support-triage 240 cases 4 graders
- You are a support agent. Be helpful and concise.
+ You are a support agent. If unsure, say so and
+ escalate. Never invent an order ID or refund policy.
accuracy 91.2% +3.4 ████████████░
tone match 96.0% +1.1 █████████████
hallucination 0.8% -0.6 ▏ (under 1.5% gate)
cost / 1k $1.84 -22%
✓ 238 pass ⚠ 2 regressions → review before merge
gate: PASS — eligible to promote to productionThe prompt layer behind AI products, copilots, and agent teams
The model underneath you changes without warning, and your prompt is the one thing you actually control. Promptfoundry gives that prompt source control, a review process, and a hard number for whether a change made things better or worse.
Variables, partials, and reusable system blocks with type-checked inputs and a live token count. Diff every revision side by side, leave inline comments, and roll back to any version the way you would a pull request — because that is exactly what a prompt change is.
Bundle your hardest cases into a suite, attach graders, and block any prompt that drops below your quality bar. No green run, no ship.
Score with exact-match, JSON-schema, regex, or an LLM judge against a rubric you write — so “is this answer actually good” turns into a number you can trend over time.
Run the same prompt across GPT, Claude, Gemini, and open models in a single sweep and compare quality, latency, and cost before you commit a dollar to any provider.
Every run is diffed against the last green one. The cases that broke float to the top with the exact output that changed, so you fix the regression instead of hunting for it.
The same loop every change runs through — author, prove, ship, and watch — so quality is something you can repeat instead of something you got lucky on.
Compose the prompt with typed variables and shared system blocks. Every save is a version with an author, a diff, and a comment thread.
Fire the suite against your golden set. Graders score each case and the run is compared to the last green one before anything is allowed to merge.
Merge the version that passed and promote it to production. The scores that cleared the gate travel with it, so every release has a receipt.
Live traffic gets traced and scored against the same baselines. When quality slips, the failing calls drop straight back into your test set.
What teams actually measure on Promptfoundry
A prompt that aced your evals can still rot the moment live users hit it. Promptfoundry follows every prompt past the merge — capturing real calls, flagging the bad ones, and feeding them straight back into the suite that guards the next release.
Capture every prompt, completion, tool call, latency, and token cost in production with a one-line SDK wrap. Search and replay any call exactly as it ran.
When a provider silently swaps a model version and your output quality slips, Promptfoundry catches the shift against your baselines and pages you before your users file a ticket.
Turn a thumbs-down or a flagged production call into a permanent test case in one click, so today's failure becomes the regression guard that blocks it tomorrow.
Split live traffic between two prompt versions, watch win-rate and cost settle in real time, and promote the winner without shipping new code.
“We used to change a system prompt, ship it, and learn it was broken from angry users. Now nothing merges without a green eval. Our hallucination rate is down two-thirds and I actually sleep the night before a launch.”
“The model-comparison view paid for the whole platform in a week. We found a smaller model that matched our quality at a third of the cost, and the evals proved it before we touched a line of production.”
“Capture-to-eval is the feature I didn't know I was missing. Every production flag becomes a test case, so the same bug can't ship twice. Our regression suite basically writes itself now.”
Start free on a real project. Scale by seats and eval runs, never by a hidden markup on tokens. No annual lock-in to sign away first.
For solo builders and prototypes.
For teams shipping AI to production.
For regulated, high-volume AI teams.
No. You can author and evaluate prompts entirely in the platform and keep calling the model yourself. If you want live tracing and drift alerts, wrap your existing client with our one-line SDK — we observe the call asynchronously, we don't sit in the critical path or add latency to your users.
Any provider with an API — OpenAI, Anthropic, Google, Mistral, and self-hosted open models behind an OpenAI-compatible endpoint. The same prompt and the same eval suite run across all of them, so switching or comparing models is a dropdown, not a rewrite.
Each case is scored by one or more graders: exact-match and JSON-schema for structured output, regex for format checks, and an LLM judge with a rubric you write for tone, faithfulness, or helpfulness. Scores roll up per suite, so quality becomes a trend line instead of a vibe.
Every eval run is compared to the last one that passed. Any case that scored well before and scores worse now is flagged as a regression and surfaced with the exact output that changed. You review the diff and decide — the prompt can't merge until you do.
Prompts and traced data are never used to train any model, ours or a provider's. Team plans isolate data per workspace, and Enterprise adds self-hosted or VPC deployment with zero retention, SSO, and full audit logging for your compliance reviews.
Import a prompt, build an eval suite, and catch your first regression this afternoon. Free to start — no sales call to get past first.