Now in private beta

Know which change actually improved your agent

You shipped a new prompt, swapped the model, and updated your tools in one release. The eval score went up. But which change helped? Isolate runs controlled experiments to find out.

Join the waitlist How it works

isolate

$ isolate run --agent support-v2

testing 4 components · 12 variants · 1,200 evals

✓ prompt_v2+18.3%

✓ rag_reranker+9.7%

~ tool_search_v3+1.2% ns

✗ gpt-4o-mini-4.1%

keep prompt_v2, rag_reranker · revert gpt-4o-mini

The problem

Eval scores tell you what changed,
not why

Every release bundles multiple changes together. A new prompt, a model swap, updated tools, better retrieval. When the score moves, nobody can say which change caused it.

3-5

changes per release

Prompt, model, tools, retrieval, routing — teams ship them all at once.

score delta

Your dashboard shows one number. It can't distinguish what helped from what hurt.

possible interactions

As agents grow, manual testing becomes impossible. You need systematic ablation.

How it works

One command.
Complete attribution.

Point it at your agent

Isolate reads your agent config — prompt, model, tools, retrieval pipeline. It snapshots the current state as a baseline.

isolate init --config agent.yaml

We run the experiments

Variants are generated automatically by swapping out individual components. Each variant runs against your eval suite in parallel — no manual test matrix.

isolate run --variants auto --eval accuracy

You get a clear answer

A report shows exactly which changes improved performance, which degraded it, and which had no effect. With statistical significance, not vibes.

isolate report --format table

Before & after

From “it went up” to knowing exactly why

Before

Typical eval dashboard

agent: support-v2

78.4% +12.1%

changes in this release:

• updated system prompt

• switched to gpt-4o-mini

• added search tool v3

• new rag reranker

Which change caused the +12.1%? Unknown.

After

Isolate attribution report

agent: support-v2

78.4% +12.1%

component attribution:

prompt_v2+18.3% p<0.01

rag_reranker+9.7% p<0.01

tool_search_v3+1.2% p=0.34

gpt-4o-mini-4.1% p<0.05

Keep prompt_v2 + rag_reranker. Revert gpt-4o-mini.

Use cases

For any team shipping agents to production

Customer support agents

You upgraded the model and rewrote the prompt. Resolution rate is up 15%. Was it the prompt or the model? Isolate tells you the prompt did the heavy lifting — the model swap actually hurt edge cases.

Coding assistants

New retrieval pipeline, updated instructions, context window change. Pass rate jumped. Isolate shows retrieval was the only change that mattered.

RAG pipelines

Chunk size, embedding model, reranker — all changed at once. Answer quality improved, but which component? Isolate ablates each one independently.

Multi-agent systems

When orchestrators delegate to sub-agents, a change in one agent can mask regressions in another. Isolate tests each agent in isolation.

Get early access

We're onboarding design partners for our private beta. Join the waitlist or book a call to discuss your eval workflow.

Book a demo call