Benchmarks
A benchmark is a set of evaluation scenarios replayed automatically against a pipeline. It is the tool for objectively measuring agent quality and catching regressions before they reach production.
Why
An agent that "seemed to work" can silently regress after:
- a mission change;
- a LLM model change;
- a new tool being added.
Without a benchmark, you discover this in production. With a benchmark, the editor warns you before you publish.
Anatomy of a scenario
A scenario contains:
- A stimulus — the input the pipeline receives (message, payload, simulated phone call).
- An expected sequence — the conversation turns, tools called, outputs produced.
- Checks — explicit assertions: "the agent must call
crm.search_customer", "the response must contain the word 'refund'", "the pipeline must complete in fewer than 3 turns".
Check catalogue
| Check | Verifies |
|---|---|
tool_called | A specific tool was called (optionally with specific args) |
tool_not_called | A tool was not called (anti-false-positive) |
output_contains | The agent output contains a given text / pattern |
output_matches_json | The JSON output validates a given schema |
turn_count_max | The pipeline completed in ≤ N turns |
cost_max | The total cost is ≤ N credits |
latency_max | The duration is ≤ N seconds |
branch_taken | A condition routed to the correct branch |
Checks are extensible: on the Enterprise plan, you can add custom checks via an operator.
Creating a benchmark
- Designer → Benchmarks → New benchmark.
- Choose the target pipeline.
- Add scenarios. You can:
- Craft manually (the most precise option).
- Capture from history — convert a real execution into a scenario (with anonymisation).
A good benchmark includes a mix: nominal case, edge cases, adversarial cases (manipulation attempts, ambiguity, missing data).
Running a benchmark
From the admin interface:
- Single run — immediate results, drawer with per-scenario detail.
- Scheduled run — every night, on each new version publication, etc.
The report shows:
- Pass rate per check
- Regression vs the previous version
- List of failed scenarios, with a diff against the expected result
Best practices
- Start small — 5 well-targeted scenarios are worth more than 50 approximate ones.
- Benchmark before every publish. The editor can block publication if the benchmark regresses.
- Version your missions. Not just pipelines: missions evolve too.
- Keep checks readable. A check that requires 20 lines of explanation to understand is a check that needs to be broken down.