Measure pipeline quality with reproducible scenarios. Detect regressions before production.

Benchmarks

A benchmark is a set of evaluation scenarios replayed automatically against a pipeline. It is the tool for objectively measuring agent quality and catching regressions before they reach production.

Why

An agent that "seemed to work" can silently regress after:

a mission change;
a LLM model change;
a new tool being added.

Without a benchmark, you discover this in production. With a benchmark, the editor warns you before you publish.

Anatomy of a scenario

A scenario contains:

A stimulus — the input the pipeline receives (message, payload, simulated phone call).
An expected sequence — the conversation turns, tools called, outputs produced.
Checks — explicit assertions: "the agent must call crm.search_customer", "the response must contain the word 'refund'", "the pipeline must complete in fewer than 3 turns".

Check catalogue

Check	Verifies
`tool_called`	A specific tool was called (optionally with specific args)
`tool_not_called`	A tool was not called (anti-false-positive)
`output_contains`	The agent output contains a given text / pattern
`output_matches_json`	The JSON output validates a given schema
`turn_count_max`	The pipeline completed in ≤ N turns
`cost_max`	The total cost is ≤ N credits
`latency_max`	The duration is ≤ N seconds
`branch_taken`	A condition routed to the correct branch

Checks are extensible: on the Enterprise plan, you can add custom checks via an operator.

Creating a benchmark

Designer → Benchmarks → New benchmark.
Choose the target pipeline.
Add scenarios. You can:
- Craft manually (the most precise option).
- Capture from history — convert a real execution into a scenario (with anonymisation).

A good benchmark includes a mix: nominal case, edge cases, adversarial cases (manipulation attempts, ambiguity, missing data).

Running a benchmark

From the admin interface:

Single run — immediate results, drawer with per-scenario detail.
Scheduled run — every night, on each new version publication, etc.

The report shows:

Pass rate per check
Regression vs the previous version
List of failed scenarios, with a diff against the expected result

Best practices

Start small — 5 well-targeted scenarios are worth more than 50 approximate ones.
Benchmark before every publish. The editor can block publication if the benchmark regresses.
Version your missions. Not just pipelines: missions evolve too.
Keep checks readable. A check that requires 20 lines of explanation to understand is a check that needs to be broken down.