Documentation

Benchmarks

Measure pipeline quality with reproducible scenarios. Detect regressions before production.

Benchmarks

A benchmark is a set of evaluation scenarios replayed automatically against a pipeline. It is the tool for objectively measuring agent quality and catching regressions before they reach production.

Why

An agent that "seemed to work" can silently regress after:

  • a mission change;
  • a LLM model change;
  • a new tool being added.

Without a benchmark, you discover this in production. With a benchmark, the editor warns you before you publish.

Anatomy of a scenario

A scenario contains:

  • A stimulus — the input the pipeline receives (message, payload, simulated phone call).
  • An expected sequence — the conversation turns, tools called, outputs produced.
  • Checks — explicit assertions: "the agent must call crm.search_customer", "the response must contain the word 'refund'", "the pipeline must complete in fewer than 3 turns".

Check catalogue

CheckVerifies
tool_calledA specific tool was called (optionally with specific args)
tool_not_calledA tool was not called (anti-false-positive)
output_containsThe agent output contains a given text / pattern
output_matches_jsonThe JSON output validates a given schema
turn_count_maxThe pipeline completed in ≤ N turns
cost_maxThe total cost is ≤ N credits
latency_maxThe duration is ≤ N seconds
branch_takenA condition routed to the correct branch

Checks are extensible: on the Enterprise plan, you can add custom checks via an operator.

Creating a benchmark

  1. Designer → Benchmarks → New benchmark.
  2. Choose the target pipeline.
  3. Add scenarios. You can:
    • Craft manually (the most precise option).
    • Capture from history — convert a real execution into a scenario (with anonymisation).

A good benchmark includes a mix: nominal case, edge cases, adversarial cases (manipulation attempts, ambiguity, missing data).

Running a benchmark

From the admin interface:

  • Single run — immediate results, drawer with per-scenario detail.
  • Scheduled run — every night, on each new version publication, etc.

The report shows:

  • Pass rate per check
  • Regression vs the previous version
  • List of failed scenarios, with a diff against the expected result

Best practices

  • Start small — 5 well-targeted scenarios are worth more than 50 approximate ones.
  • Benchmark before every publish. The editor can block publication if the benchmark regresses.
  • Version your missions. Not just pipelines: missions evolve too.
  • Keep checks readable. A check that requires 20 lines of explanation to understand is a check that needs to be broken down.