← Back to blog

April 28, 2026 · Équipe betool

Real-time voice vs. batch — a phone AI agent is not a chatbot

Why a voice assistant demands a radically different architecture from a text chatbot, and what that means for latency, barge-in and model selection.

voicearchitecturelatency

Real-time voice vs. batch — a phone AI agent is not a chatbot

When a vendor pitches you a "phone AI agent", ask them one question: "What is your end-to-end latency between the end of the user's speech and the start of the response?"

If the answer exceeds 1.5 seconds, it is not a phone agent. It is a chatbot with an ASR / TTS module glued to the front.

The 800 ms rule

A human on the phone tolerates silence after their question. But that silence has a threshold: beyond 800 ms, the other end feels "frozen". At 1,500 ms, the caller says "hello?". At 3,000 ms, they hang up assuming a line problem.

The target for a serious voice agent is therefore: under 800 ms between the end of speech and the first audible byte of the response.

Let's break it down.

The time budget in detail

[user speech]──▶[ASR streaming]──▶[LLM]──▶[TTS streaming]──▶[audio out]
      ~150ms          ~80ms        ~?       ~250ms             ~50ms

Of the ~800 ms available, roughly 270 ms are consumed by ASR + TTS + transport. That leaves 530 ms for the LLM — including prefill time, reasoning, and first-token generation.

Implications:

  • GPT-4 Turbo non-streaming: ~2,000 ms to generate a 200-token response. Out of the question.
  • GPT-4o streaming: ~400 ms to first token on good days, ~800 ms on bad ones. Acceptable but tight.
  • Claude Haiku streaming: ~250 ms to first token. Comfortable.
  • Llama 3 8B served by vLLM: ~150 ms to first token. Excellent (and private).

A serious voice architecture therefore requires:

  1. Everything is streaming — ASR, LLM, TTS. Not a single link in batch mode.
  2. The LLM model is selected for TTFT (time to first token), not for MMLU score.
  3. TTS starts speaking on the first received token, not after the full response is complete.

Barge-in

On a text channel, the user waits for the bot to finish. On the phone, they interrupt. Three times out of five, in natural conversations.

A voice agent that cannot handle being cut off feels "robotic" — it keeps talking while the user is trying to interrupt. It is intolerable, and it immediately signals a poorly built bot.

Technical barge-in requires:

  • TTS stops immediately when the ASR detects outgoing vocal signal from the user.
  • The LLM discards the in-progress completion (without complaint).
  • The outgoing audio buffer is flushed instantly (otherwise a residual half-syllable is audible).
  • The conversation history notes that the response was not heard in full — to avoid inadvertently repeating it.

It is a technical detail that changes everything in how the interaction feels.

Handling uncertainty

On the phone, the ASR makes mistakes regularly. Where a text chatbot always receives the exact text typed by the user, the voice agent receives a frequently imperfect transcription.

Strategies that work:

  • Request confirmation on high-consequence items (case number, amount, unusual surname). Not systematically — that becomes unbearable.
  • Do not blindly trust the LLM when the ASR is ambiguous. If the received phrase is "I want to cancel my contract" with a low confidence score, it is better to ask "Did you mean cancel your contract?" than to trigger the cancellation procedure.
  • Cut off derailments — an agent that drifts from the script should be brought back by a condition / operator node, not by a prompt that "should be enough".

The phone is also a stack

Beyond the AI, a phone agent requires:

  • A SIP trunk from a carrier (Twilio, Voxbone, OVH, Sewan…).
  • A media gateway (LiveKit-SIP, FreeSWITCH, Asterisk) that speaks SIP on the carrier side and WebRTC on the platform side.
  • A transcriber robust to noise, accents and overlapping voices.
  • A synthesiser that correctly pronounces proper nouns, numbers and dates.

None of these elements is trivial. All of them interact.

What betool handles

betool's voice architecture — LiveKit + LiveKit-SIP + dedicated worker + cascaded LLM streaming — covers the above requirements out of the box. You remain in control of:

  • The SIP trunk (your carrier, your numbers).
  • The models (BYOK or private models).
  • The parameters (target latency, barge-in aggressiveness, human-voice fallback).

You do not configure the real-time plumbing — that is the platform's job. You design the conversation, which is the actual business challenge.