Real-time voice vs. batch — a phone AI agent is not a chatbot
When a vendor pitches you a "phone AI agent", ask them one question: "What is your end-to-end latency between the end of the user's speech and the start of the response?"
If the answer exceeds 1.5 seconds, it is not a phone agent. It is a chatbot with an ASR / TTS module glued to the front.
The 800 ms rule
A human on the phone tolerates silence after their question. But that silence has a threshold: beyond 800 ms, the other end feels "frozen". At 1,500 ms, the caller says "hello?". At 3,000 ms, they hang up assuming a line problem.
The target for a serious voice agent is therefore: under 800 ms between the end of speech and the first audible byte of the response.
Let's break it down.
The time budget in detail
[user speech]──▶[ASR streaming]──▶[LLM]──▶[TTS streaming]──▶[audio out]
~150ms ~80ms ~? ~250ms ~50ms
Of the ~800 ms available, roughly 270 ms are consumed by ASR + TTS + transport. That leaves 530 ms for the LLM — including prefill time, reasoning, and first-token generation.
Implications:
- GPT-4 Turbo non-streaming: ~2,000 ms to generate a 200-token response. Out of the question.
- GPT-4o streaming: ~400 ms to first token on good days, ~800 ms on bad ones. Acceptable but tight.
- Claude Haiku streaming: ~250 ms to first token. Comfortable.
- Llama 3 8B served by vLLM: ~150 ms to first token. Excellent (and private).
A serious voice architecture therefore requires:
- Everything is streaming — ASR, LLM, TTS. Not a single link in batch mode.
- The LLM model is selected for TTFT (time to first token), not for MMLU score.
- TTS starts speaking on the first received token, not after the full response is complete.
Barge-in
On a text channel, the user waits for the bot to finish. On the phone, they interrupt. Three times out of five, in natural conversations.
A voice agent that cannot handle being cut off feels "robotic" — it keeps talking while the user is trying to interrupt. It is intolerable, and it immediately signals a poorly built bot.
Technical barge-in requires:
- TTS stops immediately when the ASR detects outgoing vocal signal from the user.
- The LLM discards the in-progress completion (without complaint).
- The outgoing audio buffer is flushed instantly (otherwise a residual half-syllable is audible).
- The conversation history notes that the response was not heard in full — to avoid inadvertently repeating it.
It is a technical detail that changes everything in how the interaction feels.
Handling uncertainty
On the phone, the ASR makes mistakes regularly. Where a text chatbot always receives the exact text typed by the user, the voice agent receives a frequently imperfect transcription.
Strategies that work:
- Request confirmation on high-consequence items (case number, amount, unusual surname). Not systematically — that becomes unbearable.
- Do not blindly trust the LLM when the ASR is ambiguous. If the received phrase is "I want to cancel my contract" with a low confidence score, it is better to ask "Did you mean cancel your contract?" than to trigger the cancellation procedure.
- Cut off derailments — an agent that drifts from the script should be brought back by a condition / operator node, not by a prompt that "should be enough".
The phone is also a stack
Beyond the AI, a phone agent requires:
- A SIP trunk from a carrier (Twilio, Voxbone, OVH, Sewan…).
- A media gateway (LiveKit-SIP, FreeSWITCH, Asterisk) that speaks SIP on the carrier side and WebRTC on the platform side.
- A transcriber robust to noise, accents and overlapping voices.
- A synthesiser that correctly pronounces proper nouns, numbers and dates.
None of these elements is trivial. All of them interact.
What betool handles
betool's voice architecture — LiveKit + LiveKit-SIP + dedicated worker + cascaded LLM streaming — covers the above requirements out of the box. You remain in control of:
- The SIP trunk (your carrier, your numbers).
- The models (BYOK or private models).
- The parameters (target latency, barge-in aggressiveness, human-voice fallback).
You do not configure the real-time plumbing — that is the platform's job. You design the conversation, which is the actual business challenge.