PRODUCT

Voice agents that actually pick up the phone: inside Calleague

1 June 20262 min read

Real telephony, a realtime speech-to-text → LLM → text-to-speech pipeline and production RAG — what it takes for an AI voice agent to hold a natural call, and how Calleague is built for it.

A demo voice bot and a deployable phone agent are different animals. The gap is latency, grounding, and the unglamorous reality of real telephony.

The one-second bar

Humans leave about a 200-millisecond gap between conversational turns — a figure from cross-language research published in PNAS. No phone agent hits that, but it sets the instinct: over a real line you want the response back in under roughly a second, against a natural “mouth-to-ear” budget of around 1.1 seconds. Miss it and callers talk over the agent or hang up.

How Calleague makes the budget

The trick is overlap, not raw speed. Calleague streams realtime speech-to-text into the model, streams the model’s tokens into text-to-speech, and starts speaking before you have fully finished — a cascaded pipeline rather than a sequential one. The cascade also keeps control: we can inject retrieved facts and guardrails at the model step, which a single end-to-end speech-to-speech model cannot.

Real telephony, not a web widget

Calleague speaks SIP, so it drops into the public phone network and your existing PBX or contact centre for both inbound support lines and outbound campaigns, handling narrowband call audio and keypad (DTMF) tones. It runs a visual workflow editor over a multi-model gateway, with desktop and embeddable surfaces.

Grounding is the accuracy story

An agent is only as good as what it retrieves. Calleague uses production RAG — hybrid retrieval that combines keyword (BM25) and vector search, then reranks the shortlist with a cross-encoder before answering. Keyword search matters because callers read out the exact tokens embeddings miss: order IDs, SKUs, error codes. Grounding with citations reduces hallucination; it does not eliminate it, which is why retrieval quality is the product.

Where it runs

Because caller audio and transcripts are sensitive, the deployment model matters as much as the features. Calleague runs on-prem or as SaaS, keeping audio inside your perimeter where data residency and GDPR obligations require it.

Voice is the hardest surface to fake and the easiest to feel. Calleague is built so the call feels human and the data stays yours.

All posts