How to Build an AI Voice Agent in 2026: The Production Playbook
Step-by-step playbook to build a production-grade AI voice agent in 2026. Architecture, code, latency budget, and the trade-offs you'll hit at each stage.
Table of Contents▼
Published: April 28, 2026 Updated: April 28, 2026 Reading time: 13 minutes
This guide walks through the actual architecture, technology choices, and code patterns you'll need to ship a production-grade AI voice agent in 2026. It assumes you're a developer, you've used at least one cloud provider, and you're not interested in the marketing version of "voice AI."
By the end you'll have a working mental model for the four-layer stack, the latency budget at each layer, and a complete decision tree for picking providers. We'll show two paths: build it yourself from raw APIs (educational, painful), or build it on top of an orchestration platform like Burki (production, fast).
The four-layer architecture
Every production voice agent has the same four layers:
[caller phone] → [telephony] → [STT] → [LLM] → [TTS] → [back to caller]The total round-trip from "caller stops talking" to "caller hears agent reply" must stay under ~1.5 seconds. The state-of-the-art in 2026 is closer to 600–800ms. Beyond ~1.8s, callers think the line dropped.
Layer 1: Telephony
Job: get the audio in and out of the PSTN, expose a streaming WebSocket of the caller's audio.
Pick from: Twilio, Telnyx, Vonage. All three expose a real-time media-stream WebSocket protocol. Twilio is the default, Telnyx is cheaper at high volume.
Latency budget: 50–100ms one-way (caller → your server). You don't get to optimize this.
Layer 2: STT (Speech-to-Text)
Job: take streaming audio chunks, return streaming transcription with low latency. Detect end-of-speech.
Pick from: Deepgram Nova 3 (default), AssemblyAI Universal, Azure Speech (compliance scenarios).
Latency budget: 200–400ms from end-of-speech to finalized transcript. Deepgram Nova 3 hits the lower end of this.
Layer 3: LLM (Large Language Model)
Job: take the transcript + context, return a streaming response.
Pick from: GPT-4o-mini (fast + cheap default), Claude Haiku, Llama 3.3 70B on Groq (fastest), GPT-4o (highest quality, slower).
Latency budget: 200–500ms time-to-first-token. The full response can stream over several seconds — what matters is when the first chunk arrives so TTS can start.
Layer 4: TTS (Text-to-Speech)
Job: take streaming text chunks, return streaming audio chunks low-latency.
Pick from: Cartesia Sonic (fastest), ElevenLabs Flash v2.5 (best speed-quality balance), ElevenLabs Multilingual v2 (multilingual quality).
Latency budget: 80–250ms time-to-first-byte. Cartesia hits the lower end.
Total latency budget
Adding everything: 50 + 300 + 350 + 150 = ~850ms for a tuned best-of-breed pipeline. Reality is often closer to 1.0–1.2s once you account for:
- Voice activity detection (VAD) windowing — typically 200–400ms of "is the caller really done?" buffer.
- Turn-taking heuristics — you don't want the agent interrupting.
- Audio resampling between layers.
- Network jitter.
Path 1: Build it yourself
This is the educational path. It will take a senior engineer 4–8 weeks to ship something production-ready. You'll learn a lot. You'll also rebuild things that have been built 1,000 times before.
High-level steps
- Telephony webhook + WebSocket: stand up a public HTTPS endpoint, register it with Twilio as a Voice webhook. When a call comes in, return TwiML pointing to your media-stream WebSocket. Open the WebSocket, decode the mu-law audio chunks.
- Streaming STT: open a streaming connection to Deepgram. Pipe audio chunks in, receive partial + final transcripts. Implement turn-detection: when do you decide the caller is done speaking?
- Streaming LLM: when you get a final transcript, append to the conversation history, send to OpenAI's streaming chat completions API. Receive token chunks as they arrive.
- Streaming TTS: as LLM tokens arrive, stream them in chunks to ElevenLabs/Cartesia's streaming TTS endpoint. Receive audio chunks back.
- Audio re-injection: encode the TTS audio chunks back into mu-law at 8kHz, send them back over the Twilio media-stream WebSocket.
- Interruption handling: if the caller starts speaking while the agent is talking, kill the in-flight TTS and start listening again.
- State machine: track who's talking, transitions between listening/thinking/speaking states.
- Observability: log every state transition with timestamps so you can debug latency spikes.
This is the minimum. Production extras:
- Reconnect logic if any WebSocket drops.
- Session recording with PII redaction.
- Function calling so the agent can hit your APIs mid-call.
- CRM writeback after the call ends.
- Compliance: HIPAA audit trail, PCI audio masking, GDPR deletion.
- Multi-tenant isolation if you have multiple customers.
- Carrier failover if your primary telephony degrades.
- Provider failover if your TTS or STT degrades.
- Cost accounting per call so you can charge customers.
By month 6 you'll have rebuilt 80% of what an orchestration platform gives you out of the box.
Path 2: Build on Burki
This is the production path. You skip layers 1, the WebSocket plumbing, the state machine, the observability, the multi-tenancy, and the CRM writeback. You wire your business logic on top.
Five-step setup
- Create assistant in the Burki dashboard. Pick the "best-of-breed English production agent" template:
- STT: Deepgram Nova 3 - LLM: OpenAI GPT-4o-mini - TTS: ElevenLabs Flash v2.5 - Telephony: Twilio (or Telnyx)
- Set the system prompt — this is the agent's persona and instructions. Couple of paragraphs minimum.
- Connect a phone number — Burki provisions a Twilio number or hooks into yours via BYO mode.
- Add tools — function definitions the agent can call mid-call. These are JSON-Schema function declarations that Burki forwards to the LLM.
- Test — call the number, verify the agent responds, iterate on the prompt.
That's a working voice agent. Total time: ~30 minutes.
Adding your business logic
The hard part of voice AI is rarely the voice — it's the integrations. Burki handles voice; you handle business logic.
Define a tool the agent can call:
{
"name": "create_support_ticket",
"description": "Create a support ticket in our system",
"parameters": {
"type": "object",
"properties": {
"summary": { "type": "string" },
"priority": { "enum": ["low", "medium", "high"] },
"customer_email": { "type": "string" }
}
}
}When the agent calls this tool, Burki POSTs to your webhook with the parameters. You execute your business logic (create the ticket in Zendesk, in your DB, wherever) and return a response. The agent uses the response to continue the conversation naturally.
CRM writeback
If you're on GoHighLevel, Salesforce, HubSpot, or Pipedrive, the CRM integration is a dropdown — see /integrations. Burki writes the transcript, summary, sentiment, and disposition to the right object in the right CRM at call end. Zero code.
The latency tuning checklist
Once your agent is live, the next 80% of the work is shaving latency. Checklist:
- Pick streaming everywhere — no batch APIs in the hot path. Verify each layer is genuinely streaming.
- Use Cartesia or ElevenLabs Flash for TTS unless you have a quality reason to use Turbo or Multilingual.
- Use GPT-4o-mini or Groq Llama for the LLM unless you need reasoning depth.
- Tune VAD — the silence threshold that triggers "caller is done" is usually too conservative. Drop it from 600ms to 350ms and measure.
- Send first-token audio early — start TTS on the first useful sentence, not on the full response.
- Pre-warm sessions — keep streaming connections to STT and TTS warm if you can predict an incoming call window.
- Co-locate — if your Burki region and your provider region don't match, latency suffers.
Compliance shortcuts
- HIPAA: Burki signs a BAA on Pro plans. Use Azure for STT/TTS if your compliance team requires Azure-only.
- PCI: Burki ships per-call audio masking — when the caller reads a card number, the audio frame is zeroed out from the recording while still passing through STT for tokenization.
- GDPR: per-region data residency, configurable retention, full deletion API.
Recommendation
If you're building a voice AI in 2026:
- For learning — build Path 1 once. You'll understand the trade-offs forever.
- For production — go Path 2. Burki gets you to production in days; the equivalent custom build is months.
Start Free — 200 minutes, no credit card. The full best-of-breed stack from this post is wired up by default.
Ready to try Burki?
Start your 200-minute free trial today. No credit card required.
Start Free Trial200 free minutes included. No credit card required.