Appearance
Voice Pipeline
The voice pipeline is the real-time audio processing chain that powers every call. It is built on Pipecat, an open-source framework for voice and multimodal AI agents.
Pipeline Architecture
Telephony Provider (WebSocket)
│
▼
┌──────────────────────────────────────────────────────┐
│ Transport Input (deserialize provider frames) │
└──┬───────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Audio Recorder (capture raw PCM → WAV file) │
└──┬───────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ STT Service (streaming speech-to-text) │
│ Deepgram · Sarvam · OpenAI · ElevenLabs │
└──┬───────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ CallEventLogger (log user_spoke, track idle time) │
└──┬───────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ User Context Aggregator (accumulate user turn) │
└──┬───────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ LLM Service (streaming token generation) │
│ OpenAI GPT-4.1 · Google Gemini · Grok │
│ + function calling (tools & transitions) │
└──┬───────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ TTS Service (streaming text-to-speech) │
│ Cartesia · ElevenLabs · Sarvam · Deepgram · OpenAI │
└──┬───────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Transcript Collector (capture assistant text) │
└──┬───────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Transport Output (serialize → provider frames) │
└──┬───────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Assistant Context Aggregator (accumulate bot turn) │
└──────────────────────────────────────────────────────┘Turn-Taking
Turn-taking determines when the user starts and stops speaking, and when the bot should respond. The system uses two different strategies depending on whether flow nodes are active.
Standard Mode (no flow nodes)
- Start-of-turn: VAD (Voice Activity Detection) → Transcription → MinWords(3)
- The user must speak at least 3 words before the bot considers it a real turn. This prevents false barge-ins from background noise.
- End-of-turn:
LocalSmartTurnAnalyzerV3- ONNX-based model (~265ms latency) that detects natural pauses and sentence boundaries
- Mute strategies:
MuteUntilFirstBotCompleteUserMuteStrategy— User input is muted until the greeting finishes playingFunctionCallUserMuteStrategy— User input is muted while a function call is executing
Flow Mode (with flow nodes)
- Start-of-turn: VAD → Transcription only (no MinWords gate)
- End-of-turn:
LocalSmartTurnAnalyzerV3(same as standard) - Mute strategies:
CallbackUserMuteStrategy— Delegates toFlowEngine.should_mute_user()for per-node controlFunctionCallUserMuteStrategy— Mutes during function execution- Per-node
allow_interruptflag controls whether the user can interrupt bot speech
Idle Detection
A background task monitors silence during calls:
| Threshold | Action |
|---|---|
| 25 seconds of silence | Inject "Are you still there?" prompt into LLM |
| 50 seconds of silence | Say goodbye and end the call |
These thresholds are configurable per-agent via pipeline_settings.
The idle timer resets whenever the user speaks. In flow mode, the flow engine's on_user_turn callback also resets the timer and unlocks transition locking.
Greeting
The greeting uses a dual-gate mechanism:
- The greeting is queued only after both: the PipelineTask is created and the WebSocket client connects
- The greeting is sent as a
TTSSpeakFrame, bypassing the LLM entirely — this saves ~1.3 seconds on first audio
Greeting modes
- Static greeting: If the agent has a
greetingfield, it is spoken directly via TTS - Template greeting: Supports
substitution from outbound call context (e.g.,"Hello ") - Dynamic greeting (flow mode only): If no greeting is set, the LLM generates one based on the initial node's instructions
System Prompt Injection
Every agent gets these voice-specific rules appended to their system prompt:
- Keep responses under 2 sentences
- No markdown, bullet points, or special characters
- Spell out numbers ("twenty five dollars", not "$25")
- No filler openers ("Certainly", "Absolutely")
- Call
end_callwhen the conversation is done
Function Calling
Standard Mode
The LLM sees all tools listed in the agent's configuration plus the built-in end_call function. When the LLM calls a tool:
- Tool registry looks up the function (webhook or built-in)
- For webhook tools: HTTP request is made to the configured endpoint
- Result is passed back to the LLM via
result_callback - LLM incorporates the result into its next response
Flow Mode
The LLM only sees functions defined on the current flow node:
- Transition functions: Switch to a different node. The LLM context is rebuilt with the new node's instructions and tools.
- Tool functions: Fire webhook tools and return results to the LLM within the same node.
See Flow Engine for details on node transitions.
Recording
The AudioRecorder processor captures raw PCM audio frames throughout the call:
- Audio is buffered in memory during the call
- On disconnect, the buffer is written as a WAV file to
recordings/YYYY/MM/{call_id}.wav - The file path is saved to
Call.recording_pathin the database
AI Service Providers
Speech-to-Text
| Provider | Models | Default |
|---|---|---|
| Deepgram | nova-3-general, nova-2 | nova-3-general |
| Sarvam | saaras:v3, saarika:v2.5 | saaras:v3 |
| OpenAI | whisper-1 | whisper-1 |
| ElevenLabs | default | default |
Sarvam supports stt_mode: transcribe, translate, verbatim.
Large Language Model
| Provider | Models | Default |
|---|---|---|
| OpenAI | gpt-4.1-nano, gpt-4.1-mini, gpt-4o, o1, o3 | gpt-4.1-nano |
gemini-2.5-flash, gemini-2.5-pro | gemini-2.5-flash | |
| Grok (xAI) | grok-3-beta | grok-3-beta |
Reasoning models (o1, o3) are discouraged for voice due to 5-15 second latency. If used, reasoning_effort is forced to low.
Text-to-Speech
| Provider | Models | Default Voice |
|---|---|---|
| Cartesia | sonic-3, sonic-2 | Barbershop Man (79a125e8-...) |
| ElevenLabs | eleven_flash_v2_5, eleven_turbo_v2_5 | Rachel (21m00Tcm...) |
| Sarvam | bulbul:v2, bulbul:v3-beta | Anushka |
| Deepgram | aura-2 | Thalia (aura-2-thalia-en) |
| OpenAI | tts-1, tts-1-hd | Alloy |
Cartesia supports additional settings: speed (default 1.05), emotion.