Voice Pipeline

The voice pipeline is the real-time audio processing chain that powers every call. It is built on Pipecat, an open-source framework for voice and multimodal AI agents.

Pipeline Architecture

Telephony Provider (WebSocket)
  │
  ▼
┌──────────────────────────────────────────────────────┐
│  Transport Input (deserialize provider frames)        │
└──┬───────────────────────────────────────────────────┘
   │
   ▼
┌──────────────────────────────────────────────────────┐
│  Audio Recorder (capture raw PCM → WAV file)          │
└──┬───────────────────────────────────────────────────┘
   │
   ▼
┌──────────────────────────────────────────────────────┐
│  STT Service (streaming speech-to-text)               │
│  Deepgram · Sarvam · OpenAI · ElevenLabs             │
└──┬───────────────────────────────────────────────────┘
   │
   ▼
┌──────────────────────────────────────────────────────┐
│  CallEventLogger (log user_spoke, track idle time)    │
└──┬───────────────────────────────────────────────────┘
   │
   ▼
┌──────────────────────────────────────────────────────┐
│  User Context Aggregator (accumulate user turn)       │
└──┬───────────────────────────────────────────────────┘
   │
   ▼
┌──────────────────────────────────────────────────────┐
│  LLM Service (streaming token generation)             │
│  OpenAI GPT-4.1 · Google Gemini · Grok               │
│  + function calling (tools & transitions)             │
└──┬───────────────────────────────────────────────────┘
   │
   ▼
┌──────────────────────────────────────────────────────┐
│  TTS Service (streaming text-to-speech)               │
│  Cartesia · ElevenLabs · Sarvam · Deepgram · OpenAI  │
└──┬───────────────────────────────────────────────────┘
   │
   ▼
┌──────────────────────────────────────────────────────┐
│  Transcript Collector (capture assistant text)         │
└──┬───────────────────────────────────────────────────┘
   │
   ▼
┌──────────────────────────────────────────────────────┐
│  Transport Output (serialize → provider frames)       │
└──┬───────────────────────────────────────────────────┘
   │
   ▼
┌──────────────────────────────────────────────────────┐
│  Assistant Context Aggregator (accumulate bot turn)    │
└──────────────────────────────────────────────────────┘

Turn-Taking

Turn-taking determines when the user starts and stops speaking, and when the bot should respond. The system uses two different strategies depending on whether flow nodes are active.

Standard Mode (no flow nodes)

Start-of-turn: VAD (Voice Activity Detection) → Transcription → MinWords(3)
- The user must speak at least 3 words before the bot considers it a real turn. This prevents false barge-ins from background noise.
End-of-turn: LocalSmartTurnAnalyzerV3
- ONNX-based model (~265ms latency) that detects natural pauses and sentence boundaries
Mute strategies:
- MuteUntilFirstBotCompleteUserMuteStrategy — User input is muted until the greeting finishes playing
- FunctionCallUserMuteStrategy — User input is muted while a function call is executing

Flow Mode (with flow nodes)

Start-of-turn: VAD → Transcription only (no MinWords gate)
End-of-turn: LocalSmartTurnAnalyzerV3 (same as standard)
Mute strategies:
- CallbackUserMuteStrategy — Delegates to FlowEngine.should_mute_user() for per-node control
- FunctionCallUserMuteStrategy — Mutes during function execution
- Per-node allow_interrupt flag controls whether the user can interrupt bot speech

Idle Detection

A background task monitors silence during calls:

Threshold	Action
25 seconds of silence	Inject "Are you still there?" prompt into LLM
50 seconds of silence	Say goodbye and end the call

These thresholds are configurable per-agent via pipeline_settings.

The idle timer resets whenever the user speaks. In flow mode, the flow engine's on_user_turn callback also resets the timer and unlocks transition locking.

Greeting

The greeting uses a dual-gate mechanism:

The greeting is queued only after both: the PipelineTask is created and the WebSocket client connects
The greeting is sent as a TTSSpeakFrame, bypassing the LLM entirely — this saves ~1.3 seconds on first audio

Greeting modes

Static greeting: If the agent has a greeting field, it is spoken directly via TTS
Template greeting: Supports substitution from outbound call context (e.g., "Hello ")
Dynamic greeting (flow mode only): If no greeting is set, the LLM generates one based on the initial node's instructions

System Prompt Injection

Every agent gets these voice-specific rules appended to their system prompt:

Keep responses under 2 sentences
No markdown, bullet points, or special characters
Spell out numbers ("twenty five dollars", not "$25")
No filler openers ("Certainly", "Absolutely")
Call end_call when the conversation is done

Function Calling

Standard Mode

The LLM sees all tools listed in the agent's configuration plus the built-in end_call function. When the LLM calls a tool:

Tool registry looks up the function (webhook or built-in)
For webhook tools: HTTP request is made to the configured endpoint
Result is passed back to the LLM via result_callback
LLM incorporates the result into its next response

Flow Mode

The LLM only sees functions defined on the current flow node:

Transition functions: Switch to a different node. The LLM context is rebuilt with the new node's instructions and tools.
Tool functions: Fire webhook tools and return results to the LLM within the same node.

See Flow Engine for details on node transitions.

Recording

The AudioRecorder processor captures raw PCM audio frames throughout the call:

Audio is buffered in memory during the call
On disconnect, the buffer is written as a WAV file to recordings/YYYY/MM/{call_id}.wav
The file path is saved to Call.recording_path in the database

AI Service Providers

Speech-to-Text

Provider	Models	Default
Deepgram	`nova-3-general`, `nova-2`	`nova-3-general`
Sarvam	`saaras:v3`, `saarika:v2.5`	`saaras:v3`
OpenAI	`whisper-1`	`whisper-1`
ElevenLabs	default	default

Sarvam supports stt_mode: transcribe, translate, verbatim.

Large Language Model

Provider	Models	Default
OpenAI	`gpt-4.1-nano`, `gpt-4.1-mini`, `gpt-4o`, `o1`, `o3`	`gpt-4.1-nano`
Google	`gemini-2.5-flash`, `gemini-2.5-pro`	`gemini-2.5-flash`
Grok (xAI)	`grok-3-beta`	`grok-3-beta`

Reasoning models (o1, o3) are discouraged for voice due to 5-15 second latency. If used, reasoning_effort is forced to low.

Text-to-Speech

Provider	Models	Default Voice
Cartesia	`sonic-3`, `sonic-2`	Barbershop Man (`79a125e8-...`)
ElevenLabs	`eleven_flash_v2_5`, `eleven_turbo_v2_5`	Rachel (`21m00Tcm...`)
Sarvam	`bulbul:v2`, `bulbul:v3-beta`	Anushka
Deepgram	`aura-2`	Thalia (`aura-2-thalia-en`)
OpenAI	`tts-1`, `tts-1-hd`	Alloy

Cartesia supports additional settings: speed (default 1.05), emotion.

Voice Pipeline ​

Pipeline Architecture ​

Turn-Taking ​

Standard Mode (no flow nodes) ​

Flow Mode (with flow nodes) ​

Idle Detection ​

Greeting ​

Greeting modes ​

System Prompt Injection ​

Function Calling ​

Standard Mode ​

Flow Mode ​

Recording ​

AI Service Providers ​

Speech-to-Text ​

Large Language Model ​

Text-to-Speech ​