Architecture

Overview

This platform runs configurable AI voice agents that handle inbound and outbound phone calls. It processes speech in real time using a modular pipeline, supports multi-step conversation flows, and persists call artifacts for review.

System Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                         Telephony Providers                         │
│                   Twilio  ·  Exotel  ·  Vobiz                       │
└──────────┬──────────────────────────────────────────┬───────────────┘
           │  HTTP webhooks                           │  WebSocket (audio)
           ▼                                          ▼
┌─────────────────────────────────────────────────────────────────────┐
│                          FastAPI Backend                             │
│                                                                     │
│  ┌──────────────┐  ┌──────────────┐  ┌───────────────────────────┐  │
│  │  REST API     │  │  Webhooks    │  │  WebSocket Handlers       │  │
│  │  /api/*       │  │  /telephony/ │  │  /ws/{provider}/{agent}   │  │
│  └──────┬───────┘  └──────┬───────┘  └──────────┬────────────────┘  │
│         │                 │                      │                   │
│         ▼                 ▼                      ▼                   │
│  ┌──────────────────────────────────────────────────────────────┐    │
│  │                    Voice Pipeline (Pipecat)                  │    │
│  │                                                              │    │
│  │  Audio In → STT → LLM (+ tools) → TTS → Audio Out           │    │
│  │            ↑                                                 │    │
│  │       Flow Engine (optional multi-node state machine)        │    │
│  └──────────────────────────────────────────────────────────────┘    │
│         │                                                           │
│  ┌──────┴───────┐  ┌──────────────┐  ┌──────────────┐              │
│  │  PostgreSQL   │  │    Redis     │  │  Filesystem   │              │
│  │  calls,agents │  │  sessions    │  │  recordings   │              │
│  │  events,flows │  │              │  │               │              │
│  └──────────────┘  └──────────────┘  └──────────────┘              │
└─────────────────────────────────────────────────────────────────────┘
           │
           ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     React Dashboard (admin UI)                      │
│          Agents · Calls · Flow Editor · Tools · Settings            │
└─────────────────────────────────────────────────────────────────────┘

Component Summary

Backend API (`api/`)

FastAPI application providing:

Agent CRUD and AI-powered agent generation
Call history, events, and recording retrieval
Outbound call initiation (provider-agnostic)
Inbound telephony webhooks (TwiML, ExoML, Vobiz XML)
WebSocket endpoints for real-time audio streaming
Flow node management (visual editor backend)
Webhook tool management
Telephony settings configuration

Voice Runtime (`agent/`)

Real-time voice processing built on Pipecat:

pipeline.py — Assembles the STT → LLM → TTS chain with turn-taking strategies, idle detection, and recording
flow_engine.py — State machine that drives multi-node conversations with per-node instructions, tools, and transitions
outbound.py — Provider-agnostic outbound call initiation
recording.py — Captures raw audio frames and writes WAV files
processors.py — Custom frame processors for transcript collection and event logging
trace.py — Fire-and-forget async event persistence

Telephony Layer (`telephony/`)

Provider abstraction supporting three telephony backends:

Provider	Audio Format	Serializer
Twilio	mu-law 8 kHz	`TwilioFrameSerializer`
Exotel	PCM 8 kHz	`ExotelFrameSerializer`
Vobiz	mu-law 8 kHz	`VobizFrameSerializer`

Each provider implements: transport creation, webhook response building, and outbound call initiation.

AI Services (`services/`)

Pluggable factories for each AI capability:

STT — Deepgram, Sarvam, OpenAI, ElevenLabs
LLM — OpenAI (GPT-4.1 family), Google Gemini, Grok (xAI)
TTS — Cartesia, ElevenLabs, Sarvam, Deepgram, OpenAI

Agents choose their provider and model per-service through database configuration.

Tool System (`tools/`)

Function calling via webhook-based tools:

Tools are stored in the database with HTTP endpoint, parameters schema, and headers
The LLM calls tools during conversation; results are passed back as context
Flow nodes can scope which tools are available at each step

Persistence

PostgreSQL stores agents, calls, call events, flow nodes, tools, and telephony settings.

Redis provides session caching with TTL-based expiry.

Filesystem stores WAV recordings organized as recordings/YYYY/MM/{call_id}.wav.

Background Workers (`workers/`)

APScheduler runs on app startup:

Recording cleanup — Daily at 3 AM, deletes recordings older than retention threshold
Stale call cleanup — Every 5 minutes, marks calls stuck in IN_PROGRESS for >30 minutes as completed

Dashboard (`dashboard/`)

React 19 + TypeScript admin interface:

Agent management and configuration
Call history with transcript and recording playback
Visual flow editor (XYFlow canvas)
Webhook tool management
Telephony settings

Call Lifecycle

Inbound

Provider sends HTTP request to POST /telephony/{provider}/inbound/{agent_name}
Backend returns provider-specific XML response with WebSocket stream URL
Provider opens WebSocket to /ws/{provider}/{agent_name}
WebSocket handler creates/updates call record, starts voice pipeline
Pipeline runs: STT → LLM → TTS with real-time audio streaming
Transcript, events, and recording are persisted throughout the call
On disconnect, call is marked completed or failed

Outbound

Client calls POST /api/calls/outbound with phone number, agent, and optional context
Backend creates call record in RINGING state
Provider REST API initiates the call with callback URL
When answered, provider connects WebSocket — same pipeline runs
Call context (custom parameters) is injected into the LLM system prompt

Data Model

Agent

Named voice persona with full configuration:

System prompt, greeting (supports templating)
STT, LLM, and TTS provider/model/settings
Pipeline settings (VAD, turn detection, idle thresholds)
Pre-call tool IDs, context variable schema
Associated flow nodes

Call

Individual conversation record:

Phone number, direction (inbound/outbound), status
Timestamps, duration, transcript
Recording path, provider metadata
Call context (outbound parameters)

CallEvent

Time-ordered event log per call:

call_started, call_ended
user_spoke, agent_spoke
tool_called, tool_result
node_entered, node_transition
context_injected, error

FlowNode

Conversation step within an agent's flow:

Node key, position, initial/terminal flags
Role messages (persona) and task messages (instructions)
Transition functions (route to next node)
Tool IDs and pre-actions
Visual editor position

Tool

Webhook-based external integration:

Name, description, parameter schema
HTTP endpoint, method, headers, timeout

TelephonySettings

Provider credentials and concurrency limits (singleton row).

Design Constraints

Telephony audio is 8 kHz — the pipeline matches this to avoid transcoding overhead
WebSocket calls are stateful — production deployments need sticky routing
Recordings are local files — multi-instance deployments need shared storage
Each call runs its own Pipecat pipeline — capacity scales with CPU, network, and provider limits

Architecture ​

Overview ​

System Diagram ​

Component Summary ​

Backend API (api/) ​

Voice Runtime (agent/) ​

Telephony Layer (telephony/) ​

AI Services (services/) ​

Tool System (tools/) ​

Persistence ​

Background Workers (workers/) ​

Dashboard (dashboard/) ​

Call Lifecycle ​

Inbound ​

Outbound ​

Data Model ​

Agent ​

Call ​

CallEvent ​

FlowNode ​

Tool ​

TelephonySettings ​

Design Constraints ​