Skip to content

Architecture

Overview

This platform runs configurable AI voice agents that handle inbound and outbound phone calls. It processes speech in real time using a modular pipeline, supports multi-step conversation flows, and persists call artifacts for review.

System Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                         Telephony Providers                         │
│                   Twilio  ·  Exotel  ·  Vobiz                       │
└──────────┬──────────────────────────────────────────┬───────────────┘
           │  HTTP webhooks                           │  WebSocket (audio)
           ▼                                          ▼
┌─────────────────────────────────────────────────────────────────────┐
│                          FastAPI Backend                             │
│                                                                     │
│  ┌──────────────┐  ┌──────────────┐  ┌───────────────────────────┐  │
│  │  REST API     │  │  Webhooks    │  │  WebSocket Handlers       │  │
│  │  /api/*       │  │  /telephony/ │  │  /ws/{provider}/{agent}   │  │
│  └──────┬───────┘  └──────┬───────┘  └──────────┬────────────────┘  │
│         │                 │                      │                   │
│         ▼                 ▼                      ▼                   │
│  ┌──────────────────────────────────────────────────────────────┐    │
│  │                    Voice Pipeline (Pipecat)                  │    │
│  │                                                              │    │
│  │  Audio In → STT → LLM (+ tools) → TTS → Audio Out           │    │
│  │            ↑                                                 │    │
│  │       Flow Engine (optional multi-node state machine)        │    │
│  └──────────────────────────────────────────────────────────────┘    │
│         │                                                           │
│  ┌──────┴───────┐  ┌──────────────┐  ┌──────────────┐              │
│  │  PostgreSQL   │  │    Redis     │  │  Filesystem   │              │
│  │  calls,agents │  │  sessions    │  │  recordings   │              │
│  │  events,flows │  │              │  │               │              │
│  └──────────────┘  └──────────────┘  └──────────────┘              │
└─────────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────┐
│                     React Dashboard (admin UI)                      │
│          Agents · Calls · Flow Editor · Tools · Settings            │
└─────────────────────────────────────────────────────────────────────┘

Component Summary

Backend API (api/)

FastAPI application providing:

  • Agent CRUD and AI-powered agent generation
  • Call history, events, and recording retrieval
  • Outbound call initiation (provider-agnostic)
  • Inbound telephony webhooks (TwiML, ExoML, Vobiz XML)
  • WebSocket endpoints for real-time audio streaming
  • Flow node management (visual editor backend)
  • Webhook tool management
  • Telephony settings configuration

Voice Runtime (agent/)

Real-time voice processing built on Pipecat:

  • pipeline.py — Assembles the STT → LLM → TTS chain with turn-taking strategies, idle detection, and recording
  • flow_engine.py — State machine that drives multi-node conversations with per-node instructions, tools, and transitions
  • outbound.py — Provider-agnostic outbound call initiation
  • recording.py — Captures raw audio frames and writes WAV files
  • processors.py — Custom frame processors for transcript collection and event logging
  • trace.py — Fire-and-forget async event persistence

Telephony Layer (telephony/)

Provider abstraction supporting three telephony backends:

ProviderAudio FormatSerializer
Twiliomu-law 8 kHzTwilioFrameSerializer
ExotelPCM 8 kHzExotelFrameSerializer
Vobizmu-law 8 kHzVobizFrameSerializer

Each provider implements: transport creation, webhook response building, and outbound call initiation.

AI Services (services/)

Pluggable factories for each AI capability:

  • STT — Deepgram, Sarvam, OpenAI, ElevenLabs
  • LLM — OpenAI (GPT-4.1 family), Google Gemini, Grok (xAI)
  • TTS — Cartesia, ElevenLabs, Sarvam, Deepgram, OpenAI

Agents choose their provider and model per-service through database configuration.

Tool System (tools/)

Function calling via webhook-based tools:

  • Tools are stored in the database with HTTP endpoint, parameters schema, and headers
  • The LLM calls tools during conversation; results are passed back as context
  • Flow nodes can scope which tools are available at each step

Persistence

PostgreSQL stores agents, calls, call events, flow nodes, tools, and telephony settings.

Redis provides session caching with TTL-based expiry.

Filesystem stores WAV recordings organized as recordings/YYYY/MM/{call_id}.wav.

Background Workers (workers/)

APScheduler runs on app startup:

  • Recording cleanup — Daily at 3 AM, deletes recordings older than retention threshold
  • Stale call cleanup — Every 5 minutes, marks calls stuck in IN_PROGRESS for >30 minutes as completed

Dashboard (dashboard/)

React 19 + TypeScript admin interface:

  • Agent management and configuration
  • Call history with transcript and recording playback
  • Visual flow editor (XYFlow canvas)
  • Webhook tool management
  • Telephony settings

Call Lifecycle

Inbound

  1. Provider sends HTTP request to POST /telephony/{provider}/inbound/{agent_name}
  2. Backend returns provider-specific XML response with WebSocket stream URL
  3. Provider opens WebSocket to /ws/{provider}/{agent_name}
  4. WebSocket handler creates/updates call record, starts voice pipeline
  5. Pipeline runs: STT → LLM → TTS with real-time audio streaming
  6. Transcript, events, and recording are persisted throughout the call
  7. On disconnect, call is marked completed or failed

Outbound

  1. Client calls POST /api/calls/outbound with phone number, agent, and optional context
  2. Backend creates call record in RINGING state
  3. Provider REST API initiates the call with callback URL
  4. When answered, provider connects WebSocket — same pipeline runs
  5. Call context (custom parameters) is injected into the LLM system prompt

Data Model

Agent

Named voice persona with full configuration:

  • System prompt, greeting (supports templating)
  • STT, LLM, and TTS provider/model/settings
  • Pipeline settings (VAD, turn detection, idle thresholds)
  • Pre-call tool IDs, context variable schema
  • Associated flow nodes

Call

Individual conversation record:

  • Phone number, direction (inbound/outbound), status
  • Timestamps, duration, transcript
  • Recording path, provider metadata
  • Call context (outbound parameters)

CallEvent

Time-ordered event log per call:

  • call_started, call_ended
  • user_spoke, agent_spoke
  • tool_called, tool_result
  • node_entered, node_transition
  • context_injected, error

FlowNode

Conversation step within an agent's flow:

  • Node key, position, initial/terminal flags
  • Role messages (persona) and task messages (instructions)
  • Transition functions (route to next node)
  • Tool IDs and pre-actions
  • Visual editor position

Tool

Webhook-based external integration:

  • Name, description, parameter schema
  • HTTP endpoint, method, headers, timeout

TelephonySettings

Provider credentials and concurrency limits (singleton row).

Design Constraints

  • Telephony audio is 8 kHz — the pipeline matches this to avoid transcoding overhead
  • WebSocket calls are stateful — production deployments need sticky routing
  • Recordings are local files — multi-instance deployments need shared storage
  • Each call runs its own Pipecat pipeline — capacity scales with CPU, network, and provider limits