Applied AI

Experimental2025

Support Co-Pilot

Real-time AI assistance for customer support agents on live Twilio conference calls: streaming transcription, contextual suggestions, sentiment tracking, and knowledge-base retrieval over a single unified backend.

Support Co-Pilot

AI-Assisted Customer Support Co-Pilot. A real-time assistance platform for live customer support calls. The customer and the agent share a Twilio conference call; the agent also has a web dashboard open that streams live transcription, context-aware response suggestions, sentiment tracking, and grounded answers from an internal knowledge base.

Support Co-Pilot started as a demo for OneNeural Labs to prove out a thesis: modern speech, reasoning, and retrieval models are now fast and cheap enough that an agent can be augmented in real time on an ordinary voice call, without retraining or workflow changes. The shipped demo runs end-to-end with sub-second transcript latency and sub-1.5s suggestion latency.

Overview

Detail Value
Category Applied AI, real-time systems, contact-center tooling
Role Architect, primary maintainer
Venture OneNeural Labs
Repo oneneural/copilot (Nx monorepo, TypeScript end to end)
Frontend Next.js 15 App Router, Tailwind + DaisyUI, Framer Motion, Zustand
Backend Node 20 + Express, native WebSocket, Server-Sent Events, Prisma
AI stack Deepgram Nova-3, OpenAI GPT-5 Nano, Pinecone vector search
Voice Twilio Conference + Media Streams
Auth / data Clerk, Supabase Postgres, Redis
Timeline Built Oct 2025, v0.1.0 demo drop Oct 11 2025
Status Experimental, demo-focused

Why I Built It

Customer support teams historically get "AI assistance" as a post-call report: transcripts, QA scores, sentiment charts. That is useful for coaching, but it does nothing for the call that is happening right now. Meanwhile the speech, reasoning, and retrieval stacks have all crossed the latency thresholds needed for in-the-moment assistance:

  • Deepgram Nova-3 hits sub-500 ms streaming transcription with usable diarization.
  • GPT-5 Nano returns short, grounded completions in well under 1.5 seconds.
  • Vector search over a policy corpus in Pinecone is effectively free on the critical path.

Support Co-Pilot is the smallest end-to-end system that wires those three together into an agent-facing dashboard, so I could answer three product questions:

  1. Can a single Twilio conference give both a usable call experience and a clean audio stream for the AI?
  2. Can a unified Node server handle audio ingestion, AI orchestration, and dashboard push without splitting into microservices?
  3. Can the UX stay calm enough that an agent can actually use it while talking?

The v0.1.0 demo says yes to all three.

Conference Call Flow

sequenceDiagram
  participant Customer
  participant Twilio as Twilio Conference
  participant Agent
  participant API as Unified API
  participant Deepgram
  participant OpenAI
  participant Pinecone
  participant Dashboard as Agent Dashboard

  Customer->>Twilio: Dial support number
  Twilio->>API: Conference start webhook
  API->>Twilio: Dial agent into conference
  Twilio->>Agent: Ring agent's phone
  Agent-->>Twilio: Join conference
  Twilio->>API: Media stream (WebSocket, both tracks)
  Agent->>Dashboard: Open call view (SSE connect)

  loop During the call
    Twilio-->>API: μ-law audio frames
    API->>Deepgram: PCM16 stream
    Deepgram-->>API: Diarized transcript deltas
    API->>OpenAI: Prompt with transcript window
    API->>Pinecone: Query top-K KB chunks
    Pinecone-->>API: Matching policies / FAQs
    OpenAI-->>API: Suggestion text
    API-->>Dashboard: SSE: transcript, sentiment, suggestion, KB
  end

  Agent-->>Twilio: Hang up
  Twilio->>API: Conference end webhook
  API->>OpenAI: Summarization prompt
  API->>Dashboard: SSE: final summary + metrics

System Architecture

flowchart LR
  subgraph Clients
    AgentPhone[Agent phone]
    CustomerPhone[Customer phone]
    Browser["Agent dashboard<br/>Next.js 15"]
  end

  subgraph TwilioLayer["Twilio"]
    Conf[Twilio Conference]
    Media[Media Streams]
  end

  subgraph Server["Unified Server"]
    Gateway[REST API]
    WS[WebSocket handler]
    SSE[SSE broadcaster]
    AudioProc[Audio processor]
    Orchestrator[AI orchestrator]
    KB[KB service]
  end

  subgraph AI["AI and Data"]
    DG[Deepgram Nova-3]
    GPT[OpenAI GPT-5 Nano]
    PC[Pinecone]
    PG[(Supabase Postgres)]
    RedisCache[(Redis)]
  end

  AgentPhone --- Conf
  CustomerPhone --- Conf
  Conf --- Media
  Media ==> WS
  Browser <-->|HTTPS| Gateway
  SSE -->|SSE| Browser

  WS --> AudioProc --> DG
  DG --> Orchestrator
  Orchestrator --> GPT
  Orchestrator --> KB --> PC
  Orchestrator --> PG
  Orchestrator --> RedisCache
  Orchestrator --> SSE

What It Does

For the Agent

  • Live transcript. Diarized in real time, speaker-0 = Agent, speaker-1 = Customer; scrolls as the conversation evolves.
  • Contextual suggestions. GPT-5 Nano continuously drafts short response suggestions grounded in the last N turns, rendered as cards the agent can dismiss, copy, or speak.
  • Sentiment and flow. Running sentiment bar with "escalation risk" flags, so the agent sees tone shifts before they become objections.
  • Knowledge access. Top-K matching FAQs and policies from Pinecone, linked directly to the underlying document, so the agent can cite rather than guess.
  • Quick actions. One-click "send link", "escalate to tier-2", "create ticket" hooks that fire against the CRM API.

For the Supervisor

  • Active calls dashboard. A single view of every in-progress call with live sentiment, AHT, and agent name.
  • Drop-in review. Join any live transcript read-only to coach without interrupting.
  • Performance analytics. CSAT, handle time, and resolution metrics aggregated per agent and per topic.

For the Business

  • Faster resolution. Agents get the right answer while they are still talking to the customer.
  • Consistent answers. Suggestions come from the same policy corpus every agent is reading from.
  • Lower ramp time. New hires get real-time guidance instead of memorizing binders.

Real-Time AI Pipeline

The orchestrator is the only component that decides what goes out on SSE. Everything it sees comes from three streams: transcript deltas, sentiment scores, and KB hits. It produces one broadcast per transcript delta with any updated fields.

flowchart TB
  Audio[Audio frames]
  STT[Deepgram Nova-3<br/>speaker diarization]
  Buffer[Rolling transcript buffer<br/>last N seconds]
  Sent[Sentiment model]
  Sug[GPT-5 Nano<br/>suggestion chain]
  Search[Pinecone<br/>policy + FAQ index]
  Merge[Orchestrator merge]
  Out[SSE event]

  Audio --> STT --> Buffer
  Buffer --> Sent
  Buffer --> Sug
  Buffer --> Search
  Sent --> Merge
  Sug --> Merge
  Search --> Merge
  Merge --> Out

Latency Budget

The user-visible budget is the sum of: Twilio → server, audio decode, Deepgram, orchestration, SSE broadcast, and the browser's paint. The measured allocation:

Stage Target Notes
Twilio media stream to server < 100 ms WSS over public internet
μ-law decode to PCM16 negligible Done per frame, zero-copy
Deepgram Nova-3 partial transcript < 500 ms Streaming API, partial deltas
GPT-5 Nano suggestion < 1.5 s Short prompt, streamed completion
Pinecone top-K lookup < 120 ms p95 Hosted index, small chunk size
SSE broadcast to dashboard < 100 ms Same process, EventSource
Browser render < 200 ms React reconciliation of transcript
End-to-end transcript update < 800 ms
End-to-end suggestion update < 2.0 s

These are the p50 targets at which the demo feels natural. p99 is worse on slow networks but the UX degrades gracefully because the dashboard updates fields independently.

Repository Layout

flowchart LR
  Root[copilot-assist/]
  Apps[apps/]
  Web[apps/web<br/>Next.js 15]
  API[apps/api<br/>Express + WS]
  Pkgs[packages/]
  DB[packages/database<br/>Prisma]
  Shared[packages/shared<br/>types]
  Cfg[packages/config]
  Docs[docs/<br/>product + tech + guides]

  Root --> Apps
  Root --> Pkgs
  Root --> Docs
  Apps --> Web
  Apps --> API
  Pkgs --> DB
  Pkgs --> Shared
  Pkgs --> Cfg

Nx ties it together: pnpm nx serve web and pnpm nx serve api for local work, pnpm nx run-many --target=build --all in CI. pnpm nx graph is the source of truth for dependency direction across the workspace.

Design Decisions and Trade-offs

  • One Twilio conference, not separate agent and customer calls. Routing the audio through a conference gives me a single mixed WebSocket stream and avoids double-billing minutes. The cost is that I have less direct control over per-track audio; diarization via Deepgram is what makes this work in practice.
  • Unified Node server instead of microservices. Audio ingestion, AI orchestration, and dashboard fan-out all live in one Express process with a WebSocket endpoint and an SSE endpoint. The latency budget does not tolerate the cross-service hops that a "proper" architecture would introduce at this scale, and splitting prematurely would have added deploy complexity without any real isolation benefit for a demo.
  • SSE for dashboard push, WebSocket for Twilio ingress. SSE is unidirectional, uses ordinary HTTP, auto-reconnects, and needs no client library. Twilio requires WebSocket for media streams, so that decision was made for me. Mixing the two is fine inside one process.
  • GPT-5 Nano over larger models. For short suggestion strings grounded in a tight transcript window, Nano is fast enough to stay inside the 2-second budget and accurate enough with prompt-level constraints. Larger models were tested and blew the latency budget without materially better output at this task.
  • Strict security level and no tool use on the LLM. The LLM never invokes anything. It drafts strings; the dashboard and the agent decide what to do with them. This is the simplest way to keep the real-time loop safe and auditable.
  • Prisma + Supabase + Redis, not a bespoke store. The call tables, transcript rows, and summary records are all boring Postgres. Redis is only for live session state (SSE subscriber lists, audio buffers). Nothing clever persisted.

What I Learned

  • Diarization carries most of the demo. Without reliable speaker-0 vs speaker-1 tagging, every downstream feature (suggestions, sentiment, CSAT) degrades. Picking a speech provider whose diarization is actually stable in production was the single most important call.
  • The latency budget is a design constraint, not a spec item. Every added service, every added round-trip, came out of the same 2-second envelope. "Unify first, split when forced" kept the demo believable.
  • SSE is underused. For one-to-one server-to-client push, it is strictly simpler than WebSocket, and the auto-reconnect semantics saved me at least one bug I would have written by hand.
  • Grounded suggestions feel different from open-ended ones. Adding Pinecone hits to the suggestion prompt changed agent reactions from "neat" to "I'd use this". Without retrieval, suggestions drifted; with retrieval, they stayed on-policy.
  • Nx was worth the setup cost for a demo. A single pnpm nx run-many --target=build --all for CI and a shared type package between apps/web and apps/api removed enough friction that adding features stayed cheap even at demo scope.

Status and Roadmap

v0.1.0 is a complete, runnable demo. It is not a production contact-center product; the near-term items below are what it would take to get there:

  • Post-call analytics store. Durable summaries, coaching tags, and searchable transcripts beyond the session Redis cache.
  • CSAT calibration. Map live sentiment signals to historical CSAT outcomes so the "escalation risk" indicator is grounded, not just a mood ring.
  • Multi-tenant hardening. Proper org-scoped KB indexes, per-tenant Redis prefixes, SSO via Clerk organizations, and audit logging.
  • Agent feedback loop. One-click "this suggestion helped / didn't" that streams into a dataset for prompt tuning and model selection.
  • Deployment hardening. Cloud Run autoscaling profile, Cloudflare front door, and a proper load test against 50 concurrent conferences.
applied-aicustomer-supportreal-timespeech-to-textragsentiment-analysistwiliodeepgramopenaipineconenext-jsnx-monorepo
← Back to all experiments