Skip to main content

Documentation Index

Fetch the complete documentation index at: https://orbit-docs.devotel.io/llms.txt

Use this file to discover all available pages before exploring further.

Orbit Voice Protocol (OVP)

OVP is a bidirectional WebSocket protocol that connects a SIP call directly to an AI voice agent built on Orbit. Where Twilio ConversationRelay overloads a small handful of events and buries tool calls in the text channel, OVP gives every concern — audio, transcripts, LLM tokens, tool invocations, interrupts, handoffs, quality, cost — its own typed event. Base endpoint: wss://voice.orbit.devotel.io/v1/orbit-voice Subprotocol: orbit-voice/1.0 Audio format: PCM 16-bit mono 16kHz little-endian, 20ms frames Event wire format: UTF-8 JSON with type discriminator

Why a new protocol

Twilio’s ConversationRelay is functional. We talked to teams running it in production and found three sharp edges that we could close with a clean-sheet design:
  1. prompt is two events welded together — sometimes “the user said this”, sometimes “send this to the LLM”. Devs have to inspect heuristics on each frame to know which one they’re looking at.
  2. Tool calls are unobservable — they ride the same text channel as the LLM’s spoken output. You cannot subscribe to “show me every tool the agent invoked, with arguments and results, in order”.
  3. One interrupt for three different semantics — caller barge-in, programmatic cancel, and human takeover all collapse onto one primitive. They should drive different LLM-context truncation policies.
OVP starts from those problems and adds first-class events for everything else the platform already tracks internally — quality metrics, latency breakdowns, running cost, human-handoff lifecycle, mid-call voice/model swaps.

Connection lifecycle

1. Open the WebSocket

GET /v1/orbit-voice?agent_id=agt_01HZK... HTTP/1.1
Host: voice.orbit.devotel.io
Sec-WebSocket-Protocol: orbit-voice/1.0
Authorization: Bearer <signed_agent_token>
Upgrade: websocket
The signed agent token is a short-lived JWT issued by POST /v1/voice/agents/{agent_id}/tokens. It scopes to one call leg, expires in 5 minutes, and carries tenant_id, agent_id, and call_id. If the subprotocol header is missing or wrong, the handshake fails with HTTP 400 and an OVP_SUBPROTOCOL_MISMATCH body. If the JWT is bad, the handshake completes but the server immediately sends an error event with OVP_AUTH_FAILED and closes.

2. Server sends session.start

First frame from the server. Includes the protocol version, agent config, audio invariants, and the set of tools the LLM may call. Clients should store the session_id — it appears on every subsequent event.

3. Bidirectional audio + events

  • Binary frames carry PCM samples in both directions. Each frame starts with a 12-byte header (see below).
  • Text frames carry JSON events. Every event has type, seq, ts, session_id.
Sequence numbers (seq) are monotonic per direction. The server’s outbound counter and the client’s outbound counter increment independently. Servers must reject seq regressions with OVP_SEQ_REGRESSION to defend against replay.

4. Resume on drop

If the WebSocket drops within 30 seconds, the client may reconnect and send session.resume with the last seq it observed from the server. The server replays missed events and resumes the agent in place. Twilio drops the call.

5. End

Either the caller hangs up, the agent decides the call is over, the developer programmatically cancels, or a fatal error fires. The server emits session.end with a reason and final stats, then closes the WS with code 1000.

Binary audio frame format

Every binary WebSocket frame carries one 20ms PCM chunk prefixed by a 12-byte header.
offset  size  field          notes
0       1     version        always 1
1       1     flags          bit0=silence, bit1=dtmf-overlay, bit2=last-of-utterance
2       2     direction      uint16 LE; 0=client→server, 1=server→client
4       4     frame_seq      uint32 LE; monotonic per direction
8       4     rtp_timestamp  uint32 LE; caller-side RTP ts (mod 2^32)
12      640   PCM16LE samples (320 samples × 2 bytes)
Total frame size: 652 bytes. Why our own header instead of raw PCM? Because the agent pod needs to know about packet loss, silence, and DTMF overlay without a separate side channel — and the SBC’s RTP timestamp is the canonical clock for jitter-buffer math on the agent side.

Event reference

Every JSON event extends the envelope:
{
  type: string;        // discriminator
  seq: number;         // monotonic per direction
  ts: string;          // ISO-8601 UTC
  session_id: string;  // assigned by server in session.start
}
Direction column: S→C = server emits, C→S = client emits, = both.

Session lifecycle

EventDirPurpose
session.startS→CHandshake. Protocol version, agent config, tools, audio invariants.
session.updateC→SMutate runtime config mid-call: voice, language, model, instructions, speech rate.
session.resumeC→SReconnect with state preserved. Server replays events past last_server_seq.
session.endS→CTerminate. Carries reason + final stats.

session.start (server → client)

{
  "type": "session.start",
  "seq": 0,
  "ts": "2026-05-14T12:34:56.000Z",
  "session_id": "ovps_01HZK...",
  "protocol_version": "1.0",
  "call": {
    "call_id": "call_01HZK...",
    "from": "+14155551234",
    "to": "+18005550100",
    "direction": "inbound",
    "secure": true
  },
  "agent": {
    "agent_id": "agt_01HZK...",
    "name": "Front-desk Support",
    "instructions": "You are a friendly receptionist...",
    "model": "claude-sonnet-4-6",
    "voice_id": "cartesia_voice_abc",
    "language": "en-US",
    "tools": [
      {
        "name": "lookup_order",
        "description": "Fetch an order by id",
        "input_schema": { "type": "object", "properties": { "order_id": { "type": "string" } } },
        "streaming": false
      }
    ]
  },
  "audio": {
    "sample_rate_hz": 16000,
    "sample_width_bits": 16,
    "channels": 1,
    "frame_ms": 20
  }
}

session.update (client → server)

{
  "type": "session.update", "seq": 14, "ts": "...", "session_id": "...",
  "patch": { "voice_id": "cartesia_voice_xyz", "language": "es-ES" }
}
Mutable fields: voice_id, language, model, instructions, speech_rate, caller_context. Empty patch is rejected with OVP_SESSION_UPDATE_REJECTED. Twilio supports only language swap.

Audio path metadata

Binary frames carry the bytes; these JSON events carry the metadata around them.
EventDirPurpose
audio.ingressS→CAck of caller audio ingested in the last window (debugging + VAD UI).
audio.egressS→CTTS bytes streaming for an utterance. Pairs with binary frames.
audio.flushS→CDiscard the agent’s audio queue immediately. Cause = barge_in / cancel / takeover / session_end.

Transcripts

EventDirPurpose
transcript.partialS→CInterim STT output. Not yet committed.
transcript.finalS→CCommitted STT output. Carries word-level timing.
{
  "type": "transcript.final", "seq": 27, "ts": "...", "session_id": "...",
  "utterance_id": "utt_01HZK...",
  "speaker": "caller",
  "language": "en-US",
  "text": "What time do you close on Sundays?",
  "confidence": 0.94,
  "words": [
    { "word": "What", "start_ms": 0, "end_ms": 220, "confidence": 0.96 },
    { "word": "time", "start_ms": 220, "end_ms": 470, "confidence": 0.93 }
  ]
}

Agent reasoning + tool-use

OVP splits the LLM lifecycle into four distinct event types. Twilio collapses these into one text channel.
EventDirPurpose
agent.inputC→SPush text INTO the LLM. source = caller_transcript / developer_inject / rag_context / system.
agent.thinkingS→CStreaming LLM tokens (deltas).
agent.outputS→CCommitted agent response text.
agent.tool.invokeS→CLLM is calling a tool. Carries arguments.
agent.tool.streamingS→CPartial tool result, for tools that opted into streaming.
agent.tool.resultS→CFinal tool result + duration.
agent.tool.errorS→CTool failed. Carries retryable.
{
  "type": "agent.tool.invoke", "seq": 41, "ts": "...", "session_id": "...",
  "turn_id": "turn_01HZK...",
  "tool_call_id": "tc_01HZK...",
  "name": "lookup_order",
  "arguments": { "order_id": "ord_12345" }
}
{
  "type": "agent.tool.result", "seq": 44, "ts": "...", "session_id": "...",
  "turn_id": "turn_01HZK...",
  "tool_call_id": "tc_01HZK...",
  "result": { "status": "shipped", "carrier": "UPS", "tracking": "1Z..." },
  "duration_ms": 412
}

Interrupts — three distinct types

This is where Twilio’s interrupt falls short. OVP distinguishes:
EventDirLLM context behaviour
agent.barge_inS→CCaller spoke over the bot. TTS truncated; LLM context truncated to what the caller actually heard (played_ms). Next turn references only delivered content.
agent.cancelC→SDeveloper programmatically killed the turn (guardrail, retry, etc.). LLM output discarded entirely. No context preserved.
supervisor.takeoverC→SHuman operator cuts in. AI pauses generating. Optional resumable: true lets the operator hand the call back.
supervisor.releaseC→SOperator hands back. May include context_note so the AI knows what was said while it was paused.
{
  "type": "agent.barge_in", "seq": 33, "ts": "...", "session_id": "...",
  "turn_id": "turn_01HZK...",
  "played_bytes": 51200,
  "played_ms": 1600
}

Multi-modal sends

The agent can send media to the caller mid-call. Useful when the call is happening alongside SMS / WhatsApp / RCS (the customer’s phone has both channels available).
EventDirPurpose
image.sendS→CSend an image via SMS/MMS/WhatsApp/email/RCS.
file.attachS→CSend a file (PDF, etc.) via SMS/WhatsApp/email.
screen.shareStart/stop a screen-share session (operator → caller).
{
  "type": "image.send", "seq": 52, "ts": "...", "session_id": "...",
  "channel": "whatsapp",
  "media_url": "https://media.orbit.devotel.io/...",
  "caption": "Here's the receipt you asked for",
  "media_id": "med_01HZK..."
}

Realtime observability

These events are emitted continuously during the call. Pipe them to your dashboard, your billing UI, your QA tool — Orbit does not charge for them and they fire even when no one is subscribed (so reconnects can replay).
EventDirCadence
call.quality.updateS→CEvery 5s. Packet loss, jitter, RTT, MOS.
agent.latency.breakdownS→CPer turn. STT first-token / final, LLM first-token / final, TTS first-byte, total turn round-trip.
agent.cost.tickS→CEvery 5s. USD micros, broken down by LLM / STT / TTS / voice termination.
{
  "type": "agent.latency.breakdown", "seq": 47, "ts": "...", "session_id": "...",
  "turn_id": "turn_01HZK...",
  "stt_first_token_ms": 89,
  "stt_final_ms": 240,
  "llm_first_token_ms": 312,
  "llm_final_ms": 980,
  "tts_first_byte_ms": 132,
  "total_turn_ms": 1452
}

Human handoff lifecycle

EventDirPurpose
agent.handoff_requestedS→CAI decided it can’t handle the call. Carries summary + sentiment.
agent.handoff_acceptedC→SAn operator picked it up. Carries queue_wait_ms.
agent.handoff_completedCall transferred. agent_remains_observer controls whether the AI stays on muted.
agent.handoff_failedC→SNo operator / timeout / declined. Carries fallback — continue_ai / voicemail / hangup.

DTMF

EventDirPurpose
dtmf.receivedS→CCaller pressed a key.
dtmf.sendC→SAgent / SBC pushes digits into the call (e.g. navigating an IVR).

Error

{
  "type": "error", "seq": 88, "ts": "...", "session_id": "...",
  "code": "OVP_LLM_UPSTREAM_FAILED",
  "message": "Anthropic returned 503 after 3 retries",
  "recoverable": false,
  "details": { "upstream_status": 503, "retry_count": 3 }
}
recoverable: true means the WS stays open. recoverable: false is always followed by session.end.

Worked example: simple Q&A call

Caller dials a DID, asks one question, hangs up.
[server→client] session.start                seq=0
[client→server] (binary audio frames @ 20ms, direction=0, frame_seq 0..N)
[server→client] audio.ingress                seq=1
[server→client] transcript.partial           seq=2   "What time"
[server→client] transcript.partial           seq=3   "What time do you close"
[server→client] transcript.final             seq=4   "What time do you close on Sundays?"
[server→client] agent.thinking               seq=5   delta="We"
[server→client] agent.thinking               seq=6   delta=" close"
[server→client] agent.thinking               seq=7   delta=" at"
[server→client] agent.thinking               seq=8   delta=" 6pm"
[server→client] agent.output                 seq=9   "We close at 6pm on Sundays." final=true
[server→client] audio.egress                 seq=10  utterance=utt_a bytes_sent=12800 final=false
[server→client] (binary audio frames, direction=1, ~25 frames)
[server→client] audio.egress                 seq=11  utterance=utt_a bytes_sent=32000 final=true
[server→client] agent.latency.breakdown      seq=12
[server→client] call.quality.update          seq=13  (every 5s)
[server→client] agent.cost.tick              seq=14
[server→client] session.end                  seq=15  reason=caller_hangup

Worked example: tool call mid-conversation

[server→client] transcript.final             seq=12  "Where's my order 12345?"
[server→client] agent.thinking               seq=13  delta="Let me check"
[server→client] agent.tool.invoke            seq=14  name="lookup_order" args={"order_id":"12345"}
[server→client] (caller hears filler audio — agent.thinking text routed to a low-latency TTS)
[server→client] agent.tool.result            seq=15  result={status:"shipped",tracking:"1Z..."} duration_ms=412
[server→client] agent.thinking               seq=16  delta="Your"
[server→client] agent.thinking               seq=17  delta=" order"
[server→client] agent.output                 seq=18  "Your order shipped yesterday via UPS, tracking 1Z..." final=true
[server→client] audio.egress                 seq=19
[server→client] (binary audio frames)
Note: tool invocations are observable as their own events. A dashboard can render “agent called lookup_order with {order_id:12345}, returned in 412ms” without parsing the text channel.

Worked example: human handoff

[server→client] transcript.final             seq=22  "I want to talk to a person"
[server→client] agent.handoff_requested      seq=23  reason="caller_requested" caller_summary="..." sentiment="frustrated"
[client→server] (route to operator queue via your contact-center)
[client→server] agent.handoff_accepted       seq=24  operator_user_id="usr_..." queue_wait_ms=4200
[client→server] supervisor.takeover          seq=25  operator_user_id="usr_..." resumable=false
[server→client] audio.flush                  seq=26  cause="takeover"
[client→server] agent.handoff_completed      seq=27  operator_user_id="usr_..." agent_remains_observer=true
[server→client] (binary audio frames continue, but the agent's TTS is muted; AI may still emit transcripts)
[server→client] transcript.final             seq=28  speaker="caller" text="..." (observer mode)
[server→client] session.end                  seq=99  reason="transferred"

Error codes

CodeWhen
OVP_AUTH_FAILEDJWT missing / invalid / expired.
OVP_SUBPROTOCOL_MISMATCHHandshake didn’t negotiate orbit-voice/1.0.
OVP_EVENT_SCHEMA_INVALIDJSON event failed schema validation.
OVP_EVENT_DIRECTION_VIOLATIONClient sent a server-only event (or vice versa).
OVP_SEQ_REGRESSIONseq went backwards.
OVP_AUDIO_FRAME_INVALIDBinary frame header malformed.
OVP_RESUME_EXPIREDsession.resume came in > 30s after drop.
OVP_SESSION_UPDATE_REJECTEDpatch was empty or referenced an immutable field.
OVP_TOOL_UNKNOWNLLM tried to call a tool not in session.start.agent.tools.
OVP_TOOL_ARGS_INVALIDTool arguments failed the declared input_schema.
OVP_LLM_UPSTREAM_FAILEDLLM provider returned a non-recoverable error.
OVP_STT_UPSTREAM_FAILEDSTT provider returned a non-recoverable error.
OVP_TTS_UPSTREAM_FAILEDTTS provider returned a non-recoverable error.
OVP_CALLER_HANGUPCaller hung up on the SIP side.
OVP_MAX_DURATION_EXCEEDEDHit the platform-level call duration cap.
OVP_SPEND_CAP_EXCEEDEDTenant agent-daily-spend cap reached. Call must terminate.
OVP_INTERNAL_ERRORBug, not config. Inspect Sentry.

Compared to Twilio ConversationRelay

ConcernTwilio ConversationRelayOrbit Voice Protocol
Overloaded eventsprompt is “user said” + “send to LLM”Split: transcript.final (STT) + agent.input (LLM input, typed source)
Tool callsBuried in text channelFirst-class: agent.tool.{invoke,streaming,result,error}
Interrupt types1 (interrupt)3 (barge_in, cancel, supervisor.takeover) — each truncates LLM context differently
Streaming LLM tokensFinal text onlyagent.thinking delta stream
Realtime qualityNot exposedcall.quality.update every 5s (loss / jitter / RTT / MOS)
Latency breakdownNot exposedagent.latency.breakdown per turn
Running costNot exposedagent.cost.tick every 5s, USD micros
Multi-modal mid-callNot supportedimage.send, file.attach, screen.share
Human handoffOrchestrate via RESTagent.handoff_{requested,accepted,completed,failed}
Resumable on WS dropDrops the call30s server-state preservation + replay
Mid-call voice swaplanguage onlyvoice, language, model, instructions, speech_rate
Protocol versioningNoneprotocol_version on every session.start
Audio frame headerNone — raw mulaw bytes12-byte header (seq, RTP ts, silence/dtmf/last flags)

SDK

import { OrbitVoiceClient } from "@devotel/sdk";

const client = new OrbitVoiceClient({
  agentId: "agt_01HZK...",
  callId: "call_01HZK...",
});

client.on("transcript.final", (e) => console.log(e.text));
client.on("agent.tool.invoke", (e) => console.log("tool:", e.name, e.arguments));
client.on("agent.cost.tick", (e) => updateDashboard(e.total_cost_usd_micros));

await client.connect();
The SDK handles framing, sequence-number bookkeeping, automatic reconnect with resume, and direction validation. Full reference: SDK reference.