Documentation Index
Fetch the complete documentation index at: https://orbit-docs.devotel.io/llms.txt
Use this file to discover all available pages before exploring further.
Orbit Voice Protocol (OVP)
OVP is a bidirectional WebSocket protocol that connects a SIP call directly to an AI voice agent built on Orbit. Where Twilio ConversationRelay overloads a small handful of events and buries tool calls in the text channel, OVP gives every concern — audio, transcripts, LLM tokens, tool invocations, interrupts, handoffs, quality, cost — its own typed event.
Base endpoint: wss://voice.orbit.devotel.io/v1/orbit-voice
Subprotocol: orbit-voice/1.0
Audio format: PCM 16-bit mono 16kHz little-endian, 20ms frames
Event wire format: UTF-8 JSON with type discriminator
Why a new protocol
Twilio’s ConversationRelay is functional. We talked to teams running it in production and found three sharp edges that we could close with a clean-sheet design:
prompt is two events welded together — sometimes “the user said this”, sometimes “send this to the LLM”. Devs have to inspect heuristics on each frame to know which one they’re looking at.
- Tool calls are unobservable — they ride the same text channel as the LLM’s spoken output. You cannot subscribe to “show me every tool the agent invoked, with arguments and results, in order”.
- One
interrupt for three different semantics — caller barge-in, programmatic cancel, and human takeover all collapse onto one primitive. They should drive different LLM-context truncation policies.
OVP starts from those problems and adds first-class events for everything else the platform already tracks internally — quality metrics, latency breakdowns, running cost, human-handoff lifecycle, mid-call voice/model swaps.
Connection lifecycle
1. Open the WebSocket
GET /v1/orbit-voice?agent_id=agt_01HZK... HTTP/1.1
Host: voice.orbit.devotel.io
Sec-WebSocket-Protocol: orbit-voice/1.0
Authorization: Bearer <signed_agent_token>
Upgrade: websocket
The signed agent token is a short-lived JWT issued by POST /v1/voice/agents/{agent_id}/tokens. It scopes to one call leg, expires in 5 minutes, and carries tenant_id, agent_id, and call_id.
If the subprotocol header is missing or wrong, the handshake fails with HTTP 400 and an OVP_SUBPROTOCOL_MISMATCH body. If the JWT is bad, the handshake completes but the server immediately sends an error event with OVP_AUTH_FAILED and closes.
2. Server sends session.start
First frame from the server. Includes the protocol version, agent config, audio invariants, and the set of tools the LLM may call. Clients should store the session_id — it appears on every subsequent event.
3. Bidirectional audio + events
- Binary frames carry PCM samples in both directions. Each frame starts with a 12-byte header (see below).
- Text frames carry JSON events. Every event has
type, seq, ts, session_id.
Sequence numbers (seq) are monotonic per direction. The server’s outbound counter and the client’s outbound counter increment independently. Servers must reject seq regressions with OVP_SEQ_REGRESSION to defend against replay.
4. Resume on drop
If the WebSocket drops within 30 seconds, the client may reconnect and send session.resume with the last seq it observed from the server. The server replays missed events and resumes the agent in place. Twilio drops the call.
5. End
Either the caller hangs up, the agent decides the call is over, the developer programmatically cancels, or a fatal error fires. The server emits session.end with a reason and final stats, then closes the WS with code 1000.
Every binary WebSocket frame carries one 20ms PCM chunk prefixed by a 12-byte header.
offset size field notes
0 1 version always 1
1 1 flags bit0=silence, bit1=dtmf-overlay, bit2=last-of-utterance
2 2 direction uint16 LE; 0=client→server, 1=server→client
4 4 frame_seq uint32 LE; monotonic per direction
8 4 rtp_timestamp uint32 LE; caller-side RTP ts (mod 2^32)
12 640 PCM16LE samples (320 samples × 2 bytes)
Total frame size: 652 bytes.
Why our own header instead of raw PCM? Because the agent pod needs to know about packet loss, silence, and DTMF overlay without a separate side channel — and the SBC’s RTP timestamp is the canonical clock for jitter-buffer math on the agent side.
Event reference
Every JSON event extends the envelope:
{
type: string; // discriminator
seq: number; // monotonic per direction
ts: string; // ISO-8601 UTC
session_id: string; // assigned by server in session.start
}
Direction column: S→C = server emits, C→S = client emits, ↔ = both.
Session lifecycle
| Event | Dir | Purpose |
|---|
session.start | S→C | Handshake. Protocol version, agent config, tools, audio invariants. |
session.update | C→S | Mutate runtime config mid-call: voice, language, model, instructions, speech rate. |
session.resume | C→S | Reconnect with state preserved. Server replays events past last_server_seq. |
session.end | S→C | Terminate. Carries reason + final stats. |
session.start (server → client)
{
"type": "session.start",
"seq": 0,
"ts": "2026-05-14T12:34:56.000Z",
"session_id": "ovps_01HZK...",
"protocol_version": "1.0",
"call": {
"call_id": "call_01HZK...",
"from": "+14155551234",
"to": "+18005550100",
"direction": "inbound",
"secure": true
},
"agent": {
"agent_id": "agt_01HZK...",
"name": "Front-desk Support",
"instructions": "You are a friendly receptionist...",
"model": "claude-sonnet-4-6",
"voice_id": "cartesia_voice_abc",
"language": "en-US",
"tools": [
{
"name": "lookup_order",
"description": "Fetch an order by id",
"input_schema": { "type": "object", "properties": { "order_id": { "type": "string" } } },
"streaming": false
}
]
},
"audio": {
"sample_rate_hz": 16000,
"sample_width_bits": 16,
"channels": 1,
"frame_ms": 20
}
}
session.update (client → server)
{
"type": "session.update", "seq": 14, "ts": "...", "session_id": "...",
"patch": { "voice_id": "cartesia_voice_xyz", "language": "es-ES" }
}
Mutable fields: voice_id, language, model, instructions, speech_rate, caller_context. Empty patch is rejected with OVP_SESSION_UPDATE_REJECTED. Twilio supports only language swap.
Binary frames carry the bytes; these JSON events carry the metadata around them.
| Event | Dir | Purpose |
|---|
audio.ingress | S→C | Ack of caller audio ingested in the last window (debugging + VAD UI). |
audio.egress | S→C | TTS bytes streaming for an utterance. Pairs with binary frames. |
audio.flush | S→C | Discard the agent’s audio queue immediately. Cause = barge_in / cancel / takeover / session_end. |
Transcripts
| Event | Dir | Purpose |
|---|
transcript.partial | S→C | Interim STT output. Not yet committed. |
transcript.final | S→C | Committed STT output. Carries word-level timing. |
{
"type": "transcript.final", "seq": 27, "ts": "...", "session_id": "...",
"utterance_id": "utt_01HZK...",
"speaker": "caller",
"language": "en-US",
"text": "What time do you close on Sundays?",
"confidence": 0.94,
"words": [
{ "word": "What", "start_ms": 0, "end_ms": 220, "confidence": 0.96 },
{ "word": "time", "start_ms": 220, "end_ms": 470, "confidence": 0.93 }
]
}
OVP splits the LLM lifecycle into four distinct event types. Twilio collapses these into one text channel.
| Event | Dir | Purpose |
|---|
agent.input | C→S | Push text INTO the LLM. source = caller_transcript / developer_inject / rag_context / system. |
agent.thinking | S→C | Streaming LLM tokens (deltas). |
agent.output | S→C | Committed agent response text. |
agent.tool.invoke | S→C | LLM is calling a tool. Carries arguments. |
agent.tool.streaming | S→C | Partial tool result, for tools that opted into streaming. |
agent.tool.result | S→C | Final tool result + duration. |
agent.tool.error | S→C | Tool failed. Carries retryable. |
{
"type": "agent.tool.invoke", "seq": 41, "ts": "...", "session_id": "...",
"turn_id": "turn_01HZK...",
"tool_call_id": "tc_01HZK...",
"name": "lookup_order",
"arguments": { "order_id": "ord_12345" }
}
{
"type": "agent.tool.result", "seq": 44, "ts": "...", "session_id": "...",
"turn_id": "turn_01HZK...",
"tool_call_id": "tc_01HZK...",
"result": { "status": "shipped", "carrier": "UPS", "tracking": "1Z..." },
"duration_ms": 412
}
Interrupts — three distinct types
This is where Twilio’s interrupt falls short. OVP distinguishes:
| Event | Dir | LLM context behaviour |
|---|
agent.barge_in | S→C | Caller spoke over the bot. TTS truncated; LLM context truncated to what the caller actually heard (played_ms). Next turn references only delivered content. |
agent.cancel | C→S | Developer programmatically killed the turn (guardrail, retry, etc.). LLM output discarded entirely. No context preserved. |
supervisor.takeover | C→S | Human operator cuts in. AI pauses generating. Optional resumable: true lets the operator hand the call back. |
supervisor.release | C→S | Operator hands back. May include context_note so the AI knows what was said while it was paused. |
{
"type": "agent.barge_in", "seq": 33, "ts": "...", "session_id": "...",
"turn_id": "turn_01HZK...",
"played_bytes": 51200,
"played_ms": 1600
}
Multi-modal sends
The agent can send media to the caller mid-call. Useful when the call is happening alongside SMS / WhatsApp / RCS (the customer’s phone has both channels available).
| Event | Dir | Purpose |
|---|
image.send | S→C | Send an image via SMS/MMS/WhatsApp/email/RCS. |
file.attach | S→C | Send a file (PDF, etc.) via SMS/WhatsApp/email. |
screen.share | ↔ | Start/stop a screen-share session (operator → caller). |
{
"type": "image.send", "seq": 52, "ts": "...", "session_id": "...",
"channel": "whatsapp",
"media_url": "https://media.orbit.devotel.io/...",
"caption": "Here's the receipt you asked for",
"media_id": "med_01HZK..."
}
Realtime observability
These events are emitted continuously during the call. Pipe them to your dashboard, your billing UI, your QA tool — Orbit does not charge for them and they fire even when no one is subscribed (so reconnects can replay).
| Event | Dir | Cadence |
|---|
call.quality.update | S→C | Every 5s. Packet loss, jitter, RTT, MOS. |
agent.latency.breakdown | S→C | Per turn. STT first-token / final, LLM first-token / final, TTS first-byte, total turn round-trip. |
agent.cost.tick | S→C | Every 5s. USD micros, broken down by LLM / STT / TTS / voice termination. |
{
"type": "agent.latency.breakdown", "seq": 47, "ts": "...", "session_id": "...",
"turn_id": "turn_01HZK...",
"stt_first_token_ms": 89,
"stt_final_ms": 240,
"llm_first_token_ms": 312,
"llm_final_ms": 980,
"tts_first_byte_ms": 132,
"total_turn_ms": 1452
}
Human handoff lifecycle
| Event | Dir | Purpose |
|---|
agent.handoff_requested | S→C | AI decided it can’t handle the call. Carries summary + sentiment. |
agent.handoff_accepted | C→S | An operator picked it up. Carries queue_wait_ms. |
agent.handoff_completed | ↔ | Call transferred. agent_remains_observer controls whether the AI stays on muted. |
agent.handoff_failed | C→S | No operator / timeout / declined. Carries fallback — continue_ai / voicemail / hangup. |
DTMF
| Event | Dir | Purpose |
|---|
dtmf.received | S→C | Caller pressed a key. |
dtmf.send | C→S | Agent / SBC pushes digits into the call (e.g. navigating an IVR). |
Error
{
"type": "error", "seq": 88, "ts": "...", "session_id": "...",
"code": "OVP_LLM_UPSTREAM_FAILED",
"message": "Anthropic returned 503 after 3 retries",
"recoverable": false,
"details": { "upstream_status": 503, "retry_count": 3 }
}
recoverable: true means the WS stays open. recoverable: false is always followed by session.end.
Worked example: simple Q&A call
Caller dials a DID, asks one question, hangs up.
[server→client] session.start seq=0
[client→server] (binary audio frames @ 20ms, direction=0, frame_seq 0..N)
[server→client] audio.ingress seq=1
[server→client] transcript.partial seq=2 "What time"
[server→client] transcript.partial seq=3 "What time do you close"
[server→client] transcript.final seq=4 "What time do you close on Sundays?"
[server→client] agent.thinking seq=5 delta="We"
[server→client] agent.thinking seq=6 delta=" close"
[server→client] agent.thinking seq=7 delta=" at"
[server→client] agent.thinking seq=8 delta=" 6pm"
[server→client] agent.output seq=9 "We close at 6pm on Sundays." final=true
[server→client] audio.egress seq=10 utterance=utt_a bytes_sent=12800 final=false
[server→client] (binary audio frames, direction=1, ~25 frames)
[server→client] audio.egress seq=11 utterance=utt_a bytes_sent=32000 final=true
[server→client] agent.latency.breakdown seq=12
[server→client] call.quality.update seq=13 (every 5s)
[server→client] agent.cost.tick seq=14
[server→client] session.end seq=15 reason=caller_hangup
[server→client] transcript.final seq=12 "Where's my order 12345?"
[server→client] agent.thinking seq=13 delta="Let me check"
[server→client] agent.tool.invoke seq=14 name="lookup_order" args={"order_id":"12345"}
[server→client] (caller hears filler audio — agent.thinking text routed to a low-latency TTS)
[server→client] agent.tool.result seq=15 result={status:"shipped",tracking:"1Z..."} duration_ms=412
[server→client] agent.thinking seq=16 delta="Your"
[server→client] agent.thinking seq=17 delta=" order"
[server→client] agent.output seq=18 "Your order shipped yesterday via UPS, tracking 1Z..." final=true
[server→client] audio.egress seq=19
[server→client] (binary audio frames)
Note: tool invocations are observable as their own events. A dashboard can render “agent called lookup_order with {order_id:12345}, returned in 412ms” without parsing the text channel.
Worked example: human handoff
[server→client] transcript.final seq=22 "I want to talk to a person"
[server→client] agent.handoff_requested seq=23 reason="caller_requested" caller_summary="..." sentiment="frustrated"
[client→server] (route to operator queue via your contact-center)
[client→server] agent.handoff_accepted seq=24 operator_user_id="usr_..." queue_wait_ms=4200
[client→server] supervisor.takeover seq=25 operator_user_id="usr_..." resumable=false
[server→client] audio.flush seq=26 cause="takeover"
[client→server] agent.handoff_completed seq=27 operator_user_id="usr_..." agent_remains_observer=true
[server→client] (binary audio frames continue, but the agent's TTS is muted; AI may still emit transcripts)
[server→client] transcript.final seq=28 speaker="caller" text="..." (observer mode)
[server→client] session.end seq=99 reason="transferred"
Error codes
| Code | When |
|---|
OVP_AUTH_FAILED | JWT missing / invalid / expired. |
OVP_SUBPROTOCOL_MISMATCH | Handshake didn’t negotiate orbit-voice/1.0. |
OVP_EVENT_SCHEMA_INVALID | JSON event failed schema validation. |
OVP_EVENT_DIRECTION_VIOLATION | Client sent a server-only event (or vice versa). |
OVP_SEQ_REGRESSION | seq went backwards. |
OVP_AUDIO_FRAME_INVALID | Binary frame header malformed. |
OVP_RESUME_EXPIRED | session.resume came in > 30s after drop. |
OVP_SESSION_UPDATE_REJECTED | patch was empty or referenced an immutable field. |
OVP_TOOL_UNKNOWN | LLM tried to call a tool not in session.start.agent.tools. |
OVP_TOOL_ARGS_INVALID | Tool arguments failed the declared input_schema. |
OVP_LLM_UPSTREAM_FAILED | LLM provider returned a non-recoverable error. |
OVP_STT_UPSTREAM_FAILED | STT provider returned a non-recoverable error. |
OVP_TTS_UPSTREAM_FAILED | TTS provider returned a non-recoverable error. |
OVP_CALLER_HANGUP | Caller hung up on the SIP side. |
OVP_MAX_DURATION_EXCEEDED | Hit the platform-level call duration cap. |
OVP_SPEND_CAP_EXCEEDED | Tenant agent-daily-spend cap reached. Call must terminate. |
OVP_INTERNAL_ERROR | Bug, not config. Inspect Sentry. |
Compared to Twilio ConversationRelay
| Concern | Twilio ConversationRelay | Orbit Voice Protocol |
|---|
| Overloaded events | prompt is “user said” + “send to LLM” | Split: transcript.final (STT) + agent.input (LLM input, typed source) |
| Tool calls | Buried in text channel | First-class: agent.tool.{invoke,streaming,result,error} |
| Interrupt types | 1 (interrupt) | 3 (barge_in, cancel, supervisor.takeover) — each truncates LLM context differently |
| Streaming LLM tokens | Final text only | agent.thinking delta stream |
| Realtime quality | Not exposed | call.quality.update every 5s (loss / jitter / RTT / MOS) |
| Latency breakdown | Not exposed | agent.latency.breakdown per turn |
| Running cost | Not exposed | agent.cost.tick every 5s, USD micros |
| Multi-modal mid-call | Not supported | image.send, file.attach, screen.share |
| Human handoff | Orchestrate via REST | agent.handoff_{requested,accepted,completed,failed} |
| Resumable on WS drop | Drops the call | 30s server-state preservation + replay |
| Mid-call voice swap | language only | voice, language, model, instructions, speech_rate |
| Protocol versioning | None | protocol_version on every session.start |
| Audio frame header | None — raw mulaw bytes | 12-byte header (seq, RTP ts, silence/dtmf/last flags) |
SDK
import { OrbitVoiceClient } from "@devotel/sdk";
const client = new OrbitVoiceClient({
agentId: "agt_01HZK...",
callId: "call_01HZK...",
});
client.on("transcript.final", (e) => console.log(e.text));
client.on("agent.tool.invoke", (e) => console.log("tool:", e.name, e.arguments));
client.on("agent.cost.tick", (e) => updateDashboard(e.total_cost_usd_micros));
await client.connect();
The SDK handles framing, sequence-number bookkeeping, automatic reconnect with resume, and direction validation. Full reference: SDK reference.