Orbit Voice Protocol (OVP)

Design preview — the wire format below is defined and tested; the live gateway does not yet speak it. This page is the target specification for OVP. The typed events and binary-frame codec ship today in the @devotel/voice package and are covered by conformance tests, so you can build and validate against the wire format now. The production voice bridge at wss://voice.orbit.devotel.io/voice/ws/{session_id} currently runs an earlier audio pipeline whose frame layout and audio sample rate differ from the spec on this page — it does not yet negotiate the orbit-voice/1.0 subprotocol or emit these events. Do not point a client at the live gateway expecting this protocol until this banner is removed. The endpoint, hostname, and JWT-authenticated upgrade described below are the intended shape of the service, not its current behaviour.

OVP is a bidirectional WebSocket protocol that connects a SIP call directly to an AI voice agent built on Orbit. Where Twilio ConversationRelay overloads a small handful of events and buries tool calls in the text channel, OVP gives every concern — audio, transcripts, LLM tokens, tool invocations, interrupts, handoffs, quality, cost — its own typed event. Base endpoint: wss://voice.orbit.devotel.io/voice/ws/{session_id}?token=<jwt> Subprotocol: orbit-voice/1.0 Audio format: PCM 16-bit mono 16kHz little-endian, 20ms frames Event wire format: UTF-8 JSON with type discriminator

Why a new protocol

Twilio’s ConversationRelay is functional. We talked to teams running it in production and found three sharp edges that we could close with a clean-sheet design:

prompt is two events welded together — sometimes “the user said this”, sometimes “send this to the LLM”. Devs have to inspect heuristics on each frame to know which one they’re looking at.
Tool calls are unobservable — they ride the same text channel as the LLM’s spoken output. You cannot subscribe to “show me every tool the agent invoked, with arguments and results, in order”.
One interrupt for three different semantics — caller barge-in, programmatic cancel, and human takeover all collapse onto one primitive. They should drive different LLM-context truncation policies.

OVP starts from those problems and adds first-class events for everything else the platform already tracks internally — quality metrics, latency breakdowns, running cost, human-handoff lifecycle, mid-call voice/model swaps.

Connection lifecycle

1. Open the WebSocket

The session id and its signed token are issued together when the call session is created — a browser softphone receives them from its session-start response, and the inbound SIP path receives them from the dialplan. The token travels in the token query parameter (not an Authorization header) so the gateway can authenticate the upgrade before the WebSocket handshake completes.

GET /voice/ws/{session_id}?token=<jwt> HTTP/1.1
Host: voice.orbit.devotel.io
Sec-WebSocket-Protocol: orbit-voice/1.0
Upgrade: websocket

The token is a short-lived HS256 JWT. It is bound to a single session, expires within 5 minutes of issue, and carries tenant_id, session_id, and the call context. It is minted server-side as part of session creation; there is no public per-agent token endpoint today. If the subprotocol header is missing or wrong, the handshake fails with HTTP 400 and an OVP_SUBPROTOCOL_MISMATCH body. If the JWT is bad, the handshake completes but the server immediately sends an error event with OVP_AUTH_FAILED and closes.

2. Server sends `session.start`

First frame from the server. Includes the protocol version, agent config, audio invariants, and the set of tools the LLM may call. Clients should store the session_id — it appears on every subsequent event.

3. Bidirectional audio + events

Binary frames carry PCM samples in both directions. Each frame starts with a 12-byte header (see below).
Text frames carry JSON events. Every event has type, seq, ts, session_id.

Sequence numbers (seq) are monotonic per direction. The server’s outbound counter and the client’s outbound counter increment independently. Servers must reject seq regressions with OVP_SEQ_REGRESSION to defend against replay.

4. Resume on drop

If the WebSocket drops within 30 seconds, the client may reconnect and send session.resume with the last seq it observed from the server. The server replays missed events and resumes the agent in place. Twilio drops the call.

5. End

Either the caller hangs up, the agent decides the call is over, the developer programmatically cancels, or a fatal error fires. The server emits session.end with a reason and final stats, then closes the WS with code 1000.

Binary audio frame format

Every binary WebSocket frame carries one 20ms PCM chunk prefixed by a 12-byte header.

offset  size  field          notes
     1     version        always 1
     1     flags          bit0=silence, bit1=dtmf-overlay, bit2=last-of-utterance
     2     direction      uint16 LE; 0=client→server, 1=server→client
     4     frame_seq      uint32 LE; monotonic per direction
     4     rtp_timestamp  uint32 LE; caller-side RTP ts (mod 2^32)
    640   PCM16LE samples (320 samples × 2 bytes)

Total frame size: 652 bytes. Why our own header instead of raw PCM? Because the agent pod needs to know about packet loss, silence, and DTMF overlay without a separate side channel — and the SBC’s RTP timestamp is the canonical clock for jitter-buffer math on the agent side.

Event reference

Every JSON event extends the envelope:

{
  type: string;        // discriminator
  seq: number;         // monotonic per direction
  ts: string;          // ISO-8601 UTC
  session_id: string;  // assigned by server in session.start
}

Direction column: S→C = server emits, C→S = client emits, ↔ = both.

Session lifecycle

Event	Dir	Purpose
`session.start`	S→C	Handshake. Protocol version, agent config, tools, audio invariants.
`session.update`	C→S	Mutate runtime config mid-call: voice, language, model, instructions, speech rate.
`session.resume`	C→S	Reconnect with state preserved. Server replays events past `last_server_seq`.
`session.end`	S→C	Terminate. Carries reason + final stats.

`session.start` (server → client)

{
  "type": "session.start",
  "seq": 0,
  "ts": "2026-05-14T12:34:56.000Z",
  "session_id": "ovps_01HZK...",
  "protocol_version": "1.0",
  "call": {
    "call_id": "call_01HZK...",
    "from": "+14155551234",
    "to": "+18005550100",
    "direction": "inbound",
    "secure": true
  },
  "agent": {
    "agent_id": "agt_01HZK...",
    "name": "Front-desk Support",
    "instructions": "You are a friendly receptionist...",
    "model": "claude-sonnet-4-6",
    "voice_id": "cartesia_voice_abc",
    "language": "en-US",
    "tools": [
      {
        "name": "lookup_order",
        "description": "Fetch an order by id",
        "input_schema": { "type": "object", "properties": { "order_id": { "type": "string" } } },
        "streaming": false
      }
    ]
  },
  "audio": {
    "sample_rate_hz": 16000,
    "sample_width_bits": 16,
    "channels": 1,
    "frame_ms": 20
  }
}

`session.update` (client → server)

{
  "type": "session.update", "seq": 14, "ts": "...", "session_id": "...",
  "patch": { "voice_id": "cartesia_voice_xyz", "language": "es-ES" }
}

Mutable fields: voice_id, language, model, instructions, speech_rate, caller_context. Empty patch is rejected with OVP_SESSION_UPDATE_REJECTED. Twilio supports only language swap.

Audio path metadata

Binary frames carry the bytes; these JSON events carry the metadata around them.

Event	Dir	Purpose
`audio.ingress`	S→C	Ack of caller audio ingested in the last window (debugging + VAD UI).
`audio.egress`	S→C	TTS bytes streaming for an utterance. Pairs with binary frames.
`audio.flush`	S→C	Discard the agent’s audio queue immediately. Cause = barge_in / cancel / takeover / session_end.

Transcripts

Event	Dir	Purpose
`transcript.partial`	S→C	Interim STT output. Not yet committed.
`transcript.final`	S→C	Committed STT output. Carries word-level timing.

{
  "type": "transcript.final", "seq": 27, "ts": "...", "session_id": "...",
  "utterance_id": "utt_01HZK...",
  "speaker": "caller",
  "language": "en-US",
  "text": "What time do you close on Sundays?",
  "confidence": 0.94,
  "words": [
    { "word": "What", "start_ms": 0, "end_ms": 220, "confidence": 0.96 },
    { "word": "time", "start_ms": 220, "end_ms": 470, "confidence": 0.93 }
  ]
}

Agent reasoning + tool-use

OVP splits the LLM lifecycle into four distinct event types. Twilio collapses these into one text channel.

Event	Dir	Purpose
`agent.input`	C→S	Push text INTO the LLM. `source` = caller_transcript / developer_inject / rag_context / system.
`agent.thinking`	S→C	Streaming LLM tokens (deltas).
`agent.output`	S→C	Committed agent response text.
`agent.tool.invoke`	S→C	LLM is calling a tool. Carries arguments.
`agent.tool.streaming`	S→C	Partial tool result, for tools that opted into streaming.
`agent.tool.result`	S→C	Final tool result + duration.
`agent.tool.error`	S→C	Tool failed. Carries `retryable`.

{
  "type": "agent.tool.invoke", "seq": 41, "ts": "...", "session_id": "...",
  "turn_id": "turn_01HZK...",
  "tool_call_id": "tc_01HZK...",
  "name": "lookup_order",
  "arguments": { "order_id": "ord_12345" }
}

{
  "type": "agent.tool.result", "seq": 44, "ts": "...", "session_id": "...",
  "turn_id": "turn_01HZK...",
  "tool_call_id": "tc_01HZK...",
  "result": { "status": "shipped", "carrier": "UPS", "tracking": "1Z..." },
  "duration_ms": 412
}

Interrupts — three distinct types

This is where Twilio’s interrupt falls short. OVP distinguishes:

Event	Dir	LLM context behaviour
`agent.barge_in`	S→C	Caller spoke over the bot. TTS truncated; LLM context truncated to what the caller actually heard (`played_ms`). Next turn references only delivered content.
`agent.cancel`	C→S	Developer programmatically killed the turn (guardrail, retry, etc.). LLM output discarded entirely. No context preserved.
`supervisor.takeover`	C→S	Human operator cuts in. AI pauses generating. Optional `resumable: true` lets the operator hand the call back.
`supervisor.release`	C→S	Operator hands back. May include `context_note` so the AI knows what was said while it was paused.

{
  "type": "agent.barge_in", "seq": 33, "ts": "...", "session_id": "...",
  "turn_id": "turn_01HZK...",
  "played_bytes": 51200,
  "played_ms": 1600
}

The agent can send media to the caller mid-call. Useful when the call is happening alongside SMS / WhatsApp / RCS (the customer’s phone has both channels available).

Event	Dir	Purpose
`image.send`	S→C	Send an image via SMS/MMS/WhatsApp/email/RCS.
`file.attach`	S→C	Send a file (PDF, etc.) via SMS/WhatsApp/email.
`screen.share`	↔	Start/stop a screen-share session (operator → caller).

{
  "type": "image.send", "seq": 52, "ts": "...", "session_id": "...",
  "channel": "whatsapp",
  "media_url": "https://media.orbit.devotel.io/...",
  "caption": "Here's the receipt you asked for",
  "media_id": "med_01HZK..."
}

Realtime observability

These events are emitted continuously during the call. Pipe them to your dashboard, your billing UI, your QA tool — Orbit does not charge for them and they fire even when no one is subscribed (so reconnects can replay).

Event	Dir	Cadence
`call.quality.update`	S→C	Every 5s. Packet loss, jitter, RTT, MOS.
`agent.latency.breakdown`	S→C	Per turn. STT first-token / final, LLM first-token / final, TTS first-byte, total turn round-trip.
`agent.cost.tick`	S→C	Every 5s. USD micros, broken down by LLM / STT / TTS / voice termination.

{
  "type": "agent.latency.breakdown", "seq": 47, "ts": "...", "session_id": "...",
  "turn_id": "turn_01HZK...",
  "stt_first_token_ms": 89,
  "stt_final_ms": 240,
  "llm_first_token_ms": 312,
  "llm_final_ms": 980,
  "tts_first_byte_ms": 132,
  "total_turn_ms": 1452
}

Human handoff lifecycle

Event	Dir	Purpose
`agent.handoff_requested`	S→C	AI decided it can’t handle the call. Carries summary + sentiment.
`agent.handoff_accepted`	C→S	An operator picked it up. Carries `queue_wait_ms`.
`agent.handoff_completed`	↔	Call transferred. `agent_remains_observer` controls whether the AI stays on muted.
`agent.handoff_failed`	C→S	No operator / timeout / declined. Carries `fallback` — continue_ai / voicemail / hangup.

DTMF

Event	Dir	Purpose
`dtmf.received`	S→C	Caller pressed a key.
`dtmf.send`	C→S	Agent / SBC pushes digits into the call (e.g. navigating an IVR).

Error

{
  "type": "error", "seq": 88, "ts": "...", "session_id": "...",
  "code": "OVP_LLM_UPSTREAM_FAILED",
  "message": "Anthropic returned 503 after 3 retries",
  "recoverable": false,
  "details": { "upstream_status": 503, "retry_count": 3 }
}

recoverable: true means the WS stays open. recoverable: false is always followed by session.end.

Worked example: simple Q&A call

Caller dials a DID, asks one question, hangs up.

[server→client] session.start                seq=0
[client→server] (binary audio frames @ 20ms, direction=0, frame_seq 0..N)
[server→client] audio.ingress                seq=1
[server→client] transcript.partial           seq=2   "What time"
[server→client] transcript.partial           seq=3   "What time do you close"
[server→client] transcript.final             seq=4   "What time do you close on Sundays?"
[server→client] agent.thinking               seq=5   delta="We"
[server→client] agent.thinking               seq=6   delta=" close"
[server→client] agent.thinking               seq=7   delta=" at"
[server→client] agent.thinking               seq=8   delta=" 6pm"
[server→client] agent.output                 seq=9   "We close at 6pm on Sundays." final=true
[server→client] audio.egress                 seq=10  utterance=utt_a bytes_sent=12800 final=false
[server→client] (binary audio frames, direction=1, ~25 frames)
[server→client] audio.egress                 seq=11  utterance=utt_a bytes_sent=32000 final=true
[server→client] agent.latency.breakdown      seq=12
[server→client] call.quality.update          seq=13  (every 5s)
[server→client] agent.cost.tick              seq=14
[server→client] session.end                  seq=15  reason=caller_hangup

Worked example: tool call mid-conversation

[server→client] transcript.final             seq=12  "Where's my order 12345?"
[server→client] agent.thinking               seq=13  delta="Let me check"
[server→client] agent.tool.invoke            seq=14  name="lookup_order" args={"order_id":"12345"}
[server→client] (caller hears filler audio — agent.thinking text routed to a low-latency TTS)
[server→client] agent.tool.result            seq=15  result={status:"shipped",tracking:"1Z..."} duration_ms=412
[server→client] agent.thinking               seq=16  delta="Your"
[server→client] agent.thinking               seq=17  delta=" order"
[server→client] agent.output                 seq=18  "Your order shipped yesterday via UPS, tracking 1Z..." final=true
[server→client] audio.egress                 seq=19
[server→client] (binary audio frames)

Note: tool invocations are observable as their own events. A dashboard can render “agent called lookup_order with {order_id:12345}, returned in 412ms” without parsing the text channel.

Worked example: human handoff

[server→client] transcript.final             seq=22  "I want to talk to a person"
[server→client] agent.handoff_requested      seq=23  reason="caller_requested" caller_summary="..." sentiment="frustrated"
[client→server] (route to operator queue via your contact-center)
[client→server] agent.handoff_accepted       seq=24  operator_user_id="usr_..." queue_wait_ms=4200
[client→server] supervisor.takeover          seq=25  operator_user_id="usr_..." resumable=false
[server→client] audio.flush                  seq=26  cause="takeover"
[client→server] agent.handoff_completed      seq=27  operator_user_id="usr_..." agent_remains_observer=true
[server→client] (binary audio frames continue, but the agent's TTS is muted; AI may still emit transcripts)
[server→client] transcript.final             seq=28  speaker="caller" text="..." (observer mode)
[server→client] session.end                  seq=99  reason="transferred"

Error codes

Code	When
`OVP_AUTH_FAILED`	JWT missing / invalid / expired.
`OVP_SUBPROTOCOL_MISMATCH`	Handshake didn’t negotiate `orbit-voice/1.0`.
`OVP_EVENT_SCHEMA_INVALID`	JSON event failed schema validation.
`OVP_EVENT_DIRECTION_VIOLATION`	Client sent a server-only event (or vice versa).
`OVP_SEQ_REGRESSION`	`seq` went backwards.
`OVP_AUDIO_FRAME_INVALID`	Binary frame header malformed.
`OVP_RESUME_EXPIRED`	`session.resume` came in > 30s after drop.
`OVP_SESSION_UPDATE_REJECTED`	`patch` was empty or referenced an immutable field.
`OVP_TOOL_UNKNOWN`	LLM tried to call a tool not in `session.start.agent.tools`.
`OVP_TOOL_ARGS_INVALID`	Tool arguments failed the declared `input_schema`.
`OVP_LLM_UPSTREAM_FAILED`	LLM provider returned a non-recoverable error.
`OVP_STT_UPSTREAM_FAILED`	STT provider returned a non-recoverable error.
`OVP_TTS_UPSTREAM_FAILED`	TTS provider returned a non-recoverable error.
`OVP_CALLER_HANGUP`	Caller hung up on the SIP side.
`OVP_MAX_DURATION_EXCEEDED`	Hit the platform-level call duration cap.
`OVP_SPEND_CAP_EXCEEDED`	Tenant agent-daily-spend cap reached. Call must terminate.
`OVP_INTERNAL_ERROR`	Bug, not config. Inspect Sentry.

Compared to Twilio ConversationRelay

Concern	Twilio ConversationRelay	Orbit Voice Protocol
Overloaded events	`prompt` is “user said” + “send to LLM”	Split: `transcript.final` (STT) + `agent.input` (LLM input, typed source)
Tool calls	Buried in text channel	First-class: `agent.tool.{invoke,streaming,result,error}`
Interrupt types	1 (`interrupt`)	3 (`barge_in`, `cancel`, `supervisor.takeover`) — each truncates LLM context differently
Streaming LLM tokens	Final text only	`agent.thinking` delta stream
Realtime quality	Not exposed	`call.quality.update` every 5s (loss / jitter / RTT / MOS)
Latency breakdown	Not exposed	`agent.latency.breakdown` per turn
Running cost	Not exposed	`agent.cost.tick` every 5s, USD micros
Multi-modal mid-call	Not supported	`image.send`, `file.attach`, `screen.share`
Human handoff	Orchestrate via REST	`agent.handoff_{requested,accepted,completed,failed}`
Resumable on WS drop	Drops the call	30s server-state preservation + replay
Mid-call voice swap	language only	voice, language, model, instructions, speech_rate
Protocol versioning	None	`protocol_version` on every `session.start`
Audio frame header	None — raw mulaw bytes	12-byte header (seq, RTP ts, silence/dtmf/last flags)

SDK

import { OrbitVoiceClient } from "@devotel/sdk";

const client = new OrbitVoiceClient({
  agentId: "agt_01HZK...",
  callId: "call_01HZK...",
});

client.on("transcript.final", (e) => console.log(e.text));
client.on("agent.tool.invoke", (e) => console.log("tool:", e.name, e.arguments));
client.on("agent.cost.tick", (e) => updateDashboard(e.total_cost_usd_micros));

await client.connect();

The SDK handles framing, sequence-number bookkeeping, automatic reconnect with resume, and direction validation. Full reference: SDK reference.

API Reference

All Endpoints

Orbit Voice Protocol

Orbit Voice Protocol (OVP)

Why a new protocol

Connection lifecycle

1. Open the WebSocket

2. Server sends `session.start`

3. Bidirectional audio + events

4. Resume on drop

5. End

Binary audio frame format

Event reference

Session lifecycle

`session.start` (server → client)

`session.update` (client → server)

Audio path metadata

Transcripts

Agent reasoning + tool-use

Interrupts — three distinct types

Realtime observability

Human handoff lifecycle

DTMF

Error

Worked example: simple Q&A call

Worked example: tool call mid-conversation

Worked example: human handoff

Error codes

Compared to Twilio ConversationRelay

SDK

​Orbit Voice Protocol (OVP)

​Why a new protocol

​Connection lifecycle

​1. Open the WebSocket

​2. Server sends session.start

​3. Bidirectional audio + events

​4. Resume on drop

​5. End

​Binary audio frame format

​Event reference

​Session lifecycle

​session.start (server → client)

​session.update (client → server)

​Audio path metadata

​Transcripts

​Agent reasoning + tool-use

​Interrupts — three distinct types

​Multi-modal sends

​Realtime observability

​Human handoff lifecycle

​DTMF

​Error

​Worked example: simple Q&A call

​Worked example: tool call mid-conversation

​Worked example: human handoff

​Error codes

​Compared to Twilio ConversationRelay

​SDK

Orbit Voice Protocol (OVP)

Why a new protocol

Connection lifecycle

1. Open the WebSocket

2. Server sends `session.start`

3. Bidirectional audio + events

4. Resume on drop

5. End

Binary audio frame format

Event reference

Session lifecycle

`session.start` (server → client)

`session.update` (client → server)

Audio path metadata

Transcripts

Agent reasoning + tool-use

Interrupts — three distinct types

Multi-modal sends

Realtime observability

Human handoff lifecycle

DTMF

Error

Worked example: simple Q&A call

Worked example: tool call mid-conversation

Worked example: human handoff

Error codes

Compared to Twilio ConversationRelay

SDK