Mira Realtime
Mira Realtime is a low-latency, full-duplex voice API over WebSocket. Stream audio in both directions concurrently, receive partial user transcripts, and play back the model's streamed audio response. Designed for voice assistants, call-center bots and interactive characters.
Connection endpoint
wss://api.vmira.ai/v1/realtime
Authentication uses a WebSocket subprotocol: pass two subprotocols — mira-realtime.v1 and openai-insecure-api-key.sk-mira-YOUR_KEY (standard auth-via-subprotocol pattern for browser clients).
Session flow
- 1. Connect — client opens WebSocket with subprotocol auth; server replies with session.created
- 2. Configure — client sends session.update with voice, audio format, system prompt and tools
- 3. Audio up — client streams PCM/Opus frames as input_audio_buffer.append
- 4. Partial transcripts — server emits conversation.item.input_audio_transcription.delta as recognition progresses
- 5. Audio down — server emits response.audio.delta — play immediately, don't wait for completion
- 6. Barge-in — if the client resumes input_audio_buffer.append during response.audio.delta, the server auto-cancels the in-flight response
Minimal JavaScript example
JavaScript
const ws = new WebSocket(
"wss://api.vmira.ai/v1/realtime",
["mira-realtime.v1", "openai-insecure-api-key.sk-mira-YOUR_KEY"],
);
ws.addEventListener("open", () => {
// Configure the session
ws.send(JSON.stringify({
type: "session.update",
session: {
voice: "aria",
input_audio_format: "pcm16",
output_audio_format: "pcm16",
instructions: "You are a friendly Mira voice assistant.",
},
}));
});
ws.addEventListener("message", (ev) => {
const event = JSON.parse(ev.data);
if (event.type === "response.audio.delta") {
playPcm16(base64ToBytes(event.delta)); // stream-play user-side
}
if (event.type === "conversation.item.input_audio_transcription.delta") {
console.log("user (partial):", event.delta);
}
});
// Send mic audio frames as they arrive
micStream.on("frame", (pcmBytes) => {
ws.send(JSON.stringify({
type: "input_audio_buffer.append",
audio: bytesToBase64(pcmBytes),
}));
});Server vs client events
DirectionEvent typePurpose
Client → Serversession.updateUpdate session settings
Client → Serverinput_audio_buffer.appendSend an audio frame
Client → Serverinput_audio_buffer.commitEnd the user turn
Client → Serverresponse.createRequest a model response
Server → Clientsession.createdSession handshake confirmed
Server → Clientconversation.item.input_audio_transcription.deltaPartial user transcript
Server → Clientresponse.audio.deltaStreamed audio response chunk
Server → Clientresponse.doneResponse finished
Audio formats
- Input — pcm16 (16-bit PCM, 24kHz, mono), g711_ulaw, g711_alaw
- Output — pcm16 (16-bit PCM, 24kHz, mono), g711_ulaw, g711_alaw
Stream 20–40 ms frames — that's the sweet spot between latency and server-side VAD stability.
Billing is per second of connection time plus separate input/output audio rates. See /pricing for current rates and /docs/api/reference for the full event schema.