Mira Realtime

Mira Realtime is a low-latency, full-duplex voice API over WebSocket. Stream audio in both directions concurrently, receive partial user transcripts, and play back the model's streamed audio response. Designed for voice assistants, call-center bots and interactive characters.

Connection endpoint

wss://api.vmira.ai/v1/realtime

Authentication uses a WebSocket subprotocol: pass two subprotocols — mira-realtime.v1 and openai-insecure-api-key.sk-mira-YOUR_KEY (standard auth-via-subprotocol pattern for browser clients).

Session flow

1. Connect — client opens WebSocket with subprotocol auth; server replies with session.created
2. Configure — client sends session.update with voice, audio format, system prompt and tools
3. Audio up — client streams PCM/Opus frames as input_audio_buffer.append
4. Partial transcripts — server emits conversation.item.input_audio_transcription.delta as recognition progresses
5. Audio down — server emits response.audio.delta — play immediately, don't wait for completion
6. Barge-in — if the client resumes input_audio_buffer.append during response.audio.delta, the server auto-cancels the in-flight response

Minimal JavaScript example

JavaScript

const ws = new WebSocket(
  "wss://api.vmira.ai/v1/realtime",
  ["mira-realtime.v1", "openai-insecure-api-key.sk-mira-YOUR_KEY"],
);

ws.addEventListener("open", () => {
  // Configure the session
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      voice: "aria",
      input_audio_format: "pcm16",
      output_audio_format: "pcm16",
      instructions: "You are a friendly Mira voice assistant.",
    },
  }));
});

ws.addEventListener("message", (ev) => {
  const event = JSON.parse(ev.data);

  if (event.type === "response.audio.delta") {
    playPcm16(base64ToBytes(event.delta));   // stream-play user-side
  }
  if (event.type === "conversation.item.input_audio_transcription.delta") {
    console.log("user (partial):", event.delta);
  }
});

// Send mic audio frames as they arrive
micStream.on("frame", (pcmBytes) => {
  ws.send(JSON.stringify({
    type: "input_audio_buffer.append",
    audio: bytesToBase64(pcmBytes),
  }));
});

Server vs client events

DirectionEvent typePurpose

Client → Serversession.updateUpdate session settings

Client → Serverinput_audio_buffer.appendSend an audio frame

Client → Serverinput_audio_buffer.commitEnd the user turn

Client → Serverresponse.createRequest a model response

Server → Clientsession.createdSession handshake confirmed

Server → Clientconversation.item.input_audio_transcription.deltaPartial user transcript

Server → Clientresponse.audio.deltaStreamed audio response chunk

Server → Clientresponse.doneResponse finished

Audio formats

Input — pcm16 (16-bit PCM, 24kHz, mono), g711_ulaw, g711_alaw
Output — pcm16 (16-bit PCM, 24kHz, mono), g711_ulaw, g711_alaw

Stream 20–40 ms frames — that's the sweet spot between latency and server-side VAD stability.

Billing is per second of connection time plus separate input/output audio rates. See /pricing for current rates and /docs/api/reference for the full event schema.