Mira Realtime

Mira Realtime is a low-latency, full-duplex voice API over WebSocket. Stream audio in both directions concurrently, receive partial user transcripts, and play back the model's streamed audio response. Designed for voice assistants, call-center bots and interactive characters.

Connection endpoint

wss://api.vmira.ai/v1/realtime

Authentication uses a WebSocket subprotocol: pass two subprotocols — mira-realtime.v1 and openai-insecure-api-key.sk-mira-YOUR_KEY (standard auth-via-subprotocol pattern for browser clients).

Session flow

  • 1. Connectclient opens WebSocket with subprotocol auth; server replies with session.created
  • 2. Configureclient sends session.update with voice, audio format, system prompt and tools
  • 3. Audio upclient streams PCM/Opus frames as input_audio_buffer.append
  • 4. Partial transcriptsserver emits conversation.item.input_audio_transcription.delta as recognition progresses
  • 5. Audio downserver emits response.audio.delta — play immediately, don't wait for completion
  • 6. Barge-inif the client resumes input_audio_buffer.append during response.audio.delta, the server auto-cancels the in-flight response

Minimal JavaScript example

JavaScript
const ws = new WebSocket(
  "wss://api.vmira.ai/v1/realtime",
  ["mira-realtime.v1", "openai-insecure-api-key.sk-mira-YOUR_KEY"],
);

ws.addEventListener("open", () => {
  // Configure the session
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      voice: "aria",
      input_audio_format: "pcm16",
      output_audio_format: "pcm16",
      instructions: "You are a friendly Mira voice assistant.",
    },
  }));
});

ws.addEventListener("message", (ev) => {
  const event = JSON.parse(ev.data);

  if (event.type === "response.audio.delta") {
    playPcm16(base64ToBytes(event.delta));   // stream-play user-side
  }
  if (event.type === "conversation.item.input_audio_transcription.delta") {
    console.log("user (partial):", event.delta);
  }
});

// Send mic audio frames as they arrive
micStream.on("frame", (pcmBytes) => {
  ws.send(JSON.stringify({
    type: "input_audio_buffer.append",
    audio: bytesToBase64(pcmBytes),
  }));
});

Server vs client events

DirectionEvent typePurpose
Client → Serversession.updateUpdate session settings
Client → Serverinput_audio_buffer.appendSend an audio frame
Client → Serverinput_audio_buffer.commitEnd the user turn
Client → Serverresponse.createRequest a model response
Server → Clientsession.createdSession handshake confirmed
Server → Clientconversation.item.input_audio_transcription.deltaPartial user transcript
Server → Clientresponse.audio.deltaStreamed audio response chunk
Server → Clientresponse.doneResponse finished

Audio formats

  • Inputpcm16 (16-bit PCM, 24kHz, mono), g711_ulaw, g711_alaw
  • Outputpcm16 (16-bit PCM, 24kHz, mono), g711_ulaw, g711_alaw
Stream 20–40 ms frames — that's the sweet spot between latency and server-side VAD stability.
Billing is per second of connection time plus separate input/output audio rates. See /pricing for current rates and /docs/api/reference for the full event schema.