Streaming

Streaming lets you receive the model's response as it is generated, rather than waiting for the full completion. This dramatically improves the user experience, especially for long responses, since text starts appearing almost instantly.

How to enable streaming

Add the stream: true parameter to your request body for /v1/chat/completions. Instead of a single JSON response, the server will send a series of events in Server-Sent Events (SSE) format.

Enable streaming
{
  "model": "mira",
  "stream": true,
  "messages": [
    { "role": "user", "content": "Tell me a story about space" }
  ]
}

Server-Sent Events format

When streaming, the server sends the response as a sequence of lines, each prefixed with data: and containing a JSON object. The final event is data: [DONE], indicating the end of the stream.

SSE stream format
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1711000000,"model":"mira","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1711000000,"model":"mira","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1711000000,"model":"mira","choices":[{"index":0,"delta":{"content":" there"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1711000000,"model":"mira","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1711000000,"model":"mira","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Stream event types

Each chunk in the stream contains a delta object inside choices[0]. Here is what delta can contain:

  • delta.roleAppears in the first chunk, indicates the role ("assistant").
  • delta.contentA fragment of the response text. Concatenate all fragments to get the full response.
  • finish_reasonnull during generation, "stop" on natural completion, "length" when token limit is reached.
The usage object (token counts) is only included in non-streaming responses. Token usage data is not available in streaming mode.

Implementation examples

cURL

cURL (streaming)
curl https://api.vmira.ai/v1/chat/completions \
  -H "Authorization: Bearer sk-mira-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -N \
  -d '{
    "model": "mira",
    "stream": true,
    "messages": [
      { "role": "user", "content": "Write a poem about programming" }
    ]
  }'

The -N flag disables output buffering in cURL, letting you see chunks as they arrive.

Python

Python (streaming)
import requests

response = requests.post(
    "https://api.vmira.ai/v1/chat/completions",
    headers={
        "Authorization": "Bearer sk-mira-YOUR_API_KEY",
        "Content-Type": "application/json",
    },
    json={
        "model": "mira",
        "stream": True,
        "messages": [
            {"role": "user", "content": "Explain the theory of relativity"}
        ],
    },
    stream=True,  # Enable response streaming in requests
)

for line in response.iter_lines():
    if line:
        decoded = line.decode("utf-8")
        if decoded.startswith("data: ") and decoded != "data: [DONE]":
            import json
            chunk = json.loads(decoded[6:])
            content = chunk["choices"][0]["delta"].get("content", "")
            if content:
                print(content, end="", flush=True)

print()  # Final newline

Python (OpenAI SDK)

Python (OpenAI SDK streaming)
from openai import OpenAI

client = OpenAI(
    api_key="sk-mira-YOUR_API_KEY",
    base_url="https://api.vmira.ai/v1",
)

stream = client.chat.completions.create(
    model="mira",
    messages=[
        {"role": "user", "content": "Tell me about quantum computers"}
    ],
    stream=True,
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

print()

JavaScript (fetch)

JavaScript (streaming with fetch)
const response = await fetch("https://api.vmira.ai/v1/chat/completions", {
  method: "POST",
  headers: {
    "Authorization": "Bearer sk-mira-YOUR_API_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "mira",
    stream: true,
    messages: [
      { role: "user", content: "Hello! Tell me about yourself." },
    ],
  }),
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split("\n");
  buffer = lines.pop() || "";

  for (const line of lines) {
    if (line.startsWith("data: ") && line !== "data: [DONE]") {
      const chunk = JSON.parse(line.slice(6));
      const content = chunk.choices[0]?.delta?.content;
      if (content) {
        process.stdout.write(content); // Node.js
        // Or append to a DOM element in the browser
      }
    }
  }
}

console.log(); // Final newline

JavaScript (OpenAI SDK)

JavaScript (OpenAI SDK streaming)
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "sk-mira-YOUR_API_KEY",
  baseURL: "https://api.vmira.ai/v1",
});

const stream = await client.chat.completions.create({
  model: "mira",
  messages: [
    { role: "user", content: "Explain recursion" },
  ],
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) {
    process.stdout.write(content);
  }
}

console.log();

Error handling in streams

If an error occurs during streaming (e.g., rate limit exceeded), the server will send an event with an error field instead of a normal chunk. Always wrap stream processing in try/catch.

With streaming, HTTP status 200 is returned immediately, before generation starts. Errors that occur during generation are delivered through the stream.

When to use streaming

  • Chat interfacesDisplaying text as it generates creates a natural conversational feel.
  • Long responsesThe user sees the start of the response without waiting for the end.
  • Progress indicatorStreaming serves as a natural indicator that the model is working.

Next steps