Mira Vision

Mira Vision is the set of image-understanding capabilities: detailed scene description, visual question answering, OCR and natural-language grounded detection. Pass images either to chat or to the dedicated detection endpoint.

Capabilities

  • Scene descriptiondetailed description of an image's contents
  • Visual QAanswer questions about an image
  • OCRextract printed and handwritten text, including Cyrillic
  • Object detectionfind an object by natural-language description via /v1/vision/locate — returns a normalized bounding box
  • Multi-image inputmultiple images in one chat message for comparison / composition

Images in chat

Pass an image inside the content array as an image_url part. Mix with regular text parts and add multiple images as needed.

cURL
curl https://api.vmira.ai/v1/chat/completions \
  -H "Authorization: Bearer $MIRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mira",
    "messages": [{
      "role": "user",
      "content": [
        { "type": "text", "text": "What is in this image? Extract all visible text." },
        { "type": "image_url",
          "image_url": { "url": "https://example.com/receipt.jpg" } }
      ]
    }]
  }'

Python (OpenAI-compatible SDK)

Python
from openai import OpenAI

client = OpenAI(
    api_key="sk-mira-YOUR_API_KEY",
    base_url="https://api.vmira.ai/v1",
)

response = client.chat.completions.create(
    model="mira",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe both photos and list the differences."},
            {"type": "image_url", "image_url": {"url": "https://example.com/a.jpg"}},
            {"type": "image_url", "image_url": {"url": "https://example.com/b.jpg"}},
        ],
    }],
)

print(response.choices[0].message.content)

Grounded detection — /v1/vision/locate

When you need precise object coordinates (cropping, overlays, animation), call /v1/vision/locate. Pass an image plus a natural-language target description; the endpoint returns a normalized bounding box (0..1) or a cells array in grid mode.

cURL
curl https://api.vmira.ai/v1/vision/locate \
  -H "Authorization: Bearer $MIRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "image": "https://example.com/storefront.jpg",
    "target": "the coffee-shop sign",
    "mode": "bbox"
  }'

Response (bbox mode)

JSON
{
  "ok": true,
  "x": 0.412,
  "y": 0.118,
  "width": 0.247,
  "height": 0.092
}

Parameters

/v1/vision/locate

  • imagerequired: URL or data-URI
  • targetnatural-language object description
  • modebbox (default) | grid
  • gridrequired for mode=grid: { cols, rows }
See /pricing for current per-call vision cost and /docs/api/reference for the full endpoint list.