Mira Vision
Mira Vision is the set of image-understanding capabilities: detailed scene description, visual question answering, OCR and natural-language grounded detection. Pass images either to chat or to the dedicated detection endpoint.
Capabilities
- Scene description — detailed description of an image's contents
- Visual QA — answer questions about an image
- OCR — extract printed and handwritten text, including Cyrillic
- Object detection — find an object by natural-language description via /v1/vision/locate — returns a normalized bounding box
- Multi-image input — multiple images in one chat message for comparison / composition
Images in chat
Pass an image inside the content array as an image_url part. Mix with regular text parts and add multiple images as needed.
cURL
curl https://api.vmira.ai/v1/chat/completions \
-H "Authorization: Bearer $MIRA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "mira",
"messages": [{
"role": "user",
"content": [
{ "type": "text", "text": "What is in this image? Extract all visible text." },
{ "type": "image_url",
"image_url": { "url": "https://example.com/receipt.jpg" } }
]
}]
}'Python (OpenAI-compatible SDK)
Python
from openai import OpenAI
client = OpenAI(
api_key="sk-mira-YOUR_API_KEY",
base_url="https://api.vmira.ai/v1",
)
response = client.chat.completions.create(
model="mira",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe both photos and list the differences."},
{"type": "image_url", "image_url": {"url": "https://example.com/a.jpg"}},
{"type": "image_url", "image_url": {"url": "https://example.com/b.jpg"}},
],
}],
)
print(response.choices[0].message.content)Grounded detection — /v1/vision/locate
When you need precise object coordinates (cropping, overlays, animation), call /v1/vision/locate. Pass an image plus a natural-language target description; the endpoint returns a normalized bounding box (0..1) or a cells array in grid mode.
cURL
curl https://api.vmira.ai/v1/vision/locate \
-H "Authorization: Bearer $MIRA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"image": "https://example.com/storefront.jpg",
"target": "the coffee-shop sign",
"mode": "bbox"
}'Response (bbox mode)
JSON
{
"ok": true,
"x": 0.412,
"y": 0.118,
"width": 0.247,
"height": 0.092
}Parameters
/v1/vision/locate
- image — required: URL or data-URI
- target — natural-language object description
- mode — bbox (default) | grid
- grid — required for mode=grid: { cols, rows }
See /pricing for current per-call vision cost and /docs/api/reference for the full endpoint list.