vision - Image Understanding

Analyze local images or image URLs using Vision API. Supports content description, text extraction (OCR), object recognition, and more.

Model Selection

The vision tool uses a multi-level auto-selection strategy with automatic fallback — no manual configuration required:

Main model — uses the currently configured main model for image recognition (must be a multimodal model)
Other configured models — auto-discovers other multimodal models with configured API keys as alternatives

If the current provider fails, the tool automatically tries the next one until it succeeds or all fail.

Supported Models

Provider	Vision Model	Notes
OpenAI / Compatible	Main model	All OpenAI-protocol-compatible multimodal models
Qwen (DashScope)	Main model	e.g. qwen3.7-plus, etc.
Claude	Main model	Anthropic native image format
Gemini	Main model	inlineData format
Doubao	Main model	doubao-seed-2-0 series natively supported
Kimi (Moonshot)	Main model	kimi-k2.6, kimi-k2.5 natively supported
ERNIE	Main model	Defaults to the multimodal main model (e.g. `ernie-5.1`); falls back to `ernie-4.5-turbo-vl` when the main model is not multimodal
ZhipuAI	glm-5v-turbo	Always uses the dedicated vision model
MiniMax	MiniMax-Text-01	Always uses the dedicated vision model

ZhipuAI and MiniMax text models do not support image understanding, so their dedicated vision models are always used automatically.

When use_linkai=true, LinkAI’s multimodal model is used by default.

Custom Configuration

To specify the model used by Vision, configure it in config.json, for example:

{
    "tools": {
        "vision": {
            "model": "gpt-4.1"
        }
    }
}

The specified model is used first, and the tool automatically routes to the corresponding provider based on the model name; on failure, it falls back to other configured providers. In most cases no configuration is needed — the tool works automatically as long as the main model supports multimodal input or any vision-capable API key is configured.

Parameters

Parameter	Type	Required	Description
`image`	string	Yes	Local file path or HTTP(S) image URL
`question`	string	Yes	Question to ask about the image

Supported image formats: jpg, jpeg, png, gif, webp

Use Cases

Describe image content
Extract text from images (OCR)
Identify objects, colors, scenes
Analyze screenshots and scanned documents

Images larger than 1MB are automatically compressed before upload. All images (including remote URLs) are converted to base64 for transmission to ensure compatibility with all model backends.

​Model Selection

​Supported Models

​Custom Configuration

​Parameters

​Use Cases

Model Selection

Supported Models

Custom Configuration

Parameters

Use Cases