Model Selection
The vision tool uses a multi-level auto-selection strategy with automatic fallback — no manual configuration required:- Main model — uses the currently configured main model for image recognition (must be a multimodal model)
- Other configured models — auto-discovers other multimodal models with configured API keys as alternatives
Supported Models
| Provider | Vision Model | Notes |
|---|---|---|
| OpenAI / Compatible | Main model | All OpenAI-protocol-compatible multimodal models |
| Qwen (DashScope) | Main model | e.g. qwen3.7-plus, etc. |
| Claude | Main model | Anthropic native image format |
| Gemini | Main model | inlineData format |
| Doubao | Main model | doubao-seed-2-0 series natively supported |
| Kimi (Moonshot) | Main model | kimi-k2.6, kimi-k2.5 natively supported |
| ERNIE | Main model | Defaults to the multimodal main model (e.g. ernie-5.1); falls back to ernie-4.5-turbo-vl when the main model is not multimodal |
| ZhipuAI | glm-5v-turbo | Always uses the dedicated vision model |
| MiniMax | MiniMax-Text-01 | Always uses the dedicated vision model |
ZhipuAI and MiniMax text models do not support image understanding, so their dedicated vision models are always used automatically.
When use_linkai=true, LinkAI’s multimodal model is used by default.
Custom Configuration
To specify the model used by Vision, configure it inconfig.json, for example:
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
image | string | Yes | Local file path or HTTP(S) image URL |
question | string | Yes | Question to ask about the image |
Use Cases
- Describe image content
- Extract text from images (OCR)
- Identify objects, colors, scenes
- Analyze screenshots and scanned documents
Images larger than 1MB are automatically compressed before upload. All images (including remote URLs) are converted to base64 for transmission to ensure compatibility with all model backends.
