Overview
Vision Language Models (VLMs) are a class of multimodal large models that support both image and text inputs, with the ability to understand image content and process cross-modal information. These models can output high-quality responses based on combined image and text information, and are widely used in scenarios such as image recognition, content understanding, and intelligent Q&A.Typical Use Cases
- Image Content Recognition & Description: Automatically recognize objects, colors, scenes, and spatial relationships in images, and generate natural language descriptions.
- Image-Text Comprehension: Combine image and text inputs to enable context-aware multi-turn conversations and complex task responses.
- Visual-Aided Q&A: Can serve as a supplement to OCR tools, recognizing text embedded in images and completing Q&A tasks.
- Future Applications: Suitable for intelligent visual assistants, robotic perception, augmented reality, and other interactive scenarios.
API Usage
Calling Vision Language Models requires the/chat/completions endpoint, which supports mixed image-text inputs.
Image Processing Parameters
Use thedetail field to set image processing precision, with the following options:
high: High resolution, preserves more details, suitable for fine-grained tasks.low: Low resolution, faster processing, suitable for real-time responses.auto: The system automatically selects the appropriate mode.
Message Format Examples
URL Image Format
Base64 Image Format
Base64 Image Encoding Example Code (Python)
Multi-Image Mode
Supports sending multiple images along with text as input. It is recommended to use no more than two images for optimal performance and comprehension.Supported Models
The following are the currently supported Vision Language Models (VLMs):Pricing
Image inputs for Vision Language Models are converted to tokens and calculated together with text for billing:- Token estimation rules for images vary slightly between models;
- Detailed pricing can be found on each model’s detail page.
API Call Example Code
Single Image Description
Multi-Image Comparative Analysis
FAQ & Notes
- Image resolution and clarity affect model recognition accuracy; clear image sources are recommended.
- Base64 encoding produces large payloads; it is recommended to keep images under 1MB.
- If you encounter issues, please refer to the platform developer documentation or submit a support ticket.