Skip to main content

Overview

Vision Language Models (VLMs) are a class of multimodal large models that support both image and text inputs, with the ability to understand image content and process cross-modal information. These models can output high-quality responses based on combined image and text information, and are widely used in scenarios such as image recognition, content understanding, and intelligent Q&A.

Typical Use Cases

  • Image Content Recognition & Description: Automatically recognize objects, colors, scenes, and spatial relationships in images, and generate natural language descriptions.
  • Image-Text Comprehension: Combine image and text inputs to enable context-aware multi-turn conversations and complex task responses.
  • Visual-Aided Q&A: Can serve as a supplement to OCR tools, recognizing text embedded in images and completing Q&A tasks.
  • Future Applications: Suitable for intelligent visual assistants, robotic perception, augmented reality, and other interactive scenarios.

API Usage

Calling Vision Language Models requires the /chat/completions endpoint, which supports mixed image-text inputs.

Image Processing Parameters

Use the detail field to set image processing precision, with the following options:
  • high: High resolution, preserves more details, suitable for fine-grained tasks.
  • low: Low resolution, faster processing, suitable for real-time responses.
  • auto: The system automatically selects the appropriate mode.

Message Format Examples

URL Image Format

{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {
        "url": "https://example.com/image.png",
        "detail": "high"
      }
    },
    {
      "type": "text",
      "text": "Please describe the scene in the image."
    }
  ]
}

Base64 Image Format

{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/jpeg;base64,{base64_image}",
        "detail": "low"
      }
    },
    {
      "type": "text",
      "text": "What text content is in the image?"
    }
  ]
}

Base64 Image Encoding Example Code (Python)

import base64
from PIL import Image
import io

def image_to_base64(image_path):
    with Image.open(image_path) as img:
        buffered = io.BytesIO()
        img.save(buffered, format="JPEG")
        return base64.b64encode(buffered.getvalue()).decode('utf-8')

base64_image = image_to_base64("path/to/your/image.jpg")

Multi-Image Mode

Supports sending multiple images along with text as input. It is recommended to use no more than two images for optimal performance and comprehension.
{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {
        "url": "https://example.com/image1.png"
      }
    },
    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/jpeg;base64,{base64_image}"
      }
    },
    {
      "type": "text",
      "text": "Compare the common features of these two images."
    }
  ]
}

Supported Models

The following are the currently supported Vision Language Models (VLMs):

Pricing

Image inputs for Vision Language Models are converted to tokens and calculated together with text for billing:
  • Token estimation rules for images vary slightly between models;
  • Detailed pricing can be found on each model’s detail page.

API Call Example Code

Single Image Description

from openai import OpenAI

client = OpenAI(api_key="YOUR_KEY", base_url="https://api.myrouter.ai/openai")

response = client.chat.completions.create(
    model="qwen/qwen2.5-vl-72b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/cityscape.jpg"}},
                {"type": "text", "text": "Describe the main buildings in the image."}
            ]
        }
    ],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Multi-Image Comparative Analysis

response = client.chat.completions.create(
    model="qwen/qwen2.5-vl-72b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/product1.jpg"}},
                {"type": "image_url", "image_url": {"url": "https://example.com/product2.jpg"}},
                {"type": "text", "text": "Please compare the main differences between these two products."}
            ]
        }
    ],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

FAQ & Notes

  • Image resolution and clarity affect model recognition accuracy; clear image sources are recommended.
  • Base64 encoding produces large payloads; it is recommended to keep images under 1MB.
  • If you encounter issues, please refer to the platform developer documentation or submit a support ticket.