LFM2.5-VL-1.6B

← Back to Vision Models LFM2.5-VL-1.6B is Liquid AI’s flagship vision-language model, delivering exceptional performance on image understanding, visual reasoning, and multimodal tasks. Built on LFM2.5 with a dynamic SigLIP2 image encoder.

HF GGUF MLX ONNX

Specifications

Property	Value
Parameters	1.6B
Context Length	32K tokens
Architecture	LFM2.5-VL (Dense)

Image Captioning

Detailed descriptions and alt-text

Visual Reasoning

Scene understanding and visual Q&A

OCR & Extraction

Text recognition and document parsing

Quick Start

Transformers
vLLM
llama.cpp

Install:

uv pip install "transformers>=5.0.0" pillow torch

Download & Run:

from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image

model_id = "LiquidAI/LFM2.5-VL-1.6B"
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    dtype="bfloat16",
)
# IMPORTANT: tie lm_head to input embeddings (transformers v5 bug)
model.lm_head.weight = model.get_input_embeddings().weight

processor = AutoProcessor.from_pretrained(model_id)

url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
image = load_image(url)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "What is in this image?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)

outputs = model.generate(**inputs, do_sample=True, temperature=0.1, min_p=0.15, repetition_penalty=1.05, max_new_tokens=256)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)

vLLM support for LFM Vision Models requires a specific version. Install from the custom source below.

Install:

VLLM_PRECOMPILED_WHEEL_COMMIT=72506c98349d6bcd32b4e33eec7b5513453c1502 \
  VLLM_USE_PRECOMPILED=1 \
  uv pip install git+https://github.com/vllm-project/vllm.git

uv pip install "transformers>=5.0.0" pillow

Run:

from vllm import LLM, SamplingParams

IMAGE_URL = "http://images.cocodataset.org/val2017/000000039769.jpg"

llm = LLM(
    model="LiquidAI/LFM2.5-VL-1.6B",
    max_model_len=1024,
)

sampling_params = SamplingParams(
    temperature=0.1,
    min_p=0.15,
    repetition_penalty=1.05,
    max_tokens=256,
)

messages = [{
    "role": "user",
    "content": [
        {"type": "image_url", "image_url": {"url": IMAGE_URL}},
        {"type": "text", "text": "Describe what you see in this image."},
    ],
}]

outputs = llm.chat(messages, sampling_params)
print(outputs[0].outputs[0].text)

llama.cpp enables efficient CPU inference for vision models.Install:

brew install llama.cpp

Or download pre-built binaries from llama.cpp releases.Run:

llama-cli \
    -hf LiquidAI/LFM2.5-VL-1.6B-GGUF:Q4_0 \
    --image test_image.jpg \
    -p "What's in this image?" \
    -n 128 \
    --temp 0.1 --min-p 0.15 --repeat-penalty 1.05

The -hf flag downloads the model directly from Hugging Face. Use --image-max-tokens to control image token budget.For server deployment and advanced usage, see the llama.cpp guide.

Getting Started

Models

Key Concepts

Help

Specifications

Image Captioning

Visual Reasoning

OCR & Extraction

Quick Start

Getting Started

Models

Key Concepts

Help

​Specifications

Image Captioning

Visual Reasoning

OCR & Extraction

​Quick Start

Specifications

Quick Start