LFM2-VL-3B

← Back to Vision Models LFM2-VL-3B is Liquid AI’s highest-capacity multimodal model, delivering enhanced visual reasoning and detailed image understanding. Ideal for complex vision tasks requiring deeper comprehension.

HF GGUF MLX ONNX

Specifications

Property	Value
Parameters	3B
Context Length	32K tokens
Architecture	LFM2-VL (Dense)

Advanced Reasoning

Complex visual logic and analysis

Document Understanding

Detailed document and chart parsing

Multi-Image

Compare and reason across images

Quick Start

Transformers
vLLM
llama.cpp

Install:

uv pip install "transformers>=5.0.0" pillow torch

Download & Run:

from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image

model_id = "LiquidAI/LFM2-VL-3B"
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    dtype="bfloat16",
)
# IMPORTANT: tie lm_head to input embeddings (transformers v5 bug)
model.lm_head.weight = model.get_input_embeddings().weight

processor = AutoProcessor.from_pretrained(model_id)

url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
image = load_image(url)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "What is in this image?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)

outputs = model.generate(**inputs, do_sample=True, temperature=0.1, min_p=0.15, repetition_penalty=1.05, max_new_tokens=256)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)

vLLM support for LFM Vision Models requires a specific version. Install from the custom source below.

Install:

VLLM_PRECOMPILED_WHEEL_COMMIT=72506c98349d6bcd32b4e33eec7b5513453c1502 \
  VLLM_USE_PRECOMPILED=1 \
  uv pip install git+https://github.com/vllm-project/vllm.git

uv pip install "transformers>=5.0.0" pillow

Run:

from vllm import LLM, SamplingParams

IMAGE_URL = "http://images.cocodataset.org/val2017/000000039769.jpg"

llm = LLM(
    model="LiquidAI/LFM2-VL-3B",
    max_model_len=1024,
)

sampling_params = SamplingParams(
    temperature=0.1,
    min_p=0.15,
    repetition_penalty=1.05,
    max_tokens=256,
)

messages = [{
    "role": "user",
    "content": [
        {"type": "image_url", "image_url": {"url": IMAGE_URL}},
        {"type": "text", "text": "Describe what you see in this image."},
    ],
}]

outputs = llm.chat(messages, sampling_params)
print(outputs[0].outputs[0].text)

llama.cpp enables efficient CPU inference for vision models.Install:

brew install llama.cpp

Or download pre-built binaries from llama.cpp releases.Run:

llama-cli \
    -hf LiquidAI/LFM2-VL-3B-GGUF:Q4_0 \
    --image test_image.jpg \
    -p "What's in this image?" \
    -n 128 \
    --temp 0.1 --min-p 0.15 --repeat-penalty 1.05

The -hf flag downloads the model directly from Hugging Face. Use --image-max-tokens to control image token budget.For server deployment and advanced usage, see the llama.cpp guide.

Getting Started

Models

Key Concepts

Help

Specifications

Advanced Reasoning

Document Understanding

Multi-Image

Quick Start

​Specifications

Advanced Reasoning

Document Understanding

Multi-Image

​Quick Start

Specifications

Quick Start