Use llama.cpp for CPU-only environments, local development, or edge deployment and on-device inference.
For GPU-accelerated inference at scale, consider using vLLM instead.
Installation
macOS/Linux
Pre-built Binaries
Build from Source
Download from llama.cpp releases . File naming: llama-<version>-bin-<os>-<feature>-<arch>.zipQuick selection guide:
Windows (CPU) : llama-*-bin-win-avx2-x64.zip for Intel/AMD CPUs
Windows (NVIDIA GPU) : llama-*-bin-win-cu12-x64.zip (requires CUDA drivers)
macOS (Intel) : llama-*-bin-macos-x64.zip
macOS (Apple Silicon) : llama-*-bin-macos-arm64.zip
Linux : llama-*-bin-linux-x64.zip
After downloading, unzip and run from that directory.
Detailed Download Tables by Platform
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j 8
The compiled programs will be in ./build/bin/. For detailed build instructions including GPU support, see the llama.cpp documentation .
Downloading GGUF Models
llama.cpp uses the GGUF format, which stores quantized model weights for efficient inference. All LFM models are available in GGUF format on Hugging Face. See the Models page for all available GGUF models.
You can download LFM models in GGUF format from Hugging Face as follows:
uv pip install huggingface-hub
hf download LiquidAI/LFM2.5-1.2B-Instruct-GGUF lfm2.5-1.2b-instruct-q4_k_m.gguf --local-dir .
Available quantization levels
Q4_0: 4-bit quantization, smallest size
Q4_K_M: 4-bit quantization, good balance of quality and size (recommended)
Q5_K_M: 5-bit quantization, better quality with moderate size increase
Q6_K: 6-bit quantization, excellent quality closer to original
Q8_0: 8-bit quantization, near-original quality
F16: 16-bit float, full precision
Basic Usage
llama.cpp offers two main interfaces for running inference: llama-server (OpenAI-compatible server) and llama-cli (interactive CLI).
llama-server provides an OpenAI-compatible API for serving models locally. Starting the Server: llama-server -hf LiquidAI/LFM2.5-1.2B-Instruct-GGUF -c 4096 --port 8080
The -hf flag downloads the model directly from Hugging Face. Alternatively, use a local model file: llama-server -m lfm2.5-1.2b-instruct-q4_k_m.gguf -c 4096 --port 8080
Key parameters:
-hf: Hugging Face model ID (downloads automatically)
-m: Path to local GGUF model file
-c: Context length (default: 4096)
--port: Server port (default: 8080)
-ngl 99: Offload layers to GPU (if available)
Using the Server: Once running at http://localhost:8080, use the OpenAI Python client: from openai import OpenAI
client = OpenAI(
base_url = "http://localhost:8080/v1" ,
api_key = "not-needed"
)
response = client.chat.completions.create(
model = "lfm2.5-1.2b-instruct" ,
messages = [
{ "role" : "user" , "content" : "What is machine learning?" }
],
temperature = 0.1 ,
max_tokens = 512 ,
extra_body = { "top_k" : 50 , "repetition_penalty" : 1.05 },
)
print (response.choices[ 0 ].message.content)
Using curl: curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "lfm2.5-1.2b-instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.1,
"top_k": 50,
"repetition_penalty": 1.05
}'
llama-cli provides an interactive terminal interface for chatting with models. llama-cli -hf LiquidAI/LFM2.5-1.2B-Instruct-GGUF -c 4096 --color -i \
--temp 0.1 --top-k 50 --repeat-penalty 1.05
The -hf flag downloads the model directly from Hugging Face. Alternatively, use a local model file: llama-cli -m lfm2.5-1.2b-instruct-q4_k_m.gguf -c 4096 --color -i \
--temp 0.1 --top-k 50 --repeat-penalty 1.05
Key parameters:
-hf: Hugging Face model ID (downloads automatically)
-m: Path to local GGUF model file
-c: Context length
--color: Colored output
-i: Interactive mode
-ngl 99: Offload layers to GPU (if available)
Press Ctrl+C to exit.
Generation Parameters
Control text generation behavior using parameters in the OpenAI-compatible API or command-line flags. Key parameters:
temperature (float): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0
top_p (float): Nucleus sampling - limits to tokens with cumulative probability ≤ top_p. Typical range: 0.1-1.0
top_k (int): Limits to top-k most probable tokens. Typical range: 1-100
min_p (float): Filters tokens below min_p * max_probability. Typical range: 0.05-0.3
max_tokens / --n-predict (int): Maximum number of tokens to generate
repetition_penalty / --repeat-penalty (float): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5
stop (str or list[str]): Strings that terminate generation when encountered
llama-server (OpenAI-compatible API) example
from openai import OpenAI
client = OpenAI(
base_url = "http://localhost:8080/v1" ,
api_key = "not-needed"
)
response = client.chat.completions.create(
model = "lfm2.5-1.2b-instruct" ,
messages = [{ "role" : "user" , "content" : "What is machine learning?" }],
temperature = 0.1 ,
max_tokens = 512 ,
extra_body = { "top_k" : 50 , "repetition_penalty" : 1.05 },
)
print (response.choices[ 0 ].message.content)
For command-line tools (llama-cli), use flags like --temp, --top-p, --top-k, --min-p, --repeat-penalty, and --n-predict.
Vision Models
LFM2-VL GGUF models can be used for multimodal inference with llama.cpp.
Quick Start with llama-cli
Download llama.cpp binaries and run vision inference directly:
wget https://github.com/ggml-org/llama.cpp/releases/download/b7633/llama-b7633-bin-ubuntu-x64.tar.gz
tar -xzf llama-b7633-bin-ubuntu-x64.tar.gz
Download a test image:
import requests
image_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
img_data = requests.get(image_url).content
with open ( "test_image.jpg" , "wb" ) as f:
f.write(img_data)
Run inference (works on CPU):
llama-b7633/llama-cli \
-hf LiquidAI/LFM2.5-VL-1.6B-GGUF:Q4_0 \
--image test_image.jpg \
--image-max-tokens 64 \
-p "What's in this image?" \
-n 128 \
--temp 0.1 --min-p 0.15 --repeat-penalty 1.05
The -hf flag downloads the model directly from Hugging Face. Use --image-max-tokens to control image token budget.
Alternative: Manual Model Download
If you prefer to download models manually:
uv pip install huggingface-hub
hf download LiquidAI/LFM2-VL-1.6B-GGUF LFM2-VL-1.6B-Q8_0.gguf --local-dir .
hf download LiquidAI/LFM2-VL-1.6B-GGUF mmproj-LFM2-VL-1.6B-Q8_0.gguf --local-dir .
Run inference directly from the command line: llama-mtmd-cli \
-m LFM2-VL-1.6B-Q8_0.gguf \
--mmproj mmproj-LFM2-VL-1.6B-Q8_0.gguf \
--image image.jpg \
-p "What is in this image?" \
-ngl 99
Start a vision model server with both the model and mmproj files: llama-server \
-m LFM2-VL-1.6B-Q8_0.gguf \
--mmproj mmproj-LFM2-VL-1.6B-Q8_0.gguf \
-c 4096 \
--port 8080 \
-ngl 99
Use with the OpenAI Python client: from openai import OpenAI
import base64
client = OpenAI(
base_url = "http://localhost:8080/v1" , # The hosted llama-server
api_key = "not-needed"
)
# Encode image to base64
with open ( "image.jpg" , "rb" ) as image_file:
image_data = base64.b64encode(image_file.read()).decode( "utf-8" )
response = client.chat.completions.create(
model = "lfm2.5-vl-1.6b" , # Model name should match your server configuration
messages = [
{
"role" : "user" ,
"content" : [
{ "type" : "image_url" , "image_url" : { "url" : f "data:image/jpeg;base64, { image_data } " }},
{ "type" : "text" , "text" : "What's in this image?" }
]
}
],
temperature = 0.1 ,
max_tokens = 256 ,
extra_body = { "min_p" : 0.15 , "repetition_penalty" : 1.05 },
)
print (response.choices[ 0 ].message.content)
Converting Custom Models
If you have a finetuned model or need to create a GGUF from a Hugging Face model:
# Clone llama.cpp if you haven't already
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Convert model with quantization
python convert_hf_to_gguf.py /path/to/your/model --outfile model.gguf --outtype q4_k_m
Use --outtype to specify the quantization level (e.g., q4_0, q4_k_m, q5_k_m, q6_k, q8_0, f16).
Example Applications
For more comprehensive example applications using llama.cpp with LFM models, check out these repositories:
The full list of llama.cpp language bindings can be found here .