Ollama is a command-line tool for running LLMs locally with a simple interface. It provides easy model management and serving with an OpenAI-compatible API.
Use Ollama for quick local model serving with a simple CLI or Docker-based deployment.
The official Ollama v0.17.0 (latest stable) from ollama.com fails with a missing tensor 'output_norm.weight' error on the lfm2moe architecture. This affects all LFM MoE models (e.g. LFM2-24B-A2B, LFM2-8A-A1B). To run any LFM MoE model you specifically need v0.17.1-rc0 or later.
Replace {quantization} with your preferred quantization level (e.g., q4_k_m, q8_0).Then run the local model:
Copy
Ask AI
ollama run /path/to/model.gguf
Custom Setup with Modelfile
For custom configurations (specific quantization, chat template, or parameters), create a Modelfile.Create a plain text file named Modelfile (no extension) with the following content:
Copy
Ask AI
FROM /path/to/model.ggufTEMPLATE """<|startoftext|><|im_start|>system{{ .System }}<|im_end|><|im_start|>user{{ .Prompt }}<|im_end|><|im_start|>assistant"""PARAMETER temperature 0.1PARAMETER top_k 50PARAMETER repeat_penalty 1.05PARAMETER stop "<|im_end|>"PARAMETER stop "<|endoftext|>"