Skip to main content
Use Modal for serverless cloud deployments with instant autoscaling, GPU access, and production-ready inference serving.

Clone the repository

git clone https://github.com/Liquid4All/lfm-inference

Deployment

Launch command:
cd modal

# deploy LFM2 8B MoE model
modal deploy deploy-vllm.py

# deploy other LFM2 model, MODEL_NAME defaults to LiquidAI/LFM2-8B-A1B
MODEL_NAME=LiquidAI/<model-slug> modal deploy deploy-vllm.py
See full list of open source LFM models on Hugging Face.

Production deployment

  • Since vLLM takes over 2 min to cold start, if you run the inference server for production, it is recommended to keep a minimum number of warm instances with min_containers = 1 and buffer_containers = 1. The buffer_containers config is necessary because all Modal GPUs are subject to preemption. See docs for details about cold start performance tuning.
  • Warm up the vLLM server after deployment by sending a single request. The warm-up process is included in the deploy-vllm.py script already.

Test commands

Test the deployed server with the following curl commands (replace <modal-deployment-url> with your actual deployment URL):
# List deployed model
curl https://<modal-deployment-url>/v1/models

# Query the deployed LFM model
curl -X POST https://<modal-deployment-url>/v1/chat/completions \
  -d '{
    "model": "LiquidAI/LFM2-8B-A1B",
    "messages": [
      {
        "role": "user",
        "content": "What is the melting temperature of silver?"
      }
    ],
    "max_tokens": 32,
    "temperature": 0
  }'