Use Modal for serverless cloud deployments with instant autoscaling, GPU access, and production-ready inference serving.
Clone the repository
git clone https://github.com/Liquid4All/lfm-inference
Deployment
Launch command:
cd modal
# deploy LFM2 8B MoE model
modal deploy deploy-vllm.py
# deploy other LFM2 model, MODEL_NAME defaults to LiquidAI/LFM2-8B-A1B
MODEL_NAME=LiquidAI/<model-slug> modal deploy deploy-vllm.py
See full list of open source LFM models on Hugging Face.
Production deployment
- Since vLLM takes over 2 min to cold start, if you run the inference server for production, it is recommended to keep a minimum number of warm instances with
min_containers = 1 and buffer_containers = 1. The buffer_containers config is necessary because all Modal GPUs are subject to preemption. See docs for details about cold start performance tuning.
- Warm up the vLLM server after deployment by sending a single request. The warm-up process is included in the deploy-vllm.py script already.
Test commands
Test the deployed server with the following curl commands (replace <modal-deployment-url> with your actual deployment URL):
# List deployed model
curl https://<modal-deployment-url>/v1/models
# Query the deployed LFM model
curl -X POST https://<modal-deployment-url>/v1/chat/completions \
-d '{
"model": "LiquidAI/LFM2-8B-A1B",
"messages": [
{
"role": "user",
"content": "What is the melting temperature of silver?"
}
],
"max_tokens": 32,
"temperature": 0
}'