How to Deploy an LLM Inference Server: Complete Guide with vLLM, TGI & Ollama

How to Deploy an LLM Inference Server: Complete Guide with vLLM, TGI & Ollama

How to Deploy an LLM Inference Server: Complete Guide with vLLM, TGI & Ollama

Meta Description: Step-by-step guide to deploy LLM inference servers using vLLM, TGI, and Ollama on Clore.ai. Includes Docker templates, optimization tips, and benchmarks.


Introduction: Why You Need an LLM Inference Server

Large Language Models (LLMs) have become essential for modern applications – from chatbots to code assistants, content generation to data analysis. But running these models efficiently at scale requires specialized infrastructure.

The challenge: OpenAI API costs $0.03 per 1K tokens for GPT-4. For high-volume applications, this quickly becomes expensive. Self-hosting inference servers can reduce costs by 10-50x while giving you complete control.

This comprehensive guide walks you through deploying production-ready LLM inference servers using three leading frameworks:

  • vLLM: Highest throughput, best for serving many users
  • Text Generation Inference (TGI): Hugging Face's optimized server
  • Ollama: Simplest setup, great for development

We'll deploy on Clore.ai for affordable GPU access, but these instructions work on any GPU cloud (Vast.ai, RunPod, AWS, etc.).


Prerequisites & Requirements

What You'll Need

Hardware:

  • GPU with minimum 24GB VRAM (RTX 3090, RTX 4090, A100, etc.)
  • 32GB+ system RAM recommended
  • 100GB+ SSD storage

Software:

  • Docker installed (we'll cover this)
  • SSH access to your GPU instance
  • Basic command-line knowledge

Models We'll Deploy:

  • LLaMA 2 7B (fits 24GB GPU)
  • Mistral 7B Instruct
  • CodeLlama 13B (requires 40GB+ VRAM)
  • Custom fine-tuned models

Expected Costs (on Clore.ai):

  • RTX 4090: $0.65/hr (~$468/month continuous)
  • RTX 3090: $0.35/hr (~$252/month continuous)
  • A100 40GB: $1.20/hr (~$864/month continuous)

Compare to OpenAI API: $30,000/month for equivalent volume (1B tokens). Self-hosting pays for itself quickly at scale.


vLLM is a high-throughput inference engine optimized for serving LLMs. It uses PagedAttention to maximize GPU memory efficiency and supports continuous batching for optimal throughput.

Why vLLM?

  • 2-4x faster than naive implementations
  • Continuous batching: Automatically groups requests for efficiency
  • State-of-the-art optimizations: FlashAttention, PagedAttention
  • Production-ready: Used by major companies (Anthropic, Databricks)
  • OpenAI-compatible API: Drop-in replacement for OpenAI SDK

Step 1: Rent a GPU on Clore.ai

# 1. Go to https://clore.ai and create an account
# 2. Browse marketplace for RTX 4090 or better
# 3. Filter: min 24GB VRAM, reasonable bandwidth
# 4. Click "Rent" and select duration
# 5. You'll receive SSH credentials

# SSH into your instance
ssh user@your-instance-ip

Step 2: Install Docker

# Update system
sudo apt update && sudo apt upgrade -y

# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Install NVIDIA Container Toolkit for GPU support
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
    sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

# Verify GPU access in Docker
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Step 3: Deploy vLLM with Docker

# Create directory for models
mkdir -p ~/models
cd ~/models

# Pull vLLM Docker image
docker pull vllm/vllm-openai:latest

# Run vLLM with LLaMA 2 7B
docker run -d \
  --name vllm-llama2 \
  --gpus all \
  -p 8000:8000 \
  -v ~/models:/models \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-2-7b-chat-hf \
  --dtype auto \
  --api-key your-secret-key \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.95

# Check logs
docker logs -f vllm-llama2

What each parameter does:

  • --gpus all: Use all available GPUs
  • -p 8000:8000: Expose API on port 8000
  • -v ~/models:/models: Cache downloaded models
  • --dtype auto: Automatic precision (FP16/BF16)
  • --api-key: Protect your endpoint
  • --max-model-len 4096: Maximum context window
  • --gpu-memory-utilization 0.95: Use 95% of VRAM

Step 4: Test Your vLLM Server

# test_vllm.py
from openai import OpenAI

# Point to your vLLM server
client = OpenAI(
    base_url="http://your-instance-ip:8000/v1",
    api_key="your-secret-key",
)

# Make a request (OpenAI-compatible)
completion = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(completion.choices[0].message.content)

Run the test:

pip install openai
python test_vllm.py

Step 5: Advanced vLLM Configuration

Multi-GPU Support:

# For 2x RTX 4090 or 4x RTX 3090 instances
docker run -d \
  --name vllm-multi \
  --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-2-13b-chat-hf \
  --tensor-parallel-size 2 \  # Use 2 GPUs
  --dtype auto \
  --api-key your-secret-key

Quantization for Memory Efficiency:

# AWQ 4-bit quantization (fits larger models in less VRAM)
docker run -d \
  --name vllm-awq \
  --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model TheBloke/Llama-2-13B-chat-AWQ \
  --quantization awq \
  --dtype auto \
  --api-key your-secret-key

# This fits LLaMA 13B in just 16GB VRAM!

Custom System Prompt:

# Create custom chat template
cat > ~/chat_template.jinja << 'EOF'
{% for message in messages %}
{% if message['role'] == 'system' %}System: {{ message['content'] }}
{% elif message['role'] == 'user' %}User: {{ message['content'] }}
{% elif message['role'] == 'assistant' %}Assistant: {{ message['content'] }}
{% endif %}
{% endfor %}
Assistant:
EOF

# Run with custom template
docker run -d \
  --name vllm-custom \
  --gpus all \
  -p 8000:8000 \
  -v ~/chat_template.jinja:/templates/chat.jinja \
  vllm/vllm-openai:latest \
  --model your-custom-model \
  --chat-template /templates/chat.jinja

Performance Benchmarks: vLLM

Tested on RTX 4090, LLaMA 2 7B:

Concurrent Users Throughput (tokens/s) Latency (ms)
1 85 120
10 780 450
50 2,100 1,200
100 2,450 2,800

Impressive batching efficiency – 100 users served with only 4x latency increase.


Part 2: Deploying with Text Generation Inference (TGI)

TGI is Hugging Face's production-ready inference server. It's tightly integrated with the Hugging Face ecosystem and offers excellent performance.

Why TGI?

  • Hugging Face native: Seamless integration with HF models
  • Flash Attention support: Faster inference with less memory
  • Token streaming: Real-time response generation
  • Safetensors support: Faster model loading
  • Production features: Metrics, distributed tracing

Step 1: Deploy TGI with Docker

# Pull TGI image
docker pull ghcr.io/huggingface/text-generation-inference:latest

# Run TGI with Mistral 7B
docker run -d \
  --name tgi-mistral \
  --gpus all \
  -p 8080:80 \
  -v ~/models:/data \
  --shm-size 1g \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id mistralai/Mistral-7B-Instruct-v0.2 \
  --num-shard 1 \
  --max-total-tokens 8192 \
  --max-input-length 4096

# Monitor logs
docker logs -f tgi-mistral

Step 2: Test TGI Server

# Simple curl test
curl http://localhost:8080/generate \
  -X POST \
  -d '{"inputs":"What is deep learning?","parameters":{"max_new_tokens":200}}' \
  -H 'Content-Type: application/json'

Python client:

# test_tgi.py
from huggingface_hub import InferenceClient

client = InferenceClient(model="http://your-instance-ip:8080")

# Generate text
response = client.text_generation(
    prompt="Explain neural networks simply:",
    max_new_tokens=300,
    temperature=0.7,
    top_p=0.95,
    stream=False
)

print(response)

Streaming responses (for chatbot UX):

# streaming_test.py
from huggingface_hub import InferenceClient

client = InferenceClient(model="http://your-instance-ip:8080")

for token in client.text_generation(
    prompt="Write a short story about AI:",
    max_new_tokens=500,
    stream=True  # Enable streaming
):
    print(token, end="", flush=True)

Step 3: Advanced TGI Configuration

Quantization (GPTQ):

# Deploy quantized model (fits 13B in 24GB VRAM)
docker run -d \
  --name tgi-quantized \
  --gpus all \
  -p 8080:80 \
  -v ~/models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id TheBloke/Llama-2-13B-chat-GPTQ \
  --quantize gptq \
  --max-total-tokens 4096

Custom Models from Hugging Face:

# Deploy your fine-tuned model
docker run -d \
  --name tgi-custom \
  --gpus all \
  -p 8080:80 \
  -v ~/models:/data \
  -e HUGGING_FACE_HUB_TOKEN=your_hf_token \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id your-username/your-finetuned-model \
  --max-total-tokens 4096

TGI Performance Benchmarks

Tested on RTX 4090, Mistral 7B Instruct:

Batch Size Throughput (tokens/s) Latency (ms)
1 92 110
8 650 380
32 1,850 1,100

Excellent single-user latency, slightly lower max throughput than vLLM at high concurrency.


Part 3: Deploying with Ollama (Easiest Setup)

Ollama is the simplest way to run LLMs locally. Perfect for development, prototyping, and personal use.

Why Ollama?

  • Incredibly simple: One-command deployment
  • Model library: Pre-configured popular models
  • Automatic optimizations: Works out of the box
  • Cross-platform: Linux, macOS, Windows
  • REST API: Easy integration

Trade-off: Less fine-grained control than vLLM/TGI, slightly lower performance at scale.

Step 1: Install Ollama

# One-line install
curl -fsSL https://ollama.ai/install.sh | sh

# Verify installation
ollama --version

Step 2: Pull and Run Models

# Download and run LLaMA 2 7B
ollama run llama2

# This starts an interactive chat session
>>> Hello! How are you?
I'm doing well, thank you for asking! ...

# Exit with /bye
>>> /bye

Available models:

# See all models
ollama list

# Popular models
ollama pull llama2              # LLaMA 2 7B
ollama pull mistral             # Mistral 7B
ollama pull codellama           # Code LLaMA 7B
ollama pull llama2:13b          # LLaMA 2 13B (needs 40GB VRAM)
ollama pull llama2:70b          # LLaMA 2 70B (needs 80GB VRAM)

Step 3: Run Ollama as API Server

# Run Ollama server in background
ollama serve &

# Now you can make API requests
curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Python client:

# test_ollama.py
import requests
import json

def generate(prompt, model="llama2"):
    response = requests.post(
        "http://your-instance-ip:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        }
    )
    return response.json()["response"]

# Test it
result = generate("Explain machine learning in one paragraph")
print(result)

Streaming with Ollama:

# streaming_ollama.py
import requests
import json

def generate_stream(prompt, model="llama2"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": True},
        stream=True
    )
    
    for line in response.iter_lines():
        if line:
            chunk = json.loads(line)
            print(chunk["response"], end="", flush=True)
            if chunk.get("done"):
                break

generate_stream("Write a haiku about AI")

Step 4: Create Custom Models with Ollama

# Create a Modelfile for custom behavior
cat > Modelfile << 'EOF'
FROM llama2

# Set temperature
PARAMETER temperature 0.8

# Set system prompt
SYSTEM You are a Python coding assistant. Always provide clean, well-commented code.

# Example conversation
MESSAGE user Write a function to calculate fibonacci
MESSAGE assistant Here's a clean implementation: ...
EOF

# Build custom model
ollama create python-assistant -f Modelfile

# Run it
ollama run python-assistant

Ollama Performance Benchmarks

Tested on RTX 4090, LLaMA 2 7B:

Metric Value
Cold start 2.5s
Tokens/second (single user) 78
Memory usage 8.2GB VRAM
Latency (first token) 125ms

Verdict: Great for development, acceptable for low-concurrency production. Use vLLM/TGI for high-traffic applications.


Part 4: Comparison & Choosing the Right Framework

Feature Comparison

Feature vLLM TGI Ollama
Setup Difficulty Medium Medium Easy ✅
Max Throughput Highest ✅ High Medium
Latency (single user) Good Excellent ✅ Good
Memory Efficiency Excellent ✅ Excellent Good
API Style OpenAI-compatible HuggingFace Simple REST
Streaming Yes ✅ Yes ✅ Yes ✅
Quantization AWQ, GPTQ GPTQ, EETQ Automatic
Multi-GPU Yes ✅ Yes ✅ No
Custom Models Yes ✅ Yes ✅ Limited
Best For High-traffic APIs HF ecosystem Development

When to Use Each

Choose vLLM if:

  • Serving many concurrent users (10+)
  • Need OpenAI API compatibility
  • Want maximum throughput
  • Production web application

Choose TGI if:

  • Using Hugging Face models exclusively
  • Need seamless HF integration
  • Want built-in metrics/observability
  • Production with moderate concurrency

Choose Ollama if:

  • Prototyping and development
  • Personal projects
  • Simple deployment requirements
  • Low to moderate traffic

Part 5: Production Best Practices

Secure Your Inference Server

1. Use Authentication:

# vLLM with API key
--api-key your-very-secret-key-here

# TGI with reverse proxy (nginx)
apt install nginx

Nginx config (/etc/nginx/sites-available/llm-inference):

server {
    listen 443 ssl;
    server_name your-domain.com;
    
    ssl_certificate /etc/letsencrypt/live/your-domain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/your-domain.com/privkey.pem;
    
    location / {
        proxy_pass http://localhost:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        
        # Basic auth
        auth_basic "Restricted";
        auth_basic_user_file /etc/nginx/.htpasswd;
    }
}

2. Rate Limiting:

# rate_limiter.py (FastAPI middleware)
from fastapi import FastAPI, Request
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)
app = FastAPI()
app.state.limiter = limiter

@app.post("/generate")
@limiter.limit("10/minute")  # 10 requests per minute
async def generate(request: Request, prompt: str):
    # Forward to your vLLM/TGI backend
    pass

Monitor Performance

Prometheus metrics for vLLM:

# vLLM exposes metrics on :8000/metrics
docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v ~/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

# Visualize with Grafana
docker run -d \
  --name grafana \
  -p 3000:3000 \
  grafana/grafana

Simple logging:

# log_requests.py
import logging
from datetime import datetime

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def log_inference(prompt, response, latency_ms):
    logger.info({
        "timestamp": datetime.now().isoformat(),
        "prompt_length": len(prompt),
        "response_length": len(response),
        "latency_ms": latency_ms,
    })

Auto-Scaling with Docker Compose

# docker-compose.yml
version: '3.8'

services:
  vllm-1:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
    ports:
      - "8001:8000"
    command: >
      --model meta-llama/Llama-2-7b-chat-hf
      --dtype auto

  vllm-2:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=1
    ports:
      - "8002:8000"
    command: >
      --model meta-llama/Llama-2-7b-chat-hf
      --dtype auto

  load-balancer:
    image: nginx:alpine
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - vllm-1
      - vllm-2

Load balancer config (nginx.conf):

upstream vllm_backends {
    least_conn;
    server vllm-1:8000;
    server vllm-2:8000;
}

server {
    listen 80;
    
    location / {
        proxy_pass http://vllm_backends;
        proxy_set_header Host $host;
    }
}

Part 6: Cost Analysis

Self-Hosting vs OpenAI API

Scenario: 10M tokens/month (moderate application)

OpenAI API (GPT-3.5):

  • Input: $0.0015/1K tokens
  • Output: $0.002/1K tokens
  • Average: $0.00175/1K tokens
  • Monthly cost: $17,500

Self-Hosted on Clore.ai (LLaMA 2 7B):

  • RTX 4090: $0.65/hr
  • 720 hours/month (continuous)
  • Monthly cost: $468
  • Savings: $17,032/month (97%)

Break-even analysis:

  • Setup time: 2-4 hours
  • Learning curve: 8-16 hours
  • Total investment: ~20 hours
  • At $100/hr value: $2,000 investment
  • ROI: 1 month at moderate scale

When Self-Hosting Makes Sense

Deploy your own server if:

  • Processing >1M tokens/month
  • Need data privacy/control
  • Want custom fine-tuned models
  • Have technical team to maintain

Stick with OpenAI API if:

  • Processing <500K tokens/month
  • No DevOps resources
  • Need GPT-4 level quality
  • Starting out / MVP stage

Troubleshooting Common Issues

Out of Memory (OOM)

Symptom: CUDA out of memory error

Solutions:

# 1. Reduce max-model-len
--max-model-len 2048  # Instead of 4096

# 2. Reduce GPU memory utilization
--gpu-memory-utilization 0.85  # Instead of 0.95

# 3. Use quantization
--quantization awq  # Or gptq

# 4. Use smaller model
# LLaMA 2 7B instead of 13B

Slow Inference

Symptom: Low tokens/second

Solutions:

# 1. Enable flash attention (if not automatic)
pip install flash-attn

# 2. Use better GPU (H100 > A100 > RTX 4090)
# Rent on Clore.ai: https://clore.ai

# 3. Optimize batch size
--max-num-batched-tokens 8192

# 4. Check CPU bottleneck
htop  # Should see GPU at 95%+, not CPU-bound

Model Download Fails

Symptom: Timeout downloading from HuggingFace

Solutions:

# 1. Pre-download model
huggingface-cli download meta-llama/Llama-2-7b-chat-hf

# 2. Use HF token for gated models
export HUGGING_FACE_HUB_TOKEN=your_token

# 3. Use local model path
docker run ... -v /path/to/model:/model --model /model

Conclusion: Deploying Your LLM Inference Server

You now have three proven methods to deploy production LLM inference servers:

  1. vLLM: Best for high-traffic applications (100+ concurrent users)
  2. TGI: Best for Hugging Face ecosystem integration
  3. Ollama: Best for rapid prototyping and development

Next Steps

  1. Rent a GPU on Clore.ai (RTX 4090 recommended)
  2. Choose your framework based on your needs
  3. Deploy and test with the Docker commands above
  4. Monitor and optimize for your specific workload
  5. Scale up as traffic grows

Resources

Cost savings of 90%+ compared to commercial APIs await. Start deploying your own LLM infrastructure today and join the open-source AI revolution.


Last updated: February 2025

Subscribe to Clore.ai Blog

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe