By Clore in GPU Cloud — 08 Jul 2025

How to Deploy an LLM Inference Server: Complete Guide with vLLM, TGI & Ollama

Meta Description: Step-by-step guide to deploy LLM inference servers using vLLM, TGI, and Ollama on Clore.ai. Includes Docker templates, optimization tips, and benchmarks.

Introduction: Why You Need an LLM Inference Server

Large Language Models (LLMs) have become essential for modern applications – from chatbots to code assistants, content generation to data analysis. But running these models efficiently at scale requires specialized infrastructure.

The challenge: OpenAI API costs $0.03 per 1K tokens for GPT-4. For high-volume applications, this quickly becomes expensive. Self-hosting inference servers can reduce costs by 10-50x while giving you complete control.

This comprehensive guide walks you through deploying production-ready LLM inference servers using three leading frameworks:

vLLM: Highest throughput, best for serving many users
Text Generation Inference (TGI): Hugging Face's optimized server
Ollama: Simplest setup, great for development

We'll deploy on Clore.ai for affordable GPU access, but these instructions work on any GPU cloud (Vast.ai, RunPod, AWS, etc.).

Prerequisites & Requirements

What You'll Need

Hardware:

GPU with minimum 24GB VRAM (RTX 3090, RTX 4090, A100, etc.)
32GB+ system RAM recommended
100GB+ SSD storage

Software:

Docker installed (we'll cover this)
SSH access to your GPU instance
Basic command-line knowledge

Models We'll Deploy:

LLaMA 2 7B (fits 24GB GPU)
Mistral 7B Instruct
CodeLlama 13B (requires 40GB+ VRAM)
Custom fine-tuned models

Expected Costs (on Clore.ai):

RTX 4090: $0.65/hr (~$468/month continuous)
RTX 3090: $0.35/hr (~$252/month continuous)
A100 40GB: $1.20/hr (~$864/month continuous)

Compare to OpenAI API: $30,000/month for equivalent volume (1B tokens). Self-hosting pays for itself quickly at scale.

Part 1: Deploying with vLLM (Recommended for Production)

vLLM is a high-throughput inference engine optimized for serving LLMs. It uses PagedAttention to maximize GPU memory efficiency and supports continuous batching for optimal throughput.

Why vLLM?

2-4x faster than naive implementations
Continuous batching: Automatically groups requests for efficiency
State-of-the-art optimizations: FlashAttention, PagedAttention
Production-ready: Used by major companies (Anthropic, Databricks)
OpenAI-compatible API: Drop-in replacement for OpenAI SDK

Step 1: Rent a GPU on Clore.ai

# 1. Go to https://clore.ai and create an account
# 2. Browse marketplace for RTX 4090 or better
# 3. Filter: min 24GB VRAM, reasonable bandwidth
# 4. Click "Rent" and select duration
# 5. You'll receive SSH credentials

# SSH into your instance
ssh user@your-instance-ip

Step 2: Install Docker

# Update system
sudo apt update && sudo apt upgrade -y

# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Install NVIDIA Container Toolkit for GPU support
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
    sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

# Verify GPU access in Docker
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Step 3: Deploy vLLM with Docker

# Create directory for models
mkdir -p ~/models
cd ~/models

# Pull vLLM Docker image
docker pull vllm/vllm-openai:latest

# Run vLLM with LLaMA 2 7B
docker run -d \
  --name vllm-llama2 \
  --gpus all \
  -p 8000:8000 \
  -v ~/models:/models \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-2-7b-chat-hf \
  --dtype auto \
  --api-key your-secret-key \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.95

# Check logs
docker logs -f vllm-llama2

What each parameter does:

--gpus all: Use all available GPUs
-p 8000:8000: Expose API on port 8000
-v ~/models:/models: Cache downloaded models
--dtype auto: Automatic precision (FP16/BF16)
--api-key: Protect your endpoint
--max-model-len 4096: Maximum context window
--gpu-memory-utilization 0.95: Use 95% of VRAM

Step 4: Test Your vLLM Server

# test_vllm.py
from openai import OpenAI

# Point to your vLLM server
client = OpenAI(
    base_url="http://your-instance-ip:8000/v1",
    api_key="your-secret-key",
)

# Make a request (OpenAI-compatible)
completion = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(completion.choices[0].message.content)

Run the test:

pip install openai
python test_vllm.py

Step 5: Advanced vLLM Configuration

Multi-GPU Support:

# For 2x RTX 4090 or 4x RTX 3090 instances
docker run -d \
  --name vllm-multi \
  --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-2-13b-chat-hf \
  --tensor-parallel-size 2 \  # Use 2 GPUs
  --dtype auto \
  --api-key your-secret-key

Quantization for Memory Efficiency:

# AWQ 4-bit quantization (fits larger models in less VRAM)
docker run -d \
  --name vllm-awq \
  --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model TheBloke/Llama-2-13B-chat-AWQ \
  --quantization awq \
  --dtype auto \
  --api-key your-secret-key

# This fits LLaMA 13B in just 16GB VRAM!

Custom System Prompt:

# Create custom chat template
cat > ~/chat_template.jinja << 'EOF'
{% for message in messages %}
{% if message['role'] == 'system' %}System: {{ message['content'] }}
{% elif message['role'] == 'user' %}User: {{ message['content'] }}
{% elif message['role'] == 'assistant' %}Assistant: {{ message['content'] }}
{% endif %}
{% endfor %}
Assistant:
EOF

# Run with custom template
docker run -d \
  --name vllm-custom \
  --gpus all \
  -p 8000:8000 \
  -v ~/chat_template.jinja:/templates/chat.jinja \
  vllm/vllm-openai:latest \
  --model your-custom-model \
  --chat-template /templates/chat.jinja

Performance Benchmarks: vLLM

Tested on RTX 4090, LLaMA 2 7B:

Concurrent Users	Throughput (tokens/s)	Latency (ms)
1	85	120
10	780	450
50	2,100	1,200
100	2,450	2,800

Impressive batching efficiency – 100 users served with only 4x latency increase.

Part 2: Deploying with Text Generation Inference (TGI)

TGI is Hugging Face's production-ready inference server. It's tightly integrated with the Hugging Face ecosystem and offers excellent performance.

Why TGI?

Hugging Face native: Seamless integration with HF models
Flash Attention support: Faster inference with less memory
Token streaming: Real-time response generation
Safetensors support: Faster model loading
Production features: Metrics, distributed tracing

Step 1: Deploy TGI with Docker

# Pull TGI image
docker pull ghcr.io/huggingface/text-generation-inference:latest

# Run TGI with Mistral 7B
docker run -d \
  --name tgi-mistral \
  --gpus all \
  -p 8080:80 \
  -v ~/models:/data \
  --shm-size 1g \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id mistralai/Mistral-7B-Instruct-v0.2 \
  --num-shard 1 \
  --max-total-tokens 8192 \
  --max-input-length 4096

# Monitor logs
docker logs -f tgi-mistral

Step 2: Test TGI Server

# Simple curl test
curl http://localhost:8080/generate \
  -X POST \
  -d '{"inputs":"What is deep learning?","parameters":{"max_new_tokens":200}}' \
  -H 'Content-Type: application/json'

Python client:

# test_tgi.py
from huggingface_hub import InferenceClient

client = InferenceClient(model="http://your-instance-ip:8080")

# Generate text
response = client.text_generation(
    prompt="Explain neural networks simply:",
    max_new_tokens=300,
    temperature=0.7,
    top_p=0.95,
    stream=False
)

print(response)

Streaming responses (for chatbot UX):

# streaming_test.py
from huggingface_hub import InferenceClient

client = InferenceClient(model="http://your-instance-ip:8080")

for token in client.text_generation(
    prompt="Write a short story about AI:",
    max_new_tokens=500,
    stream=True  # Enable streaming
):
    print(token, end="", flush=True)

Step 3: Advanced TGI Configuration

Quantization (GPTQ):

# Deploy quantized model (fits 13B in 24GB VRAM)
docker run -d \
  --name tgi-quantized \
  --gpus all \
  -p 8080:80 \
  -v ~/models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id TheBloke/Llama-2-13B-chat-GPTQ \
  --quantize gptq \
  --max-total-tokens 4096

Custom Models from Hugging Face:

# Deploy your fine-tuned model
docker run -d \
  --name tgi-custom \
  --gpus all \
  -p 8080:80 \
  -v ~/models:/data \
  -e HUGGING_FACE_HUB_TOKEN=your_hf_token \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id your-username/your-finetuned-model \
  --max-total-tokens 4096

TGI Performance Benchmarks

Tested on RTX 4090, Mistral 7B Instruct:

Batch Size	Throughput (tokens/s)	Latency (ms)
1	92	110
8	650	380
32	1,850	1,100

Excellent single-user latency, slightly lower max throughput than vLLM at high concurrency.

Part 3: Deploying with Ollama (Easiest Setup)

Ollama is the simplest way to run LLMs locally. Perfect for development, prototyping, and personal use.

Why Ollama?

Incredibly simple: One-command deployment
Model library: Pre-configured popular models
Automatic optimizations: Works out of the box
Cross-platform: Linux, macOS, Windows
REST API: Easy integration

Trade-off: Less fine-grained control than vLLM/TGI, slightly lower performance at scale.

Step 1: Install Ollama

# One-line install
curl -fsSL https://ollama.ai/install.sh | sh

# Verify installation
ollama --version

Step 2: Pull and Run Models

# Download and run LLaMA 2 7B
ollama run llama2

# This starts an interactive chat session
>>> Hello! How are you?
I'm doing well, thank you for asking! ...

# Exit with /bye
>>> /bye

Available models:

# See all models
ollama list

# Popular models
ollama pull llama2              # LLaMA 2 7B
ollama pull mistral             # Mistral 7B
ollama pull codellama           # Code LLaMA 7B
ollama pull llama2:13b          # LLaMA 2 13B (needs 40GB VRAM)
ollama pull llama2:70b          # LLaMA 2 70B (needs 80GB VRAM)

Step 3: Run Ollama as API Server

# Run Ollama server in background
ollama serve &

# Now you can make API requests
curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Python client:

# test_ollama.py
import requests
import json

def generate(prompt, model="llama2"):
    response = requests.post(
        "http://your-instance-ip:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        }
    )
    return response.json()["response"]

# Test it
result = generate("Explain machine learning in one paragraph")
print(result)

Streaming with Ollama:

# streaming_ollama.py
import requests
import json

def generate_stream(prompt, model="llama2"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": True},
        stream=True
    )
    
    for line in response.iter_lines():
        if line:
            chunk = json.loads(line)
            print(chunk["response"], end="", flush=True)
            if chunk.get("done"):
                break

generate_stream("Write a haiku about AI")

Step 4: Create Custom Models with Ollama

# Create a Modelfile for custom behavior
cat > Modelfile << 'EOF'
FROM llama2

# Set temperature
PARAMETER temperature 0.8

# Set system prompt
SYSTEM You are a Python coding assistant. Always provide clean, well-commented code.

# Example conversation
MESSAGE user Write a function to calculate fibonacci
MESSAGE assistant Here's a clean implementation: ...
EOF

# Build custom model
ollama create python-assistant -f Modelfile

# Run it
ollama run python-assistant

Ollama Performance Benchmarks

Tested on RTX 4090, LLaMA 2 7B:

Metric	Value
Cold start	2.5s
Tokens/second (single user)	78
Memory usage	8.2GB VRAM
Latency (first token)	125ms

Verdict: Great for development, acceptable for low-concurrency production. Use vLLM/TGI for high-traffic applications.

Part 4: Comparison & Choosing the Right Framework

Feature Comparison

Feature	vLLM	TGI	Ollama
Setup Difficulty	Medium	Medium	Easy ✅
Max Throughput	Highest ✅	High	Medium
Latency (single user)	Good	Excellent ✅	Good
Memory Efficiency	Excellent ✅	Excellent	Good
API Style	OpenAI-compatible	HuggingFace	Simple REST
Streaming	Yes ✅	Yes ✅	Yes ✅
Quantization	AWQ, GPTQ	GPTQ, EETQ	Automatic
Multi-GPU	Yes ✅	Yes ✅	No
Custom Models	Yes ✅	Yes ✅	Limited
Best For	High-traffic APIs	HF ecosystem	Development

When to Use Each

Choose vLLM if:

Serving many concurrent users (10+)
Need OpenAI API compatibility
Want maximum throughput
Production web application

Choose TGI if:

Using Hugging Face models exclusively
Need seamless HF integration
Want built-in metrics/observability
Production with moderate concurrency

Choose Ollama if:

Prototyping and development
Personal projects
Simple deployment requirements
Low to moderate traffic

Part 5: Production Best Practices

Secure Your Inference Server

1. Use Authentication:

# vLLM with API key
--api-key your-very-secret-key-here

# TGI with reverse proxy (nginx)
apt install nginx

Nginx config (/etc/nginx/sites-available/llm-inference):

server {
    listen 443 ssl;
    server_name your-domain.com;
    
    ssl_certificate /etc/letsencrypt/live/your-domain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/your-domain.com/privkey.pem;
    
    location / {
        proxy_pass http://localhost:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        
        # Basic auth
        auth_basic "Restricted";
        auth_basic_user_file /etc/nginx/.htpasswd;
    }
}

2. Rate Limiting:

# rate_limiter.py (FastAPI middleware)
from fastapi import FastAPI, Request
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)
app = FastAPI()
app.state.limiter = limiter

@app.post("/generate")
@limiter.limit("10/minute")  # 10 requests per minute
async def generate(request: Request, prompt: str):
    # Forward to your vLLM/TGI backend
    pass

Monitor Performance

Prometheus metrics for vLLM:

# vLLM exposes metrics on :8000/metrics
docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v ~/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

# Visualize with Grafana
docker run -d \
  --name grafana \
  -p 3000:3000 \
  grafana/grafana

Simple logging:

# log_requests.py
import logging
from datetime import datetime

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def log_inference(prompt, response, latency_ms):
    logger.info({
        "timestamp": datetime.now().isoformat(),
        "prompt_length": len(prompt),
        "response_length": len(response),
        "latency_ms": latency_ms,
    })

Auto-Scaling with Docker Compose

# docker-compose.yml
version: '3.8'

services:
  vllm-1:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
    ports:
      - "8001:8000"
    command: >
      --model meta-llama/Llama-2-7b-chat-hf
      --dtype auto

  vllm-2:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=1
    ports:
      - "8002:8000"
    command: >
      --model meta-llama/Llama-2-7b-chat-hf
      --dtype auto

  load-balancer:
    image: nginx:alpine
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - vllm-1
      - vllm-2

Load balancer config (nginx.conf):

upstream vllm_backends {
    least_conn;
    server vllm-1:8000;
    server vllm-2:8000;
}

server {
    listen 80;
    
    location / {
        proxy_pass http://vllm_backends;
        proxy_set_header Host $host;
    }
}

Part 6: Cost Analysis

Self-Hosting vs OpenAI API

Scenario: 10M tokens/month (moderate application)

OpenAI API (GPT-3.5):

Input: $0.0015/1K tokens
Output: $0.002/1K tokens
Average: $0.00175/1K tokens
Monthly cost: $17,500

Self-Hosted on Clore.ai (LLaMA 2 7B):

RTX 4090: $0.65/hr
720 hours/month (continuous)
Monthly cost: $468
Savings: $17,032/month (97%)

Break-even analysis:

Setup time: 2-4 hours
Learning curve: 8-16 hours
Total investment: ~20 hours
At $100/hr value: $2,000 investment
ROI: 1 month at moderate scale

When Self-Hosting Makes Sense

✅ Deploy your own server if:

Processing >1M tokens/month
Need data privacy/control
Want custom fine-tuned models
Have technical team to maintain

❌ Stick with OpenAI API if:

Processing <500K tokens/month
No DevOps resources
Need GPT-4 level quality
Starting out / MVP stage

Troubleshooting Common Issues

Out of Memory (OOM)

Symptom: CUDA out of memory error

Solutions:

# 1. Reduce max-model-len
--max-model-len 2048  # Instead of 4096

# 2. Reduce GPU memory utilization
--gpu-memory-utilization 0.85  # Instead of 0.95

# 3. Use quantization
--quantization awq  # Or gptq

# 4. Use smaller model
# LLaMA 2 7B instead of 13B

Slow Inference

Symptom: Low tokens/second

Solutions:

# 1. Enable flash attention (if not automatic)
pip install flash-attn

# 2. Use better GPU (H100 > A100 > RTX 4090)
# Rent on Clore.ai: https://clore.ai

# 3. Optimize batch size
--max-num-batched-tokens 8192

# 4. Check CPU bottleneck
htop  # Should see GPU at 95%+, not CPU-bound

Model Download Fails

Symptom: Timeout downloading from HuggingFace

Solutions:

# 1. Pre-download model
huggingface-cli download meta-llama/Llama-2-7b-chat-hf

# 2. Use HF token for gated models
export HUGGING_FACE_HUB_TOKEN=your_token

# 3. Use local model path
docker run ... -v /path/to/model:/model --model /model

Conclusion: Deploying Your LLM Inference Server

You now have three proven methods to deploy production LLM inference servers:

vLLM: Best for high-traffic applications (100+ concurrent users)
TGI: Best for Hugging Face ecosystem integration
Ollama: Best for rapid prototyping and development

Next Steps

Rent a GPU on Clore.ai (RTX 4090 recommended)
Choose your framework based on your needs
Deploy and test with the Docker commands above
Monitor and optimize for your specific workload
Scale up as traffic grows

Resources

vLLM Docs: https://docs.vllm.ai
TGI Docs: https://huggingface.co/docs/text-generation-inference
Ollama Docs: https://ollama.ai/docs
Clore.ai Marketplace: https://clore.ai

Cost savings of 90%+ compared to commercial APIs await. Start deploying your own LLM infrastructure today and join the open-source AI revolution.

Last updated: February 2025

How to Deploy an LLM Inference Server: Complete Guide with vLLM, TGI & Ollama

Introduction: Why You Need an LLM Inference Server

Prerequisites & Requirements

What You'll Need

Part 1: Deploying with vLLM (Recommended for Production)

Why vLLM?

Step 1: Rent a GPU on Clore.ai

Step 2: Install Docker

Step 3: Deploy vLLM with Docker

Step 4: Test Your vLLM Server

Step 5: Advanced vLLM Configuration

Performance Benchmarks: vLLM

Part 2: Deploying with Text Generation Inference (TGI)

Why TGI?

Step 1: Deploy TGI with Docker

Step 2: Test TGI Server

Step 3: Advanced TGI Configuration

TGI Performance Benchmarks

Part 3: Deploying with Ollama (Easiest Setup)

Why Ollama?

Step 1: Install Ollama

Step 2: Pull and Run Models

Step 3: Run Ollama as API Server

Step 4: Create Custom Models with Ollama

Ollama Performance Benchmarks

Part 4: Comparison & Choosing the Right Framework

Feature Comparison

When to Use Each

Part 5: Production Best Practices

Secure Your Inference Server

Monitor Performance

Auto-Scaling with Docker Compose

Part 6: Cost Analysis

Self-Hosting vs OpenAI API

When Self-Hosting Makes Sense

Troubleshooting Common Issues

Out of Memory (OOM)

Slow Inference

Model Download Fails

Conclusion: Deploying Your LLM Inference Server

Next Steps

Resources

Subscribe to Clore.ai Blog