How to Fine-Tune LLaMA 3 on a Cloud GPU

How to Fine-Tune LLaMA 3 on a Cloud GPU

How to Fine-Tune LLaMA 3 on a Cloud GPU

Large language models (LLMs) have transformed what's possible with AI — from chatbots and code assistants to content generation and data analysis. Meta's LLaMA 3 family of models, released in open-source, has become one of the most popular foundations for building custom AI applications. But while LLaMA 3 is free to download, fine-tuning it to your specific use case requires serious GPU horsepower.

Fine-tuning adapts a pre-trained model to perform better on your specific data. Want a customer support bot that knows your product? Fine-tune LLaMA 3 on your support transcripts. Need a coding assistant for a specific framework? Fine-tune on your codebase documentation. The possibilities are endless — but the GPU requirements are real.

In this comprehensive tutorial, you'll learn how to fine-tune LLaMA 3 (8B and 70B parameter versions) on a cloud GPU rented from Clore.ai. We'll cover everything from choosing the right GPU and setting up the environment to running the training loop and saving your fine-tuned model. No expensive hardware needed — just a browser and a few dollars of compute budget.

Prerequisites

Before we start, here's what you'll need:

  • A Clore.ai account with funds ($5–$20 depending on model size and training duration)
  • A Hugging Face account to download LLaMA 3 weights (requires accepting Meta's license agreement)
  • Basic Python knowledge — we'll provide all the code, but understanding what it does helps
  • Your training dataset — a JSONL file with prompt-completion pairs (we'll cover format below)

Choosing the Right GPU for LLaMA 3 Fine-Tuning

GPU selection depends on the model size and fine-tuning method:

LLaMA 3 8B (LoRA/QLoRA Fine-Tuning)

  • Minimum: RTX 3090 (24 GB VRAM) — $0.06–$0.12/hr on Clore.ai
  • Recommended: RTX 4090 (24 GB VRAM) — $0.10–$0.25/hr on Clore.ai
  • Best: RTX 5090 (32 GB VRAM) — $0.30–$0.50/hr on Clore.ai

With QLoRA (4-bit quantization + LoRA adapters), the 8B model fits comfortably in 24 GB of VRAM with room for reasonable batch sizes.

LLaMA 3 70B (QLoRA Fine-Tuning)

  • Minimum: A100 40GB — $0.40–$0.70/hr on Clore.ai
  • Recommended: A100 80GB — $0.70–$1.00/hr on Clore.ai
  • Best: H100 80GB — $1.00–$2.00/hr on Clore.ai

The 70B model is significantly larger and requires at least 40 GB of VRAM with 4-bit quantization. For full fine-tuning (not LoRA), you'll need multi-GPU setups.

LLaMA 3 8B (Full Fine-Tuning)

  • Minimum: 2x RTX 4090 or 1x A100 80GB
  • Recommended: 2x A100 80GB or 4x RTX 4090

Full fine-tuning updates all model parameters and requires significantly more VRAM than LoRA methods.

For a complete pricing guide, see our Top 10 Cheapest GPUs for AI Training.

Step 1: Rent a GPU on Clore.ai

1.1: Find a Suitable Server

Log into Clore.ai and navigate to the marketplace. For this tutorial, we'll fine-tune LLaMA 3 8B with QLoRA on an RTX 4090.

Filter the marketplace:

  1. GPU: RTX 4090
  2. Sort by: Price (low to high)
  3. Storage: At least 100 GB (model weights + datasets + checkpoints)
  4. RAM: 32 GB or more
  5. Internet: 500+ Mbps for faster model downloads

1.2: Configure Your Rental

Select a server and configure:

  • Docker image: Choose a PyTorch image (e.g., pytorch/pytorch:2.3.0-cuda12.4-cudnn9-devel) or a base Ubuntu image with NVIDIA drivers
  • SSH access: Enable (you'll need terminal access for most of this tutorial)
  • Ports: Expose port 8888 if you want to use Jupyter Notebook, or port 7860 for Gradio-based testing later
  • Rental type: On-demand for uninterrupted training, or GigaSPOT if you're comfortable with checkpointing

1.3: Start the Rental

Click "Rent" and wait for your server to come online (usually 1–3 minutes). Note the SSH connection details from your dashboard.

Step 2: Set Up the Environment

SSH into your rented server:

ssh root@your-server-address -p your-port

2.1: Update System and Install Dependencies

# Update package lists
apt update && apt upgrade -y

# Install essential tools
apt install -y git wget curl nano htop

# Verify GPU is detected
nvidia-smi

You should see your RTX 4090 listed with driver version and CUDA information.

2.2: Install Python Packages

# Upgrade pip
pip install --upgrade pip

# Install core ML libraries
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Install fine-tuning stack
pip install transformers==4.45.0
pip install datasets==3.1.0
pip install accelerate==1.0.0
pip install peft==0.13.0
pip install bitsandbytes==0.44.0
pip install trl==0.12.0
pip install scipy
pip install sentencepiece
pip install protobuf

# Install wandb for experiment tracking (optional but recommended)
pip install wandb

2.3: Authenticate with Hugging Face

# Install Hugging Face CLI
pip install huggingface_hub

# Login (you'll need your HF token)
huggingface-cli login

Enter your Hugging Face access token when prompted. Make sure you've accepted Meta's license agreement for LLaMA 3 on the Hugging Face model page.

Step 3: Prepare Your Training Data

3.1: Data Format

Fine-tuning data should be in a structured format. The most common is a JSONL file where each line is a conversation or instruction-response pair:

{"instruction": "What is your return policy?", "input": "", "output": "Our return policy allows returns within 30 days of purchase. Items must be in original condition with tags attached. Contact support@example.com to initiate a return."}
{"instruction": "How do I track my order?", "input": "Order #12345", "output": "You can track your order by visiting our tracking page at example.com/track and entering your order number. You'll see real-time updates on your shipment status."}
{"instruction": "Do you offer international shipping?", "input": "", "output": "Yes, we ship to over 50 countries worldwide. International shipping rates are calculated at checkout based on your destination and package weight. Delivery typically takes 7-14 business days."}

3.2: Upload Your Dataset

Upload your dataset to the server:

# Option 1: SCP from your local machine
scp -P your-port /path/to/your/dataset.jsonl root@your-server-address:/root/dataset.jsonl

# Option 2: Download from a URL
wget https://your-storage.com/dataset.jsonl -O /root/dataset.jsonl

# Option 3: Use a Hugging Face dataset
# (We'll handle this in the training script)

3.3: Data Quality Tips

The quality of your fine-tuning data directly impacts model performance:

  • Minimum 100 examples for noticeable effect; 1,000–10,000 for robust results
  • Consistent formatting — use the same instruction style throughout
  • Diverse examples — cover the range of inputs your model will encounter
  • Clean data — remove duplicates, fix typos, ensure accuracy
  • Balanced distribution — don't over-represent any single topic

Step 4: Fine-Tune LLaMA 3 8B with QLoRA

Now for the main event. We'll use QLoRA, which combines 4-bit quantization with Low-Rank Adaptation to dramatically reduce VRAM requirements while maintaining training quality.

4.1: Create the Training Script

nano /root/train_llama3.py

Paste the following script:

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer

# ============================================
# Configuration
# ============================================
MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"
DATASET_PATH = "/root/dataset.jsonl"
OUTPUT_DIR = "/root/llama3-finetuned"
MAX_SEQ_LENGTH = 2048
NUM_EPOCHS = 3
BATCH_SIZE = 4
GRADIENT_ACCUMULATION = 4
LEARNING_RATE = 2e-4
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05

# ============================================
# Load Tokenizer
# ============================================
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# ============================================
# Quantization Config (4-bit for QLoRA)
# ============================================
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# ============================================
# Load Model
# ============================================
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
model = prepare_model_for_kbit_training(model)

# ============================================
# LoRA Configuration
# ============================================
lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# ============================================
# Load and Format Dataset
# ============================================
print("Loading dataset...")
dataset = load_dataset("json", data_files=DATASET_PATH, split="train")

def format_instruction(sample):
    """Format dataset samples into LLaMA 3 chat template."""
    if sample.get("input", ""):
        text = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{sample['instruction']}

Context: {sample['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{sample['output']}<|eot_id|>"""
    else:
        text = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{sample['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{sample['output']}<|eot_id|>"""
    return {"text": text}

dataset = dataset.map(format_instruction)

# ============================================
# Training Arguments
# ============================================
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION,
    learning_rate=LEARNING_RATE,
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_steps=100,
    save_total_limit=3,
    bf16=True,
    optim="paged_adamw_8bit",
    gradient_checkpointing=True,
    max_grad_norm=0.3,
    group_by_length=True,
    report_to="wandb",  # Change to "none" to disable
)

# ============================================
# Initialize Trainer
# ============================================
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=training_args,
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_text_field="text",
    packing=True,
)

# ============================================
# Train!
# ============================================
print("Starting training...")
trainer.train()

# ============================================
# Save the Model
# ============================================
print("Saving model...")
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

print(f"Training complete! Model saved to {OUTPUT_DIR}")

4.2: Run the Training

cd /root
python train_llama3.py

The first run will download the LLaMA 3 model weights (~16 GB for the 8B model). Subsequent runs use the cached version.

4.3: Monitor Training

In a separate terminal (or tmux session), monitor GPU utilization:

watch -n 1 nvidia-smi

You should see GPU utilization near 100% and VRAM usage around 18–22 GB for the QLoRA 8B setup.

Training time depends on your dataset size and epochs:

Dataset Size Epochs RTX 4090 Time RTX 3090 Time
1,000 samples 3 ~15–30 min ~25–45 min
5,000 samples 3 ~1–2 hours ~2–3 hours
10,000 samples 3 ~2–4 hours ~4–7 hours
50,000 samples 3 ~10–20 hours ~18–35 hours

4.4: Estimated Costs on Clore.ai

Setup GPU Duration Cost
1K samples, QLoRA 8B RTX 4090 ~30 min $0.05–$0.12
5K samples, QLoRA 8B RTX 4090 ~2 hours $0.20–$0.50
10K samples, QLoRA 8B RTX 4090 ~4 hours $0.40–$1.00
10K samples, QLoRA 70B A100 80GB ~12 hours $8.40–$12.00

These costs are remarkably low. Fine-tuning a state-of-the-art 8B parameter LLM for under $1 was unimaginable just two years ago.

Step 5: Test Your Fine-Tuned Model

After training completes, test your model:

5.1: Quick Inference Test

# test_model.py
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"
ADAPTER_PATH = "/root/llama3-finetuned"

# Load base model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load fine-tuned LoRA adapter
model = PeftModel.from_pretrained(model, ADAPTER_PATH)

# Generate a response
prompt = """<|begin_of_text|><|start_header_id|>user<|end_header_id|>

What is your return policy?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Run it:

python test_model.py

Your model should generate responses consistent with your training data.

5.2: Interactive Chat (Optional)

For a more interactive experience, create a simple Gradio interface:

# chat_interface.py
import gradio as gr
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"
ADAPTER_PATH = "/root/llama3-finetuned"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(model, ADAPTER_PATH)

def chat(message, history):
    prompt = f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response

demo = gr.ChatInterface(chat, title="Fine-Tuned LLaMA 3")
demo.launch(server_name="0.0.0.0", server_port=7860)

Access the chat at http://your-server-address:7860.

Step 6: Save and Export Your Model

6.1: Push to Hugging Face Hub

# Install git-lfs for large files
apt install git-lfs
git lfs install

# Push the adapter to HF Hub
huggingface-cli upload your-username/llama3-finetuned /root/llama3-finetuned

6.2: Merge LoRA Weights (Optional)

If you want a standalone model without requiring the base model + adapter at inference time:

# merge_model.py
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"
ADAPTER_PATH = "/root/llama3-finetuned"
MERGED_PATH = "/root/llama3-finetuned-merged"

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, ADAPTER_PATH)
model = model.merge_and_unload()

model.save_pretrained(MERGED_PATH)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained(MERGED_PATH)

print(f"Merged model saved to {MERGED_PATH}")

6.3: Convert to GGUF for llama.cpp (Optional)

For local inference with llama.cpp:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
pip install -r requirements.txt

# Convert to GGUF format
python convert_hf_to_gguf.py /root/llama3-finetuned-merged --outtype f16 --outfile llama3-finetuned.gguf

# Quantize to 4-bit for smaller file size
./llama-quantize llama3-finetuned.gguf llama3-finetuned-q4_k_m.gguf q4_k_m

6.4: Download Your Model

Before terminating your rental, download your model weights:

# Compress the adapter
cd /root
tar -czf llama3-finetuned.tar.gz llama3-finetuned/

# Download via SCP (from your local machine)
scp -P your-port root@your-server-address:/root/llama3-finetuned.tar.gz ./

Advanced: Fine-Tuning LLaMA 3 70B

For the 70B model, the process is similar but requires more VRAM:

Hardware Requirements

  • QLoRA 70B: 1x A100 80GB ($0.70–$1.00/hr on Clore.ai)
  • Full Fine-Tuning 70B: 4–8x A100 80GB or 4–8x H100

Key Differences in Configuration

# For 70B, adjust these settings:
MODEL_NAME = "meta-llama/Meta-Llama-3.1-70B-Instruct"
BATCH_SIZE = 1  # Reduce batch size
GRADIENT_ACCUMULATION = 16  # Compensate with more accumulation
MAX_SEQ_LENGTH = 1024  # Shorter sequences save VRAM
LORA_R = 8  # Smaller rank saves VRAM

Multi-GPU Setup

For multi-GPU training, use accelerate:

# Configure accelerate for multi-GPU
accelerate config

# Run training with accelerate
accelerate launch train_llama3.py

Troubleshooting Common Issues

CUDA Out of Memory

  • Reduce BATCH_SIZE to 1 or 2
  • Reduce MAX_SEQ_LENGTH to 1024 or 512
  • Reduce LORA_R to 8
  • Ensure gradient_checkpointing=True
  • Switch to a GPU with more VRAM

Slow Training Speed

  • Enable Flash Attention 2 (attn_implementation="flash_attention_2")
  • Use bf16=True instead of fp16=True
  • Ensure packing=True in SFTTrainer for efficient batching
  • Upgrade to a faster GPU (RTX 4090 → RTX 5090 or H100)

Loss Not Decreasing

  • Check data quality and formatting
  • Try a lower learning rate (1e-4 or 5e-5)
  • Increase training epochs
  • Ensure your data isn't too repetitive

Model Generating Nonsense

  • Increase training data (aim for 1,000+ diverse examples)
  • Reduce learning rate to avoid catastrophic forgetting
  • Check that the prompt template matches training format exactly

Conclusion

Fine-tuning LLaMA 3 on a cloud GPU has never been more accessible or affordable. With Clore.ai, you can rent an RTX 4090 for $0.10–$0.25/hour and fine-tune an 8B parameter model for under $1. Even the massive 70B model can be fine-tuned for $10–$15 using QLoRA on an A100 80GB.

Here's a recap of the workflow:

  1. Rent a GPU on Clore.ai (RTX 4090 for 8B, A100 80GB for 70B)
  2. Set up the environment with PyTorch, Transformers, PEFT, and TRL
  3. Prepare your dataset in JSONL format with instruction-output pairs
  4. Run QLoRA fine-tuning with the provided training script
  5. Test your model with inference scripts or a Gradio interface
  6. Save and export your adapter weights to Hugging Face or as a GGUF file

The era of custom LLMs is here, and it's accessible to everyone with a few dollars and a good dataset. What will you build?

Ready to fine-tune your first model? Rent a GPU on Clore.ai and start training today. For image generation workflows, check out our tutorial on How to Run Stable Diffusion on a Cloud GPU.

Subscribe to Clore.ai Blog

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe