How to Fine-Tune LLaMA 3 on a Cloud GPU
How to Fine-Tune LLaMA 3 on a Cloud GPU
Large language models (LLMs) have transformed what's possible with AI — from chatbots and code assistants to content generation and data analysis. Meta's LLaMA 3 family of models, released in open-source, has become one of the most popular foundations for building custom AI applications. But while LLaMA 3 is free to download, fine-tuning it to your specific use case requires serious GPU horsepower.
Fine-tuning adapts a pre-trained model to perform better on your specific data. Want a customer support bot that knows your product? Fine-tune LLaMA 3 on your support transcripts. Need a coding assistant for a specific framework? Fine-tune on your codebase documentation. The possibilities are endless — but the GPU requirements are real.
In this comprehensive tutorial, you'll learn how to fine-tune LLaMA 3 (8B and 70B parameter versions) on a cloud GPU rented from Clore.ai. We'll cover everything from choosing the right GPU and setting up the environment to running the training loop and saving your fine-tuned model. No expensive hardware needed — just a browser and a few dollars of compute budget.
Prerequisites
Before we start, here's what you'll need:
- A Clore.ai account with funds ($5–$20 depending on model size and training duration)
- A Hugging Face account to download LLaMA 3 weights (requires accepting Meta's license agreement)
- Basic Python knowledge — we'll provide all the code, but understanding what it does helps
- Your training dataset — a JSONL file with prompt-completion pairs (we'll cover format below)
Choosing the Right GPU for LLaMA 3 Fine-Tuning
GPU selection depends on the model size and fine-tuning method:
LLaMA 3 8B (LoRA/QLoRA Fine-Tuning)
- Minimum: RTX 3090 (24 GB VRAM) — $0.06–$0.12/hr on Clore.ai
- Recommended: RTX 4090 (24 GB VRAM) — $0.10–$0.25/hr on Clore.ai
- Best: RTX 5090 (32 GB VRAM) — $0.30–$0.50/hr on Clore.ai
With QLoRA (4-bit quantization + LoRA adapters), the 8B model fits comfortably in 24 GB of VRAM with room for reasonable batch sizes.
LLaMA 3 70B (QLoRA Fine-Tuning)
- Minimum: A100 40GB — $0.40–$0.70/hr on Clore.ai
- Recommended: A100 80GB — $0.70–$1.00/hr on Clore.ai
- Best: H100 80GB — $1.00–$2.00/hr on Clore.ai
The 70B model is significantly larger and requires at least 40 GB of VRAM with 4-bit quantization. For full fine-tuning (not LoRA), you'll need multi-GPU setups.
LLaMA 3 8B (Full Fine-Tuning)
- Minimum: 2x RTX 4090 or 1x A100 80GB
- Recommended: 2x A100 80GB or 4x RTX 4090
Full fine-tuning updates all model parameters and requires significantly more VRAM than LoRA methods.
For a complete pricing guide, see our Top 10 Cheapest GPUs for AI Training.
Step 1: Rent a GPU on Clore.ai
1.1: Find a Suitable Server
Log into Clore.ai and navigate to the marketplace. For this tutorial, we'll fine-tune LLaMA 3 8B with QLoRA on an RTX 4090.
Filter the marketplace:
- GPU: RTX 4090
- Sort by: Price (low to high)
- Storage: At least 100 GB (model weights + datasets + checkpoints)
- RAM: 32 GB or more
- Internet: 500+ Mbps for faster model downloads
1.2: Configure Your Rental
Select a server and configure:
- Docker image: Choose a PyTorch image (e.g.,
pytorch/pytorch:2.3.0-cuda12.4-cudnn9-devel) or a base Ubuntu image with NVIDIA drivers - SSH access: Enable (you'll need terminal access for most of this tutorial)
- Ports: Expose port 8888 if you want to use Jupyter Notebook, or port 7860 for Gradio-based testing later
- Rental type: On-demand for uninterrupted training, or GigaSPOT if you're comfortable with checkpointing
1.3: Start the Rental
Click "Rent" and wait for your server to come online (usually 1–3 minutes). Note the SSH connection details from your dashboard.
Step 2: Set Up the Environment
SSH into your rented server:
ssh root@your-server-address -p your-port
2.1: Update System and Install Dependencies
# Update package lists
apt update && apt upgrade -y
# Install essential tools
apt install -y git wget curl nano htop
# Verify GPU is detected
nvidia-smi
You should see your RTX 4090 listed with driver version and CUDA information.
2.2: Install Python Packages
# Upgrade pip
pip install --upgrade pip
# Install core ML libraries
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# Install fine-tuning stack
pip install transformers==4.45.0
pip install datasets==3.1.0
pip install accelerate==1.0.0
pip install peft==0.13.0
pip install bitsandbytes==0.44.0
pip install trl==0.12.0
pip install scipy
pip install sentencepiece
pip install protobuf
# Install wandb for experiment tracking (optional but recommended)
pip install wandb
2.3: Authenticate with Hugging Face
# Install Hugging Face CLI
pip install huggingface_hub
# Login (you'll need your HF token)
huggingface-cli login
Enter your Hugging Face access token when prompted. Make sure you've accepted Meta's license agreement for LLaMA 3 on the Hugging Face model page.
Step 3: Prepare Your Training Data
3.1: Data Format
Fine-tuning data should be in a structured format. The most common is a JSONL file where each line is a conversation or instruction-response pair:
{"instruction": "What is your return policy?", "input": "", "output": "Our return policy allows returns within 30 days of purchase. Items must be in original condition with tags attached. Contact support@example.com to initiate a return."}
{"instruction": "How do I track my order?", "input": "Order #12345", "output": "You can track your order by visiting our tracking page at example.com/track and entering your order number. You'll see real-time updates on your shipment status."}
{"instruction": "Do you offer international shipping?", "input": "", "output": "Yes, we ship to over 50 countries worldwide. International shipping rates are calculated at checkout based on your destination and package weight. Delivery typically takes 7-14 business days."}
3.2: Upload Your Dataset
Upload your dataset to the server:
# Option 1: SCP from your local machine
scp -P your-port /path/to/your/dataset.jsonl root@your-server-address:/root/dataset.jsonl
# Option 2: Download from a URL
wget https://your-storage.com/dataset.jsonl -O /root/dataset.jsonl
# Option 3: Use a Hugging Face dataset
# (We'll handle this in the training script)
3.3: Data Quality Tips
The quality of your fine-tuning data directly impacts model performance:
- Minimum 100 examples for noticeable effect; 1,000–10,000 for robust results
- Consistent formatting — use the same instruction style throughout
- Diverse examples — cover the range of inputs your model will encounter
- Clean data — remove duplicates, fix typos, ensure accuracy
- Balanced distribution — don't over-represent any single topic
Step 4: Fine-Tune LLaMA 3 8B with QLoRA
Now for the main event. We'll use QLoRA, which combines 4-bit quantization with Low-Rank Adaptation to dramatically reduce VRAM requirements while maintaining training quality.
4.1: Create the Training Script
nano /root/train_llama3.py
Paste the following script:
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
# ============================================
# Configuration
# ============================================
MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"
DATASET_PATH = "/root/dataset.jsonl"
OUTPUT_DIR = "/root/llama3-finetuned"
MAX_SEQ_LENGTH = 2048
NUM_EPOCHS = 3
BATCH_SIZE = 4
GRADIENT_ACCUMULATION = 4
LEARNING_RATE = 2e-4
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05
# ============================================
# Load Tokenizer
# ============================================
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# ============================================
# Quantization Config (4-bit for QLoRA)
# ============================================
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# ============================================
# Load Model
# ============================================
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
model = prepare_model_for_kbit_training(model)
# ============================================
# LoRA Configuration
# ============================================
lora_config = LoraConfig(
r=LORA_R,
lora_alpha=LORA_ALPHA,
lora_dropout=LORA_DROPOUT,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# ============================================
# Load and Format Dataset
# ============================================
print("Loading dataset...")
dataset = load_dataset("json", data_files=DATASET_PATH, split="train")
def format_instruction(sample):
"""Format dataset samples into LLaMA 3 chat template."""
if sample.get("input", ""):
text = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>
{sample['instruction']}
Context: {sample['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{sample['output']}<|eot_id|>"""
else:
text = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>
{sample['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{sample['output']}<|eot_id|>"""
return {"text": text}
dataset = dataset.map(format_instruction)
# ============================================
# Training Arguments
# ============================================
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
num_train_epochs=NUM_EPOCHS,
per_device_train_batch_size=BATCH_SIZE,
gradient_accumulation_steps=GRADIENT_ACCUMULATION,
learning_rate=LEARNING_RATE,
weight_decay=0.01,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
logging_steps=10,
save_steps=100,
save_total_limit=3,
bf16=True,
optim="paged_adamw_8bit",
gradient_checkpointing=True,
max_grad_norm=0.3,
group_by_length=True,
report_to="wandb", # Change to "none" to disable
)
# ============================================
# Initialize Trainer
# ============================================
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=training_args,
max_seq_length=MAX_SEQ_LENGTH,
dataset_text_field="text",
packing=True,
)
# ============================================
# Train!
# ============================================
print("Starting training...")
trainer.train()
# ============================================
# Save the Model
# ============================================
print("Saving model...")
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"Training complete! Model saved to {OUTPUT_DIR}")
4.2: Run the Training
cd /root
python train_llama3.py
The first run will download the LLaMA 3 model weights (~16 GB for the 8B model). Subsequent runs use the cached version.
4.3: Monitor Training
In a separate terminal (or tmux session), monitor GPU utilization:
watch -n 1 nvidia-smi
You should see GPU utilization near 100% and VRAM usage around 18–22 GB for the QLoRA 8B setup.
Training time depends on your dataset size and epochs:
| Dataset Size | Epochs | RTX 4090 Time | RTX 3090 Time |
|---|---|---|---|
| 1,000 samples | 3 | ~15–30 min | ~25–45 min |
| 5,000 samples | 3 | ~1–2 hours | ~2–3 hours |
| 10,000 samples | 3 | ~2–4 hours | ~4–7 hours |
| 50,000 samples | 3 | ~10–20 hours | ~18–35 hours |
4.4: Estimated Costs on Clore.ai
| Setup | GPU | Duration | Cost |
|---|---|---|---|
| 1K samples, QLoRA 8B | RTX 4090 | ~30 min | $0.05–$0.12 |
| 5K samples, QLoRA 8B | RTX 4090 | ~2 hours | $0.20–$0.50 |
| 10K samples, QLoRA 8B | RTX 4090 | ~4 hours | $0.40–$1.00 |
| 10K samples, QLoRA 70B | A100 80GB | ~12 hours | $8.40–$12.00 |
These costs are remarkably low. Fine-tuning a state-of-the-art 8B parameter LLM for under $1 was unimaginable just two years ago.
Step 5: Test Your Fine-Tuned Model
After training completes, test your model:
5.1: Quick Inference Test
# test_model.py
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"
ADAPTER_PATH = "/root/llama3-finetuned"
# Load base model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Load fine-tuned LoRA adapter
model = PeftModel.from_pretrained(model, ADAPTER_PATH)
# Generate a response
prompt = """<|begin_of_text|><|start_header_id|>user<|end_header_id|>
What is your return policy?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Run it:
python test_model.py
Your model should generate responses consistent with your training data.
5.2: Interactive Chat (Optional)
For a more interactive experience, create a simple Gradio interface:
# chat_interface.py
import gradio as gr
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"
ADAPTER_PATH = "/root/llama3-finetuned"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME, torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(model, ADAPTER_PATH)
def chat(message, history):
prompt = f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
return response
demo = gr.ChatInterface(chat, title="Fine-Tuned LLaMA 3")
demo.launch(server_name="0.0.0.0", server_port=7860)
Access the chat at http://your-server-address:7860.
Step 6: Save and Export Your Model
6.1: Push to Hugging Face Hub
# Install git-lfs for large files
apt install git-lfs
git lfs install
# Push the adapter to HF Hub
huggingface-cli upload your-username/llama3-finetuned /root/llama3-finetuned
6.2: Merge LoRA Weights (Optional)
If you want a standalone model without requiring the base model + adapter at inference time:
# merge_model.py
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"
ADAPTER_PATH = "/root/llama3-finetuned"
MERGED_PATH = "/root/llama3-finetuned-merged"
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, ADAPTER_PATH)
model = model.merge_and_unload()
model.save_pretrained(MERGED_PATH)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained(MERGED_PATH)
print(f"Merged model saved to {MERGED_PATH}")
6.3: Convert to GGUF for llama.cpp (Optional)
For local inference with llama.cpp:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
pip install -r requirements.txt
# Convert to GGUF format
python convert_hf_to_gguf.py /root/llama3-finetuned-merged --outtype f16 --outfile llama3-finetuned.gguf
# Quantize to 4-bit for smaller file size
./llama-quantize llama3-finetuned.gguf llama3-finetuned-q4_k_m.gguf q4_k_m
6.4: Download Your Model
Before terminating your rental, download your model weights:
# Compress the adapter
cd /root
tar -czf llama3-finetuned.tar.gz llama3-finetuned/
# Download via SCP (from your local machine)
scp -P your-port root@your-server-address:/root/llama3-finetuned.tar.gz ./
Advanced: Fine-Tuning LLaMA 3 70B
For the 70B model, the process is similar but requires more VRAM:
Hardware Requirements
- QLoRA 70B: 1x A100 80GB ($0.70–$1.00/hr on Clore.ai)
- Full Fine-Tuning 70B: 4–8x A100 80GB or 4–8x H100
Key Differences in Configuration
# For 70B, adjust these settings:
MODEL_NAME = "meta-llama/Meta-Llama-3.1-70B-Instruct"
BATCH_SIZE = 1 # Reduce batch size
GRADIENT_ACCUMULATION = 16 # Compensate with more accumulation
MAX_SEQ_LENGTH = 1024 # Shorter sequences save VRAM
LORA_R = 8 # Smaller rank saves VRAM
Multi-GPU Setup
For multi-GPU training, use accelerate:
# Configure accelerate for multi-GPU
accelerate config
# Run training with accelerate
accelerate launch train_llama3.py
Troubleshooting Common Issues
CUDA Out of Memory
- Reduce
BATCH_SIZEto 1 or 2 - Reduce
MAX_SEQ_LENGTHto 1024 or 512 - Reduce
LORA_Rto 8 - Ensure
gradient_checkpointing=True - Switch to a GPU with more VRAM
Slow Training Speed
- Enable Flash Attention 2 (
attn_implementation="flash_attention_2") - Use
bf16=Trueinstead offp16=True - Ensure
packing=Truein SFTTrainer for efficient batching - Upgrade to a faster GPU (RTX 4090 → RTX 5090 or H100)
Loss Not Decreasing
- Check data quality and formatting
- Try a lower learning rate (1e-4 or 5e-5)
- Increase training epochs
- Ensure your data isn't too repetitive
Model Generating Nonsense
- Increase training data (aim for 1,000+ diverse examples)
- Reduce learning rate to avoid catastrophic forgetting
- Check that the prompt template matches training format exactly
Conclusion
Fine-tuning LLaMA 3 on a cloud GPU has never been more accessible or affordable. With Clore.ai, you can rent an RTX 4090 for $0.10–$0.25/hour and fine-tune an 8B parameter model for under $1. Even the massive 70B model can be fine-tuned for $10–$15 using QLoRA on an A100 80GB.
Here's a recap of the workflow:
- Rent a GPU on Clore.ai (RTX 4090 for 8B, A100 80GB for 70B)
- Set up the environment with PyTorch, Transformers, PEFT, and TRL
- Prepare your dataset in JSONL format with instruction-output pairs
- Run QLoRA fine-tuning with the provided training script
- Test your model with inference scripts or a Gradio interface
- Save and export your adapter weights to Hugging Face or as a GGUF file
The era of custom LLMs is here, and it's accessible to everyone with a few dollars and a good dataset. What will you build?
Ready to fine-tune your first model? Rent a GPU on Clore.ai and start training today. For image generation workflows, check out our tutorial on How to Run Stable Diffusion on a Cloud GPU.