By Clore in GPU Cloud — 20 Oct 2025

How to Fine-Tune LLaMA 3 on a Cloud GPU

Large language models (LLMs) have transformed what's possible with AI — from chatbots and code assistants to content generation and data analysis. Meta's LLaMA 3 family of models, released in open-source, has become one of the most popular foundations for building custom AI applications. But while LLaMA 3 is free to download, fine-tuning it to your specific use case requires serious GPU horsepower.

Fine-tuning adapts a pre-trained model to perform better on your specific data. Want a customer support bot that knows your product? Fine-tune LLaMA 3 on your support transcripts. Need a coding assistant for a specific framework? Fine-tune on your codebase documentation. The possibilities are endless — but the GPU requirements are real.

In this comprehensive tutorial, you'll learn how to fine-tune LLaMA 3 (8B and 70B parameter versions) on a cloud GPU rented from Clore.ai. We'll cover everything from choosing the right GPU and setting up the environment to running the training loop and saving your fine-tuned model. No expensive hardware needed — just a browser and a few dollars of compute budget.

Prerequisites

Before we start, here's what you'll need:

A Clore.ai account with funds ($5–$20 depending on model size and training duration)
A Hugging Face account to download LLaMA 3 weights (requires accepting Meta's license agreement)
Basic Python knowledge — we'll provide all the code, but understanding what it does helps
Your training dataset — a JSONL file with prompt-completion pairs (we'll cover format below)

Choosing the Right GPU for LLaMA 3 Fine-Tuning

GPU selection depends on the model size and fine-tuning method:

LLaMA 3 8B (LoRA/QLoRA Fine-Tuning)

Minimum: RTX 3090 (24 GB VRAM) — $0.06–$0.12/hr on Clore.ai
Recommended: RTX 4090 (24 GB VRAM) — $0.10–$0.25/hr on Clore.ai
Best: RTX 5090 (32 GB VRAM) — $0.30–$0.50/hr on Clore.ai

With QLoRA (4-bit quantization + LoRA adapters), the 8B model fits comfortably in 24 GB of VRAM with room for reasonable batch sizes.

LLaMA 3 70B (QLoRA Fine-Tuning)

Minimum: A100 40GB — $0.40–$0.70/hr on Clore.ai
Recommended: A100 80GB — $0.70–$1.00/hr on Clore.ai
Best: H100 80GB — $1.00–$2.00/hr on Clore.ai

The 70B model is significantly larger and requires at least 40 GB of VRAM with 4-bit quantization. For full fine-tuning (not LoRA), you'll need multi-GPU setups.

LLaMA 3 8B (Full Fine-Tuning)

Minimum: 2x RTX 4090 or 1x A100 80GB
Recommended: 2x A100 80GB or 4x RTX 4090

Full fine-tuning updates all model parameters and requires significantly more VRAM than LoRA methods.

For a complete pricing guide, see our Top 10 Cheapest GPUs for AI Training.

Step 1: Rent a GPU on Clore.ai

1.1: Find a Suitable Server

Log into Clore.ai and navigate to the marketplace. For this tutorial, we'll fine-tune LLaMA 3 8B with QLoRA on an RTX 4090.

Filter the marketplace:

GPU: RTX 4090
Sort by: Price (low to high)
Storage: At least 100 GB (model weights + datasets + checkpoints)
RAM: 32 GB or more
Internet: 500+ Mbps for faster model downloads

1.2: Configure Your Rental

Select a server and configure:

Docker image: Choose a PyTorch image (e.g., pytorch/pytorch:2.3.0-cuda12.4-cudnn9-devel) or a base Ubuntu image with NVIDIA drivers
SSH access: Enable (you'll need terminal access for most of this tutorial)
Ports: Expose port 8888 if you want to use Jupyter Notebook, or port 7860 for Gradio-based testing later
Rental type: On-demand for uninterrupted training, or GigaSPOT if you're comfortable with checkpointing

1.3: Start the Rental

Click "Rent" and wait for your server to come online (usually 1–3 minutes). Note the SSH connection details from your dashboard.

Step 2: Set Up the Environment

SSH into your rented server:

ssh root@your-server-address -p your-port

2.1: Update System and Install Dependencies

# Update package lists
apt update && apt upgrade -y

# Install essential tools
apt install -y git wget curl nano htop

# Verify GPU is detected
nvidia-smi

You should see your RTX 4090 listed with driver version and CUDA information.

2.2: Install Python Packages

# Upgrade pip
pip install --upgrade pip

# Install core ML libraries
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Install fine-tuning stack
pip install transformers==4.45.0
pip install datasets==3.1.0
pip install accelerate==1.0.0
pip install peft==0.13.0
pip install bitsandbytes==0.44.0
pip install trl==0.12.0
pip install scipy
pip install sentencepiece
pip install protobuf

# Install wandb for experiment tracking (optional but recommended)
pip install wandb

2.3: Authenticate with Hugging Face

# Install Hugging Face CLI
pip install huggingface_hub

# Login (you'll need your HF token)
huggingface-cli login

Enter your Hugging Face access token when prompted. Make sure you've accepted Meta's license agreement for LLaMA 3 on the Hugging Face model page.

Step 3: Prepare Your Training Data

3.1: Data Format

Fine-tuning data should be in a structured format. The most common is a JSONL file where each line is a conversation or instruction-response pair:

{"instruction": "What is your return policy?", "input": "", "output": "Our return policy allows returns within 30 days of purchase. Items must be in original condition with tags attached. Contact support@example.com to initiate a return."}
{"instruction": "How do I track my order?", "input": "Order #12345", "output": "You can track your order by visiting our tracking page at example.com/track and entering your order number. You'll see real-time updates on your shipment status."}
{"instruction": "Do you offer international shipping?", "input": "", "output": "Yes, we ship to over 50 countries worldwide. International shipping rates are calculated at checkout based on your destination and package weight. Delivery typically takes 7-14 business days."}

3.2: Upload Your Dataset

Upload your dataset to the server:

# Option 1: SCP from your local machine
scp -P your-port /path/to/your/dataset.jsonl root@your-server-address:/root/dataset.jsonl

# Option 2: Download from a URL
wget https://your-storage.com/dataset.jsonl -O /root/dataset.jsonl

# Option 3: Use a Hugging Face dataset
# (We'll handle this in the training script)

3.3: Data Quality Tips

The quality of your fine-tuning data directly impacts model performance:

Minimum 100 examples for noticeable effect; 1,000–10,000 for robust results
Consistent formatting — use the same instruction style throughout
Diverse examples — cover the range of inputs your model will encounter
Clean data — remove duplicates, fix typos, ensure accuracy
Balanced distribution — don't over-represent any single topic

Step 4: Fine-Tune LLaMA 3 8B with QLoRA

Now for the main event. We'll use QLoRA, which combines 4-bit quantization with Low-Rank Adaptation to dramatically reduce VRAM requirements while maintaining training quality.

4.1: Create the Training Script

nano /root/train_llama3.py

Paste the following script:

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer

# ============================================
# Configuration
# ============================================
MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"
DATASET_PATH = "/root/dataset.jsonl"
OUTPUT_DIR = "/root/llama3-finetuned"
MAX_SEQ_LENGTH = 2048
NUM_EPOCHS = 3
BATCH_SIZE = 4
GRADIENT_ACCUMULATION = 4
LEARNING_RATE = 2e-4
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05

# ============================================
# Load Tokenizer
# ============================================
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# ============================================
# Quantization Config (4-bit for QLoRA)
# ============================================
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# ============================================
# Load Model
# ============================================
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
model = prepare_model_for_kbit_training(model)

# ============================================
# LoRA Configuration
# ============================================
lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# ============================================
# Load and Format Dataset
# ============================================
print("Loading dataset...")
dataset = load_dataset("json", data_files=DATASET_PATH, split="train")

def format_instruction(sample):
    """Format dataset samples into LLaMA 3 chat template."""
    if sample.get("input", ""):
        text = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{sample['instruction']}

Context: {sample['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{sample['output']}<|eot_id|>"""
    else:
        text = f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{sample['instruction']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{sample['output']}<|eot_id|>"""
    return {"text": text}

dataset = dataset.map(format_instruction)

# ============================================
# Training Arguments
# ============================================
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION,
    learning_rate=LEARNING_RATE,
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_steps=100,
    save_total_limit=3,
    bf16=True,
    optim="paged_adamw_8bit",
    gradient_checkpointing=True,
    max_grad_norm=0.3,
    group_by_length=True,
    report_to="wandb",  # Change to "none" to disable
)

# ============================================
# Initialize Trainer
# ============================================
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=training_args,
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_text_field="text",
    packing=True,
)

# ============================================
# Train!
# ============================================
print("Starting training...")
trainer.train()

# ============================================
# Save the Model
# ============================================
print("Saving model...")
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

print(f"Training complete! Model saved to {OUTPUT_DIR}")

4.2: Run the Training

cd /root
python train_llama3.py

The first run will download the LLaMA 3 model weights (~16 GB for the 8B model). Subsequent runs use the cached version.

4.3: Monitor Training

In a separate terminal (or tmux session), monitor GPU utilization:

watch -n 1 nvidia-smi

You should see GPU utilization near 100% and VRAM usage around 18–22 GB for the QLoRA 8B setup.

Training time depends on your dataset size and epochs:

Dataset Size	Epochs	RTX 4090 Time	RTX 3090 Time
1,000 samples	3	~15–30 min	~25–45 min
5,000 samples	3	~1–2 hours	~2–3 hours
10,000 samples	3	~2–4 hours	~4–7 hours
50,000 samples	3	~10–20 hours	~18–35 hours

4.4: Estimated Costs on Clore.ai

Setup	GPU	Duration	Cost
1K samples, QLoRA 8B	RTX 4090	~30 min	$0.05–$0.12
5K samples, QLoRA 8B	RTX 4090	~2 hours	$0.20–$0.50
10K samples, QLoRA 8B	RTX 4090	~4 hours	$0.40–$1.00
10K samples, QLoRA 70B	A100 80GB	~12 hours	$8.40–$12.00

These costs are remarkably low. Fine-tuning a state-of-the-art 8B parameter LLM for under $1 was unimaginable just two years ago.

Step 5: Test Your Fine-Tuned Model

After training completes, test your model:

5.1: Quick Inference Test

# test_model.py
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"
ADAPTER_PATH = "/root/llama3-finetuned"

# Load base model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load fine-tuned LoRA adapter
model = PeftModel.from_pretrained(model, ADAPTER_PATH)

# Generate a response
prompt = """<|begin_of_text|><|start_header_id|>user<|end_header_id|>

What is your return policy?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Run it:

python test_model.py

Your model should generate responses consistent with your training data.

5.2: Interactive Chat (Optional)

For a more interactive experience, create a simple Gradio interface:

# chat_interface.py
import gradio as gr
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"
ADAPTER_PATH = "/root/llama3-finetuned"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(model, ADAPTER_PATH)

def chat(message, history):
    prompt = f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response

demo = gr.ChatInterface(chat, title="Fine-Tuned LLaMA 3")
demo.launch(server_name="0.0.0.0", server_port=7860)

Access the chat at http://your-server-address:7860.

Step 6: Save and Export Your Model

6.1: Push to Hugging Face Hub

# Install git-lfs for large files
apt install git-lfs
git lfs install

# Push the adapter to HF Hub
huggingface-cli upload your-username/llama3-finetuned /root/llama3-finetuned

6.2: Merge LoRA Weights (Optional)

If you want a standalone model without requiring the base model + adapter at inference time:

# merge_model.py
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

MODEL_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct"
ADAPTER_PATH = "/root/llama3-finetuned"
MERGED_PATH = "/root/llama3-finetuned-merged"

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, ADAPTER_PATH)
model = model.merge_and_unload()

model.save_pretrained(MERGED_PATH)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained(MERGED_PATH)

print(f"Merged model saved to {MERGED_PATH}")

6.3: Convert to GGUF for llama.cpp (Optional)

For local inference with llama.cpp:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
pip install -r requirements.txt

# Convert to GGUF format
python convert_hf_to_gguf.py /root/llama3-finetuned-merged --outtype f16 --outfile llama3-finetuned.gguf

# Quantize to 4-bit for smaller file size
./llama-quantize llama3-finetuned.gguf llama3-finetuned-q4_k_m.gguf q4_k_m

6.4: Download Your Model

Before terminating your rental, download your model weights:

# Compress the adapter
cd /root
tar -czf llama3-finetuned.tar.gz llama3-finetuned/

# Download via SCP (from your local machine)
scp -P your-port root@your-server-address:/root/llama3-finetuned.tar.gz ./

Advanced: Fine-Tuning LLaMA 3 70B

For the 70B model, the process is similar but requires more VRAM:

Hardware Requirements

QLoRA 70B: 1x A100 80GB ($0.70–$1.00/hr on Clore.ai)
Full Fine-Tuning 70B: 4–8x A100 80GB or 4–8x H100

Key Differences in Configuration

# For 70B, adjust these settings:
MODEL_NAME = "meta-llama/Meta-Llama-3.1-70B-Instruct"
BATCH_SIZE = 1  # Reduce batch size
GRADIENT_ACCUMULATION = 16  # Compensate with more accumulation
MAX_SEQ_LENGTH = 1024  # Shorter sequences save VRAM
LORA_R = 8  # Smaller rank saves VRAM

Multi-GPU Setup

For multi-GPU training, use accelerate:

# Configure accelerate for multi-GPU
accelerate config

# Run training with accelerate
accelerate launch train_llama3.py

Troubleshooting Common Issues

CUDA Out of Memory

Reduce BATCH_SIZE to 1 or 2
Reduce MAX_SEQ_LENGTH to 1024 or 512
Reduce LORA_R to 8
Ensure gradient_checkpointing=True
Switch to a GPU with more VRAM

Slow Training Speed

Enable Flash Attention 2 (attn_implementation="flash_attention_2")
Use bf16=True instead of fp16=True
Ensure packing=True in SFTTrainer for efficient batching
Upgrade to a faster GPU (RTX 4090 → RTX 5090 or H100)

Loss Not Decreasing

Check data quality and formatting
Try a lower learning rate (1e-4 or 5e-5)
Increase training epochs
Ensure your data isn't too repetitive

Model Generating Nonsense

Increase training data (aim for 1,000+ diverse examples)
Reduce learning rate to avoid catastrophic forgetting
Check that the prompt template matches training format exactly

Conclusion

Fine-tuning LLaMA 3 on a cloud GPU has never been more accessible or affordable. With Clore.ai, you can rent an RTX 4090 for $0.10–$0.25/hour and fine-tune an 8B parameter model for under $1. Even the massive 70B model can be fine-tuned for $10–$15 using QLoRA on an A100 80GB.

Here's a recap of the workflow:

Rent a GPU on Clore.ai (RTX 4090 for 8B, A100 80GB for 70B)
Set up the environment with PyTorch, Transformers, PEFT, and TRL
Prepare your dataset in JSONL format with instruction-output pairs
Run QLoRA fine-tuning with the provided training script
Test your model with inference scripts or a Gradio interface
Save and export your adapter weights to Hugging Face or as a GGUF file

The era of custom LLMs is here, and it's accessible to everyone with a few dollars and a good dataset. What will you build?

Ready to fine-tune your first model? Rent a GPU on Clore.ai and start training today. For image generation workflows, check out our tutorial on How to Run Stable Diffusion on a Cloud GPU.