Fine-tuning LLMs - LLMs & Transformers Tutorial

🎓 Complete all tutorials to earn your Free LLMs & Transformers Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup fee

🔧 Why Fine-tune? When RAG and ICL Aren't Enough

In the previous tutorial, you learned that RAG and In-Context Learning solve 95% of LLM application needs. But what about the other 5%? When do you actually need to fine-tune a model?

⚠️ Critical Decision Point: Fine-tuning is powerful but expensive and time-consuming. Before fine-tuning, ask yourself: "Have I truly exhausted prompt engineering, few-shot learning, and RAG?" If not, go back and try those first.

Scenarios Where Fine-tuning Wins

1. 🎨 Learning Specific Style/Tone

Problem: You need your chatbot to respond in your company's unique voice—formal, technical, friendly, etc.

Why RAG fails: RAG retrieves facts, not communication style. Few-shot examples can't capture nuanced tone across all situations.

Fine-tuning wins: Model internalizes the style through thousands of examples.

Example: Customer support chatbot 
trained on 5,000 past conversations 
learns your company's empathetic, 
professional tone automatically.

2. 🏥 Domain-Specific Behavior

Problem: Need deep medical/legal/scientific reasoning that requires specialized knowledge patterns.

Why RAG fails: RAG can retrieve documents, but the model still needs to reason about them using domain expertise.

Fine-tuning wins: Model learns domain-specific reasoning patterns, terminology, and decision-making.

Example: Medical diagnosis assistant 
fine-tuned on 10,000 case studies 
learns clinical reasoning patterns 
that few-shot examples can't teach.

3. ⚡ Latency & Cost Optimization

Problem: RAG adds retrieval latency (~200-500ms). ICL inflates token usage with long prompts.

Why RAG fails: Every query requires embedding generation, vector search, and longer context.

Fine-tuning wins: Knowledge is "baked into" model weights—no retrieval needed, faster inference.

RAG: 500ms retrieval + 1000 tokens
Fine-tuned: 0ms retrieval + 200 tokens
Result: 3x faster, 5x cheaper!

4. 📝 Complex Output Formatting

Problem: Need highly structured outputs (JSON, code, specific formats) with 100% consistency.

Why ICL fails: Few-shot examples can be inconsistent. Model might deviate from format on edge cases.

Fine-tuning wins: Model learns to always produce the exact format through thousands of examples.

Example: Code generation model 
trained on 50,000 examples learns 
to always produce valid, properly 
formatted code.

5. 🎯 Task-Specific Specialization

Problem: Need superhuman performance on a narrow, specialized task.

Why general models fail: GPT-4 is a generalist. For specialized tasks, a smaller fine-tuned model can outperform it.

Fine-tuning wins: Smaller, cheaper model (7B) fine-tuned on specific task beats general 175B+ model.

Example: SQL query generation
Fine-tuned 7B model: 95% accuracy
GPT-4 few-shot: 87% accuracy
Cost: 50x cheaper!

6. 🔒 Privacy & Compliance

Problem: Can't send sensitive data to external APIs (OpenAI, Anthropic). Need on-premise deployment.

Why APIs fail: Healthcare, finance, legal data often can't leave your infrastructure.

Fine-tuning wins: Train and deploy your own model on your hardware. Full data control.

Healthcare: Train Llama 2 on internal 
patient data, deploy on-premise.
Zero external API calls.

The Decision Framework

START: "I need an LLM to do X"
    ↓
Question 1: Can few-shot examples (2-10) show the model how to do it?
    ✅ YES → Use In-Context Learning (ICL)
    ❌ NO → Continue
    ↓
Question 2: Is it primarily about accessing external knowledge/documents?
    ✅ YES → Use RAG
    ❌ NO → Continue
    ↓
Question 3: Do I have 1,000+ high-quality training examples?
    ❌ NO → Go back to ICL/RAG, you don't have enough data
    ✅ YES → Continue
    ↓
Question 4: Is it about style, behavior, or specialized reasoning?
    ✅ YES → Consider fine-tuning
    ❌ NO → Probably can solve with RAG + ICL
    ↓
Question 5: Have I truly exhausted prompt engineering?
    ❌ NO → Go back and try harder with prompts
    ✅ YES → Fine-tune!
    ↓
Question 6: Which fine-tuning approach?
    • Need 99% accuracy, have A100 GPUs? → Full fine-tuning
    • Consumer GPU, good balance? → LoRA
    • Minimal resources, 7-13B models? → QLoRA ✅ (most common)

💡 Real-World Stats:

70% of applications: Solved with RAG alone
20% of applications: RAG + ICL combination
8% of applications: Fine-tuning (usually QLoRA)
2% of applications: Full fine-tuning (big companies with resources)

Fine-tuning vs RAG: Side-by-Side Example

Scenario: Build a customer support bot for a SaaS company.

Aspect	RAG Approach	Fine-tuning Approach
Setup Time	✅ 1-2 days (index docs)	⚠️ 1-2 weeks (collect data, train)
Data Needed	✅ Documentation + FAQs	⚠️ 5,000+ conversation examples
Knowledge Updates	✅ Instant (add new docs)	❌ Need retraining
Learning Style/Tone	⚠️ Moderate (via system prompt)	✅ Excellent (learns from examples)
Accuracy on Facts	✅ Excellent (retrieves docs)	⚠️ Good (but can hallucinate)
Cost per Query	⚠️ $0.002-0.005 (GPT-4 + retrieval)	✅ $0.0001 (self-hosted 7B)
Latency	⚠️ 800ms-1.5s (retrieval + generation)	✅ 200-400ms (generation only)
Best For	Factual accuracy, frequent updates	Consistent style, low latency, privacy

✅ Hybrid Approach (Best of Both Worlds): Use a fine-tuned model for style/tone + RAG for facts! Example: Fine-tune a 7B model on your support conversations (learns tone), then use RAG to retrieve current product docs (learns facts). This combines the strengths of both approaches.

When NOT to Fine-tune

❌ Don't fine-tune if:

<500 training examples: Not enough data for meaningful fine-tuning. Use ICL instead.
Rapidly changing domain: If your knowledge updates daily/weekly, RAG is better.
Multiple unrelated tasks: Fine-tuning specializes. For generalists, use GPT-4 + prompts.
No clear success metric: How will you know if fine-tuning improved things? Need evaluation data.
Budget/time constraints: Fine-tuning requires investment. RAG is faster to market.
Primarily factual QA: RAG excels here. Fine-tuning won't help much.

💾 Full Fine-tuning: Training All Parameters

Full fine-tuning (also called supervised fine-tuning or SFT) means updating every single parameter in the model through backpropagation. If you're fine-tuning a 7B parameter model, you're training all 7 billion weights.

📊 Scale: A 7B model has ~7 billion parameters. At 16-bit precision (FP16), that's 14GB just to load the model. During training, you also need memory for gradients (another 14GB), optimizer states (28GB for Adam), and activations. Total: ~60-80GB VRAM for full fine-tuning of a 7B model!

When to Use Full Fine-tuning

✅ Use Full Fine-tuning When:

You have access to A100/H100 GPUs (80GB VRAM)
Need absolute maximum performance
Training relatively small models (< 3B params)
Have budget for expensive compute
Domain shift is very large (e.g., new language)

❌ Skip Full Fine-tuning When:

Using consumer GPUs (RTX 3090, 4090)
Models are large (> 7B params)
Budget is limited
Quick experimentation needed
Task is well-defined (LoRA/QLoRA work well)

Hardware Requirements by Model Size

Model Size	Model Weights	Full Fine-tune VRAM	GPU Required	Approx Cost
GPT-2 (124M)	~500MB	~4GB	RTX 3060 (12GB)	$300-500 GPU
GPT-2 XL (1.5B)	~6GB	~24GB	RTX 3090/4090 (24GB)	$1,000-1,500 GPU
Llama 2 (7B)	~14GB	~60-80GB	A100 (80GB)	$1.50/hr cloud or $10K GPU
Llama 2 (13B)	~26GB	~120GB	2x A100 (80GB each)	$3/hr cloud or $20K GPUs
Llama 2 (70B)	~140GB	~500GB+	8x A100 (80GB each)	$12/hr cloud or $80K+ GPUs

⚠️ Reality Check: Full fine-tuning a 7B model on an A100 costs $1.50/hour. A typical training run takes 10-20 hours = $15-30 per experiment. With hyperparameter tuning (5-10 runs), you're looking at $150-300 in compute costs.

The Full Fine-tuning Process

Here's a complete, production-ready fine-tuning pipeline using HuggingFace Transformers:

Step 1: Prepare Your Training Data

// train.jsonl (instruction-following format)
{"instruction": "Summarize this article", "input": "Article text here...", "output": "Summary text here..."}
{"instruction": "Translate to French", "input": "Hello world", "output": "Bonjour le monde"}
{"instruction": "Answer the question", "input": "What is Python?", "output": "Python is a programming language..."}

// Recommendation: 1,000+ examples minimum, 10,000+ for best results

💡 Data Quality > Quantity: 1,000 high-quality, diverse examples beat 10,000 noisy, repetitive examples. Each example should be carefully reviewed, properly formatted, and representative of your task.

Step 2: Complete Training Script

# install: pip install transformers datasets accelerate torch
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)

# 1. Load your model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # Use FP16 for memory efficiency
    device_map="auto"  # Automatically distribute across GPUs
)

# Add padding token if missing
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = model.config.eos_token_id

# 2. Load and prepare dataset
dataset = load_dataset("json", data_files={
    "train": "train.jsonl",
    "validation": "validation.jsonl"
})

def format_instruction(example):
    """Convert instruction-input-output to a prompt"""
    instruction = example["instruction"]
    input_text = example.get("input", "")
    output = example["output"]
    
    if input_text:
        prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n{output}"
    else:
        prompt = f"### Instruction:\n{instruction}\n\n### Response:\n{output}"
    
    return {"text": prompt}

# Format all examples
dataset = dataset.map(format_instruction)

# 3. Tokenize dataset
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,  # Adjust based on your task
        padding="max_length"
    )

tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset["train"].column_names
)

# 4. Set up training arguments
training_args = TrainingArguments(
    output_dir="./llama-2-7b-finetuned",
    num_train_epochs=3,  # 3-5 epochs typical
    per_device_train_batch_size=4,  # Adjust based on VRAM
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch size = 4 * 4 = 16
    learning_rate=2e-5,  # Low LR to prevent catastrophic forgetting
    weight_decay=0.01,
    warmup_steps=100,  # Gradual learning rate warmup
    lr_scheduler_type="cosine",  # Cosine decay schedule
    logging_steps=10,
    evaluation_strategy="steps",
    eval_steps=100,
    save_steps=500,
    save_total_limit=3,  # Keep only 3 best checkpoints
    load_best_model_at_end=True,
    fp16=True,  # Mixed precision training (faster, less memory)
    report_to="tensorboard",  # Log to TensorBoard
    push_to_hub=False
)

# 5. Create data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # We're doing causal LM, not masked LM
)

# 6. Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator
)

# 7. Train!
print("🚀 Starting training...")
trainer.train()

# 8. Save the final model
print("💾 Saving model...")
trainer.save_model("./llama-2-7b-final")
tokenizer.save_pretrained("./llama-2-7b-final")

print("✅ Training complete!")

Step 3: Monitor Training

# Watch training progress in real-time
tensorboard --logdir ./llama-2-7b-finetuned/runs

# Monitor GPU usage
watch -n 1 nvidia-smi

# Expected training time for 7B model on A100:
# - 5,000 examples: ~8-12 hours
# - 10,000 examples: ~16-24 hours
# - 50,000 examples: ~80-120 hours

Step 4: Evaluate Your Model

# Load fine-tuned model
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("./llama-2-7b-final")
model = AutoModelForCausalLM.from_pretrained(
    "./llama-2-7b-final",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Test it!
def generate_response(instruction, input_text=""):
    if input_text:
        prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
    else:
        prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
    
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.9,
        do_sample=True
    )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.split("### Response:\n")[1].strip()

# Try it
print(generate_response(
    "Summarize this article",
    "Python is a high-level programming language known for its simplicity and readability..."
))

Key Hyperparameters Explained

📊 Learning Rate

2e-5 to 5e-5: Standard range
Too high (1e-4+): Catastrophic forgetting, unstable training
Too low (1e-6): Very slow learning, might not improve
Pro tip: Start with 2e-5, increase if loss plateaus

🔄 Epochs

3-5 epochs: Typical for most tasks
1-2 epochs: Large datasets (50K+ examples)
5-10 epochs: Small datasets (< 1K examples), but risk overfitting
Pro tip: Monitor validation loss, stop when it increases

📦 Batch Size

Effective batch size 16-32: Good balance
Use gradient accumulation: Simulate large batches on small GPUs
Example: batch_size=4 × accumulation=4 = effective 16
Pro tip: Larger batches = more stable, but need more memory

🌡️ Warmup Steps

10% of total steps: Common heuristic
Purpose: Gradually increase LR from 0 to target
Why: Prevents early training instability
Example: 10,000 steps total → 1,000 warmup steps

The Catastrophic Forgetting Problem

⚠️ Catastrophic Forgetting: When you fine-tune a model too aggressively, it "forgets" its general knowledge and only remembers your training data. The model becomes specialized but loses its general capabilities.

Before Fine-tuning:
Q: "What is Python?"
A: "Python is a high-level programming language created by Guido van Rossum..."
(General knowledge intact)

After Aggressive Fine-tuning (LR too high, too many epochs):
Q: "What is Python?"
A: "### Instruction:\nAnswer this question\n\n### Response:\n..."
(Model only knows training format, forgot general knowledge!)

Solution:
✅ Low learning rate (2e-5)
✅ Few epochs (3-5)
✅ Validation monitoring
✅ Early stopping if val loss increases

💡 How to Detect Catastrophic Forgetting:

Test model on general knowledge questions before and after fine-tuning
If performance on general tasks drops significantly, you've overfitted
Solution: Reduce LR, reduce epochs, or add general examples to training data

Cost Analysis: Full Fine-tuning

Setup	Model	Training Time	Hardware	Cost per Run
Small Scale	GPT-2 (124M)	~2 hours	RTX 3090 (24GB)	$0 (own GPU) or $0.50 cloud
Medium Scale	GPT-2 XL (1.5B)	~8 hours	RTX 4090 (24GB)	$0 (own GPU) or $2 cloud
Production Scale	Llama 2 (7B)	~16 hours	A100 (80GB)	$24 cloud per experiment
Enterprise Scale	Llama 2 (70B)	~80 hours	8x A100 (80GB)	$960 cloud per experiment

💸 Budget Reality: Full fine-tuning a 7B model requires experimentation (5-10 runs to tune hyperparameters) = $120-240 in cloud costs. Plus data preparation time (1-2 weeks). This is why LoRA/QLoRA became so popular—they're 10x cheaper!

⚡ LoRA (Low-Rank Adaptation): The Game Changer

LoRA (Low-Rank Adaptation) is a breakthrough technique published by Microsoft researchers in 2021 that revolutionized LLM fine-tuning. Instead of updating billions of parameters, LoRA trains tiny adapter modules (< 1% of model size) that achieve 95-99% of full fine-tuning performance at a fraction of the cost.

🎯 The Revolution: LoRA made it possible to fine-tune 7B parameter models on consumer GPUs (RTX 3090, 4090) instead of requiring $10K A100 GPUs. This democratized fine-tuning for researchers, startups, and hobbyists.

The Core Idea: Low-Rank Decomposition

LoRA is based on a key insight: fine-tuning updates to model weights are low-rank. This means most of the "learning" happens in a low-dimensional subspace, not across all dimensions.

Standard Fine-tuning:
--------------------
Original Weight Matrix W: 4096 × 4096 = 16,777,216 parameters
During training: Update ALL 16M parameters
Memory needed: W + gradients + optimizer states = 4x model size

LoRA Fine-tuning:
-----------------
Original Weight Matrix W: 4096 × 4096 (frozen, not trained)
LoRA Adapter:
  - Matrix A: 4096 × 8 = 32,768 parameters
  - Matrix B: 8 × 4096 = 32,768 parameters
  - Total: 65,536 parameters (0.4% of original!)

During training: Update only A and B
Memory needed: Only adapters' gradients + optimizer states
Result: 100x less memory!

How LoRA Works: The Math

In a standard transformer layer, attention computation looks like:

Standard Forward Pass:
----------------------
output = W × input

Where W is a huge weight matrix (e.g., 4096 × 4096)

LoRA Forward Pass:
------------------
output = W × input + (B × A) × input
         ↑           ↑
         Frozen      Trained adapters
         
Where:
- W: Original pretrained weights (frozen)
- A: Low-rank matrix (e.g., 4096 × r, r=8)
- B: Low-rank matrix (e.g., r × 4096)
- r: Rank (typically 8, 16, 32, or 64)

Key insight: (B × A) is a 4096 × 4096 matrix, but we only 
            train r × (4096 + 4096) parameters instead of 4096²!

💡 Example: For a 4096×4096 weight matrix:

Full fine-tuning: 16,777,216 trainable parameters
LoRA (r=8): 65,536 trainable parameters (256x reduction!)
LoRA (r=16): 131,072 trainable parameters (128x reduction)
LoRA (r=64): 524,288 trainable parameters (32x reduction)

LoRA Configuration Parameters

📏 Rank (r)

What it is: Dimensionality of the low-rank decomposition

r=4-8: Very parameter-efficient, good for simple tasks
r=16-32: Sweet spot for most tasks
r=64-128: Complex tasks, closer to full fine-tuning

Trade-off: Higher r = more parameters = better performance but more memory

🎚️ LoRA Alpha

What it is: Scaling factor for LoRA updates

Formula: scaling = alpha / r
Typical values: 16, 32, 64
Rule of thumb: alpha = 2 × r

Effect: Controls how much LoRA adapters influence the output

🎯 Target Modules

What it is: Which layers get LoRA adapters

Common: ["q_proj", "v_proj"] (query & value in attention)
More aggressive: ["q_proj", "k_proj", "v_proj", "o_proj"]
Full coverage: All linear layers

Trade-off: More modules = better performance but more parameters

💧 LoRA Dropout

What it is: Dropout applied to LoRA layers

Typical: 0.05 (5%)
Purpose: Regularization to prevent overfitting
When to increase: Small datasets (< 1K examples)

Effect: Helps generalization, especially on small datasets

Complete LoRA Fine-tuning Example

# Install: pip install peft transformers datasets accelerate
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# 1. Load base model (in FP16 for memory efficiency)
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# 2. Configure LoRA
lora_config = LoraConfig(
    r=16,  # Rank: 16 is a good balance
    lora_alpha=32,  # Alpha: 2 × r
    target_modules=[
        "q_proj",   # Query projection in attention
        "k_proj",   # Key projection
        "v_proj",   # Value projection
        "o_proj",   # Output projection
        "gate_proj",  # For Llama's FFN
        "up_proj",
        "down_proj"
    ],
    lora_dropout=0.05,  # 5% dropout for regularization
    bias="none",  # Don't train bias terms
    task_type="CAUSAL_LM"  # Causal language modeling
)

# 3. Apply LoRA to model
model = get_peft_model(model, lora_config)

# Print trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062%

# 4. Load and prepare dataset (same as full fine-tuning)
dataset = load_dataset("json", data_files={"train": "train.jsonl", "validation": "val.jsonl"})

def format_instruction(example):
    return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"}

dataset = dataset.map(format_instruction)

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512, padding="max_length")

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# 5. Training arguments (optimized for LoRA)
training_args = TrainingArguments(
    output_dir="./llama-2-7b-lora",
    num_train_epochs=3,
    per_device_train_batch_size=8,  # Can use larger batch size!
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=3e-4,  # LoRA can use higher LR than full fine-tuning
    weight_decay=0.01,
    warmup_steps=100,
    logging_steps=10,
    evaluation_strategy="steps",
    eval_steps=100,
    save_steps=500,
    fp16=True,
    report_to="tensorboard"
)

# 6. Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"]
)

print("🚀 Starting LoRA training...")
trainer.train()

# 7. Save LoRA adapters (only ~10-50MB!)
model.save_pretrained("./llama-2-7b-lora-final")
tokenizer.save_pretrained("./llama-2-7b-lora-final")

print("✅ LoRA training complete!")
print("💡 Saved adapter weights are only ~20MB instead of ~14GB!")

Loading and Using LoRA Models

# Load the base model + LoRA adapters
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load LoRA adapters on top
model = PeftModel.from_pretrained(base_model, "./llama-2-7b-lora-final")
tokenizer = AutoTokenizer.from_pretrained("./llama-2-7b-lora-final")

# Use it!
def generate(instruction):
    prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
    return tokenizer.decode(outputs[0], skip_special_tokens=True).split("### Response:\n")[1]

print(generate("Explain quantum computing in simple terms"))

# Merge LoRA weights into base model (optional, for deployment)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./llama-2-7b-merged")
# Now you have a standalone model without needing the base + adapter separately!

LoRA Hardware Requirements

Model Size	Base Model	LoRA VRAM	GPU Options	Training Time
GPT-2 (124M)	~500MB	~2GB	RTX 3060 (12GB) ✅	~30 min
GPT-2 XL (1.5B)	~6GB	~8GB	RTX 3090 (24GB) ✅	~2 hours
Llama 2 (7B)	~14GB	~18GB	RTX 3090/4090 (24GB) ✅	~6 hours
Llama 2 (13B)	~26GB	~32GB	A100 (40GB) or RTX 6000 Ada	~12 hours
Llama 2 (70B)	~140GB	~160GB	2-3x A100 (80GB each)	~48 hours

✅ LoRA's Killer Feature: You can fine-tune a 7B model on a consumer RTX 3090 (24GB, $1,200) instead of needing an A100 (80GB, $10,000). This 8x cost reduction democratized LLM fine-tuning!

LoRA vs Full Fine-tuning: Performance Comparison

Research shows LoRA achieves 95-99% of full fine-tuning performance across diverse tasks:

Task	Full Fine-tune	LoRA (r=16)	Difference
Text Classification	95.2%	94.8%	-0.4%
Question Answering	88.7%	87.9%	-0.8%
Summarization (ROUGE)	42.1	41.6	-0.5
Code Generation	72.3%	71.1%	-1.2%
Average	100%	97.5%	-2.5%

💡 Cost-Benefit Analysis:

Full fine-tuning: 100% performance, $30/run (A100), 16 hours
LoRA: 97.5% performance, $0 (own RTX 3090) or $3/run (cloud), 6 hours
Verdict: LoRA is 10x cheaper for 2.5% performance drop—incredible trade-off!

LoRA Best Practices

✅ Do's

Start with r=16, alpha=32 (good default)
Target at least q_proj and v_proj
Use higher learning rates (3e-4) than full fine-tuning
Train for more epochs (3-5) since only adapters update
Save adapters separately (only ~20MB!)
Merge adapters before deployment for faster inference

❌ Don'ts

Don't use r > 64 (diminishing returns, use full fine-tuning instead)
Don't forget to freeze base model (LoRA does this automatically)
Don't use too low r (< 4) for complex tasks
Don't train on too few examples (< 500)
Don't forget to merge adapters for production

🔋 QLoRA (Quantized LoRA): The Ultimate Memory Saver

QLoRA (Quantized Low-Rank Adaptation) takes LoRA one step further by combining it with 4-bit quantization. The result? You can fine-tune a 13B model on a single RTX 3090 (24GB) or a 70B model on a single A100 (80GB)—something previously impossible without multiple high-end GPUs.

🚀 The Magic: QLoRA enabled researchers to fine-tune 65B parameter models on consumer hardware for the first time. This democratization led to an explosion of open-source fine-tuned models in 2023 (Alpaca, Vicuna, WizardLM, etc.).

How QLoRA Works: Quantization + LoRA

QLoRA combines three key innovations:

1. 4-bit NormalFloat (NF4) Quantization

Stores model weights in 4-bit instead of 16-bit:

16-bit (FP16): 7B model = 14GB
4-bit (NF4): 7B model = 3.5GB (4x smaller!)

NF4: Optimized 4-bit format for normally distributed weights (like neural nets)

2. Double Quantization

Quantize the quantization constants themselves:

First quantization saves 4x memory
Double quantization saves additional ~0.37 bits/param
Total: 4.2x memory reduction with minimal quality loss

3. Paged Optimizers

Use CPU RAM when GPU memory is full:

Optimizer states moved to CPU
Transferred to GPU only when needed
Prevents out-of-memory crashes
Enables training on smaller GPUs

Memory Breakdown: QLoRA vs LoRA vs Full

Fine-tuning a 7B model (in FP16):

Full Fine-tuning:
├── Model weights: 14GB
├── Gradients: 14GB
├── Optimizer states (Adam): 28GB
└── Activations: 8-12GB
    TOTAL: ~64-68GB (needs A100 80GB)

LoRA (r=16):
├── Model weights: 14GB (frozen, no gradients)
├── LoRA adapters: 0.02GB
├── LoRA gradients: 0.02GB
├── Optimizer states: 0.04GB
└── Activations: 8-12GB
    TOTAL: ~22-26GB (fits RTX 3090 24GB, barely)

QLoRA (r=16, 4-bit):
├── Model weights (4-bit): 3.5GB (no gradients!)
├── LoRA adapters (FP16): 0.02GB
├── LoRA gradients: 0.02GB
├── Optimizer states: 0.04GB
└── Activations: 4-6GB
    TOTAL: ~7.5-9.5GB (fits RTX 3060 12GB!)

⚠️ The Catch: QLoRA is ~20-30% slower than LoRA (due to quantization overhead) and achieves ~96% of full fine-tuning performance vs LoRA's 97.5%. But the memory savings are so dramatic that it's worth the trade-off for most use cases.

Complete QLoRA Implementation

# Install: pip install bitsandbytes peft transformers accelerate
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# 1. Configure 4-bit quantization with NF4
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Use 4-bit quantization
    bnb_4bit_quant_type="nf4",  # NormalFloat4 (optimal for neural nets)
    bnb_4bit_compute_dtype=torch.float16,  # Compute in FP16 for speed
    bnb_4bit_use_double_quant=True  # Double quantization for extra savings
)

# 2. Load model in 4-bit
model_name = "meta-llama/Llama-2-13b-hf"  # 13B model!
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",  # Automatically distribute across available GPUs
    trust_remote_code=True
)

# 3. Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# 4. Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# 5. Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 13,016,035,328 || trainable%: 0.322%

# 6. Load dataset (same as before)
dataset = load_dataset("json", data_files={"train": "train.jsonl", "validation": "val.jsonl"})

def format_instruction(example):
    return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"}

dataset = dataset.map(format_instruction)

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512, padding="max_length")

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# 7. Training arguments (optimized for QLoRA)
training_args = TrainingArguments(
    output_dir="./llama-2-13b-qlora",
    num_train_epochs=3,
    per_device_train_batch_size=4,  # 4-bit allows larger batches
    gradient_accumulation_steps=4,
    learning_rate=2e-4,  # Slightly higher than full fine-tuning
    warmup_steps=100,
    logging_steps=10,
    save_steps=500,
    fp16=True,  # Mixed precision
    optim="paged_adamw_8bit",  # 8-bit Adam with paging (crucial for QLoRA!)
    lr_scheduler_type="cosine",
    evaluation_strategy="steps",
    eval_steps=100,
    report_to="tensorboard"
)

# 8. Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"]
)

print("🚀 Starting QLoRA training of 13B model on consumer GPU...")
trainer.train()

# 9. Save
model.save_pretrained("./llama-2-13b-qlora-final")
tokenizer.save_pretrained("./llama-2-13b-qlora-final")
print("✅ QLoRA training complete! Saved adapters (~40MB).")

QLoRA Hardware Requirements: The Real Game-Changer

Model Size	Full Fine-tune	LoRA	QLoRA	GPU Needed (QLoRA)
Llama 2 (7B)	~64GB	~22GB	~8GB	RTX 3060 (12GB) ✅
Llama 2 (13B)	~120GB	~40GB	~16GB	RTX 3090/4090 (24GB) ✅
Llama 2 (70B)	~600GB	~200GB	~48GB	A100 (80GB) ✅
Mixtral (8x7B)	~480GB	~160GB	~40GB	A100 (80GB) ✅

🎯 QLoRA's Superpower: A hobbyist with a $1,500 RTX 4090 can fine-tune a 13B model that outperforms GPT-3.5 on specialized tasks. This was unthinkable before QLoRA—you needed $30K+ in GPUs.

QLoRA Performance: How Much Quality Do You Lose?

QLoRA achieves 96-98% of full fine-tuning performance across tasks:

Task	Full FT	LoRA (r=16)	QLoRA (r=16)
Instruction Following	91.2%	90.1% (-1.1%)	89.3% (-1.9%)
Code Generation	68.5%	67.2% (-1.3%)	66.1% (-2.4%)
Summarization (ROUGE)	45.7	44.9 (-0.8)	44.1 (-1.6)
Average Retention	100%	98.5%	96.8%

💡 Real Talk: QLoRA loses 3-4% performance vs full fine-tuning but uses 8x less memory. For most applications, this is an incredible trade-off. Plus, you can often recover lost performance by using a larger model (e.g., 13B QLoRA beats 7B full fine-tuning).

📊 Comparison

Approach	Trainable Params	VRAM (7B)	Speed	Quality
Full	7B (all)	40-80GB	Slow	⭐⭐⭐⭐⭐
LoRA	~1M	10-15GB	Fast	⭐⭐⭐⭐⭐
QLoRA	~1M	6-8GB	Very Fast	⭐⭐⭐⭐

📋 Best Practices

Start with prompting/RAG: Try those first. They're faster
Use QLoRA: Best cost-benefit for most use cases
Low learning rates: 1e-4 to 5e-5 typical
Evaluate regularly: On validation set during training
Don't overfit: 3-5 epochs usually optimal
Use instruction data: Format data as instructions for best results
Monitor perplexity: Track validation loss

📋 Summary

What You've Learned:

Full fine-tuning updates all parameters (expensive)
LoRA trains small adapter modules (10x cheaper)
QLoRA combines quantization + LoRA (most practical)
QLoRA: ~1M trainable params, 6-8GB VRAM for 7B models
Start with RAG/prompting. Fine-tune only when needed

What's Next?

In our next tutorial, Inference Optimization, we'll learn how to deploy models efficiently.

🎉 Excellent! You now have the skills to fine-tune any LLM on your data efficiently!

Test Your Knowledge

Q1: What is the main advantage of Parameter-Efficient Fine-Tuning (PEFT)?

It makes models larger

It eliminates the need for training data

It updates only a small subset of parameters, reducing memory and compute requirements

It works without GPUs

Q2: What does LoRA (Low-Rank Adaptation) do?

It compresses the model permanently

It adds small trainable rank decomposition matrices to model layers while keeping original weights frozen

It removes layers from the model

It increases model size significantly

Q3: Which fine-tuning approach updates ALL model parameters?

LoRA

Prefix tuning

Adapter layers

Full fine-tuning

Q4: What is the purpose of the validation set during fine-tuning?

To monitor model performance and detect overfitting without using test data

To increase training speed

To generate more training data

To replace the test set

Q5: When should you consider full fine-tuning instead of PEFT?

Always, it's always better

Never, PEFT is always superior

When you have substantial domain-specific data and need maximum performance

Only for image models