Home β†’ LLMs & Transformers β†’ Fine-tuning LLMs

Fine-tuning LLMs

Master different fine-tuning approaches. Learn full fine-tune, LoRA, QLoRA, and how to train on your data efficiently

πŸ“… Tutorial 5 πŸ“Š Intermediate

πŸŽ“ Complete all tutorials to earn your Free LLMs & Transformers Certificate
Shareable on LinkedIn β€’ Verified by AITutorials.site β€’ No signup fee

πŸ”§ Why Fine-tune? When RAG and ICL Aren't Enough

In the previous tutorial, you learned that RAG and In-Context Learning solve 95% of LLM application needs. But what about the other 5%? When do you actually need to fine-tune a model?

⚠️ Critical Decision Point: Fine-tuning is powerful but expensive and time-consuming. Before fine-tuning, ask yourself: "Have I truly exhausted prompt engineering, few-shot learning, and RAG?" If not, go back and try those first.

Scenarios Where Fine-tuning Wins

1. 🎨 Learning Specific Style/Tone

Problem: You need your chatbot to respond in your company's unique voiceβ€”formal, technical, friendly, etc.

Why RAG fails: RAG retrieves facts, not communication style. Few-shot examples can't capture nuanced tone across all situations.

Fine-tuning wins: Model internalizes the style through thousands of examples.

Example: Customer support chatbot 
trained on 5,000 past conversations 
learns your company's empathetic, 
professional tone automatically.

2. πŸ₯ Domain-Specific Behavior

Problem: Need deep medical/legal/scientific reasoning that requires specialized knowledge patterns.

Why RAG fails: RAG can retrieve documents, but the model still needs to reason about them using domain expertise.

Fine-tuning wins: Model learns domain-specific reasoning patterns, terminology, and decision-making.

Example: Medical diagnosis assistant 
fine-tuned on 10,000 case studies 
learns clinical reasoning patterns 
that few-shot examples can't teach.

3. ⚑ Latency & Cost Optimization

Problem: RAG adds retrieval latency (~200-500ms). ICL inflates token usage with long prompts.

Why RAG fails: Every query requires embedding generation, vector search, and longer context.

Fine-tuning wins: Knowledge is "baked into" model weightsβ€”no retrieval needed, faster inference.

RAG: 500ms retrieval + 1000 tokens
Fine-tuned: 0ms retrieval + 200 tokens
Result: 3x faster, 5x cheaper!

4. πŸ“ Complex Output Formatting

Problem: Need highly structured outputs (JSON, code, specific formats) with 100% consistency.

Why ICL fails: Few-shot examples can be inconsistent. Model might deviate from format on edge cases.

Fine-tuning wins: Model learns to always produce the exact format through thousands of examples.

Example: Code generation model 
trained on 50,000 examples learns 
to always produce valid, properly 
formatted code.

5. 🎯 Task-Specific Specialization

Problem: Need superhuman performance on a narrow, specialized task.

Why general models fail: GPT-4 is a generalist. For specialized tasks, a smaller fine-tuned model can outperform it.

Fine-tuning wins: Smaller, cheaper model (7B) fine-tuned on specific task beats general 175B+ model.

Example: SQL query generation
Fine-tuned 7B model: 95% accuracy
GPT-4 few-shot: 87% accuracy
Cost: 50x cheaper!

6. πŸ”’ Privacy & Compliance

Problem: Can't send sensitive data to external APIs (OpenAI, Anthropic). Need on-premise deployment.

Why APIs fail: Healthcare, finance, legal data often can't leave your infrastructure.

Fine-tuning wins: Train and deploy your own model on your hardware. Full data control.

Healthcare: Train Llama 2 on internal 
patient data, deploy on-premise.
Zero external API calls.

The Decision Framework

START: "I need an LLM to do X"
    ↓
Question 1: Can few-shot examples (2-10) show the model how to do it?
    βœ… YES β†’ Use In-Context Learning (ICL)
    ❌ NO β†’ Continue
    ↓
Question 2: Is it primarily about accessing external knowledge/documents?
    βœ… YES β†’ Use RAG
    ❌ NO β†’ Continue
    ↓
Question 3: Do I have 1,000+ high-quality training examples?
    ❌ NO β†’ Go back to ICL/RAG, you don't have enough data
    βœ… YES β†’ Continue
    ↓
Question 4: Is it about style, behavior, or specialized reasoning?
    βœ… YES β†’ Consider fine-tuning
    ❌ NO β†’ Probably can solve with RAG + ICL
    ↓
Question 5: Have I truly exhausted prompt engineering?
    ❌ NO β†’ Go back and try harder with prompts
    βœ… YES β†’ Fine-tune!
    ↓
Question 6: Which fine-tuning approach?
    β€’ Need 99% accuracy, have A100 GPUs? β†’ Full fine-tuning
    β€’ Consumer GPU, good balance? β†’ LoRA
    β€’ Minimal resources, 7-13B models? β†’ QLoRA βœ… (most common)

πŸ’‘ Real-World Stats:

  • 70% of applications: Solved with RAG alone
  • 20% of applications: RAG + ICL combination
  • 8% of applications: Fine-tuning (usually QLoRA)
  • 2% of applications: Full fine-tuning (big companies with resources)

Fine-tuning vs RAG: Side-by-Side Example

Scenario: Build a customer support bot for a SaaS company.

Aspect RAG Approach Fine-tuning Approach
Setup Time βœ… 1-2 days (index docs) ⚠️ 1-2 weeks (collect data, train)
Data Needed βœ… Documentation + FAQs ⚠️ 5,000+ conversation examples
Knowledge Updates βœ… Instant (add new docs) ❌ Need retraining
Learning Style/Tone ⚠️ Moderate (via system prompt) βœ… Excellent (learns from examples)
Accuracy on Facts βœ… Excellent (retrieves docs) ⚠️ Good (but can hallucinate)
Cost per Query ⚠️ $0.002-0.005 (GPT-4 + retrieval) βœ… $0.0001 (self-hosted 7B)
Latency ⚠️ 800ms-1.5s (retrieval + generation) βœ… 200-400ms (generation only)
Best For Factual accuracy, frequent updates Consistent style, low latency, privacy

βœ… Hybrid Approach (Best of Both Worlds): Use a fine-tuned model for style/tone + RAG for facts! Example: Fine-tune a 7B model on your support conversations (learns tone), then use RAG to retrieve current product docs (learns facts). This combines the strengths of both approaches.

When NOT to Fine-tune

❌ Don't fine-tune if:

  • <500 training examples: Not enough data for meaningful fine-tuning. Use ICL instead.
  • Rapidly changing domain: If your knowledge updates daily/weekly, RAG is better.
  • Multiple unrelated tasks: Fine-tuning specializes. For generalists, use GPT-4 + prompts.
  • No clear success metric: How will you know if fine-tuning improved things? Need evaluation data.
  • Budget/time constraints: Fine-tuning requires investment. RAG is faster to market.
  • Primarily factual QA: RAG excels here. Fine-tuning won't help much.

πŸ’Ύ Full Fine-tuning: Training All Parameters

Full fine-tuning (also called supervised fine-tuning or SFT) means updating every single parameter in the model through backpropagation. If you're fine-tuning a 7B parameter model, you're training all 7 billion weights.

πŸ“Š Scale: A 7B model has ~7 billion parameters. At 16-bit precision (FP16), that's 14GB just to load the model. During training, you also need memory for gradients (another 14GB), optimizer states (28GB for Adam), and activations. Total: ~60-80GB VRAM for full fine-tuning of a 7B model!

When to Use Full Fine-tuning

βœ… Use Full Fine-tuning When:

  • You have access to A100/H100 GPUs (80GB VRAM)
  • Need absolute maximum performance
  • Training relatively small models (< 3B params)
  • Have budget for expensive compute
  • Domain shift is very large (e.g., new language)

❌ Skip Full Fine-tuning When:

  • Using consumer GPUs (RTX 3090, 4090)
  • Models are large (> 7B params)
  • Budget is limited
  • Quick experimentation needed
  • Task is well-defined (LoRA/QLoRA work well)

Hardware Requirements by Model Size

Model Size Model Weights Full Fine-tune VRAM GPU Required Approx Cost
GPT-2 (124M) ~500MB ~4GB RTX 3060 (12GB) $300-500 GPU
GPT-2 XL (1.5B) ~6GB ~24GB RTX 3090/4090 (24GB) $1,000-1,500 GPU
Llama 2 (7B) ~14GB ~60-80GB A100 (80GB) $1.50/hr cloud or $10K GPU
Llama 2 (13B) ~26GB ~120GB 2x A100 (80GB each) $3/hr cloud or $20K GPUs
Llama 2 (70B) ~140GB ~500GB+ 8x A100 (80GB each) $12/hr cloud or $80K+ GPUs

⚠️ Reality Check: Full fine-tuning a 7B model on an A100 costs $1.50/hour. A typical training run takes 10-20 hours = $15-30 per experiment. With hyperparameter tuning (5-10 runs), you're looking at $150-300 in compute costs.

The Full Fine-tuning Process

Here's a complete, production-ready fine-tuning pipeline using HuggingFace Transformers:

Step 1: Prepare Your Training Data

// train.jsonl (instruction-following format)
{"instruction": "Summarize this article", "input": "Article text here...", "output": "Summary text here..."}
{"instruction": "Translate to French", "input": "Hello world", "output": "Bonjour le monde"}
{"instruction": "Answer the question", "input": "What is Python?", "output": "Python is a programming language..."}

// Recommendation: 1,000+ examples minimum, 10,000+ for best results

πŸ’‘ Data Quality > Quantity: 1,000 high-quality, diverse examples beat 10,000 noisy, repetitive examples. Each example should be carefully reviewed, properly formatted, and representative of your task.

Step 2: Complete Training Script

# install: pip install transformers datasets accelerate torch
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)

# 1. Load your model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # Use FP16 for memory efficiency
    device_map="auto"  # Automatically distribute across GPUs
)

# Add padding token if missing
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = model.config.eos_token_id

# 2. Load and prepare dataset
dataset = load_dataset("json", data_files={
    "train": "train.jsonl",
    "validation": "validation.jsonl"
})

def format_instruction(example):
    """Convert instruction-input-output to a prompt"""
    instruction = example["instruction"]
    input_text = example.get("input", "")
    output = example["output"]
    
    if input_text:
        prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n{output}"
    else:
        prompt = f"### Instruction:\n{instruction}\n\n### Response:\n{output}"
    
    return {"text": prompt}

# Format all examples
dataset = dataset.map(format_instruction)

# 3. Tokenize dataset
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,  # Adjust based on your task
        padding="max_length"
    )

tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset["train"].column_names
)

# 4. Set up training arguments
training_args = TrainingArguments(
    output_dir="./llama-2-7b-finetuned",
    num_train_epochs=3,  # 3-5 epochs typical
    per_device_train_batch_size=4,  # Adjust based on VRAM
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch size = 4 * 4 = 16
    learning_rate=2e-5,  # Low LR to prevent catastrophic forgetting
    weight_decay=0.01,
    warmup_steps=100,  # Gradual learning rate warmup
    lr_scheduler_type="cosine",  # Cosine decay schedule
    logging_steps=10,
    evaluation_strategy="steps",
    eval_steps=100,
    save_steps=500,
    save_total_limit=3,  # Keep only 3 best checkpoints
    load_best_model_at_end=True,
    fp16=True,  # Mixed precision training (faster, less memory)
    report_to="tensorboard",  # Log to TensorBoard
    push_to_hub=False
)

# 5. Create data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # We're doing causal LM, not masked LM
)

# 6. Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator
)

# 7. Train!
print("πŸš€ Starting training...")
trainer.train()

# 8. Save the final model
print("πŸ’Ύ Saving model...")
trainer.save_model("./llama-2-7b-final")
tokenizer.save_pretrained("./llama-2-7b-final")

print("βœ… Training complete!")

Step 3: Monitor Training

# Watch training progress in real-time
tensorboard --logdir ./llama-2-7b-finetuned/runs

# Monitor GPU usage
watch -n 1 nvidia-smi

# Expected training time for 7B model on A100:
# - 5,000 examples: ~8-12 hours
# - 10,000 examples: ~16-24 hours
# - 50,000 examples: ~80-120 hours

Step 4: Evaluate Your Model

# Load fine-tuned model
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("./llama-2-7b-final")
model = AutoModelForCausalLM.from_pretrained(
    "./llama-2-7b-final",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Test it!
def generate_response(instruction, input_text=""):
    if input_text:
        prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
    else:
        prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
    
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.9,
        do_sample=True
    )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.split("### Response:\n")[1].strip()

# Try it
print(generate_response(
    "Summarize this article",
    "Python is a high-level programming language known for its simplicity and readability..."
))

Key Hyperparameters Explained

πŸ“Š Learning Rate

  • 2e-5 to 5e-5: Standard range
  • Too high (1e-4+): Catastrophic forgetting, unstable training
  • Too low (1e-6): Very slow learning, might not improve
  • Pro tip: Start with 2e-5, increase if loss plateaus

πŸ”„ Epochs

  • 3-5 epochs: Typical for most tasks
  • 1-2 epochs: Large datasets (50K+ examples)
  • 5-10 epochs: Small datasets (< 1K examples), but risk overfitting
  • Pro tip: Monitor validation loss, stop when it increases

πŸ“¦ Batch Size

  • Effective batch size 16-32: Good balance
  • Use gradient accumulation: Simulate large batches on small GPUs
  • Example: batch_size=4 Γ— accumulation=4 = effective 16
  • Pro tip: Larger batches = more stable, but need more memory

🌑️ Warmup Steps

  • 10% of total steps: Common heuristic
  • Purpose: Gradually increase LR from 0 to target
  • Why: Prevents early training instability
  • Example: 10,000 steps total β†’ 1,000 warmup steps

The Catastrophic Forgetting Problem

⚠️ Catastrophic Forgetting: When you fine-tune a model too aggressively, it "forgets" its general knowledge and only remembers your training data. The model becomes specialized but loses its general capabilities.

Before Fine-tuning:
Q: "What is Python?"
A: "Python is a high-level programming language created by Guido van Rossum..."
(General knowledge intact)

After Aggressive Fine-tuning (LR too high, too many epochs):
Q: "What is Python?"
A: "### Instruction:\nAnswer this question\n\n### Response:\n..."
(Model only knows training format, forgot general knowledge!)

Solution:
βœ… Low learning rate (2e-5)
βœ… Few epochs (3-5)
βœ… Validation monitoring
βœ… Early stopping if val loss increases

πŸ’‘ How to Detect Catastrophic Forgetting:

  • Test model on general knowledge questions before and after fine-tuning
  • If performance on general tasks drops significantly, you've overfitted
  • Solution: Reduce LR, reduce epochs, or add general examples to training data

Cost Analysis: Full Fine-tuning

Setup Model Training Time Hardware Cost per Run
Small Scale GPT-2 (124M) ~2 hours RTX 3090 (24GB) $0 (own GPU) or $0.50 cloud
Medium Scale GPT-2 XL (1.5B) ~8 hours RTX 4090 (24GB) $0 (own GPU) or $2 cloud
Production Scale Llama 2 (7B) ~16 hours A100 (80GB) $24 cloud per experiment
Enterprise Scale Llama 2 (70B) ~80 hours 8x A100 (80GB) $960 cloud per experiment

πŸ’Έ Budget Reality: Full fine-tuning a 7B model requires experimentation (5-10 runs to tune hyperparameters) = $120-240 in cloud costs. Plus data preparation time (1-2 weeks). This is why LoRA/QLoRA became so popularβ€”they're 10x cheaper!

⚑ LoRA (Low-Rank Adaptation): The Game Changer

LoRA (Low-Rank Adaptation) is a breakthrough technique published by Microsoft researchers in 2021 that revolutionized LLM fine-tuning. Instead of updating billions of parameters, LoRA trains tiny adapter modules (< 1% of model size) that achieve 95-99% of full fine-tuning performance at a fraction of the cost.

🎯 The Revolution: LoRA made it possible to fine-tune 7B parameter models on consumer GPUs (RTX 3090, 4090) instead of requiring $10K A100 GPUs. This democratized fine-tuning for researchers, startups, and hobbyists.

The Core Idea: Low-Rank Decomposition

LoRA is based on a key insight: fine-tuning updates to model weights are low-rank. This means most of the "learning" happens in a low-dimensional subspace, not across all dimensions.

Standard Fine-tuning:
--------------------
Original Weight Matrix W: 4096 Γ— 4096 = 16,777,216 parameters
During training: Update ALL 16M parameters
Memory needed: W + gradients + optimizer states = 4x model size

LoRA Fine-tuning:
-----------------
Original Weight Matrix W: 4096 Γ— 4096 (frozen, not trained)
LoRA Adapter:
  - Matrix A: 4096 Γ— 8 = 32,768 parameters
  - Matrix B: 8 Γ— 4096 = 32,768 parameters
  - Total: 65,536 parameters (0.4% of original!)

During training: Update only A and B
Memory needed: Only adapters' gradients + optimizer states
Result: 100x less memory!

How LoRA Works: The Math

In a standard transformer layer, attention computation looks like:

Standard Forward Pass:
----------------------
output = W Γ— input

Where W is a huge weight matrix (e.g., 4096 Γ— 4096)

LoRA Forward Pass:
------------------
output = W Γ— input + (B Γ— A) Γ— input
         ↑           ↑
         Frozen      Trained adapters
         
Where:
- W: Original pretrained weights (frozen)
- A: Low-rank matrix (e.g., 4096 Γ— r, r=8)
- B: Low-rank matrix (e.g., r Γ— 4096)
- r: Rank (typically 8, 16, 32, or 64)

Key insight: (B Γ— A) is a 4096 Γ— 4096 matrix, but we only 
            train r Γ— (4096 + 4096) parameters instead of 4096Β²!

πŸ’‘ Example: For a 4096Γ—4096 weight matrix:

  • Full fine-tuning: 16,777,216 trainable parameters
  • LoRA (r=8): 65,536 trainable parameters (256x reduction!)
  • LoRA (r=16): 131,072 trainable parameters (128x reduction)
  • LoRA (r=64): 524,288 trainable parameters (32x reduction)

LoRA Configuration Parameters

πŸ“ Rank (r)

What it is: Dimensionality of the low-rank decomposition

  • r=4-8: Very parameter-efficient, good for simple tasks
  • r=16-32: Sweet spot for most tasks
  • r=64-128: Complex tasks, closer to full fine-tuning

Trade-off: Higher r = more parameters = better performance but more memory

🎚️ LoRA Alpha

What it is: Scaling factor for LoRA updates

  • Formula: scaling = alpha / r
  • Typical values: 16, 32, 64
  • Rule of thumb: alpha = 2 Γ— r

Effect: Controls how much LoRA adapters influence the output

🎯 Target Modules

What it is: Which layers get LoRA adapters

  • Common: ["q_proj", "v_proj"] (query & value in attention)
  • More aggressive: ["q_proj", "k_proj", "v_proj", "o_proj"]
  • Full coverage: All linear layers

Trade-off: More modules = better performance but more parameters

πŸ’§ LoRA Dropout

What it is: Dropout applied to LoRA layers

  • Typical: 0.05 (5%)
  • Purpose: Regularization to prevent overfitting
  • When to increase: Small datasets (< 1K examples)

Effect: Helps generalization, especially on small datasets

Complete LoRA Fine-tuning Example

# Install: pip install peft transformers datasets accelerate
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# 1. Load base model (in FP16 for memory efficiency)
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# 2. Configure LoRA
lora_config = LoraConfig(
    r=16,  # Rank: 16 is a good balance
    lora_alpha=32,  # Alpha: 2 Γ— r
    target_modules=[
        "q_proj",   # Query projection in attention
        "k_proj",   # Key projection
        "v_proj",   # Value projection
        "o_proj",   # Output projection
        "gate_proj",  # For Llama's FFN
        "up_proj",
        "down_proj"
    ],
    lora_dropout=0.05,  # 5% dropout for regularization
    bias="none",  # Don't train bias terms
    task_type="CAUSAL_LM"  # Causal language modeling
)

# 3. Apply LoRA to model
model = get_peft_model(model, lora_config)

# Print trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062%

# 4. Load and prepare dataset (same as full fine-tuning)
dataset = load_dataset("json", data_files={"train": "train.jsonl", "validation": "val.jsonl"})

def format_instruction(example):
    return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"}

dataset = dataset.map(format_instruction)

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512, padding="max_length")

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# 5. Training arguments (optimized for LoRA)
training_args = TrainingArguments(
    output_dir="./llama-2-7b-lora",
    num_train_epochs=3,
    per_device_train_batch_size=8,  # Can use larger batch size!
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=3e-4,  # LoRA can use higher LR than full fine-tuning
    weight_decay=0.01,
    warmup_steps=100,
    logging_steps=10,
    evaluation_strategy="steps",
    eval_steps=100,
    save_steps=500,
    fp16=True,
    report_to="tensorboard"
)

# 6. Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"]
)

print("πŸš€ Starting LoRA training...")
trainer.train()

# 7. Save LoRA adapters (only ~10-50MB!)
model.save_pretrained("./llama-2-7b-lora-final")
tokenizer.save_pretrained("./llama-2-7b-lora-final")

print("βœ… LoRA training complete!")
print("πŸ’‘ Saved adapter weights are only ~20MB instead of ~14GB!")

Loading and Using LoRA Models

# Load the base model + LoRA adapters
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load LoRA adapters on top
model = PeftModel.from_pretrained(base_model, "./llama-2-7b-lora-final")
tokenizer = AutoTokenizer.from_pretrained("./llama-2-7b-lora-final")

# Use it!
def generate(instruction):
    prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
    return tokenizer.decode(outputs[0], skip_special_tokens=True).split("### Response:\n")[1]

print(generate("Explain quantum computing in simple terms"))

# Merge LoRA weights into base model (optional, for deployment)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./llama-2-7b-merged")
# Now you have a standalone model without needing the base + adapter separately!

LoRA Hardware Requirements

Model Size Base Model LoRA VRAM GPU Options Training Time
GPT-2 (124M) ~500MB ~2GB RTX 3060 (12GB) βœ… ~30 min
GPT-2 XL (1.5B) ~6GB ~8GB RTX 3090 (24GB) βœ… ~2 hours
Llama 2 (7B) ~14GB ~18GB RTX 3090/4090 (24GB) βœ… ~6 hours
Llama 2 (13B) ~26GB ~32GB A100 (40GB) or RTX 6000 Ada ~12 hours
Llama 2 (70B) ~140GB ~160GB 2-3x A100 (80GB each) ~48 hours

βœ… LoRA's Killer Feature: You can fine-tune a 7B model on a consumer RTX 3090 (24GB, $1,200) instead of needing an A100 (80GB, $10,000). This 8x cost reduction democratized LLM fine-tuning!

LoRA vs Full Fine-tuning: Performance Comparison

Research shows LoRA achieves 95-99% of full fine-tuning performance across diverse tasks:

Task Full Fine-tune LoRA (r=16) Difference
Text Classification 95.2% 94.8% -0.4%
Question Answering 88.7% 87.9% -0.8%
Summarization (ROUGE) 42.1 41.6 -0.5
Code Generation 72.3% 71.1% -1.2%
Average 100% 97.5% -2.5%

πŸ’‘ Cost-Benefit Analysis:

  • Full fine-tuning: 100% performance, $30/run (A100), 16 hours
  • LoRA: 97.5% performance, $0 (own RTX 3090) or $3/run (cloud), 6 hours
  • Verdict: LoRA is 10x cheaper for 2.5% performance dropβ€”incredible trade-off!

LoRA Best Practices

βœ… Do's

  • Start with r=16, alpha=32 (good default)
  • Target at least q_proj and v_proj
  • Use higher learning rates (3e-4) than full fine-tuning
  • Train for more epochs (3-5) since only adapters update
  • Save adapters separately (only ~20MB!)
  • Merge adapters before deployment for faster inference

❌ Don'ts

  • Don't use r > 64 (diminishing returns, use full fine-tuning instead)
  • Don't forget to freeze base model (LoRA does this automatically)
  • Don't use too low r (< 4) for complex tasks
  • Don't train on too few examples (< 500)
  • Don't forget to merge adapters for production

πŸ”‹ QLoRA (Quantized LoRA): The Ultimate Memory Saver

QLoRA (Quantized Low-Rank Adaptation) takes LoRA one step further by combining it with 4-bit quantization. The result? You can fine-tune a 13B model on a single RTX 3090 (24GB) or a 70B model on a single A100 (80GB)β€”something previously impossible without multiple high-end GPUs.

πŸš€ The Magic: QLoRA enabled researchers to fine-tune 65B parameter models on consumer hardware for the first time. This democratization led to an explosion of open-source fine-tuned models in 2023 (Alpaca, Vicuna, WizardLM, etc.).

How QLoRA Works: Quantization + LoRA

QLoRA combines three key innovations:

1. 4-bit NormalFloat (NF4) Quantization

Stores model weights in 4-bit instead of 16-bit:

  • 16-bit (FP16): 7B model = 14GB
  • 4-bit (NF4): 7B model = 3.5GB (4x smaller!)

NF4: Optimized 4-bit format for normally distributed weights (like neural nets)

2. Double Quantization

Quantize the quantization constants themselves:

  • First quantization saves 4x memory
  • Double quantization saves additional ~0.37 bits/param
  • Total: 4.2x memory reduction with minimal quality loss

3. Paged Optimizers

Use CPU RAM when GPU memory is full:

  • Optimizer states moved to CPU
  • Transferred to GPU only when needed
  • Prevents out-of-memory crashes
  • Enables training on smaller GPUs

Memory Breakdown: QLoRA vs LoRA vs Full

Fine-tuning a 7B model (in FP16):

Full Fine-tuning:
β”œβ”€β”€ Model weights: 14GB
β”œβ”€β”€ Gradients: 14GB
β”œβ”€β”€ Optimizer states (Adam): 28GB
└── Activations: 8-12GB
    TOTAL: ~64-68GB (needs A100 80GB)

LoRA (r=16):
β”œβ”€β”€ Model weights: 14GB (frozen, no gradients)
β”œβ”€β”€ LoRA adapters: 0.02GB
β”œβ”€β”€ LoRA gradients: 0.02GB
β”œβ”€β”€ Optimizer states: 0.04GB
└── Activations: 8-12GB
    TOTAL: ~22-26GB (fits RTX 3090 24GB, barely)

QLoRA (r=16, 4-bit):
β”œβ”€β”€ Model weights (4-bit): 3.5GB (no gradients!)
β”œβ”€β”€ LoRA adapters (FP16): 0.02GB
β”œβ”€β”€ LoRA gradients: 0.02GB
β”œβ”€β”€ Optimizer states: 0.04GB
└── Activations: 4-6GB
    TOTAL: ~7.5-9.5GB (fits RTX 3060 12GB!)

⚠️ The Catch: QLoRA is ~20-30% slower than LoRA (due to quantization overhead) and achieves ~96% of full fine-tuning performance vs LoRA's 97.5%. But the memory savings are so dramatic that it's worth the trade-off for most use cases.

Complete QLoRA Implementation

# Install: pip install bitsandbytes peft transformers accelerate
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# 1. Configure 4-bit quantization with NF4
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Use 4-bit quantization
    bnb_4bit_quant_type="nf4",  # NormalFloat4 (optimal for neural nets)
    bnb_4bit_compute_dtype=torch.float16,  # Compute in FP16 for speed
    bnb_4bit_use_double_quant=True  # Double quantization for extra savings
)

# 2. Load model in 4-bit
model_name = "meta-llama/Llama-2-13b-hf"  # 13B model!
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",  # Automatically distribute across available GPUs
    trust_remote_code=True
)

# 3. Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# 4. Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# 5. Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 13,016,035,328 || trainable%: 0.322%

# 6. Load dataset (same as before)
dataset = load_dataset("json", data_files={"train": "train.jsonl", "validation": "val.jsonl"})

def format_instruction(example):
    return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"}

dataset = dataset.map(format_instruction)

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512, padding="max_length")

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# 7. Training arguments (optimized for QLoRA)
training_args = TrainingArguments(
    output_dir="./llama-2-13b-qlora",
    num_train_epochs=3,
    per_device_train_batch_size=4,  # 4-bit allows larger batches
    gradient_accumulation_steps=4,
    learning_rate=2e-4,  # Slightly higher than full fine-tuning
    warmup_steps=100,
    logging_steps=10,
    save_steps=500,
    fp16=True,  # Mixed precision
    optim="paged_adamw_8bit",  # 8-bit Adam with paging (crucial for QLoRA!)
    lr_scheduler_type="cosine",
    evaluation_strategy="steps",
    eval_steps=100,
    report_to="tensorboard"
)

# 8. Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"]
)

print("πŸš€ Starting QLoRA training of 13B model on consumer GPU...")
trainer.train()

# 9. Save
model.save_pretrained("./llama-2-13b-qlora-final")
tokenizer.save_pretrained("./llama-2-13b-qlora-final")
print("βœ… QLoRA training complete! Saved adapters (~40MB).")

QLoRA Hardware Requirements: The Real Game-Changer

Model Size Full Fine-tune LoRA QLoRA GPU Needed (QLoRA)
Llama 2 (7B) ~64GB ~22GB ~8GB RTX 3060 (12GB) βœ…
Llama 2 (13B) ~120GB ~40GB ~16GB RTX 3090/4090 (24GB) βœ…
Llama 2 (70B) ~600GB ~200GB ~48GB A100 (80GB) βœ…
Mixtral (8x7B) ~480GB ~160GB ~40GB A100 (80GB) βœ…

🎯 QLoRA's Superpower: A hobbyist with a $1,500 RTX 4090 can fine-tune a 13B model that outperforms GPT-3.5 on specialized tasks. This was unthinkable before QLoRAβ€”you needed $30K+ in GPUs.

QLoRA Performance: How Much Quality Do You Lose?

QLoRA achieves 96-98% of full fine-tuning performance across tasks:

Task Full FT LoRA (r=16) QLoRA (r=16)
Instruction Following 91.2% 90.1% (-1.1%) 89.3% (-1.9%)
Code Generation 68.5% 67.2% (-1.3%) 66.1% (-2.4%)
Summarization (ROUGE) 45.7 44.9 (-0.8) 44.1 (-1.6)
Average Retention 100% 98.5% 96.8%

πŸ’‘ Real Talk: QLoRA loses 3-4% performance vs full fine-tuning but uses 8x less memory. For most applications, this is an incredible trade-off. Plus, you can often recover lost performance by using a larger model (e.g., 13B QLoRA beats 7B full fine-tuning).

πŸ“Š Comparison

Approach Trainable Params VRAM (7B) Speed Quality
Full 7B (all) 40-80GB Slow ⭐⭐⭐⭐⭐
LoRA ~1M 10-15GB Fast ⭐⭐⭐⭐⭐
QLoRA ~1M 6-8GB Very Fast ⭐⭐⭐⭐

πŸ“‹ Best Practices

  • Start with prompting/RAG: Try those first. They're faster
  • Use QLoRA: Best cost-benefit for most use cases
  • Low learning rates: 1e-4 to 5e-5 typical
  • Evaluate regularly: On validation set during training
  • Don't overfit: 3-5 epochs usually optimal
  • Use instruction data: Format data as instructions for best results
  • Monitor perplexity: Track validation loss

πŸ“‹ Summary

What You've Learned:

  • Full fine-tuning updates all parameters (expensive)
  • LoRA trains small adapter modules (10x cheaper)
  • QLoRA combines quantization + LoRA (most practical)
  • QLoRA: ~1M trainable params, 6-8GB VRAM for 7B models
  • Start with RAG/prompting. Fine-tune only when needed

What's Next?

In our next tutorial, Inference Optimization, we'll learn how to deploy models efficiently.

πŸŽ‰ Excellent! You now have the skills to fine-tune any LLM on your data efficiently!

Test Your Knowledge

Q1: What is the main advantage of Parameter-Efficient Fine-Tuning (PEFT)?

It makes models larger
It eliminates the need for training data
It updates only a small subset of parameters, reducing memory and compute requirements
It works without GPUs

Q2: What does LoRA (Low-Rank Adaptation) do?

It compresses the model permanently
It adds small trainable rank decomposition matrices to model layers while keeping original weights frozen
It removes layers from the model
It increases model size significantly

Q3: Which fine-tuning approach updates ALL model parameters?

LoRA
Prefix tuning
Adapter layers
Full fine-tuning

Q4: What is the purpose of the validation set during fine-tuning?

To monitor model performance and detect overfitting without using test data
To increase training speed
To generate more training data
To replace the test set

Q5: When should you consider full fine-tuning instead of PEFT?

Always, it's always better
Never, PEFT is always superior
When you have substantial domain-specific data and need maximum performance
Only for image models