π Complete all tutorials to earn your Free LLMs & Transformers Certificate
Shareable on LinkedIn β’ Verified by AITutorials.site β’ No signup fee
π§ Why Fine-tune? When RAG and ICL Aren't Enough
In the previous tutorial, you learned that RAG and In-Context Learning solve 95% of LLM application needs. But what about the other 5%? When do you actually need to fine-tune a model?
β οΈ Critical Decision Point: Fine-tuning is powerful but expensive and time-consuming. Before fine-tuning, ask yourself: "Have I truly exhausted prompt engineering, few-shot learning, and RAG?" If not, go back and try those first.
Scenarios Where Fine-tuning Wins
1. π¨ Learning Specific Style/Tone
Problem: You need your chatbot to respond in your company's unique voiceβformal, technical, friendly, etc.
Why RAG fails: RAG retrieves facts, not communication style. Few-shot examples can't capture nuanced tone across all situations.
Fine-tuning wins: Model internalizes the style through thousands of examples.
Example: Customer support chatbot
trained on 5,000 past conversations
learns your company's empathetic,
professional tone automatically.
2. π₯ Domain-Specific Behavior
Problem: Need deep medical/legal/scientific reasoning that requires specialized knowledge patterns.
Why RAG fails: RAG can retrieve documents, but the model still needs to reason about them using domain expertise.
Fine-tuning wins: Model learns domain-specific reasoning patterns, terminology, and decision-making.
Example: Medical diagnosis assistant
fine-tuned on 10,000 case studies
learns clinical reasoning patterns
that few-shot examples can't teach.
3. β‘ Latency & Cost Optimization
Problem: RAG adds retrieval latency (~200-500ms). ICL inflates token usage with long prompts.
Why RAG fails: Every query requires embedding generation, vector search, and longer context.
Fine-tuning wins: Knowledge is "baked into" model weightsβno retrieval needed, faster inference.
RAG: 500ms retrieval + 1000 tokens
Fine-tuned: 0ms retrieval + 200 tokens
Result: 3x faster, 5x cheaper!
4. π Complex Output Formatting
Problem: Need highly structured outputs (JSON, code, specific formats) with 100% consistency.
Why ICL fails: Few-shot examples can be inconsistent. Model might deviate from format on edge cases.
Fine-tuning wins: Model learns to always produce the exact format through thousands of examples.
Example: Code generation model
trained on 50,000 examples learns
to always produce valid, properly
formatted code.
5. π― Task-Specific Specialization
Problem: Need superhuman performance on a narrow, specialized task.
Why general models fail: GPT-4 is a generalist. For specialized tasks, a smaller fine-tuned model can outperform it.
Fine-tuning wins: Smaller, cheaper model (7B) fine-tuned on specific task beats general 175B+ model.
Example: SQL query generation
Fine-tuned 7B model: 95% accuracy
GPT-4 few-shot: 87% accuracy
Cost: 50x cheaper!
6. π Privacy & Compliance
Problem: Can't send sensitive data to external APIs (OpenAI, Anthropic). Need on-premise deployment.
Why APIs fail: Healthcare, finance, legal data often can't leave your infrastructure.
Fine-tuning wins: Train and deploy your own model on your hardware. Full data control.
Healthcare: Train Llama 2 on internal
patient data, deploy on-premise.
Zero external API calls.
The Decision Framework
START: "I need an LLM to do X"
β
Question 1: Can few-shot examples (2-10) show the model how to do it?
β
YES β Use In-Context Learning (ICL)
β NO β Continue
β
Question 2: Is it primarily about accessing external knowledge/documents?
β
YES β Use RAG
β NO β Continue
β
Question 3: Do I have 1,000+ high-quality training examples?
β NO β Go back to ICL/RAG, you don't have enough data
β
YES β Continue
β
Question 4: Is it about style, behavior, or specialized reasoning?
β
YES β Consider fine-tuning
β NO β Probably can solve with RAG + ICL
β
Question 5: Have I truly exhausted prompt engineering?
β NO β Go back and try harder with prompts
β
YES β Fine-tune!
β
Question 6: Which fine-tuning approach?
β’ Need 99% accuracy, have A100 GPUs? β Full fine-tuning
β’ Consumer GPU, good balance? β LoRA
β’ Minimal resources, 7-13B models? β QLoRA β
(most common)
π‘ Real-World Stats:
- 70% of applications: Solved with RAG alone
- 20% of applications: RAG + ICL combination
- 8% of applications: Fine-tuning (usually QLoRA)
- 2% of applications: Full fine-tuning (big companies with resources)
Fine-tuning vs RAG: Side-by-Side Example
Scenario: Build a customer support bot for a SaaS company.
β Hybrid Approach (Best of Both Worlds): Use a fine-tuned model for style/tone + RAG for facts! Example: Fine-tune a 7B model on your support conversations (learns tone), then use RAG to retrieve current product docs (learns facts). This combines the strengths of both approaches.
When NOT to Fine-tune
β Don't fine-tune if:
- <500 training examples: Not enough data for meaningful fine-tuning. Use ICL instead.
- Rapidly changing domain: If your knowledge updates daily/weekly, RAG is better.
- Multiple unrelated tasks: Fine-tuning specializes. For generalists, use GPT-4 + prompts.
- No clear success metric: How will you know if fine-tuning improved things? Need evaluation data.
- Budget/time constraints: Fine-tuning requires investment. RAG is faster to market.
- Primarily factual QA: RAG excels here. Fine-tuning won't help much.
πΎ Full Fine-tuning: Training All Parameters
Full fine-tuning (also called supervised fine-tuning or SFT) means updating every single parameter in the model through backpropagation. If you're fine-tuning a 7B parameter model, you're training all 7 billion weights.
π Scale: A 7B model has ~7 billion parameters. At 16-bit precision (FP16), that's 14GB just to load the model. During training, you also need memory for gradients (another 14GB), optimizer states (28GB for Adam), and activations. Total: ~60-80GB VRAM for full fine-tuning of a 7B model!
When to Use Full Fine-tuning
β Use Full Fine-tuning When:
- You have access to A100/H100 GPUs (80GB VRAM)
- Need absolute maximum performance
- Training relatively small models (< 3B params)
- Have budget for expensive compute
- Domain shift is very large (e.g., new language)
β Skip Full Fine-tuning When:
- Using consumer GPUs (RTX 3090, 4090)
- Models are large (> 7B params)
- Budget is limited
- Quick experimentation needed
- Task is well-defined (LoRA/QLoRA work well)
Hardware Requirements by Model Size
β οΈ Reality Check: Full fine-tuning a 7B model on an A100 costs $1.50/hour. A typical training run takes 10-20 hours = $15-30 per experiment. With hyperparameter tuning (5-10 runs), you're looking at $150-300 in compute costs.
The Full Fine-tuning Process
Here's a complete, production-ready fine-tuning pipeline using HuggingFace Transformers:
Step 1: Prepare Your Training Data
// train.jsonl (instruction-following format)
{"instruction": "Summarize this article", "input": "Article text here...", "output": "Summary text here..."}
{"instruction": "Translate to French", "input": "Hello world", "output": "Bonjour le monde"}
{"instruction": "Answer the question", "input": "What is Python?", "output": "Python is a programming language..."}
// Recommendation: 1,000+ examples minimum, 10,000+ for best results
π‘ Data Quality > Quantity: 1,000 high-quality, diverse examples beat 10,000 noisy, repetitive examples. Each example should be carefully reviewed, properly formatted, and representative of your task.
Step 2: Complete Training Script
# install: pip install transformers datasets accelerate torch
import torch
from datasets import load_dataset
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
TrainingArguments,
Trainer,
DataCollatorForLanguageModeling
)
# 1. Load your model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16, # Use FP16 for memory efficiency
device_map="auto" # Automatically distribute across GPUs
)
# Add padding token if missing
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id
# 2. Load and prepare dataset
dataset = load_dataset("json", data_files={
"train": "train.jsonl",
"validation": "validation.jsonl"
})
def format_instruction(example):
"""Convert instruction-input-output to a prompt"""
instruction = example["instruction"]
input_text = example.get("input", "")
output = example["output"]
if input_text:
prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n{output}"
else:
prompt = f"### Instruction:\n{instruction}\n\n### Response:\n{output}"
return {"text": prompt}
# Format all examples
dataset = dataset.map(format_instruction)
# 3. Tokenize dataset
def tokenize_function(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=512, # Adjust based on your task
padding="max_length"
)
tokenized_dataset = dataset.map(
tokenize_function,
batched=True,
remove_columns=dataset["train"].column_names
)
# 4. Set up training arguments
training_args = TrainingArguments(
output_dir="./llama-2-7b-finetuned",
num_train_epochs=3, # 3-5 epochs typical
per_device_train_batch_size=4, # Adjust based on VRAM
per_device_eval_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16
learning_rate=2e-5, # Low LR to prevent catastrophic forgetting
weight_decay=0.01,
warmup_steps=100, # Gradual learning rate warmup
lr_scheduler_type="cosine", # Cosine decay schedule
logging_steps=10,
evaluation_strategy="steps",
eval_steps=100,
save_steps=500,
save_total_limit=3, # Keep only 3 best checkpoints
load_best_model_at_end=True,
fp16=True, # Mixed precision training (faster, less memory)
report_to="tensorboard", # Log to TensorBoard
push_to_hub=False
)
# 5. Create data collator
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False # We're doing causal LM, not masked LM
)
# 6. Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
data_collator=data_collator
)
# 7. Train!
print("π Starting training...")
trainer.train()
# 8. Save the final model
print("πΎ Saving model...")
trainer.save_model("./llama-2-7b-final")
tokenizer.save_pretrained("./llama-2-7b-final")
print("β
Training complete!")
Step 3: Monitor Training
# Watch training progress in real-time
tensorboard --logdir ./llama-2-7b-finetuned/runs
# Monitor GPU usage
watch -n 1 nvidia-smi
# Expected training time for 7B model on A100:
# - 5,000 examples: ~8-12 hours
# - 10,000 examples: ~16-24 hours
# - 50,000 examples: ~80-120 hours
Step 4: Evaluate Your Model
# Load fine-tuned model
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("./llama-2-7b-final")
model = AutoModelForCausalLM.from_pretrained(
"./llama-2-7b-final",
torch_dtype=torch.float16,
device_map="auto"
)
# Test it!
def generate_response(instruction, input_text=""):
if input_text:
prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
else:
prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return response.split("### Response:\n")[1].strip()
# Try it
print(generate_response(
"Summarize this article",
"Python is a high-level programming language known for its simplicity and readability..."
))
Key Hyperparameters Explained
π Learning Rate
- 2e-5 to 5e-5: Standard range
- Too high (1e-4+): Catastrophic forgetting, unstable training
- Too low (1e-6): Very slow learning, might not improve
- Pro tip: Start with 2e-5, increase if loss plateaus
π Epochs
- 3-5 epochs: Typical for most tasks
- 1-2 epochs: Large datasets (50K+ examples)
- 5-10 epochs: Small datasets (< 1K examples), but risk overfitting
- Pro tip: Monitor validation loss, stop when it increases
π¦ Batch Size
- Effective batch size 16-32: Good balance
- Use gradient accumulation: Simulate large batches on small GPUs
- Example: batch_size=4 Γ accumulation=4 = effective 16
- Pro tip: Larger batches = more stable, but need more memory
π‘οΈ Warmup Steps
- 10% of total steps: Common heuristic
- Purpose: Gradually increase LR from 0 to target
- Why: Prevents early training instability
- Example: 10,000 steps total β 1,000 warmup steps
The Catastrophic Forgetting Problem
β οΈ Catastrophic Forgetting: When you fine-tune a model too aggressively, it "forgets" its general knowledge and only remembers your training data. The model becomes specialized but loses its general capabilities.
Before Fine-tuning:
Q: "What is Python?"
A: "Python is a high-level programming language created by Guido van Rossum..."
(General knowledge intact)
After Aggressive Fine-tuning (LR too high, too many epochs):
Q: "What is Python?"
A: "### Instruction:\nAnswer this question\n\n### Response:\n..."
(Model only knows training format, forgot general knowledge!)
Solution:
β
Low learning rate (2e-5)
β
Few epochs (3-5)
β
Validation monitoring
β
Early stopping if val loss increases
π‘ How to Detect Catastrophic Forgetting:
- Test model on general knowledge questions before and after fine-tuning
- If performance on general tasks drops significantly, you've overfitted
- Solution: Reduce LR, reduce epochs, or add general examples to training data
Cost Analysis: Full Fine-tuning
πΈ Budget Reality: Full fine-tuning a 7B model requires experimentation (5-10 runs to tune hyperparameters) = $120-240 in cloud costs. Plus data preparation time (1-2 weeks). This is why LoRA/QLoRA became so popularβthey're 10x cheaper!
β‘ LoRA (Low-Rank Adaptation): The Game Changer
LoRA (Low-Rank Adaptation) is a breakthrough technique published by Microsoft researchers in 2021 that revolutionized LLM fine-tuning. Instead of updating billions of parameters, LoRA trains tiny adapter modules (< 1% of model size) that achieve 95-99% of full fine-tuning performance at a fraction of the cost.
π― The Revolution: LoRA made it possible to fine-tune 7B parameter models on consumer GPUs (RTX 3090, 4090) instead of requiring $10K A100 GPUs. This democratized fine-tuning for researchers, startups, and hobbyists.
The Core Idea: Low-Rank Decomposition
LoRA is based on a key insight: fine-tuning updates to model weights are low-rank. This means most of the "learning" happens in a low-dimensional subspace, not across all dimensions.
Standard Fine-tuning:
--------------------
Original Weight Matrix W: 4096 Γ 4096 = 16,777,216 parameters
During training: Update ALL 16M parameters
Memory needed: W + gradients + optimizer states = 4x model size
LoRA Fine-tuning:
-----------------
Original Weight Matrix W: 4096 Γ 4096 (frozen, not trained)
LoRA Adapter:
- Matrix A: 4096 Γ 8 = 32,768 parameters
- Matrix B: 8 Γ 4096 = 32,768 parameters
- Total: 65,536 parameters (0.4% of original!)
During training: Update only A and B
Memory needed: Only adapters' gradients + optimizer states
Result: 100x less memory!
How LoRA Works: The Math
In a standard transformer layer, attention computation looks like:
Standard Forward Pass:
----------------------
output = W Γ input
Where W is a huge weight matrix (e.g., 4096 Γ 4096)
LoRA Forward Pass:
------------------
output = W Γ input + (B Γ A) Γ input
β β
Frozen Trained adapters
Where:
- W: Original pretrained weights (frozen)
- A: Low-rank matrix (e.g., 4096 Γ r, r=8)
- B: Low-rank matrix (e.g., r Γ 4096)
- r: Rank (typically 8, 16, 32, or 64)
Key insight: (B Γ A) is a 4096 Γ 4096 matrix, but we only
train r Γ (4096 + 4096) parameters instead of 4096Β²!
π‘ Example: For a 4096Γ4096 weight matrix:
- Full fine-tuning: 16,777,216 trainable parameters
- LoRA (r=8): 65,536 trainable parameters (256x reduction!)
- LoRA (r=16): 131,072 trainable parameters (128x reduction)
- LoRA (r=64): 524,288 trainable parameters (32x reduction)
LoRA Configuration Parameters
π Rank (r)
What it is: Dimensionality of the low-rank decomposition
- r=4-8: Very parameter-efficient, good for simple tasks
- r=16-32: Sweet spot for most tasks
- r=64-128: Complex tasks, closer to full fine-tuning
Trade-off: Higher r = more parameters = better performance but more memory
ποΈ LoRA Alpha
What it is: Scaling factor for LoRA updates
- Formula: scaling = alpha / r
- Typical values: 16, 32, 64
- Rule of thumb: alpha = 2 Γ r
Effect: Controls how much LoRA adapters influence the output
π― Target Modules
What it is: Which layers get LoRA adapters
- Common: ["q_proj", "v_proj"] (query & value in attention)
- More aggressive: ["q_proj", "k_proj", "v_proj", "o_proj"]
- Full coverage: All linear layers
Trade-off: More modules = better performance but more parameters
π§ LoRA Dropout
What it is: Dropout applied to LoRA layers
- Typical: 0.05 (5%)
- Purpose: Regularization to prevent overfitting
- When to increase: Small datasets (< 1K examples)
Effect: Helps generalization, especially on small datasets
Complete LoRA Fine-tuning Example
# Install: pip install peft transformers datasets accelerate
import torch
from datasets import load_dataset
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
TrainingArguments,
Trainer
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# 1. Load base model (in FP16 for memory efficiency)
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# 2. Configure LoRA
lora_config = LoraConfig(
r=16, # Rank: 16 is a good balance
lora_alpha=32, # Alpha: 2 Γ r
target_modules=[
"q_proj", # Query projection in attention
"k_proj", # Key projection
"v_proj", # Value projection
"o_proj", # Output projection
"gate_proj", # For Llama's FFN
"up_proj",
"down_proj"
],
lora_dropout=0.05, # 5% dropout for regularization
bias="none", # Don't train bias terms
task_type="CAUSAL_LM" # Causal language modeling
)
# 3. Apply LoRA to model
model = get_peft_model(model, lora_config)
# Print trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062%
# 4. Load and prepare dataset (same as full fine-tuning)
dataset = load_dataset("json", data_files={"train": "train.jsonl", "validation": "val.jsonl"})
def format_instruction(example):
return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"}
dataset = dataset.map(format_instruction)
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True, max_length=512, padding="max_length")
tokenized_dataset = dataset.map(tokenize_function, batched=True)
# 5. Training arguments (optimized for LoRA)
training_args = TrainingArguments(
output_dir="./llama-2-7b-lora",
num_train_epochs=3,
per_device_train_batch_size=8, # Can use larger batch size!
per_device_eval_batch_size=8,
gradient_accumulation_steps=2,
learning_rate=3e-4, # LoRA can use higher LR than full fine-tuning
weight_decay=0.01,
warmup_steps=100,
logging_steps=10,
evaluation_strategy="steps",
eval_steps=100,
save_steps=500,
fp16=True,
report_to="tensorboard"
)
# 6. Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"]
)
print("π Starting LoRA training...")
trainer.train()
# 7. Save LoRA adapters (only ~10-50MB!)
model.save_pretrained("./llama-2-7b-lora-final")
tokenizer.save_pretrained("./llama-2-7b-lora-final")
print("β
LoRA training complete!")
print("π‘ Saved adapter weights are only ~20MB instead of ~14GB!")
Loading and Using LoRA Models
# Load the base model + LoRA adapters
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto"
)
# Load LoRA adapters on top
model = PeftModel.from_pretrained(base_model, "./llama-2-7b-lora-final")
tokenizer = AutoTokenizer.from_pretrained("./llama-2-7b-lora-final")
# Use it!
def generate(instruction):
prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
return tokenizer.decode(outputs[0], skip_special_tokens=True).split("### Response:\n")[1]
print(generate("Explain quantum computing in simple terms"))
# Merge LoRA weights into base model (optional, for deployment)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./llama-2-7b-merged")
# Now you have a standalone model without needing the base + adapter separately!
LoRA Hardware Requirements
β LoRA's Killer Feature: You can fine-tune a 7B model on a consumer RTX 3090 (24GB, $1,200) instead of needing an A100 (80GB, $10,000). This 8x cost reduction democratized LLM fine-tuning!
LoRA vs Full Fine-tuning: Performance Comparison
Research shows LoRA achieves 95-99% of full fine-tuning performance across diverse tasks:
π‘ Cost-Benefit Analysis:
- Full fine-tuning: 100% performance, $30/run (A100), 16 hours
- LoRA: 97.5% performance, $0 (own RTX 3090) or $3/run (cloud), 6 hours
- Verdict: LoRA is 10x cheaper for 2.5% performance dropβincredible trade-off!
LoRA Best Practices
β Do's
- Start with r=16, alpha=32 (good default)
- Target at least q_proj and v_proj
- Use higher learning rates (3e-4) than full fine-tuning
- Train for more epochs (3-5) since only adapters update
- Save adapters separately (only ~20MB!)
- Merge adapters before deployment for faster inference
β Don'ts
- Don't use r > 64 (diminishing returns, use full fine-tuning instead)
- Don't forget to freeze base model (LoRA does this automatically)
- Don't use too low r (< 4) for complex tasks
- Don't train on too few examples (< 500)
- Don't forget to merge adapters for production
π QLoRA (Quantized LoRA): The Ultimate Memory Saver
QLoRA (Quantized Low-Rank Adaptation) takes LoRA one step further by combining it with 4-bit quantization. The result? You can fine-tune a 13B model on a single RTX 3090 (24GB) or a 70B model on a single A100 (80GB)βsomething previously impossible without multiple high-end GPUs.
π The Magic: QLoRA enabled researchers to fine-tune 65B parameter models on consumer hardware for the first time. This democratization led to an explosion of open-source fine-tuned models in 2023 (Alpaca, Vicuna, WizardLM, etc.).
How QLoRA Works: Quantization + LoRA
QLoRA combines three key innovations:
1. 4-bit NormalFloat (NF4) Quantization
Stores model weights in 4-bit instead of 16-bit:
- 16-bit (FP16): 7B model = 14GB
- 4-bit (NF4): 7B model = 3.5GB (4x smaller!)
NF4: Optimized 4-bit format for normally distributed weights (like neural nets)
2. Double Quantization
Quantize the quantization constants themselves:
- First quantization saves 4x memory
- Double quantization saves additional ~0.37 bits/param
- Total: 4.2x memory reduction with minimal quality loss
3. Paged Optimizers
Use CPU RAM when GPU memory is full:
- Optimizer states moved to CPU
- Transferred to GPU only when needed
- Prevents out-of-memory crashes
- Enables training on smaller GPUs
Memory Breakdown: QLoRA vs LoRA vs Full
Fine-tuning a 7B model (in FP16):
Full Fine-tuning:
βββ Model weights: 14GB
βββ Gradients: 14GB
βββ Optimizer states (Adam): 28GB
βββ Activations: 8-12GB
TOTAL: ~64-68GB (needs A100 80GB)
LoRA (r=16):
βββ Model weights: 14GB (frozen, no gradients)
βββ LoRA adapters: 0.02GB
βββ LoRA gradients: 0.02GB
βββ Optimizer states: 0.04GB
βββ Activations: 8-12GB
TOTAL: ~22-26GB (fits RTX 3090 24GB, barely)
QLoRA (r=16, 4-bit):
βββ Model weights (4-bit): 3.5GB (no gradients!)
βββ LoRA adapters (FP16): 0.02GB
βββ LoRA gradients: 0.02GB
βββ Optimizer states: 0.04GB
βββ Activations: 4-6GB
TOTAL: ~7.5-9.5GB (fits RTX 3060 12GB!)
β οΈ The Catch: QLoRA is ~20-30% slower than LoRA (due to quantization overhead) and achieves ~96% of full fine-tuning performance vs LoRA's 97.5%. But the memory savings are so dramatic that it's worth the trade-off for most use cases.
Complete QLoRA Implementation
# Install: pip install bitsandbytes peft transformers accelerate
import torch
from datasets import load_dataset
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
BitsAndBytesConfig,
TrainingArguments,
Trainer
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# 1. Configure 4-bit quantization with NF4
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # Use 4-bit quantization
bnb_4bit_quant_type="nf4", # NormalFloat4 (optimal for neural nets)
bnb_4bit_compute_dtype=torch.float16, # Compute in FP16 for speed
bnb_4bit_use_double_quant=True # Double quantization for extra savings
)
# 2. Load model in 4-bit
model_name = "meta-llama/Llama-2-13b-hf" # 13B model!
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto", # Automatically distribute across available GPUs
trust_remote_code=True
)
# 3. Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)
# 4. Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# 5. Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 13,016,035,328 || trainable%: 0.322%
# 6. Load dataset (same as before)
dataset = load_dataset("json", data_files={"train": "train.jsonl", "validation": "val.jsonl"})
def format_instruction(example):
return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"}
dataset = dataset.map(format_instruction)
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True, max_length=512, padding="max_length")
tokenized_dataset = dataset.map(tokenize_function, batched=True)
# 7. Training arguments (optimized for QLoRA)
training_args = TrainingArguments(
output_dir="./llama-2-13b-qlora",
num_train_epochs=3,
per_device_train_batch_size=4, # 4-bit allows larger batches
gradient_accumulation_steps=4,
learning_rate=2e-4, # Slightly higher than full fine-tuning
warmup_steps=100,
logging_steps=10,
save_steps=500,
fp16=True, # Mixed precision
optim="paged_adamw_8bit", # 8-bit Adam with paging (crucial for QLoRA!)
lr_scheduler_type="cosine",
evaluation_strategy="steps",
eval_steps=100,
report_to="tensorboard"
)
# 8. Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"]
)
print("π Starting QLoRA training of 13B model on consumer GPU...")
trainer.train()
# 9. Save
model.save_pretrained("./llama-2-13b-qlora-final")
tokenizer.save_pretrained("./llama-2-13b-qlora-final")
print("β
QLoRA training complete! Saved adapters (~40MB).")
QLoRA Hardware Requirements: The Real Game-Changer
π― QLoRA's Superpower: A hobbyist with a $1,500 RTX 4090 can fine-tune a 13B model that outperforms GPT-3.5 on specialized tasks. This was unthinkable before QLoRAβyou needed $30K+ in GPUs.
QLoRA Performance: How Much Quality Do You Lose?
QLoRA achieves 96-98% of full fine-tuning performance across tasks:
π‘ Real Talk: QLoRA loses 3-4% performance vs full fine-tuning but uses 8x less memory. For most applications, this is an incredible trade-off. Plus, you can often recover lost performance by using a larger model (e.g., 13B QLoRA beats 7B full fine-tuning).
π Comparison
π Best Practices
- Start with prompting/RAG: Try those first. They're faster
- Use QLoRA: Best cost-benefit for most use cases
- Low learning rates: 1e-4 to 5e-5 typical
- Evaluate regularly: On validation set during training
- Don't overfit: 3-5 epochs usually optimal
- Use instruction data: Format data as instructions for best results
- Monitor perplexity: Track validation loss
π Summary
What You've Learned:
- Full fine-tuning updates all parameters (expensive)
- LoRA trains small adapter modules (10x cheaper)
- QLoRA combines quantization + LoRA (most practical)
- QLoRA: ~1M trainable params, 6-8GB VRAM for 7B models
- Start with RAG/prompting. Fine-tune only when needed
What's Next?
In our next tutorial, Inference Optimization, we'll learn how to deploy models efficiently.
π Excellent! You now have the skills to fine-tune any LLM on your data efficiently!
Test Your Knowledge
Q1: What is the main advantage of Parameter-Efficient Fine-Tuning (PEFT)?
Q2: What does LoRA (Low-Rank Adaptation) do?
Q3: Which fine-tuning approach updates ALL model parameters?
Q4: What is the purpose of the validation set during fine-tuning?
Q5: When should you consider full fine-tuning instead of PEFT?