šÆ Project Overview
In this project, you'll build a sentiment analysis classifier by fine-tuning BERT on the IMDb movie reviews dataset. You'll learn the complete workflow from data preparation to deployment.
What You'll Build
Data Pipeline
Load and preprocess 50k movie reviews. Split train/val/test sets properly.
Fine-tuned Model
Train BERT-base on sentiment classification. Achieve >90% accuracy.
Evaluation System
Measure accuracy, F1, precision, recall. Analyze errors and confusion matrix.
Deployment API
Serve model via FastAPI. Build simple web interface for real-time predictions.
š Prerequisites
- Python 3.8+ installed
- Basic PyTorch knowledge (tensors, models)
- GPU recommended (Colab T4 free tier works)
- Completed LLM tutorials 1-5
ā±ļø Time Breakdown
- Setup: 10 minutes (install libraries, download data)
- Data Exploration: 15 minutes (understand dataset)
- Training: 30 minutes (fine-tune BERT)
- Evaluation: 15 minutes (test and analyze)
- Deployment: 20 minutes (create API)
š§ Step 1: Environment Setup
Install Dependencies
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install required packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers datasets accelerate evaluate scikit-learn
pip install fastapi uvicorn pandas numpy matplotlib seaborn
# Verify installation
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import transformers; print(f'Transformers: {transformers.__version__}')"
Project Structure
bert-sentiment-classifier/
āāā data/
ā āāā imdb/ # Downloaded dataset
āāā models/
ā āāā bert-sentiment/ # Saved model checkpoints
āāā notebooks/
ā āāā exploration.ipynb # Data exploration
āāā src/
ā āāā train.py # Training script
ā āāā evaluate.py # Evaluation script
ā āāā predict.py # Inference
ā āāā api.py # FastAPI server
āāā requirements.txt
āāā README.md
š” Using Google Colab?
If you don't have a GPU, use Google Colab (free T4 GPU). Go to Runtime ā Change runtime type ā GPU (T4). Training will take ~20 minutes instead of 2 hours on CPU.
š Step 2: Data Preparation
Load IMDb Dataset
# data_preparation.py
from datasets import load_dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load IMDb dataset (50k reviews: 25k train, 25k test)
print("Loading IMDb dataset...")
dataset = load_dataset("imdb")
print(f"Train size: {len(dataset['train'])}")
print(f"Test size: {len(dataset['test'])}")
# Examine a sample
sample = dataset['train'][0]
print(f"\nSample review:")
print(f"Text: {sample['text'][:200]}...")
print(f"Label: {sample['label']} (0=negative, 1=positive)")
# Check label distribution
train_labels = [ex['label'] for ex in dataset['train']]
test_labels = [ex['label'] for ex in dataset['test']]
print(f"\nTrain distribution:")
print(f"Negative: {train_labels.count(0)} ({train_labels.count(0)/len(train_labels)*100:.1f}%)")
print(f"Positive: {train_labels.count(1)} ({train_labels.count(1)/len(train_labels)*100:.1f}%)")
# Visualize
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
sns.countplot(x=train_labels)
plt.title('Train Set Label Distribution')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.subplot(1, 2, 2)
lengths = [len(ex['text'].split()) for ex in dataset['train'][:1000]]
plt.hist(lengths, bins=50)
plt.title('Review Length Distribution (words)')
plt.xlabel('Word Count')
plt.ylabel('Frequency')
plt.tight_layout()
plt.savefig('data_exploration.png')
print("\nSaved visualization to data_exploration.png")
š Expected Output
Train size: 25000 Test size: 25000 Sample review: Text: One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked... Label: 1 (0=negative, 1=positive) Train distribution: Negative: 12500 (50.0%) Positive: 12500 (50.0%)
Create Train/Validation Split
# Split training data into train (80%) and validation (20%)
train_testvalid = dataset['train'].train_test_split(test_size=0.2, seed=42)
# Final splits
train_dataset = train_testvalid['train'] # 20k samples
val_dataset = train_testvalid['test'] # 5k samples
test_dataset = dataset['test'] # 25k samples
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"Test samples: {len(test_dataset)}")
ā ļø Common Mistake: Don't test on the training set! Always hold out a separate test set that the model never sees during training.
š§ Step 3: Tokenization & Data Loading
Initialize Tokenizer
from transformers import AutoTokenizer
# Load BERT tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Test tokenization
sample_text = "This movie was absolutely fantastic! I loved every minute of it."
tokens = tokenizer(sample_text, truncation=True, padding=True, max_length=512)
print("Original text:", sample_text)
print("\nTokenized:")
print("Input IDs:", tokens['input_ids'][:10], "...")
print("Attention Mask:", tokens['attention_mask'][:10], "...")
# Decode back to text
decoded = tokenizer.decode(tokens['input_ids'])
print("\nDecoded:", decoded)
Tokenize Dataset
def tokenize_function(examples):
"""Tokenize a batch of texts"""
return tokenizer(
examples['text'],
truncation=True,
padding='max_length',
max_length=512 # BERT's max sequence length
)
# Tokenize all datasets (batched for speed)
print("Tokenizing datasets...")
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val = val_dataset.map(tokenize_function, batched=True)
tokenized_test = test_dataset.map(tokenize_function, batched=True)
# Set format for PyTorch
tokenized_train.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
tokenized_val.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
tokenized_test.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
print("Tokenization complete!")
Create DataLoaders
from torch.utils.data import DataLoader
# Training: batch_size=16, shuffle=True
train_dataloader = DataLoader(tokenized_train, batch_size=16, shuffle=True)
# Validation/Test: batch_size=32, shuffle=False
val_dataloader = DataLoader(tokenized_val, batch_size=32, shuffle=False)
test_dataloader = DataLoader(tokenized_test, batch_size=32, shuffle=False)
print(f"Training batches: {len(train_dataloader)}")
print(f"Validation batches: {len(val_dataloader)}")
print(f"Test batches: {len(test_dataloader)}")
# Examine a batch
batch = next(iter(train_dataloader))
print(f"\nBatch keys: {batch.keys()}")
print(f"Input IDs shape: {batch['input_ids'].shape}") # [16, 512]
print(f"Labels shape: {batch['label'].shape}") # [16]
š” Batch Size Guidelines
GPU Memory:
- 12GB (T4): batch_size=16
- 16GB (V100): batch_size=24
- 24GB (RTX 3090): batch_size=32
- 40GB (A100): batch_size=64
š Step 4: Model Training
Initialize Model
from transformers import AutoModelForSequenceClassification
import torch
# Load pre-trained BERT with classification head
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=2, # Binary classification (negative/positive)
problem_type="single_label_classification"
)
# Move to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(f"Model loaded on: {device}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
Training Configuration
from transformers import TrainingArguments, Trainer
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
# Define evaluation metrics
def compute_metrics(eval_pred):
"""Compute accuracy, F1, precision, recall"""
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return {
'accuracy': accuracy_score(labels, predictions),
'f1': f1_score(labels, predictions, average='binary'),
'precision': precision_score(labels, predictions, average='binary'),
'recall': recall_score(labels, predictions, average='binary')
}
# Training arguments
training_args = TrainingArguments(
output_dir='./models/bert-sentiment',
# Training hyperparameters
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
learning_rate=2e-5,
weight_decay=0.01,
warmup_steps=500,
# Evaluation
evaluation_strategy="steps",
eval_steps=500,
save_strategy="steps",
save_steps=500,
save_total_limit=2, # Keep only 2 best checkpoints
load_best_model_at_end=True,
metric_for_best_model='f1',
# Logging
logging_dir='./logs',
logging_steps=100,
report_to="none", # Disable wandb/tensorboard
# Performance
fp16=torch.cuda.is_available(), # Mixed precision training (faster)
dataloader_num_workers=4,
)
print("Training configuration:")
print(f"Epochs: {training_args.num_train_epochs}")
print(f"Batch size: {training_args.per_device_train_batch_size}")
print(f"Learning rate: {training_args.learning_rate}")
print(f"Mixed precision: {training_args.fp16}")
Train the Model
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_val,
compute_metrics=compute_metrics,
)
# Start training
print("\nš Starting training...")
print("This will take ~20 minutes on T4 GPU, ~2 hours on CPU\n")
train_result = trainer.train()
# Print training summary
print("\nā
Training complete!")
print(f"Training time: {train_result.metrics['train_runtime']:.2f} seconds")
print(f"Samples/second: {train_result.metrics['train_samples_per_second']:.2f}")
print(f"Final loss: {train_result.metrics['train_loss']:.4f}")
# Save model
trainer.save_model('./models/bert-sentiment-final')
tokenizer.save_pretrained('./models/bert-sentiment-final')
print("\nModel saved to ./models/bert-sentiment-final")
š Expected Training Output
š Starting training... Epoch 1/3 Step 100: loss=0.3245, eval_acc=0.8720, eval_f1=0.8698 Step 500: loss=0.2012, eval_acc=0.9140, eval_f1=0.9128 Epoch 1 complete: avg_loss=0.2234 Epoch 2/3 Step 1000: loss=0.1456, eval_acc=0.9280, eval_f1=0.9275 Step 1500: loss=0.1189, eval_acc=0.9320, eval_f1=0.9318 Epoch 2 complete: avg_loss=0.1398 Epoch 3/3 Step 2000: loss=0.0892, eval_acc=0.9345, eval_f1=0.9342 Step 2500: loss=0.0745, eval_acc=0.9360, eval_f1=0.9358 Epoch 3 complete: avg_loss=0.0856 ā Training complete! Training time: 1234.56 seconds Final loss: 0.0856 Best F1 score: 0.9358
š” Training Tips
- Overfitting? Add dropout, reduce epochs, or use more data
- Slow training? Enable fp16 mixed precision (2x faster)
- Out of memory? Reduce batch_size or use gradient accumulation
- Poor performance? Try lower learning rate (1e-5) or more epochs
š Step 5: Model Evaluation
Evaluate on Test Set
# Evaluate on held-out test set
print("Evaluating on test set...")
test_results = trainer.evaluate(tokenized_test)
print("\nš Test Set Results:")
print(f"Accuracy: {test_results['eval_accuracy']:.4f}")
print(f"F1 Score: {test_results['eval_f1']:.4f}")
print(f"Precision: {test_results['eval_precision']:.4f}")
print(f"Recall: {test_results['eval_recall']:.4f}")
Confusion Matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
# Get predictions
predictions = trainer.predict(tokenized_test)
pred_labels = np.argmax(predictions.predictions, axis=-1)
true_labels = predictions.label_ids
# Compute confusion matrix
cm = confusion_matrix(true_labels, pred_labels)
# Visualize
fig, ax = plt.subplots(figsize=(8, 6))
disp = ConfusionMatrixDisplay(
confusion_matrix=cm,
display_labels=['Negative', 'Positive']
)
disp.plot(ax=ax, cmap='Blues', values_format='d')
plt.title('Confusion Matrix - BERT Sentiment Classifier')
plt.savefig('confusion_matrix.png')
print("\nConfusion matrix saved to confusion_matrix.png")
# Calculate per-class metrics
tn, fp, fn, tp = cm.ravel()
print(f"\nTrue Negatives: {tn} ({tn/(tn+fp)*100:.1f}%)")
print(f"False Positives: {fp} ({fp/(tn+fp)*100:.1f}%)")
print(f"False Negatives: {fn} ({fn/(tp+fn)*100:.1f}%)")
print(f"True Positives: {tp} ({tp/(tp+fn)*100:.1f}%)")
Error Analysis
# Find misclassified examples
errors = []
for i, (pred, true) in enumerate(zip(pred_labels, true_labels)):
if pred != true:
errors.append({
'index': i,
'text': test_dataset[i]['text'],
'true_label': true,
'predicted_label': pred,
'confidence': np.max(predictions.predictions[i])
})
print(f"\nTotal errors: {len(errors)} ({len(errors)/len(test_dataset)*100:.2f}%)")
# Show 5 most confident wrong predictions
errors_sorted = sorted(errors, key=lambda x: x['confidence'], reverse=True)
print("\nš Top 5 Most Confident Errors:\n")
for i, error in enumerate(errors_sorted[:5], 1):
true_label = 'Positive' if error['true_label'] == 1 else 'Negative'
pred_label = 'Positive' if error['predicted_label'] == 1 else 'Negative'
print(f"{i}. True: {true_label} | Predicted: {pred_label} (conf: {error['confidence']:.3f})")
print(f" Text: {error['text'][:150]}...")
print()
Test on Custom Examples
def predict_sentiment(text, model, tokenizer):
"""Predict sentiment for a single text"""
# Tokenize
inputs = tokenizer(text, return_tensors='pt', truncation=True,
padding=True, max_length=512)
inputs = {k: v.to(device) for k, v in inputs.items()}
# Predict
model.eval()
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = torch.softmax(logits, dim=-1)
# Get prediction
prediction = torch.argmax(probs, dim=-1).item()
confidence = probs[0][prediction].item()
sentiment = 'Positive š' if prediction == 1 else 'Negative š'
return {
'sentiment': sentiment,
'confidence': confidence,
'positive_prob': probs[0][1].item(),
'negative_prob': probs[0][0].item()
}
# Test examples
test_reviews = [
"This movie was absolutely amazing! Best film I've seen all year.",
"Terrible waste of time. I want my money back.",
"It was okay, not great but not terrible either.",
"Brilliant acting and stunning cinematography. Highly recommend!",
"Boring and predictable. Fell asleep halfway through."
]
print("š¬ Custom Review Predictions:\n")
for review in test_reviews:
result = predict_sentiment(review, model, tokenizer)
print(f"Review: {review}")
print(f"Prediction: {result['sentiment']} (confidence: {result['confidence']:.3f})")
print(f"Probabilities: Negative={result['negative_prob']:.3f}, Positive={result['positive_prob']:.3f}\n")
š Expected Results
Target Metrics (BERT-base on IMDb):
- Accuracy: 93-94%
- F1 Score: 93-94%
- Training time: ~20 minutes (T4 GPU)
š Step 6: Deployment
Create FastAPI Server
# api.py - Production-ready API
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import logging
# Initialize logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Initialize FastAPI
app = FastAPI(title="BERT Sentiment Analysis API", version="1.0.0")
# Load model at startup
@app.on_event("startup")
async def load_model():
global model, tokenizer, device
logger.info("Loading model...")
model_path = "./models/bert-sentiment-final"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
model.eval()
logger.info(f"Model loaded on {device}")
class ReviewRequest(BaseModel):
text: str
class PredictionResponse(BaseModel):
sentiment: str
confidence: float
positive_probability: float
negative_probability: float
@app.post("/predict", response_model=PredictionResponse)
async def predict(request: ReviewRequest):
"""Predict sentiment for a review"""
try:
# Validate input
if not request.text or len(request.text.strip()) == 0:
raise HTTPException(status_code=400, detail="Text cannot be empty")
if len(request.text) > 5000:
raise HTTPException(status_code=400, detail="Text too long (max 5000 chars)")
# Tokenize
inputs = tokenizer(
request.text,
return_tensors='pt',
truncation=True,
padding=True,
max_length=512
)
inputs = {k: v.to(device) for k, v in inputs.items()}
# Predict
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
# Extract results
prediction = torch.argmax(probs, dim=-1).item()
confidence = probs[0][prediction].item()
sentiment = 'positive' if prediction == 1 else 'negative'
return PredictionResponse(
sentiment=sentiment,
confidence=confidence,
positive_probability=probs[0][1].item(),
negative_probability=probs[0][0].item()
)
except Exception as e:
logger.error(f"Prediction error: {e}")
raise HTTPException(status_code=500, detail="Prediction failed")
@app.get("/health")
async def health():
"""Health check endpoint"""
return {"status": "healthy", "model": "bert-base-uncased"}
@app.get("/")
async def root():
"""API info"""
return {
"name": "BERT Sentiment Analysis API",
"version": "1.0.0",
"endpoints": {
"POST /predict": "Predict sentiment",
"GET /health": "Health check",
"GET /docs": "API documentation"
}
}
# Run: uvicorn api:app --host 0.0.0.0 --port 8000 --reload
Test the API
# Start server
uvicorn api:app --host 0.0.0.0 --port 8000
# In another terminal, test with curl
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{"text": "This movie was fantastic!"}'
# Expected response:
# {
# "sentiment": "positive",
# "confidence": 0.987,
# "positive_probability": 0.987,
# "negative_probability": 0.013
# }
Simple Web Interface
<!-- index.html - Simple web UI -->
<!DOCTYPE html>
<html>
<head>
<title>Sentiment Analyzer</title>
<style>
body { font-family: Arial; max-width: 600px; margin: 50px auto; padding: 20px; }
textarea { width: 100%; height: 150px; padding: 10px; font-size: 14px; }
button { background: #3b82f6; color: white; padding: 10px 20px; border: none;
border-radius: 5px; cursor: pointer; font-size: 16px; }
button:hover { background: #2563eb; }
#result { margin-top: 20px; padding: 20px; border-radius: 10px; display: none; }
.positive { background: #d1fae5; border: 2px solid #10b981; }
.negative { background: #fee2e2; border: 2px solid #ef4444; }
</style>
</head>
<body>
<h1>š¬ Movie Review Sentiment Analyzer</h1>
<p>Enter a movie review to analyze its sentiment:</p>
<textarea id="review" placeholder="Type your review here..."></textarea>
<br><br>
<button onclick="analyzeSentiment()">Analyze Sentiment</button>
<div id="result"></div>
<script>
async function analyzeSentiment() {
const text = document.getElementById('review').value;
if (!text.trim()) {
alert('Please enter a review');
return;
}
const response = await fetch('http://localhost:8000/predict', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ text: text })
});
const data = await response.json();
const resultDiv = document.getElementById('result');
resultDiv.className = data.sentiment;
resultDiv.style.display = 'block';
const emoji = data.sentiment === 'positive' ? 'š' : 'š';
const sentiment = data.sentiment.charAt(0).toUpperCase() + data.sentiment.slice(1);
resultDiv.innerHTML = `
<h2>${emoji} ${sentiment} Sentiment</h2>
<p>Confidence: ${(data.confidence * 100).toFixed(1)}%</p>
<p>Positive: ${(data.positive_probability * 100).toFixed(1)}%</p>
<p>Negative: ${(data.negative_probability * 100).toFixed(1)}%</p>
`;
}
</script>
</body>
</html>
š Deployment Options
- Local: uvicorn api:app (development)
- Docker: Containerize with Dockerfile
- Cloud: Deploy to AWS, GCP, Azure (with GPU)
- Serverless: AWS Lambda + API Gateway (small models)
šÆ Step 7: Improvements & Extensions
Model Improvements
Use Larger Model
Try BERT-large or RoBERTa for +1-2% accuracy. Training takes 2-3x longer.
More Data
Combine IMDb with Yelp, Amazon reviews. More diverse data = better generalization.
Hyperparameter Tuning
Grid search: learning rates (1e-5, 2e-5, 5e-5), batch sizes, dropout rates.
Multi-class
Expand to 5-star ratings instead of binary (negative/positive).
Production Enhancements
# 1. Add caching for repeated queries
from functools import lru_cache
import hashlib
@lru_cache(maxsize=1000)
def predict_cached(text_hash):
# Cache predictions by text hash
pass
# 2. Batch prediction endpoint
@app.post("/predict_batch")
async def predict_batch(reviews: List[str]):
# Process multiple reviews at once
pass
# 3. Model versioning
@app.get("/model_info")
async def model_info():
return {
"model": "bert-base-uncased",
"fine_tuned_on": "IMDb",
"version": "1.0.0",
"accuracy": 0.934
}
# 4. Rate limiting
from slowapi import Limiter
limiter = Limiter(key_func=lambda: "global")
@app.post("/predict")
@limiter.limit("10/minute")
async def predict(...):
# Limit to 10 requests per minute
pass
š Challenge Extensions
- Multi-lingual: Fine-tune mBERT on reviews in Spanish, French, etc.
- Aspect-based: Classify sentiment per aspect (acting, plot, visuals)
- Zero-shot: Compare with GPT-3.5 zero-shot (no training)
- Distillation: Compress to DistilBERT (2x faster, 40% smaller)
- Real-time: Deploy with WebSockets for streaming predictions
š Complete Code Summary
Key Files Created
- data_preparation.py: Load and explore IMDb dataset
- train.py: Fine-tune BERT on sentiment classification
- evaluate.py: Test model and analyze errors
- api.py: FastAPI server for predictions
- index.html: Simple web interface
What You Learned
- ā Load and preprocess text datasets with HuggingFace
- ā Tokenize text for BERT models
- ā Fine-tune pre-trained models with Trainer API
- ā Evaluate with accuracy, F1, precision, recall
- ā Analyze errors with confusion matrix
- ā Deploy model via FastAPI
- ā Build web interface for predictions
š Expected Final Results
| Metric | Value | Notes |
|---|---|---|
| Test Accuracy | 93-94% | State-of-the-art for IMDb |
| F1 Score | 93-94% | Balanced performance |
| Training Time | ~20 minutes | On T4 GPU (Colab free tier) |
| Inference Time | ~50ms | Per review on GPU |
| Model Size | ~440MB | BERT-base parameters |
š Congratulations! You've built a production-ready sentiment classifier from scratch. You can now:
- Fine-tune any HuggingFace model on any classification task
- Deploy ML models via REST APIs
- Evaluate model performance with proper metrics
- Build end-to-end ML projects independently
š Resources & Next Steps
Code Repository
Full project code available at: github.com/your-repo/bert-sentiment-classifier
Further Reading
Next Projects
- Project 2: Build a RAG Chatbot with vector search
- Project 3: Deploy a Fine-tuned LLM at scale
Test Your Knowledge
Q1: What library is commonly used for fine-tuning transformer models?
Q2: What is the purpose of tokenization in BERT fine-tuning?
Q3: Which metric is commonly used for evaluating classification models?
Q4: What happens during the training loop?
Q5: Why do we use a validation set during fine-tuning?