Home → LLMs & Transformers → Project 1: Fine-tune BERT
šŸš€ HANDS-ON PROJECT

Project 1: Fine-tune BERT for Sentiment Classification

Build a complete sentiment analysis system from scratch. Train, evaluate, and deploy a production-ready classifier

ā±ļø 90-120 minutes šŸ”§ Hands-on Project šŸ’» Python, HuggingFace, PyTorch

šŸŽÆ Project Overview

In this project, you'll build a sentiment analysis classifier by fine-tuning BERT on the IMDb movie reviews dataset. You'll learn the complete workflow from data preparation to deployment.

What You'll Build

šŸ“Š

Data Pipeline

Load and preprocess 50k movie reviews. Split train/val/test sets properly.

🧠

Fine-tuned Model

Train BERT-base on sentiment classification. Achieve >90% accuracy.

šŸ“ˆ

Evaluation System

Measure accuracy, F1, precision, recall. Analyze errors and confusion matrix.

šŸš€

Deployment API

Serve model via FastAPI. Build simple web interface for real-time predictions.

šŸ“š Prerequisites

  • Python 3.8+ installed
  • Basic PyTorch knowledge (tensors, models)
  • GPU recommended (Colab T4 free tier works)
  • Completed LLM tutorials 1-5

ā±ļø Time Breakdown

  • Setup: 10 minutes (install libraries, download data)
  • Data Exploration: 15 minutes (understand dataset)
  • Training: 30 minutes (fine-tune BERT)
  • Evaluation: 15 minutes (test and analyze)
  • Deployment: 20 minutes (create API)

šŸ”§ Step 1: Environment Setup

Install Dependencies

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install required packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers datasets accelerate evaluate scikit-learn
pip install fastapi uvicorn pandas numpy matplotlib seaborn

# Verify installation
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import transformers; print(f'Transformers: {transformers.__version__}')"

Project Structure

bert-sentiment-classifier/
ā”œā”€ā”€ data/
│   └── imdb/                 # Downloaded dataset
ā”œā”€ā”€ models/
│   └── bert-sentiment/       # Saved model checkpoints
ā”œā”€ā”€ notebooks/
│   └── exploration.ipynb     # Data exploration
ā”œā”€ā”€ src/
│   ā”œā”€ā”€ train.py             # Training script
│   ā”œā”€ā”€ evaluate.py          # Evaluation script
│   ā”œā”€ā”€ predict.py           # Inference
│   └── api.py               # FastAPI server
ā”œā”€ā”€ requirements.txt
└── README.md

šŸ’” Using Google Colab?

If you don't have a GPU, use Google Colab (free T4 GPU). Go to Runtime → Change runtime type → GPU (T4). Training will take ~20 minutes instead of 2 hours on CPU.

šŸ“Š Step 2: Data Preparation

Load IMDb Dataset

# data_preparation.py
from datasets import load_dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load IMDb dataset (50k reviews: 25k train, 25k test)
print("Loading IMDb dataset...")
dataset = load_dataset("imdb")

print(f"Train size: {len(dataset['train'])}")
print(f"Test size: {len(dataset['test'])}")

# Examine a sample
sample = dataset['train'][0]
print(f"\nSample review:")
print(f"Text: {sample['text'][:200]}...")
print(f"Label: {sample['label']} (0=negative, 1=positive)")

# Check label distribution
train_labels = [ex['label'] for ex in dataset['train']]
test_labels = [ex['label'] for ex in dataset['test']]

print(f"\nTrain distribution:")
print(f"Negative: {train_labels.count(0)} ({train_labels.count(0)/len(train_labels)*100:.1f}%)")
print(f"Positive: {train_labels.count(1)} ({train_labels.count(1)/len(train_labels)*100:.1f}%)")

# Visualize
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
sns.countplot(x=train_labels)
plt.title('Train Set Label Distribution')
plt.xlabel('Sentiment')
plt.ylabel('Count')

plt.subplot(1, 2, 2)
lengths = [len(ex['text'].split()) for ex in dataset['train'][:1000]]
plt.hist(lengths, bins=50)
plt.title('Review Length Distribution (words)')
plt.xlabel('Word Count')
plt.ylabel('Frequency')

plt.tight_layout()
plt.savefig('data_exploration.png')
print("\nSaved visualization to data_exploration.png")

šŸ“ˆ Expected Output

Train size: 25000
Test size: 25000

Sample review:
Text: One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked...

Label: 1 (0=negative, 1=positive)

Train distribution:
Negative: 12500 (50.0%)
Positive: 12500 (50.0%)

Create Train/Validation Split

# Split training data into train (80%) and validation (20%)
train_testvalid = dataset['train'].train_test_split(test_size=0.2, seed=42)

# Final splits
train_dataset = train_testvalid['train']  # 20k samples
val_dataset = train_testvalid['test']     # 5k samples
test_dataset = dataset['test']            # 25k samples

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"Test samples: {len(test_dataset)}")

āš ļø Common Mistake: Don't test on the training set! Always hold out a separate test set that the model never sees during training.

🧠 Step 3: Tokenization & Data Loading

Initialize Tokenizer

from transformers import AutoTokenizer

# Load BERT tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Test tokenization
sample_text = "This movie was absolutely fantastic! I loved every minute of it."
tokens = tokenizer(sample_text, truncation=True, padding=True, max_length=512)

print("Original text:", sample_text)
print("\nTokenized:")
print("Input IDs:", tokens['input_ids'][:10], "...")
print("Attention Mask:", tokens['attention_mask'][:10], "...")

# Decode back to text
decoded = tokenizer.decode(tokens['input_ids'])
print("\nDecoded:", decoded)

Tokenize Dataset

def tokenize_function(examples):
    """Tokenize a batch of texts"""
    return tokenizer(
        examples['text'],
        truncation=True,
        padding='max_length',
        max_length=512  # BERT's max sequence length
    )

# Tokenize all datasets (batched for speed)
print("Tokenizing datasets...")
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val = val_dataset.map(tokenize_function, batched=True)
tokenized_test = test_dataset.map(tokenize_function, batched=True)

# Set format for PyTorch
tokenized_train.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
tokenized_val.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
tokenized_test.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

print("Tokenization complete!")

Create DataLoaders

from torch.utils.data import DataLoader

# Training: batch_size=16, shuffle=True
train_dataloader = DataLoader(tokenized_train, batch_size=16, shuffle=True)

# Validation/Test: batch_size=32, shuffle=False
val_dataloader = DataLoader(tokenized_val, batch_size=32, shuffle=False)
test_dataloader = DataLoader(tokenized_test, batch_size=32, shuffle=False)

print(f"Training batches: {len(train_dataloader)}")
print(f"Validation batches: {len(val_dataloader)}")
print(f"Test batches: {len(test_dataloader)}")

# Examine a batch
batch = next(iter(train_dataloader))
print(f"\nBatch keys: {batch.keys()}")
print(f"Input IDs shape: {batch['input_ids'].shape}")  # [16, 512]
print(f"Labels shape: {batch['label'].shape}")  # [16]

šŸ’” Batch Size Guidelines

GPU Memory:

  • 12GB (T4): batch_size=16
  • 16GB (V100): batch_size=24
  • 24GB (RTX 3090): batch_size=32
  • 40GB (A100): batch_size=64

šŸŽ“ Step 4: Model Training

Initialize Model

from transformers import AutoModelForSequenceClassification
import torch

# Load pre-trained BERT with classification head
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,  # Binary classification (negative/positive)
    problem_type="single_label_classification"
)

# Move to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(f"Model loaded on: {device}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

Training Configuration

from transformers import TrainingArguments, Trainer
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

# Define evaluation metrics
def compute_metrics(eval_pred):
    """Compute accuracy, F1, precision, recall"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    
    return {
        'accuracy': accuracy_score(labels, predictions),
        'f1': f1_score(labels, predictions, average='binary'),
        'precision': precision_score(labels, predictions, average='binary'),
        'recall': recall_score(labels, predictions, average='binary')
    }

# Training arguments
training_args = TrainingArguments(
    output_dir='./models/bert-sentiment',
    
    # Training hyperparameters
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_steps=500,
    
    # Evaluation
    evaluation_strategy="steps",
    eval_steps=500,
    save_strategy="steps",
    save_steps=500,
    save_total_limit=2,  # Keep only 2 best checkpoints
    load_best_model_at_end=True,
    metric_for_best_model='f1',
    
    # Logging
    logging_dir='./logs',
    logging_steps=100,
    report_to="none",  # Disable wandb/tensorboard
    
    # Performance
    fp16=torch.cuda.is_available(),  # Mixed precision training (faster)
    dataloader_num_workers=4,
)

print("Training configuration:")
print(f"Epochs: {training_args.num_train_epochs}")
print(f"Batch size: {training_args.per_device_train_batch_size}")
print(f"Learning rate: {training_args.learning_rate}")
print(f"Mixed precision: {training_args.fp16}")

Train the Model

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    compute_metrics=compute_metrics,
)

# Start training
print("\nšŸš€ Starting training...")
print("This will take ~20 minutes on T4 GPU, ~2 hours on CPU\n")

train_result = trainer.train()

# Print training summary
print("\nāœ… Training complete!")
print(f"Training time: {train_result.metrics['train_runtime']:.2f} seconds")
print(f"Samples/second: {train_result.metrics['train_samples_per_second']:.2f}")
print(f"Final loss: {train_result.metrics['train_loss']:.4f}")

# Save model
trainer.save_model('./models/bert-sentiment-final')
tokenizer.save_pretrained('./models/bert-sentiment-final')
print("\nModel saved to ./models/bert-sentiment-final")

šŸ“Š Expected Training Output

šŸš€ Starting training...

Epoch 1/3
Step   100: loss=0.3245, eval_acc=0.8720, eval_f1=0.8698
Step   500: loss=0.2012, eval_acc=0.9140, eval_f1=0.9128
Epoch 1 complete: avg_loss=0.2234

Epoch 2/3
Step  1000: loss=0.1456, eval_acc=0.9280, eval_f1=0.9275
Step  1500: loss=0.1189, eval_acc=0.9320, eval_f1=0.9318
Epoch 2 complete: avg_loss=0.1398

Epoch 3/3
Step  2000: loss=0.0892, eval_acc=0.9345, eval_f1=0.9342
Step  2500: loss=0.0745, eval_acc=0.9360, eval_f1=0.9358
Epoch 3 complete: avg_loss=0.0856

āœ… Training complete!
Training time: 1234.56 seconds
Final loss: 0.0856
Best F1 score: 0.9358

šŸ’” Training Tips

  • Overfitting? Add dropout, reduce epochs, or use more data
  • Slow training? Enable fp16 mixed precision (2x faster)
  • Out of memory? Reduce batch_size or use gradient accumulation
  • Poor performance? Try lower learning rate (1e-5) or more epochs

šŸ“ˆ Step 5: Model Evaluation

Evaluate on Test Set

# Evaluate on held-out test set
print("Evaluating on test set...")
test_results = trainer.evaluate(tokenized_test)

print("\nšŸ“Š Test Set Results:")
print(f"Accuracy:  {test_results['eval_accuracy']:.4f}")
print(f"F1 Score:  {test_results['eval_f1']:.4f}")
print(f"Precision: {test_results['eval_precision']:.4f}")
print(f"Recall:    {test_results['eval_recall']:.4f}")

Confusion Matrix

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Get predictions
predictions = trainer.predict(tokenized_test)
pred_labels = np.argmax(predictions.predictions, axis=-1)
true_labels = predictions.label_ids

# Compute confusion matrix
cm = confusion_matrix(true_labels, pred_labels)

# Visualize
fig, ax = plt.subplots(figsize=(8, 6))
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=['Negative', 'Positive']
)
disp.plot(ax=ax, cmap='Blues', values_format='d')
plt.title('Confusion Matrix - BERT Sentiment Classifier')
plt.savefig('confusion_matrix.png')
print("\nConfusion matrix saved to confusion_matrix.png")

# Calculate per-class metrics
tn, fp, fn, tp = cm.ravel()
print(f"\nTrue Negatives:  {tn} ({tn/(tn+fp)*100:.1f}%)")
print(f"False Positives: {fp} ({fp/(tn+fp)*100:.1f}%)")
print(f"False Negatives: {fn} ({fn/(tp+fn)*100:.1f}%)")
print(f"True Positives:  {tp} ({tp/(tp+fn)*100:.1f}%)")

Error Analysis

# Find misclassified examples
errors = []
for i, (pred, true) in enumerate(zip(pred_labels, true_labels)):
    if pred != true:
        errors.append({
            'index': i,
            'text': test_dataset[i]['text'],
            'true_label': true,
            'predicted_label': pred,
            'confidence': np.max(predictions.predictions[i])
        })

print(f"\nTotal errors: {len(errors)} ({len(errors)/len(test_dataset)*100:.2f}%)")

# Show 5 most confident wrong predictions
errors_sorted = sorted(errors, key=lambda x: x['confidence'], reverse=True)

print("\nšŸ” Top 5 Most Confident Errors:\n")
for i, error in enumerate(errors_sorted[:5], 1):
    true_label = 'Positive' if error['true_label'] == 1 else 'Negative'
    pred_label = 'Positive' if error['predicted_label'] == 1 else 'Negative'
    
    print(f"{i}. True: {true_label} | Predicted: {pred_label} (conf: {error['confidence']:.3f})")
    print(f"   Text: {error['text'][:150]}...")
    print()

Test on Custom Examples

def predict_sentiment(text, model, tokenizer):
    """Predict sentiment for a single text"""
    # Tokenize
    inputs = tokenizer(text, return_tensors='pt', truncation=True, 
                      padding=True, max_length=512)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Predict
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = torch.softmax(logits, dim=-1)
    
    # Get prediction
    prediction = torch.argmax(probs, dim=-1).item()
    confidence = probs[0][prediction].item()
    
    sentiment = 'Positive 😊' if prediction == 1 else 'Negative šŸ˜ž'
    
    return {
        'sentiment': sentiment,
        'confidence': confidence,
        'positive_prob': probs[0][1].item(),
        'negative_prob': probs[0][0].item()
    }

# Test examples
test_reviews = [
    "This movie was absolutely amazing! Best film I've seen all year.",
    "Terrible waste of time. I want my money back.",
    "It was okay, not great but not terrible either.",
    "Brilliant acting and stunning cinematography. Highly recommend!",
    "Boring and predictable. Fell asleep halfway through."
]

print("šŸŽ¬ Custom Review Predictions:\n")
for review in test_reviews:
    result = predict_sentiment(review, model, tokenizer)
    print(f"Review: {review}")
    print(f"Prediction: {result['sentiment']} (confidence: {result['confidence']:.3f})")
    print(f"Probabilities: Negative={result['negative_prob']:.3f}, Positive={result['positive_prob']:.3f}\n")

šŸ“Š Expected Results

Target Metrics (BERT-base on IMDb):

  • Accuracy: 93-94%
  • F1 Score: 93-94%
  • Training time: ~20 minutes (T4 GPU)

šŸš€ Step 6: Deployment

Create FastAPI Server

# api.py - Production-ready API
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import logging

# Initialize logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Initialize FastAPI
app = FastAPI(title="BERT Sentiment Analysis API", version="1.0.0")

# Load model at startup
@app.on_event("startup")
async def load_model():
    global model, tokenizer, device
    
    logger.info("Loading model...")
    model_path = "./models/bert-sentiment-final"
    
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForSequenceClassification.from_pretrained(model_path)
    
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    model.eval()
    
    logger.info(f"Model loaded on {device}")

class ReviewRequest(BaseModel):
    text: str
    
class PredictionResponse(BaseModel):
    sentiment: str
    confidence: float
    positive_probability: float
    negative_probability: float

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: ReviewRequest):
    """Predict sentiment for a review"""
    try:
        # Validate input
        if not request.text or len(request.text.strip()) == 0:
            raise HTTPException(status_code=400, detail="Text cannot be empty")
        
        if len(request.text) > 5000:
            raise HTTPException(status_code=400, detail="Text too long (max 5000 chars)")
        
        # Tokenize
        inputs = tokenizer(
            request.text,
            return_tensors='pt',
            truncation=True,
            padding=True,
            max_length=512
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        # Predict
        with torch.no_grad():
            outputs = model(**inputs)
            probs = torch.softmax(outputs.logits, dim=-1)
        
        # Extract results
        prediction = torch.argmax(probs, dim=-1).item()
        confidence = probs[0][prediction].item()
        
        sentiment = 'positive' if prediction == 1 else 'negative'
        
        return PredictionResponse(
            sentiment=sentiment,
            confidence=confidence,
            positive_probability=probs[0][1].item(),
            negative_probability=probs[0][0].item()
        )
        
    except Exception as e:
        logger.error(f"Prediction error: {e}")
        raise HTTPException(status_code=500, detail="Prediction failed")

@app.get("/health")
async def health():
    """Health check endpoint"""
    return {"status": "healthy", "model": "bert-base-uncased"}

@app.get("/")
async def root():
    """API info"""
    return {
        "name": "BERT Sentiment Analysis API",
        "version": "1.0.0",
        "endpoints": {
            "POST /predict": "Predict sentiment",
            "GET /health": "Health check",
            "GET /docs": "API documentation"
        }
    }

# Run: uvicorn api:app --host 0.0.0.0 --port 8000 --reload

Test the API

# Start server
uvicorn api:app --host 0.0.0.0 --port 8000

# In another terminal, test with curl
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"text": "This movie was fantastic!"}'

# Expected response:
# {
#   "sentiment": "positive",
#   "confidence": 0.987,
#   "positive_probability": 0.987,
#   "negative_probability": 0.013
# }

Simple Web Interface

<!-- index.html - Simple web UI -->
<!DOCTYPE html>
<html>
<head>
    <title>Sentiment Analyzer</title>
    <style>
        body { font-family: Arial; max-width: 600px; margin: 50px auto; padding: 20px; }
        textarea { width: 100%; height: 150px; padding: 10px; font-size: 14px; }
        button { background: #3b82f6; color: white; padding: 10px 20px; border: none; 
                 border-radius: 5px; cursor: pointer; font-size: 16px; }
        button:hover { background: #2563eb; }
        #result { margin-top: 20px; padding: 20px; border-radius: 10px; display: none; }
        .positive { background: #d1fae5; border: 2px solid #10b981; }
        .negative { background: #fee2e2; border: 2px solid #ef4444; }
    </style>
</head>
<body>
    <h1>šŸŽ¬ Movie Review Sentiment Analyzer</h1>
    <p>Enter a movie review to analyze its sentiment:</p>
    
    <textarea id="review" placeholder="Type your review here..."></textarea>
    <br><br>
    <button onclick="analyzeSentiment()">Analyze Sentiment</button>
    
    <div id="result"></div>
    
    <script>
        async function analyzeSentiment() {
            const text = document.getElementById('review').value;
            
            if (!text.trim()) {
                alert('Please enter a review');
                return;
            }
            
            const response = await fetch('http://localhost:8000/predict', {
                method: 'POST',
                headers: { 'Content-Type': 'application/json' },
                body: JSON.stringify({ text: text })
            });
            
            const data = await response.json();
            
            const resultDiv = document.getElementById('result');
            resultDiv.className = data.sentiment;
            resultDiv.style.display = 'block';
            
            const emoji = data.sentiment === 'positive' ? '😊' : 'šŸ˜ž';
            const sentiment = data.sentiment.charAt(0).toUpperCase() + data.sentiment.slice(1);
            
            resultDiv.innerHTML = `
                <h2>${emoji} ${sentiment} Sentiment</h2>
                <p>Confidence: ${(data.confidence * 100).toFixed(1)}%</p>
                <p>Positive: ${(data.positive_probability * 100).toFixed(1)}%</p>
                <p>Negative: ${(data.negative_probability * 100).toFixed(1)}%</p>
            `;
        }
    </script>
</body>
</html>

šŸš€ Deployment Options

  • Local: uvicorn api:app (development)
  • Docker: Containerize with Dockerfile
  • Cloud: Deploy to AWS, GCP, Azure (with GPU)
  • Serverless: AWS Lambda + API Gateway (small models)

šŸŽÆ Step 7: Improvements & Extensions

Model Improvements

⚔

Use Larger Model

Try BERT-large or RoBERTa for +1-2% accuracy. Training takes 2-3x longer.

šŸ“š

More Data

Combine IMDb with Yelp, Amazon reviews. More diverse data = better generalization.

šŸ”§

Hyperparameter Tuning

Grid search: learning rates (1e-5, 2e-5, 5e-5), batch sizes, dropout rates.

šŸŽ­

Multi-class

Expand to 5-star ratings instead of binary (negative/positive).

Production Enhancements

# 1. Add caching for repeated queries
from functools import lru_cache
import hashlib

@lru_cache(maxsize=1000)
def predict_cached(text_hash):
    # Cache predictions by text hash
    pass

# 2. Batch prediction endpoint
@app.post("/predict_batch")
async def predict_batch(reviews: List[str]):
    # Process multiple reviews at once
    pass

# 3. Model versioning
@app.get("/model_info")
async def model_info():
    return {
        "model": "bert-base-uncased",
        "fine_tuned_on": "IMDb",
        "version": "1.0.0",
        "accuracy": 0.934
    }

# 4. Rate limiting
from slowapi import Limiter

limiter = Limiter(key_func=lambda: "global")
@app.post("/predict")
@limiter.limit("10/minute")
async def predict(...):
    # Limit to 10 requests per minute
    pass

šŸ† Challenge Extensions

  1. Multi-lingual: Fine-tune mBERT on reviews in Spanish, French, etc.
  2. Aspect-based: Classify sentiment per aspect (acting, plot, visuals)
  3. Zero-shot: Compare with GPT-3.5 zero-shot (no training)
  4. Distillation: Compress to DistilBERT (2x faster, 40% smaller)
  5. Real-time: Deploy with WebSockets for streaming predictions

šŸ“‹ Complete Code Summary

Key Files Created

  • data_preparation.py: Load and explore IMDb dataset
  • train.py: Fine-tune BERT on sentiment classification
  • evaluate.py: Test model and analyze errors
  • api.py: FastAPI server for predictions
  • index.html: Simple web interface

What You Learned

  • āœ… Load and preprocess text datasets with HuggingFace
  • āœ… Tokenize text for BERT models
  • āœ… Fine-tune pre-trained models with Trainer API
  • āœ… Evaluate with accuracy, F1, precision, recall
  • āœ… Analyze errors with confusion matrix
  • āœ… Deploy model via FastAPI
  • āœ… Build web interface for predictions

šŸ“Š Expected Final Results

Metric Value Notes
Test Accuracy 93-94% State-of-the-art for IMDb
F1 Score 93-94% Balanced performance
Training Time ~20 minutes On T4 GPU (Colab free tier)
Inference Time ~50ms Per review on GPU
Model Size ~440MB BERT-base parameters

šŸŽ‰ Congratulations! You've built a production-ready sentiment classifier from scratch. You can now:

  • Fine-tune any HuggingFace model on any classification task
  • Deploy ML models via REST APIs
  • Evaluate model performance with proper metrics
  • Build end-to-end ML projects independently

šŸ”— Resources & Next Steps

Code Repository

Full project code available at: github.com/your-repo/bert-sentiment-classifier

Further Reading

Next Projects

  • Project 2: Build a RAG Chatbot with vector search
  • Project 3: Deploy a Fine-tuned LLM at scale

Test Your Knowledge

Q1: What library is commonly used for fine-tuning transformer models?

NumPy
Pandas
Hugging Face Transformers
Matplotlib

Q2: What is the purpose of tokenization in BERT fine-tuning?

To increase model size
To convert text into numerical tokens that the model can process
To remove stopwords
To translate text

Q3: Which metric is commonly used for evaluating classification models?

Only accuracy
Only loss
Only perplexity
Accuracy, precision, recall, and F1-score

Q4: What happens during the training loop?

Forward pass, loss calculation, backward pass, and parameter updates
Only data loading
Only model evaluation
Only tokenization

Q5: Why do we use a validation set during fine-tuning?

To increase training speed
To reduce model size
To monitor performance and detect overfitting
To generate more data