Project 1: Fine-tune BERT for Sentiment Classification

🎯 Project Overview

In this project, you'll build a sentiment analysis classifier by fine-tuning BERT on the IMDb movie reviews dataset. You'll learn the complete workflow from data preparation to deployment.

What You'll Build

📊

Data Pipeline

Load and preprocess 50k movie reviews. Split train/val/test sets properly.

🧠

Fine-tuned Model

Train BERT-base on sentiment classification. Achieve >90% accuracy.

📈

Evaluation System

Measure accuracy, F1, precision, recall. Analyze errors and confusion matrix.

🚀

Deployment API

Serve model via FastAPI. Build simple web interface for real-time predictions.

📚 Prerequisites

Python 3.8+ installed
Basic PyTorch knowledge (tensors, models)
GPU recommended (Colab T4 free tier works)
Completed LLM tutorials 1-5

⏱️ Time Breakdown

Setup: 10 minutes (install libraries, download data)
Data Exploration: 15 minutes (understand dataset)
Training: 30 minutes (fine-tune BERT)
Evaluation: 15 minutes (test and analyze)
Deployment: 20 minutes (create API)

🔧 Step 1: Environment Setup

Install Dependencies

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install required packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers datasets accelerate evaluate scikit-learn
pip install fastapi uvicorn pandas numpy matplotlib seaborn

# Verify installation
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import transformers; print(f'Transformers: {transformers.__version__}')"

Project Structure

bert-sentiment-classifier/
├── data/
│   └── imdb/                 # Downloaded dataset
├── models/
│   └── bert-sentiment/       # Saved model checkpoints
├── notebooks/
│   └── exploration.ipynb     # Data exploration
├── src/
│   ├── train.py             # Training script
│   ├── evaluate.py          # Evaluation script
│   ├── predict.py           # Inference
│   └── api.py               # FastAPI server
├── requirements.txt
└── README.md

💡 Using Google Colab?

If you don't have a GPU, use Google Colab (free T4 GPU). Go to Runtime → Change runtime type → GPU (T4). Training will take ~20 minutes instead of 2 hours on CPU.

📊 Step 2: Data Preparation

Load IMDb Dataset

# data_preparation.py
from datasets import load_dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load IMDb dataset (50k reviews: 25k train, 25k test)
print("Loading IMDb dataset...")
dataset = load_dataset("imdb")

print(f"Train size: {len(dataset['train'])}")
print(f"Test size: {len(dataset['test'])}")

# Examine a sample
sample = dataset['train'][0]
print(f"\nSample review:")
print(f"Text: {sample['text'][:200]}...")
print(f"Label: {sample['label']} (0=negative, 1=positive)")

# Check label distribution
train_labels = [ex['label'] for ex in dataset['train']]
test_labels = [ex['label'] for ex in dataset['test']]

print(f"\nTrain distribution:")
print(f"Negative: {train_labels.count(0)} ({train_labels.count(0)/len(train_labels)*100:.1f}%)")
print(f"Positive: {train_labels.count(1)} ({train_labels.count(1)/len(train_labels)*100:.1f}%)")

# Visualize
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
sns.countplot(x=train_labels)
plt.title('Train Set Label Distribution')
plt.xlabel('Sentiment')
plt.ylabel('Count')

plt.subplot(1, 2, 2)
lengths = [len(ex['text'].split()) for ex in dataset['train'][:1000]]
plt.hist(lengths, bins=50)
plt.title('Review Length Distribution (words)')
plt.xlabel('Word Count')
plt.ylabel('Frequency')

plt.tight_layout()
plt.savefig('data_exploration.png')
print("\nSaved visualization to data_exploration.png")

📈 Expected Output

Train size: 25000
Test size: 25000

Sample review:
Text: One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked...

Label: 1 (0=negative, 1=positive)

Train distribution:
Negative: 12500 (50.0%)
Positive: 12500 (50.0%)

Create Train/Validation Split

# Split training data into train (80%) and validation (20%)
train_testvalid = dataset['train'].train_test_split(test_size=0.2, seed=42)

# Final splits
train_dataset = train_testvalid['train']  # 20k samples
val_dataset = train_testvalid['test']     # 5k samples
test_dataset = dataset['test']            # 25k samples

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"Test samples: {len(test_dataset)}")

⚠️ Common Mistake: Don't test on the training set! Always hold out a separate test set that the model never sees during training.

🧠 Step 3: Tokenization & Data Loading

Initialize Tokenizer

from transformers import AutoTokenizer

# Load BERT tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Test tokenization
sample_text = "This movie was absolutely fantastic! I loved every minute of it."
tokens = tokenizer(sample_text, truncation=True, padding=True, max_length=512)

print("Original text:", sample_text)
print("\nTokenized:")
print("Input IDs:", tokens['input_ids'][:10], "...")
print("Attention Mask:", tokens['attention_mask'][:10], "...")

# Decode back to text
decoded = tokenizer.decode(tokens['input_ids'])
print("\nDecoded:", decoded)

Tokenize Dataset

def tokenize_function(examples):
    """Tokenize a batch of texts"""
    return tokenizer(
        examples['text'],
        truncation=True,
        padding='max_length',
        max_length=512  # BERT's max sequence length
    )

# Tokenize all datasets (batched for speed)
print("Tokenizing datasets...")
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val = val_dataset.map(tokenize_function, batched=True)
tokenized_test = test_dataset.map(tokenize_function, batched=True)

# Set format for PyTorch
tokenized_train.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
tokenized_val.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
tokenized_test.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

print("Tokenization complete!")

Create DataLoaders

from torch.utils.data import DataLoader

# Training: batch_size=16, shuffle=True
train_dataloader = DataLoader(tokenized_train, batch_size=16, shuffle=True)

# Validation/Test: batch_size=32, shuffle=False
val_dataloader = DataLoader(tokenized_val, batch_size=32, shuffle=False)
test_dataloader = DataLoader(tokenized_test, batch_size=32, shuffle=False)

print(f"Training batches: {len(train_dataloader)}")
print(f"Validation batches: {len(val_dataloader)}")
print(f"Test batches: {len(test_dataloader)}")

# Examine a batch
batch = next(iter(train_dataloader))
print(f"\nBatch keys: {batch.keys()}")
print(f"Input IDs shape: {batch['input_ids'].shape}")  # [16, 512]
print(f"Labels shape: {batch['label'].shape}")  # [16]

💡 Batch Size Guidelines

GPU Memory:

12GB (T4): batch_size=16
16GB (V100): batch_size=24
24GB (RTX 3090): batch_size=32
40GB (A100): batch_size=64

🎓 Step 4: Model Training

Initialize Model

from transformers import AutoModelForSequenceClassification
import torch

# Load pre-trained BERT with classification head
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,  # Binary classification (negative/positive)
    problem_type="single_label_classification"
)

# Move to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(f"Model loaded on: {device}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

Training Configuration

from transformers import TrainingArguments, Trainer
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

# Define evaluation metrics
def compute_metrics(eval_pred):
    """Compute accuracy, F1, precision, recall"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    
    return {
        'accuracy': accuracy_score(labels, predictions),
        'f1': f1_score(labels, predictions, average='binary'),
        'precision': precision_score(labels, predictions, average='binary'),
        'recall': recall_score(labels, predictions, average='binary')
    }

# Training arguments
training_args = TrainingArguments(
    output_dir='./models/bert-sentiment',
    
    # Training hyperparameters
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_steps=500,
    
    # Evaluation
    evaluation_strategy="steps",
    eval_steps=500,
    save_strategy="steps",
    save_steps=500,
    save_total_limit=2,  # Keep only 2 best checkpoints
    load_best_model_at_end=True,
    metric_for_best_model='f1',
    
    # Logging
    logging_dir='./logs',
    logging_steps=100,
    report_to="none",  # Disable wandb/tensorboard
    
    # Performance
    fp16=torch.cuda.is_available(),  # Mixed precision training (faster)
    dataloader_num_workers=4,
)

print("Training configuration:")
print(f"Epochs: {training_args.num_train_epochs}")
print(f"Batch size: {training_args.per_device_train_batch_size}")
print(f"Learning rate: {training_args.learning_rate}")
print(f"Mixed precision: {training_args.fp16}")

Train the Model

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    compute_metrics=compute_metrics,
)

# Start training
print("\n🚀 Starting training...")
print("This will take ~20 minutes on T4 GPU, ~2 hours on CPU\n")

train_result = trainer.train()

# Print training summary
print("\n✅ Training complete!")
print(f"Training time: {train_result.metrics['train_runtime']:.2f} seconds")
print(f"Samples/second: {train_result.metrics['train_samples_per_second']:.2f}")
print(f"Final loss: {train_result.metrics['train_loss']:.4f}")

# Save model
trainer.save_model('./models/bert-sentiment-final')
tokenizer.save_pretrained('./models/bert-sentiment-final')
print("\nModel saved to ./models/bert-sentiment-final")

📊 Expected Training Output

🚀 Starting training...

Epoch 1/3
Step   100: loss=0.3245, eval_acc=0.8720, eval_f1=0.8698
Step   500: loss=0.2012, eval_acc=0.9140, eval_f1=0.9128
Epoch 1 complete: avg_loss=0.2234

Epoch 2/3
Step  1000: loss=0.1456, eval_acc=0.9280, eval_f1=0.9275
Step  1500: loss=0.1189, eval_acc=0.9320, eval_f1=0.9318
Epoch 2 complete: avg_loss=0.1398

Epoch 3/3
Step  2000: loss=0.0892, eval_acc=0.9345, eval_f1=0.9342
Step  2500: loss=0.0745, eval_acc=0.9360, eval_f1=0.9358
Epoch 3 complete: avg_loss=0.0856

✅ Training complete!
Training time: 1234.56 seconds
Final loss: 0.0856
Best F1 score: 0.9358

💡 Training Tips

Overfitting? Add dropout, reduce epochs, or use more data
Slow training? Enable fp16 mixed precision (2x faster)
Out of memory? Reduce batch_size or use gradient accumulation
Poor performance? Try lower learning rate (1e-5) or more epochs

📈 Step 5: Model Evaluation

Evaluate on Test Set

# Evaluate on held-out test set
print("Evaluating on test set...")
test_results = trainer.evaluate(tokenized_test)

print("\n📊 Test Set Results:")
print(f"Accuracy:  {test_results['eval_accuracy']:.4f}")
print(f"F1 Score:  {test_results['eval_f1']:.4f}")
print(f"Precision: {test_results['eval_precision']:.4f}")
print(f"Recall:    {test_results['eval_recall']:.4f}")

Confusion Matrix

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Get predictions
predictions = trainer.predict(tokenized_test)
pred_labels = np.argmax(predictions.predictions, axis=-1)
true_labels = predictions.label_ids

# Compute confusion matrix
cm = confusion_matrix(true_labels, pred_labels)

# Visualize
fig, ax = plt.subplots(figsize=(8, 6))
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=['Negative', 'Positive']
)
disp.plot(ax=ax, cmap='Blues', values_format='d')
plt.title('Confusion Matrix - BERT Sentiment Classifier')
plt.savefig('confusion_matrix.png')
print("\nConfusion matrix saved to confusion_matrix.png")

# Calculate per-class metrics
tn, fp, fn, tp = cm.ravel()
print(f"\nTrue Negatives:  {tn} ({tn/(tn+fp)*100:.1f}%)")
print(f"False Positives: {fp} ({fp/(tn+fp)*100:.1f}%)")
print(f"False Negatives: {fn} ({fn/(tp+fn)*100:.1f}%)")
print(f"True Positives:  {tp} ({tp/(tp+fn)*100:.1f}%)")

Error Analysis

# Find misclassified examples
errors = []
for i, (pred, true) in enumerate(zip(pred_labels, true_labels)):
    if pred != true:
        errors.append({
            'index': i,
            'text': test_dataset[i]['text'],
            'true_label': true,
            'predicted_label': pred,
            'confidence': np.max(predictions.predictions[i])
        })

print(f"\nTotal errors: {len(errors)} ({len(errors)/len(test_dataset)*100:.2f}%)")

# Show 5 most confident wrong predictions
errors_sorted = sorted(errors, key=lambda x: x['confidence'], reverse=True)

print("\n🔍 Top 5 Most Confident Errors:\n")
for i, error in enumerate(errors_sorted[:5], 1):
    true_label = 'Positive' if error['true_label'] == 1 else 'Negative'
    pred_label = 'Positive' if error['predicted_label'] == 1 else 'Negative'
    
    print(f"{i}. True: {true_label} | Predicted: {pred_label} (conf: {error['confidence']:.3f})")
    print(f"   Text: {error['text'][:150]}...")
    print()

Test on Custom Examples

def predict_sentiment(text, model, tokenizer):
    """Predict sentiment for a single text"""
    # Tokenize
    inputs = tokenizer(text, return_tensors='pt', truncation=True, 
                      padding=True, max_length=512)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Predict
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = torch.softmax(logits, dim=-1)
    
    # Get prediction
    prediction = torch.argmax(probs, dim=-1).item()
    confidence = probs[0][prediction].item()
    
    sentiment = 'Positive 😊' if prediction == 1 else 'Negative 😞'
    
    return {
        'sentiment': sentiment,
        'confidence': confidence,
        'positive_prob': probs[0][1].item(),
        'negative_prob': probs[0][0].item()
    }

# Test examples
test_reviews = [
    "This movie was absolutely amazing! Best film I've seen all year.",
    "Terrible waste of time. I want my money back.",
    "It was okay, not great but not terrible either.",
    "Brilliant acting and stunning cinematography. Highly recommend!",
    "Boring and predictable. Fell asleep halfway through."
]

print("🎬 Custom Review Predictions:\n")
for review in test_reviews:
    result = predict_sentiment(review, model, tokenizer)
    print(f"Review: {review}")
    print(f"Prediction: {result['sentiment']} (confidence: {result['confidence']:.3f})")
    print(f"Probabilities: Negative={result['negative_prob']:.3f}, Positive={result['positive_prob']:.3f}\n")

📊 Expected Results

Target Metrics (BERT-base on IMDb):

Accuracy: 93-94%
F1 Score: 93-94%
Training time: ~20 minutes (T4 GPU)

🚀 Step 6: Deployment

Create FastAPI Server

# api.py - Production-ready API
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import logging

# Initialize logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Initialize FastAPI
app = FastAPI(title="BERT Sentiment Analysis API", version="1.0.0")

# Load model at startup
@app.on_event("startup")
async def load_model():
    global model, tokenizer, device
    
    logger.info("Loading model...")
    model_path = "./models/bert-sentiment-final"
    
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForSequenceClassification.from_pretrained(model_path)
    
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    model.eval()
    
    logger.info(f"Model loaded on {device}")

class ReviewRequest(BaseModel):
    text: str
    
class PredictionResponse(BaseModel):
    sentiment: str
    confidence: float
    positive_probability: float
    negative_probability: float

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: ReviewRequest):
    """Predict sentiment for a review"""
    try:
        # Validate input
        if not request.text or len(request.text.strip()) == 0:
            raise HTTPException(status_code=400, detail="Text cannot be empty")
        
        if len(request.text) > 5000:
            raise HTTPException(status_code=400, detail="Text too long (max 5000 chars)")
        
        # Tokenize
        inputs = tokenizer(
            request.text,
            return_tensors='pt',
            truncation=True,
            padding=True,
            max_length=512
        )
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        # Predict
        with torch.no_grad():
            outputs = model(**inputs)
            probs = torch.softmax(outputs.logits, dim=-1)
        
        # Extract results
        prediction = torch.argmax(probs, dim=-1).item()
        confidence = probs[0][prediction].item()
        
        sentiment = 'positive' if prediction == 1 else 'negative'
        
        return PredictionResponse(
            sentiment=sentiment,
            confidence=confidence,
            positive_probability=probs[0][1].item(),
            negative_probability=probs[0][0].item()
        )
        
    except Exception as e:
        logger.error(f"Prediction error: {e}")
        raise HTTPException(status_code=500, detail="Prediction failed")

@app.get("/health")
async def health():
    """Health check endpoint"""
    return {"status": "healthy", "model": "bert-base-uncased"}

@app.get("/")
async def root():
    """API info"""
    return {
        "name": "BERT Sentiment Analysis API",
        "version": "1.0.0",
        "endpoints": {
            "POST /predict": "Predict sentiment",
            "GET /health": "Health check",
            "GET /docs": "API documentation"
        }
    }

# Run: uvicorn api:app --host 0.0.0.0 --port 8000 --reload

Test the API

# Start server
uvicorn api:app --host 0.0.0.0 --port 8000

# In another terminal, test with curl
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"text": "This movie was fantastic!"}'

# Expected response:
# {
#   "sentiment": "positive",
#   "confidence": 0.987,
#   "positive_probability": 0.987,
#   "negative_probability": 0.013
# }

Simple Web Interface

<!-- index.html - Simple web UI -->
<!DOCTYPE html>
<html>
<head>
    <title>Sentiment Analyzer</title>
    <style>
        body { font-family: Arial; max-width: 600px; margin: 50px auto; padding: 20px; }
        textarea { width: 100%; height: 150px; padding: 10px; font-size: 14px; }
        button { background: #3b82f6; color: white; padding: 10px 20px; border: none; 
                 border-radius: 5px; cursor: pointer; font-size: 16px; }
        button:hover { background: #2563eb; }
        #result { margin-top: 20px; padding: 20px; border-radius: 10px; display: none; }
        .positive { background: #d1fae5; border: 2px solid #10b981; }
        .negative { background: #fee2e2; border: 2px solid #ef4444; }
    </style>
</head>
<body>
    <h1>🎬 Movie Review Sentiment Analyzer</h1>
    <p>Enter a movie review to analyze its sentiment:</p>
    
    <textarea id="review" placeholder="Type your review here..."></textarea>
    <br><br>
    <button onclick="analyzeSentiment()">Analyze Sentiment</button>
    
    <div id="result"></div>
    
    <script>
        async function analyzeSentiment() {
            const text = document.getElementById('review').value;
            
            if (!text.trim()) {
                alert('Please enter a review');
                return;
            }
            
            const response = await fetch('http://localhost:8000/predict', {
                method: 'POST',
                headers: { 'Content-Type': 'application/json' },
                body: JSON.stringify({ text: text })
            });
            
            const data = await response.json();
            
            const resultDiv = document.getElementById('result');
            resultDiv.className = data.sentiment;
            resultDiv.style.display = 'block';
            
            const emoji = data.sentiment === 'positive' ? '😊' : '😞';
            const sentiment = data.sentiment.charAt(0).toUpperCase() + data.sentiment.slice(1);
            
            resultDiv.innerHTML = `
                <h2>${emoji} ${sentiment} Sentiment</h2>
                <p>Confidence: ${(data.confidence * 100).toFixed(1)}%</p>
                <p>Positive: ${(data.positive_probability * 100).toFixed(1)}%</p>
                <p>Negative: ${(data.negative_probability * 100).toFixed(1)}%</p>
            `;
        }
    </script>
</body>
</html>

🚀 Deployment Options

Local: uvicorn api:app (development)
Docker: Containerize with Dockerfile
Cloud: Deploy to AWS, GCP, Azure (with GPU)
Serverless: AWS Lambda + API Gateway (small models)

🎯 Step 7: Improvements & Extensions

Model Improvements

⚡

Use Larger Model

Try BERT-large or RoBERTa for +1-2% accuracy. Training takes 2-3x longer.

📚

More Data

Combine IMDb with Yelp, Amazon reviews. More diverse data = better generalization.

🔧

Hyperparameter Tuning

Grid search: learning rates (1e-5, 2e-5, 5e-5), batch sizes, dropout rates.

🎭

Multi-class

Expand to 5-star ratings instead of binary (negative/positive).

Production Enhancements

# 1. Add caching for repeated queries
from functools import lru_cache
import hashlib

@lru_cache(maxsize=1000)
def predict_cached(text_hash):
    # Cache predictions by text hash
    pass

# 2. Batch prediction endpoint
@app.post("/predict_batch")
async def predict_batch(reviews: List[str]):
    # Process multiple reviews at once
    pass

# 3. Model versioning
@app.get("/model_info")
async def model_info():
    return {
        "model": "bert-base-uncased",
        "fine_tuned_on": "IMDb",
        "version": "1.0.0",
        "accuracy": 0.934
    }

# 4. Rate limiting
from slowapi import Limiter

limiter = Limiter(key_func=lambda: "global")
@app.post("/predict")
@limiter.limit("10/minute")
async def predict(...):
    # Limit to 10 requests per minute
    pass

🏆 Challenge Extensions

Multi-lingual: Fine-tune mBERT on reviews in Spanish, French, etc.
Aspect-based: Classify sentiment per aspect (acting, plot, visuals)
Zero-shot: Compare with GPT-3.5 zero-shot (no training)
Distillation: Compress to DistilBERT (2x faster, 40% smaller)
Real-time: Deploy with WebSockets for streaming predictions

📋 Complete Code Summary

Key Files Created

data_preparation.py: Load and explore IMDb dataset
train.py: Fine-tune BERT on sentiment classification
evaluate.py: Test model and analyze errors
api.py: FastAPI server for predictions
index.html: Simple web interface

What You Learned

✅ Load and preprocess text datasets with HuggingFace
✅ Tokenize text for BERT models
✅ Fine-tune pre-trained models with Trainer API
✅ Evaluate with accuracy, F1, precision, recall
✅ Analyze errors with confusion matrix
✅ Deploy model via FastAPI
✅ Build web interface for predictions

📊 Expected Final Results

Metric	Value	Notes
Test Accuracy	93-94%	State-of-the-art for IMDb
F1 Score	93-94%	Balanced performance
Training Time	~20 minutes	On T4 GPU (Colab free tier)
Inference Time	~50ms	Per review on GPU
Model Size	~440MB	BERT-base parameters

🎉 Congratulations! You've built a production-ready sentiment classifier from scratch. You can now:

Fine-tune any HuggingFace model on any classification task
Deploy ML models via REST APIs
Evaluate model performance with proper metrics
Build end-to-end ML projects independently

🔗 Resources & Next Steps

Code Repository

Full project code available at: github.com/your-repo/bert-sentiment-classifier

Next Projects

Project 2: Build a RAG Chatbot with vector search
Project 3: Deploy a Fine-tuned LLM at scale

Test Your Knowledge

Q1: What library is commonly used for fine-tuning transformer models?

NumPy

Pandas

Hugging Face Transformers

Matplotlib

Q2: What is the purpose of tokenization in BERT fine-tuning?

To increase model size

To convert text into numerical tokens that the model can process

To remove stopwords

To translate text

Q3: Which metric is commonly used for evaluating classification models?

Only accuracy

Only loss

Only perplexity

Accuracy, precision, recall, and F1-score

Q4: What happens during the training loop?

Forward pass, loss calculation, backward pass, and parameter updates

Only data loading

Only model evaluation

Only tokenization

Q5: Why do we use a validation set during fine-tuning?

To increase training speed

To reduce model size

To monitor performance and detect overfitting

To generate more data