Home → LLMs & Transformers → Project 2: RAG Chatbot
šŸš€ HANDS-ON PROJECT

Project 2: Build a RAG Chatbot with Vector Search

Create an intelligent chatbot that answers questions from your documents using retrieval-augmented generation

ā±ļø 90-120 minutes šŸ”§ Hands-on Project šŸ’» Python, LangChain, ChromaDB, OpenAI

šŸŽÆ Project Overview

Build a production-ready RAG chatbot that answers questions based on your documents. You'll implement document ingestion, vector search, conversational memory, and a web interface.

What You'll Build

šŸ“„

Document Ingestion

Load PDFs, text files, and web pages. Chunk intelligently and create embeddings.

šŸ”

Vector Search

Store embeddings in ChromaDB. Retrieve relevant context with semantic similarity.

šŸ’¬

Conversational Bot

Generate answers with GPT-4. Remember conversation history. Cite sources.

🌐

Web Interface

Build Gradio UI for chatting. Upload documents, ask questions, see sources.

šŸ—ļø System Architecture

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                        USER INTERFACE                        │
│         (Gradio: Upload docs, Ask questions, View chat)     │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                       │
                       ā–¼
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                    DOCUMENT INGESTION                        │
│  1. Load documents (PDF, TXT, URL)                          │
│  2. Split into chunks (RecursiveCharacterTextSplitter)      │
│  3. Generate embeddings (OpenAI ada-002)                    │
│  4. Store in vector DB (ChromaDB)                           │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                       │
                       ā–¼
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                    RETRIEVAL PIPELINE                        │
│  1. User asks question                                       │
│  2. Convert question to embedding                            │
│  3. Search ChromaDB for similar chunks (top k=4)            │
│  4. Return relevant context                                  │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                       │
                       ā–¼
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                   GENERATION PIPELINE                        │
│  1. Combine: question + retrieved context + chat history    │
│  2. Send to GPT-4 with system prompt                        │
│  3. Generate answer with source citations                   │
│  4. Save conversation to memory                             │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                    

šŸ“š Prerequisites

  • Python 3.8+ installed
  • OpenAI API key (GPT-4 access recommended)
  • Basic understanding of embeddings and vector search
  • Completed Tutorial 4: In-Context Learning & RAG

ā±ļø Time Breakdown

  • Setup: 15 minutes (install libraries, API keys)
  • Document Ingestion: 20 minutes (load, chunk, embed)
  • Vector Search: 15 minutes (ChromaDB setup)
  • Chat Pipeline: 25 minutes (LangChain chains, memory)
  • Web Interface: 15 minutes (Gradio UI)

šŸ”§ Step 1: Environment Setup

Install Dependencies

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install core packages
pip install langchain langchain-community langchain-openai
pip install chromadb tiktoken
pip install pypdf unstructured  # Document loaders
pip install gradio  # Web interface
pip install openai python-dotenv

# Verify installation
python -c "import langchain; print(f'LangChain: {langchain.__version__}')"
python -c "import chromadb; print(f'ChromaDB: {chromadb.__version__}')"

Set Up API Keys

# Create .env file
cat > .env << EOF
OPENAI_API_KEY=sk-your-openai-api-key-here
EOF

# Or export directly
export OPENAI_API_KEY="sk-your-key-here"

āš ļø API Costs: OpenAI embeddings cost ~$0.10 per 1M tokens. For this project, expect ~$0.50-2 total depending on document size.

Project Structure

rag-chatbot/
ā”œā”€ā”€ data/
│   ā”œā”€ā”€ documents/           # PDFs, TXT files
│   └── chroma_db/          # Vector database storage
ā”œā”€ā”€ src/
│   ā”œā”€ā”€ ingest.py           # Document ingestion
│   ā”œā”€ā”€ retriever.py        # Vector search
│   ā”œā”€ā”€ chatbot.py          # Main chat logic
│   └── app.py              # Gradio interface
ā”œā”€ā”€ .env                     # API keys
ā”œā”€ā”€ requirements.txt
└── README.md

šŸ“„ Step 2: Document Ingestion

Load Documents

# ingest.py - Document loading and processing
from langchain_community.document_loaders import (
    PyPDFLoader,
    TextLoader,
    DirectoryLoader,
    WebBaseLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

def load_documents(data_dir="./data/documents"):
    """Load documents from various sources"""
    documents = []
    
    # Load PDFs
    pdf_loader = DirectoryLoader(
        data_dir,
        glob="**/*.pdf",
        loader_cls=PyPDFLoader
    )
    documents.extend(pdf_loader.load())
    print(f"Loaded {len(documents)} pages from PDFs")
    
    # Load text files
    txt_loader = DirectoryLoader(
        data_dir,
        glob="**/*.txt",
        loader_cls=TextLoader
    )
    text_docs = txt_loader.load()
    documents.extend(text_docs)
    print(f"Loaded {len(text_docs)} text files")
    
    # Load from URLs (optional)
    urls = [
        "https://en.wikipedia.org/wiki/Artificial_intelligence",
        # Add more URLs as needed
    ]
    
    if urls:
        web_loader = WebBaseLoader(urls)
        web_docs = web_loader.load()
        documents.extend(web_docs)
        print(f"Loaded {len(web_docs)} web pages")
    
    print(f"\nTotal documents loaded: {len(documents)}")
    return documents

# Test document loading
if __name__ == "__main__":
    docs = load_documents()
    
    # Examine first document
    if docs:
        print(f"\nFirst document:")
        print(f"Source: {docs[0].metadata.get('source', 'Unknown')}")
        print(f"Content preview: {docs[0].page_content[:200]}...")

Chunk Documents Intelligently

def chunk_documents(documents, chunk_size=1000, chunk_overlap=200):
    """Split documents into chunks for embedding"""
    
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", " ", ""]  # Try these in order
    )
    
    chunks = text_splitter.split_documents(documents)
    
    print(f"Split {len(documents)} documents into {len(chunks)} chunks")
    print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) / len(chunks):.0f} chars")
    
    return chunks

# Test chunking
if __name__ == "__main__":
    docs = load_documents()
    chunks = chunk_documents(docs)
    
    # Examine a chunk
    print(f"\nSample chunk:")
    print(f"Content: {chunks[0].page_content[:300]}...")
    print(f"Metadata: {chunks[0].metadata}")

šŸ’” Chunking Strategy

Chunk Size Guidelines:

  • Small chunks (200-500): More precise retrieval, but may lack context
  • Medium chunks (500-1000): Balanced (recommended)
  • Large chunks (1000-2000): More context, but less precise
  • Overlap (100-200): Prevents information loss at boundaries

Create Embeddings & Store in Vector DB

def create_vector_store(chunks, persist_directory="./data/chroma_db"):
    """Create ChromaDB vector store from document chunks"""
    
    # Initialize OpenAI embeddings
    embeddings = OpenAIEmbeddings(
        model="text-embedding-ada-002",
        openai_api_key=os.getenv("OPENAI_API_KEY")
    )
    
    # Create and persist vector store
    print("Creating embeddings and storing in ChromaDB...")
    print("This may take a few minutes depending on document size...")
    
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=persist_directory,
        collection_name="rag_documents"
    )
    
    print(f"Vector store created with {vectorstore._collection.count()} vectors")
    print(f"Persisted to: {persist_directory}")
    
    return vectorstore

# Complete ingestion pipeline
def ingest_documents(data_dir="./data/documents", persist_dir="./data/chroma_db"):
    """Complete document ingestion pipeline"""
    
    # Load documents
    documents = load_documents(data_dir)
    
    if not documents:
        print("No documents found!")
        return None
    
    # Chunk documents
    chunks = chunk_documents(documents)
    
    # Create vector store
    vectorstore = create_vector_store(chunks, persist_dir)
    
    print("\nāœ… Document ingestion complete!")
    return vectorstore

if __name__ == "__main__":
    vectorstore = ingest_documents()

šŸ“Š Expected Output

Loaded 15 pages from PDFs
Loaded 3 text files
Loaded 1 web pages

Total documents loaded: 19

Split 19 documents into 87 chunks
Average chunk size: 892 chars

Creating embeddings and storing in ChromaDB...
This may take a few minutes depending on document size...

Vector store created with 87 vectors
Persisted to: ./data/chroma_db

āœ… Document ingestion complete!

šŸ” Step 3: Vector Search & Retrieval

Initialize Retriever

# retriever.py - Semantic search functionality
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
import os

def load_vector_store(persist_directory="./data/chroma_db"):
    """Load existing ChromaDB vector store"""
    
    embeddings = OpenAIEmbeddings(
        model="text-embedding-ada-002",
        openai_api_key=os.getenv("OPENAI_API_KEY")
    )
    
    vectorstore = Chroma(
        persist_directory=persist_directory,
        embedding_function=embeddings,
        collection_name="rag_documents"
    )
    
    print(f"Loaded vector store with {vectorstore._collection.count()} documents")
    return vectorstore

def create_retriever(vectorstore, k=4, search_type="similarity"):
    """Create retriever with configurable search parameters"""
    
    retriever = vectorstore.as_retriever(
        search_type=search_type,  # "similarity" or "mmr" (max marginal relevance)
        search_kwargs={
            "k": k,  # Number of documents to retrieve
            "fetch_k": 20  # For MMR: fetch more, then rerank
        }
    )
    
    return retriever

# Test retrieval
if __name__ == "__main__":
    vectorstore = load_vector_store()
    retriever = create_retriever(vectorstore, k=4)
    
    # Test query
    query = "What is artificial intelligence?"
    docs = retriever.get_relevant_documents(query)
    
    print(f"\nQuery: {query}")
    print(f"Retrieved {len(docs)} documents:\n")
    
    for i, doc in enumerate(docs, 1):
        print(f"{i}. Source: {doc.metadata.get('source', 'Unknown')}")
        print(f"   Content: {doc.page_content[:200]}...")
        print()

Advanced Retrieval Strategies

def hybrid_search(vectorstore, query, k=4):
    """Combine similarity search with keyword matching"""
    
    # 1. Semantic search (vector similarity)
    semantic_docs = vectorstore.similarity_search(query, k=k)
    
    # 2. Add metadata filtering (optional)
    # Example: only retrieve from specific source
    filtered_docs = vectorstore.similarity_search(
        query,
        k=k,
        filter={"source": "important_document.pdf"}
    )
    
    return semantic_docs

def mmr_search(vectorstore, query, k=4, diversity=0.3):
    """Maximum Marginal Relevance - balance relevance and diversity"""
    
    # MMR retrieves diverse results (avoids redundant chunks)
    docs = vectorstore.max_marginal_relevance_search(
        query,
        k=k,
        fetch_k=20,  # Fetch 20, return top k diverse ones
        lambda_mult=diversity  # 0=max diversity, 1=max relevance
    )
    
    return docs

def retrieval_with_scores(vectorstore, query, k=4):
    """Get documents with similarity scores"""
    
    docs_with_scores = vectorstore.similarity_search_with_score(query, k=k)
    
    print(f"Query: {query}\n")
    for i, (doc, score) in enumerate(docs_with_scores, 1):
        print(f"{i}. Score: {score:.4f}")
        print(f"   Source: {doc.metadata.get('source', 'Unknown')}")
        print(f"   Content: {doc.page_content[:150]}...")
        print()
    
    return docs_with_scores

# Test different retrieval strategies
if __name__ == "__main__":
    vectorstore = load_vector_store()
    query = "Explain machine learning algorithms"
    
    print("=== Similarity Search ===")
    docs1 = hybrid_search(vectorstore, query)
    
    print("\n=== MMR Search (Diverse Results) ===")
    docs2 = mmr_search(vectorstore, query)
    
    print("\n=== With Similarity Scores ===")
    docs3 = retrieval_with_scores(vectorstore, query)

šŸŽÆ Retrieval Strategies Comparison

  • Similarity Search: Most relevant chunks (may have duplicates)
  • MMR: Balances relevance + diversity (best for broad questions)
  • Metadata Filtering: Restrict to specific sources/dates
  • Hybrid: Combine vector + keyword search (most robust)

šŸ’¬ Step 4: Conversational RAG Pipeline

Build Chat Chain with Memory

# chatbot.py - Main RAG chatbot logic
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate
import os

def create_chatbot(vectorstore):
    """Create RAG chatbot with conversational memory"""
    
    # Initialize LLM
    llm = ChatOpenAI(
        model="gpt-4-turbo-preview",
        temperature=0.7,
        openai_api_key=os.getenv("OPENAI_API_KEY")
    )
    
    # Create retriever
    retriever = vectorstore.as_retriever(
        search_type="mmr",
        search_kwargs={"k": 4, "fetch_k": 20}
    )
    
    # Initialize conversation memory
    memory = ConversationBufferMemory(
        memory_key="chat_history",
        return_messages=True,
        output_key="answer"
    )
    
    # Custom system prompt
    system_prompt = """You are a helpful AI assistant that answers questions based on the provided context.

Instructions:
1. Answer questions using ONLY the information from the context
2. If the answer is not in the context, say "I don't have enough information to answer that"
3. Always cite your sources by mentioning the document name
4. Be concise but complete
5. If asked about previous messages, use the chat history

Context:
{context}

Question: {question}

Answer:"""
    
    # Create conversational retrieval chain
    qa_chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=retriever,
        memory=memory,
        return_source_documents=True,
        verbose=True,
        combine_docs_chain_kwargs={
            "prompt": PromptTemplate(
                template=system_prompt,
                input_variables=["context", "question"]
            )
        }
    )
    
    return qa_chain

# Chat function
def chat(qa_chain, question):
    """Send a question to the chatbot"""
    
    result = qa_chain({"question": question})
    
    answer = result["answer"]
    sources = result["source_documents"]
    
    return {
        "answer": answer,
        "sources": sources,
        "chat_history": result.get("chat_history", [])
    }

# Test the chatbot
if __name__ == "__main__":
    from retriever import load_vector_store
    
    # Load vector store
    vectorstore = load_vector_store()
    
    # Create chatbot
    qa_chain = create_chatbot(vectorstore)
    
    # Test conversation
    print("RAG Chatbot initialized. Type 'quit' to exit.\n")
    
    while True:
        question = input("You: ")
        if question.lower() in ['quit', 'exit', 'q']:
            break
        
        result = chat(qa_chain, question)
        
        print(f"\nBot: {result['answer']}\n")
        print(f"Sources ({len(result['sources'])}):")
        for i, doc in enumerate(result['sources'], 1):
            source = doc.metadata.get('source', 'Unknown')
            print(f"  {i}. {source}")
        print()

Add Source Citations

def format_answer_with_sources(answer, sources):
    """Format answer with inline source citations"""
    
    # Extract unique sources
    unique_sources = {}
    for i, doc in enumerate(sources, 1):
        source = doc.metadata.get('source', 'Unknown')
        if source not in unique_sources:
            unique_sources[source] = i
    
    # Add source list
    formatted_answer = f"{answer}\n\n**Sources:**\n"
    for source, num in unique_sources.items():
        formatted_answer += f"{num}. {source}\n"
    
    return formatted_answer

def chat_with_citations(qa_chain, question):
    """Chat with formatted citations"""
    
    result = chat(qa_chain, question)
    formatted = format_answer_with_sources(result['answer'], result['sources'])
    
    return {
        "answer": formatted,
        "sources": result['sources']
    }

Handle Follow-up Questions

# Test conversation flow
if __name__ == "__main__":
    from retriever import load_vector_store
    
    vectorstore = load_vector_store()
    qa_chain = create_chatbot(vectorstore)
    
    # Multi-turn conversation
    questions = [
        "What is machine learning?",
        "What are some applications of it?",  # "it" refers to ML
        "How does it differ from traditional programming?"  # Still talking about ML
    ]
    
    print("=== Multi-turn Conversation ===\n")
    for q in questions:
        result = chat(qa_chain, q)
        print(f"Q: {q}")
        print(f"A: {result['answer']}\n")

šŸ’” Memory Management

Memory Types:

  • ConversationBufferMemory: Store all messages (simple, unlimited growth)
  • ConversationBufferWindowMemory: Keep last N messages (prevents overflow)
  • ConversationSummaryMemory: Summarize old messages (best for long conversations)
  • ConversationSummaryBufferMemory: Hybrid approach (most balanced)

🌐 Step 5: Web Interface with Gradio

Create Interactive UI

# app.py - Gradio web interface
import gradio as gr
from chatbot import create_chatbot, chat_with_citations
from retriever import load_vector_store
from ingest import ingest_documents
import os

# Global chatbot instance
qa_chain = None

def initialize_chatbot():
    """Initialize or reinitialize the chatbot"""
    global qa_chain
    
    # Check if vector store exists
    if not os.path.exists("./data/chroma_db"):
        return "āš ļø No documents found. Please upload documents first."
    
    try:
        vectorstore = load_vector_store()
        qa_chain = create_chatbot(vectorstore)
        return "āœ… Chatbot initialized successfully!"
    except Exception as e:
        return f"āŒ Error: {str(e)}"

def upload_documents(files):
    """Handle document uploads"""
    if not files:
        return "No files uploaded"
    
    # Create documents directory
    os.makedirs("./data/documents", exist_ok=True)
    
    # Save uploaded files
    for file in files:
        filename = os.path.basename(file.name)
        destination = f"./data/documents/{filename}"
        
        # Copy file
        with open(file.name, 'rb') as src:
            with open(destination, 'wb') as dst:
                dst.write(src.read())
    
    # Ingest documents
    try:
        ingest_documents()
        initialize_chatbot()
        return f"āœ… Successfully uploaded and processed {len(files)} document(s)"
    except Exception as e:
        return f"āŒ Error processing documents: {str(e)}"

def chat_interface(message, history):
    """Handle chat messages"""
    global qa_chain
    
    if qa_chain is None:
        return "Please initialize the chatbot first by uploading documents."
    
    try:
        result = chat_with_citations(qa_chain, message)
        return result['answer']
    except Exception as e:
        return f"Error: {str(e)}"

# Create Gradio interface
def create_ui():
    """Create Gradio web interface"""
    
    with gr.Blocks(title="RAG Chatbot", theme=gr.themes.Soft()) as demo:
        gr.Markdown("# šŸ¤– RAG Chatbot with Document Q&A")
        gr.Markdown("Upload documents, then ask questions about them!")
        
        with gr.Tab("šŸ’¬ Chat"):
            chatbot = gr.Chatbot(height=500)
            msg = gr.Textbox(
                placeholder="Ask a question about your documents...",
                label="Your Question"
            )
            clear = gr.Button("Clear Chat")
            
            msg.submit(chat_interface, [msg, chatbot], [chatbot])
            clear.click(lambda: None, None, chatbot, queue=False)
        
        with gr.Tab("šŸ“„ Upload Documents"):
            gr.Markdown("### Upload your documents (PDF, TXT)")
            
            file_upload = gr.File(
                file_count="multiple",
                label="Upload Documents",
                file_types=[".pdf", ".txt"]
            )
            
            upload_btn = gr.Button("Process Documents", variant="primary")
            upload_status = gr.Textbox(label="Status", interactive=False)
            
            upload_btn.click(
                upload_documents,
                inputs=file_upload,
                outputs=upload_status
            )
        
        with gr.Tab("ā„¹ļø About"):
            gr.Markdown("""
            ### How it works
            
            1. **Upload Documents**: Add PDFs or text files
            2. **Processing**: Documents are chunked and embedded
            3. **Ask Questions**: Chat about your documents
            4. **Get Answers**: Responses include source citations
            
            ### Features
            - Semantic search with vector embeddings
            - Conversational memory (remembers context)
            - Source citations for transparency
            - Supports PDF and text files
            
            ### Technologies
            - **LangChain**: RAG pipeline
            - **ChromaDB**: Vector database
            - **OpenAI**: Embeddings and GPT-4
            - **Gradio**: Web interface
            """)
    
    return demo

# Launch the app
if __name__ == "__main__":
    # Try to initialize chatbot if data exists
    if os.path.exists("./data/chroma_db"):
        initialize_chatbot()
        print("āœ… Chatbot initialized with existing data")
    else:
        print("āš ļø No existing data. Please upload documents first.")
    
    # Create and launch UI
    demo = create_ui()
    demo.launch(
        server_name="0.0.0.0",
        server_port=7860,
        share=False  # Set to True to create public link
    )

šŸš€ Launch the Application

# Run the Gradio app
python app.py

# Open browser to: http://localhost:7860

# To create public link (accessible from anywhere):
# demo.launch(share=True)

āœ… Your RAG chatbot is now running!

  • Go to http://localhost:7860 in your browser
  • Upload documents (PDFs, TXT files)
  • Wait for processing (~30 seconds for 10 pages)
  • Start asking questions!

šŸ“Š Step 6: Testing & Evaluation

Test Different Question Types

# test_chatbot.py - Comprehensive testing
from chatbot import create_chatbot, chat
from retriever import load_vector_store

def test_question_types():
    """Test different types of questions"""
    
    vectorstore = load_vector_store()
    qa_chain = create_chatbot(vectorstore)
    
    test_cases = [
        # Factual questions
        ("What is the definition of machine learning?", "factual"),
        
        # Comparison questions
        ("What's the difference between supervised and unsupervised learning?", "comparison"),
        
        # List questions
        ("What are the main applications of AI?", "list"),
        
        # Out-of-scope questions
        ("What's the weather today?", "out-of-scope"),
        
        # Follow-up questions
        ("Tell me more about neural networks", "follow-up"),
        ("What are the advantages?", "follow-up-pronoun"),
    ]
    
    print("=== Testing Different Question Types ===\n")
    
    for question, q_type in test_cases:
        result = chat(qa_chain, question)
        
        print(f"Type: {q_type}")
        print(f"Q: {question}")
        print(f"A: {result['answer'][:200]}...")
        print(f"Sources: {len(result['sources'])}")
        print("-" * 80)
        print()

if __name__ == "__main__":
    test_question_types()

Measure Retrieval Quality

def evaluate_retrieval(vectorstore, test_queries):
    """Evaluate retrieval quality"""
    
    retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
    
    print("=== Retrieval Quality Analysis ===\n")
    
    for query in test_queries:
        docs = retriever.get_relevant_documents(query)
        
        print(f"Query: {query}")
        print(f"Retrieved: {len(docs)} documents")
        
        # Check diversity (unique sources)
        sources = [doc.metadata.get('source') for doc in docs]
        unique_sources = len(set(sources))
        print(f"Unique sources: {unique_sources}/{len(docs)}")
        
        # Check relevance (manual inspection needed)
        print("Top result preview:")
        print(f"  {docs[0].page_content[:150]}...")
        print()

test_queries = [
    "What is deep learning?",
    "Explain natural language processing",
    "What are neural networks?"
]

vectorstore = load_vector_store()
evaluate_retrieval(vectorstore, test_queries)

šŸ“ˆ Expected Performance

Typical Metrics:

  • Response Time: 2-4 seconds (including retrieval + generation)
  • Retrieval Accuracy: 80-90% (relevant docs in top 4)
  • Answer Quality: High when docs contain answer
  • Source Coverage: 2-4 unique sources per answer
  • Hallucination Rate: <5% (with good prompting)

šŸš€ Step 7: Production Enhancements

Add Caching for Faster Responses

from functools import lru_cache
import hashlib

@lru_cache(maxsize=100)
def cached_retrieval(query_hash):
    """Cache retrieval results"""
    # Implement cached retrieval logic
    pass

def get_cached_answer(question):
    """Check cache before querying"""
    question_hash = hashlib.md5(question.encode()).hexdigest()
    
    # Check if answer is cached
    cached = redis_client.get(f"answer:{question_hash}")
    if cached:
        return cached
    
    # Generate new answer
    result = chat(qa_chain, question)
    
    # Cache for 1 hour
    redis_client.setex(
        f"answer:{question_hash}",
        3600,
        result['answer']
    )
    
    return result['answer']

Add User Feedback

# In Gradio interface
def rate_answer(rating):
    """Collect user feedback"""
    # Save to database
    save_feedback(rating, question, answer)
    return f"Thanks for your feedback! (Rating: {rating}/5)"

with gr.Row():
    rating = gr.Slider(1, 5, step=1, label="Rate this answer")
    submit_rating = gr.Button("Submit Rating")
    
submit_rating.click(rate_answer, inputs=rating, outputs=feedback_text)

Monitoring & Logging

import logging
from datetime import datetime

# Setup logging
logging.basicConfig(
    filename='chatbot.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

def chat_with_logging(qa_chain, question):
    """Chat with comprehensive logging"""
    
    start_time = datetime.now()
    
    try:
        result = chat(qa_chain, question)
        
        latency = (datetime.now() - start_time).total_seconds()
        
        logging.info(f"Question: {question}")
        logging.info(f"Answer length: {len(result['answer'])}")
        logging.info(f"Sources: {len(result['sources'])}")
        logging.info(f"Latency: {latency:.2f}s")
        
        return result
        
    except Exception as e:
        logging.error(f"Error: {e}")
        raise

šŸŽÆ Production Checklist

  • āœ… Document processing pipeline (PDF, TXT, URLs)
  • āœ… Vector database with efficient retrieval
  • āœ… Conversational memory for context
  • āœ… Source citations for transparency
  • āœ… Web interface (Gradio)
  • āœ… Error handling and validation
  • āœ… Caching for performance
  • āœ… User feedback collection
  • āœ… Logging and monitoring
  • āœ… API key security (.env)

šŸ† Extensions & Improvements

Advanced Features

šŸŽ­

Multi-modal RAG

Add support for images, tables, charts. Use GPT-4 Vision for visual Q&A.

šŸŒ

Multi-lingual

Support documents in multiple languages. Use multilingual embeddings.

šŸ”„

Real-time Updates

Auto-sync documents from Google Drive, Notion, or Confluence.

šŸ‘„

Multi-user

Add authentication, per-user document collections, shared workspaces.

Deployment Options

# Docker deployment
docker build -t rag-chatbot .
docker run -p 7860:7860 -v $(pwd)/data:/app/data rag-chatbot

# Cloud deployment (AWS, GCP, Azure)
# Use managed vector DB: Pinecone, Weaviate, or pgvector

# Serverless deployment
# Use Lambda + API Gateway + DynamoDB

šŸŽ“ Challenge Extensions

  1. Hybrid Search: Combine vector search with BM25 keyword search
  2. Query Rewriting: Use LLM to reformulate ambiguous questions
  3. Answer Confidence: Add confidence scores to answers
  4. Streaming Responses: Stream answers token-by-token for better UX
  5. Voice Interface: Add speech-to-text and text-to-speech

šŸ“‹ Complete Code Summary

Project Structure

rag-chatbot/
ā”œā”€ā”€ data/
│   ā”œā”€ā”€ documents/           # Your PDFs, TXT files
│   │   ā”œā”€ā”€ ai_overview.pdf
│   │   ā”œā”€ā”€ ml_guide.txt
│   │   └── ...
│   └── chroma_db/          # Vector database storage
│       └── ...
ā”œā”€ā”€ src/
│   ā”œā”€ā”€ ingest.py           # Document loading, chunking, embedding
│   ā”œā”€ā”€ retriever.py        # Vector search and retrieval
│   ā”œā”€ā”€ chatbot.py          # RAG pipeline with memory
│   └── app.py              # Gradio web interface
ā”œā”€ā”€ .env                     # OPENAI_API_KEY=sk-...
ā”œā”€ā”€ requirements.txt         # Dependencies
└── README.md               # Documentation

What You Learned

  • āœ… Load and process documents (PDF, TXT, web pages)
  • āœ… Chunk documents intelligently with overlap
  • āœ… Create embeddings with OpenAI ada-002
  • āœ… Store and query vector database (ChromaDB)
  • āœ… Build RAG pipeline with LangChain
  • āœ… Add conversational memory for context
  • āœ… Implement semantic search (similarity, MMR)
  • āœ… Generate answers with source citations
  • āœ… Create web interface with Gradio
  • āœ… Deploy production-ready chatbot

šŸ“Š Cost Estimate

Component Usage Cost
Embeddings 100k tokens (one-time) $0.10
GPT-4 Queries 100 questions $3.00
ChromaDB Local storage Free
Total (dev) Testing + 100 queries ~$3.50

šŸŽ‰ Congratulations! You've built a production-ready RAG chatbot. You can now:

  • Build document Q&A systems for any domain
  • Implement semantic search over large document collections
  • Create conversational AI with grounded responses
  • Deploy LangChain applications to production

šŸ”— Resources & Next Steps

Code Repository

Full project code: github.com/your-repo/rag-chatbot

Further Reading

Next Project

  • Project 3: Deploy a Fine-tuned LLM at scale with vLLM

Test Your Knowledge

Q1: What does RAG stand for?

Random Access Generation
Rapid Application Gateway
Retrieval-Augmented Generation
Recursive Adaptive Grounding

Q2: What is the purpose of vector embeddings in RAG systems?

To make the model larger
To enable semantic similarity search for relevant document retrieval
To compress text
To translate languages

Q3: Which database is commonly used for vector storage in RAG applications?

MySQL
SQLite
MongoDB
ChromaDB, Pinecone, or FAISS

Q4: What are the main steps in a RAG pipeline?

Retrieve relevant documents, augment prompt with context, generate response
Only generate responses
Only store embeddings
Only tokenize text

Q5: What is the key benefit of RAG over vanilla LLMs?

RAG makes models smaller
RAG reduces latency
RAG grounds responses in external knowledge, reducing hallucinations and enabling domain-specific answers
RAG eliminates the need for GPUs