šÆ Project Overview
Build a production-ready RAG chatbot that answers questions based on your documents. You'll implement document ingestion, vector search, conversational memory, and a web interface.
What You'll Build
Document Ingestion
Load PDFs, text files, and web pages. Chunk intelligently and create embeddings.
Vector Search
Store embeddings in ChromaDB. Retrieve relevant context with semantic similarity.
Conversational Bot
Generate answers with GPT-4. Remember conversation history. Cite sources.
Web Interface
Build Gradio UI for chatting. Upload documents, ask questions, see sources.
šļø System Architecture
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā USER INTERFACE ā
ā (Gradio: Upload docs, Ask questions, View chat) ā
āāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā DOCUMENT INGESTION ā
ā 1. Load documents (PDF, TXT, URL) ā
ā 2. Split into chunks (RecursiveCharacterTextSplitter) ā
ā 3. Generate embeddings (OpenAI ada-002) ā
ā 4. Store in vector DB (ChromaDB) ā
āāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā RETRIEVAL PIPELINE ā
ā 1. User asks question ā
ā 2. Convert question to embedding ā
ā 3. Search ChromaDB for similar chunks (top k=4) ā
ā 4. Return relevant context ā
āāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā GENERATION PIPELINE ā
ā 1. Combine: question + retrieved context + chat history ā
ā 2. Send to GPT-4 with system prompt ā
ā 3. Generate answer with source citations ā
ā 4. Save conversation to memory ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
š Prerequisites
- Python 3.8+ installed
- OpenAI API key (GPT-4 access recommended)
- Basic understanding of embeddings and vector search
- Completed Tutorial 4: In-Context Learning & RAG
ā±ļø Time Breakdown
- Setup: 15 minutes (install libraries, API keys)
- Document Ingestion: 20 minutes (load, chunk, embed)
- Vector Search: 15 minutes (ChromaDB setup)
- Chat Pipeline: 25 minutes (LangChain chains, memory)
- Web Interface: 15 minutes (Gradio UI)
š§ Step 1: Environment Setup
Install Dependencies
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install core packages
pip install langchain langchain-community langchain-openai
pip install chromadb tiktoken
pip install pypdf unstructured # Document loaders
pip install gradio # Web interface
pip install openai python-dotenv
# Verify installation
python -c "import langchain; print(f'LangChain: {langchain.__version__}')"
python -c "import chromadb; print(f'ChromaDB: {chromadb.__version__}')"
Set Up API Keys
# Create .env file
cat > .env << EOF
OPENAI_API_KEY=sk-your-openai-api-key-here
EOF
# Or export directly
export OPENAI_API_KEY="sk-your-key-here"
ā ļø API Costs: OpenAI embeddings cost ~$0.10 per 1M tokens. For this project, expect ~$0.50-2 total depending on document size.
Project Structure
rag-chatbot/
āāā data/
ā āāā documents/ # PDFs, TXT files
ā āāā chroma_db/ # Vector database storage
āāā src/
ā āāā ingest.py # Document ingestion
ā āāā retriever.py # Vector search
ā āāā chatbot.py # Main chat logic
ā āāā app.py # Gradio interface
āāā .env # API keys
āāā requirements.txt
āāā README.md
š Step 2: Document Ingestion
Load Documents
# ingest.py - Document loading and processing
from langchain_community.document_loaders import (
PyPDFLoader,
TextLoader,
DirectoryLoader,
WebBaseLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import os
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
def load_documents(data_dir="./data/documents"):
"""Load documents from various sources"""
documents = []
# Load PDFs
pdf_loader = DirectoryLoader(
data_dir,
glob="**/*.pdf",
loader_cls=PyPDFLoader
)
documents.extend(pdf_loader.load())
print(f"Loaded {len(documents)} pages from PDFs")
# Load text files
txt_loader = DirectoryLoader(
data_dir,
glob="**/*.txt",
loader_cls=TextLoader
)
text_docs = txt_loader.load()
documents.extend(text_docs)
print(f"Loaded {len(text_docs)} text files")
# Load from URLs (optional)
urls = [
"https://en.wikipedia.org/wiki/Artificial_intelligence",
# Add more URLs as needed
]
if urls:
web_loader = WebBaseLoader(urls)
web_docs = web_loader.load()
documents.extend(web_docs)
print(f"Loaded {len(web_docs)} web pages")
print(f"\nTotal documents loaded: {len(documents)}")
return documents
# Test document loading
if __name__ == "__main__":
docs = load_documents()
# Examine first document
if docs:
print(f"\nFirst document:")
print(f"Source: {docs[0].metadata.get('source', 'Unknown')}")
print(f"Content preview: {docs[0].page_content[:200]}...")
Chunk Documents Intelligently
def chunk_documents(documents, chunk_size=1000, chunk_overlap=200):
"""Split documents into chunks for embedding"""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
separators=["\n\n", "\n", " ", ""] # Try these in order
)
chunks = text_splitter.split_documents(documents)
print(f"Split {len(documents)} documents into {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) / len(chunks):.0f} chars")
return chunks
# Test chunking
if __name__ == "__main__":
docs = load_documents()
chunks = chunk_documents(docs)
# Examine a chunk
print(f"\nSample chunk:")
print(f"Content: {chunks[0].page_content[:300]}...")
print(f"Metadata: {chunks[0].metadata}")
š” Chunking Strategy
Chunk Size Guidelines:
- Small chunks (200-500): More precise retrieval, but may lack context
- Medium chunks (500-1000): Balanced (recommended)
- Large chunks (1000-2000): More context, but less precise
- Overlap (100-200): Prevents information loss at boundaries
Create Embeddings & Store in Vector DB
def create_vector_store(chunks, persist_directory="./data/chroma_db"):
"""Create ChromaDB vector store from document chunks"""
# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings(
model="text-embedding-ada-002",
openai_api_key=os.getenv("OPENAI_API_KEY")
)
# Create and persist vector store
print("Creating embeddings and storing in ChromaDB...")
print("This may take a few minutes depending on document size...")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=persist_directory,
collection_name="rag_documents"
)
print(f"Vector store created with {vectorstore._collection.count()} vectors")
print(f"Persisted to: {persist_directory}")
return vectorstore
# Complete ingestion pipeline
def ingest_documents(data_dir="./data/documents", persist_dir="./data/chroma_db"):
"""Complete document ingestion pipeline"""
# Load documents
documents = load_documents(data_dir)
if not documents:
print("No documents found!")
return None
# Chunk documents
chunks = chunk_documents(documents)
# Create vector store
vectorstore = create_vector_store(chunks, persist_dir)
print("\nā
Document ingestion complete!")
return vectorstore
if __name__ == "__main__":
vectorstore = ingest_documents()
š Expected Output
Loaded 15 pages from PDFs Loaded 3 text files Loaded 1 web pages Total documents loaded: 19 Split 19 documents into 87 chunks Average chunk size: 892 chars Creating embeddings and storing in ChromaDB... This may take a few minutes depending on document size... Vector store created with 87 vectors Persisted to: ./data/chroma_db ā Document ingestion complete!
š Step 3: Vector Search & Retrieval
Initialize Retriever
# retriever.py - Semantic search functionality
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
import os
def load_vector_store(persist_directory="./data/chroma_db"):
"""Load existing ChromaDB vector store"""
embeddings = OpenAIEmbeddings(
model="text-embedding-ada-002",
openai_api_key=os.getenv("OPENAI_API_KEY")
)
vectorstore = Chroma(
persist_directory=persist_directory,
embedding_function=embeddings,
collection_name="rag_documents"
)
print(f"Loaded vector store with {vectorstore._collection.count()} documents")
return vectorstore
def create_retriever(vectorstore, k=4, search_type="similarity"):
"""Create retriever with configurable search parameters"""
retriever = vectorstore.as_retriever(
search_type=search_type, # "similarity" or "mmr" (max marginal relevance)
search_kwargs={
"k": k, # Number of documents to retrieve
"fetch_k": 20 # For MMR: fetch more, then rerank
}
)
return retriever
# Test retrieval
if __name__ == "__main__":
vectorstore = load_vector_store()
retriever = create_retriever(vectorstore, k=4)
# Test query
query = "What is artificial intelligence?"
docs = retriever.get_relevant_documents(query)
print(f"\nQuery: {query}")
print(f"Retrieved {len(docs)} documents:\n")
for i, doc in enumerate(docs, 1):
print(f"{i}. Source: {doc.metadata.get('source', 'Unknown')}")
print(f" Content: {doc.page_content[:200]}...")
print()
Advanced Retrieval Strategies
def hybrid_search(vectorstore, query, k=4):
"""Combine similarity search with keyword matching"""
# 1. Semantic search (vector similarity)
semantic_docs = vectorstore.similarity_search(query, k=k)
# 2. Add metadata filtering (optional)
# Example: only retrieve from specific source
filtered_docs = vectorstore.similarity_search(
query,
k=k,
filter={"source": "important_document.pdf"}
)
return semantic_docs
def mmr_search(vectorstore, query, k=4, diversity=0.3):
"""Maximum Marginal Relevance - balance relevance and diversity"""
# MMR retrieves diverse results (avoids redundant chunks)
docs = vectorstore.max_marginal_relevance_search(
query,
k=k,
fetch_k=20, # Fetch 20, return top k diverse ones
lambda_mult=diversity # 0=max diversity, 1=max relevance
)
return docs
def retrieval_with_scores(vectorstore, query, k=4):
"""Get documents with similarity scores"""
docs_with_scores = vectorstore.similarity_search_with_score(query, k=k)
print(f"Query: {query}\n")
for i, (doc, score) in enumerate(docs_with_scores, 1):
print(f"{i}. Score: {score:.4f}")
print(f" Source: {doc.metadata.get('source', 'Unknown')}")
print(f" Content: {doc.page_content[:150]}...")
print()
return docs_with_scores
# Test different retrieval strategies
if __name__ == "__main__":
vectorstore = load_vector_store()
query = "Explain machine learning algorithms"
print("=== Similarity Search ===")
docs1 = hybrid_search(vectorstore, query)
print("\n=== MMR Search (Diverse Results) ===")
docs2 = mmr_search(vectorstore, query)
print("\n=== With Similarity Scores ===")
docs3 = retrieval_with_scores(vectorstore, query)
šÆ Retrieval Strategies Comparison
- Similarity Search: Most relevant chunks (may have duplicates)
- MMR: Balances relevance + diversity (best for broad questions)
- Metadata Filtering: Restrict to specific sources/dates
- Hybrid: Combine vector + keyword search (most robust)
š¬ Step 4: Conversational RAG Pipeline
Build Chat Chain with Memory
# chatbot.py - Main RAG chatbot logic
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate
import os
def create_chatbot(vectorstore):
"""Create RAG chatbot with conversational memory"""
# Initialize LLM
llm = ChatOpenAI(
model="gpt-4-turbo-preview",
temperature=0.7,
openai_api_key=os.getenv("OPENAI_API_KEY")
)
# Create retriever
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 4, "fetch_k": 20}
)
# Initialize conversation memory
memory = ConversationBufferMemory(
memory_key="chat_history",
return_messages=True,
output_key="answer"
)
# Custom system prompt
system_prompt = """You are a helpful AI assistant that answers questions based on the provided context.
Instructions:
1. Answer questions using ONLY the information from the context
2. If the answer is not in the context, say "I don't have enough information to answer that"
3. Always cite your sources by mentioning the document name
4. Be concise but complete
5. If asked about previous messages, use the chat history
Context:
{context}
Question: {question}
Answer:"""
# Create conversational retrieval chain
qa_chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
memory=memory,
return_source_documents=True,
verbose=True,
combine_docs_chain_kwargs={
"prompt": PromptTemplate(
template=system_prompt,
input_variables=["context", "question"]
)
}
)
return qa_chain
# Chat function
def chat(qa_chain, question):
"""Send a question to the chatbot"""
result = qa_chain({"question": question})
answer = result["answer"]
sources = result["source_documents"]
return {
"answer": answer,
"sources": sources,
"chat_history": result.get("chat_history", [])
}
# Test the chatbot
if __name__ == "__main__":
from retriever import load_vector_store
# Load vector store
vectorstore = load_vector_store()
# Create chatbot
qa_chain = create_chatbot(vectorstore)
# Test conversation
print("RAG Chatbot initialized. Type 'quit' to exit.\n")
while True:
question = input("You: ")
if question.lower() in ['quit', 'exit', 'q']:
break
result = chat(qa_chain, question)
print(f"\nBot: {result['answer']}\n")
print(f"Sources ({len(result['sources'])}):")
for i, doc in enumerate(result['sources'], 1):
source = doc.metadata.get('source', 'Unknown')
print(f" {i}. {source}")
print()
Add Source Citations
def format_answer_with_sources(answer, sources):
"""Format answer with inline source citations"""
# Extract unique sources
unique_sources = {}
for i, doc in enumerate(sources, 1):
source = doc.metadata.get('source', 'Unknown')
if source not in unique_sources:
unique_sources[source] = i
# Add source list
formatted_answer = f"{answer}\n\n**Sources:**\n"
for source, num in unique_sources.items():
formatted_answer += f"{num}. {source}\n"
return formatted_answer
def chat_with_citations(qa_chain, question):
"""Chat with formatted citations"""
result = chat(qa_chain, question)
formatted = format_answer_with_sources(result['answer'], result['sources'])
return {
"answer": formatted,
"sources": result['sources']
}
Handle Follow-up Questions
# Test conversation flow
if __name__ == "__main__":
from retriever import load_vector_store
vectorstore = load_vector_store()
qa_chain = create_chatbot(vectorstore)
# Multi-turn conversation
questions = [
"What is machine learning?",
"What are some applications of it?", # "it" refers to ML
"How does it differ from traditional programming?" # Still talking about ML
]
print("=== Multi-turn Conversation ===\n")
for q in questions:
result = chat(qa_chain, q)
print(f"Q: {q}")
print(f"A: {result['answer']}\n")
š” Memory Management
Memory Types:
- ConversationBufferMemory: Store all messages (simple, unlimited growth)
- ConversationBufferWindowMemory: Keep last N messages (prevents overflow)
- ConversationSummaryMemory: Summarize old messages (best for long conversations)
- ConversationSummaryBufferMemory: Hybrid approach (most balanced)
š Step 5: Web Interface with Gradio
Create Interactive UI
# app.py - Gradio web interface
import gradio as gr
from chatbot import create_chatbot, chat_with_citations
from retriever import load_vector_store
from ingest import ingest_documents
import os
# Global chatbot instance
qa_chain = None
def initialize_chatbot():
"""Initialize or reinitialize the chatbot"""
global qa_chain
# Check if vector store exists
if not os.path.exists("./data/chroma_db"):
return "ā ļø No documents found. Please upload documents first."
try:
vectorstore = load_vector_store()
qa_chain = create_chatbot(vectorstore)
return "ā
Chatbot initialized successfully!"
except Exception as e:
return f"ā Error: {str(e)}"
def upload_documents(files):
"""Handle document uploads"""
if not files:
return "No files uploaded"
# Create documents directory
os.makedirs("./data/documents", exist_ok=True)
# Save uploaded files
for file in files:
filename = os.path.basename(file.name)
destination = f"./data/documents/{filename}"
# Copy file
with open(file.name, 'rb') as src:
with open(destination, 'wb') as dst:
dst.write(src.read())
# Ingest documents
try:
ingest_documents()
initialize_chatbot()
return f"ā
Successfully uploaded and processed {len(files)} document(s)"
except Exception as e:
return f"ā Error processing documents: {str(e)}"
def chat_interface(message, history):
"""Handle chat messages"""
global qa_chain
if qa_chain is None:
return "Please initialize the chatbot first by uploading documents."
try:
result = chat_with_citations(qa_chain, message)
return result['answer']
except Exception as e:
return f"Error: {str(e)}"
# Create Gradio interface
def create_ui():
"""Create Gradio web interface"""
with gr.Blocks(title="RAG Chatbot", theme=gr.themes.Soft()) as demo:
gr.Markdown("# š¤ RAG Chatbot with Document Q&A")
gr.Markdown("Upload documents, then ask questions about them!")
with gr.Tab("š¬ Chat"):
chatbot = gr.Chatbot(height=500)
msg = gr.Textbox(
placeholder="Ask a question about your documents...",
label="Your Question"
)
clear = gr.Button("Clear Chat")
msg.submit(chat_interface, [msg, chatbot], [chatbot])
clear.click(lambda: None, None, chatbot, queue=False)
with gr.Tab("š Upload Documents"):
gr.Markdown("### Upload your documents (PDF, TXT)")
file_upload = gr.File(
file_count="multiple",
label="Upload Documents",
file_types=[".pdf", ".txt"]
)
upload_btn = gr.Button("Process Documents", variant="primary")
upload_status = gr.Textbox(label="Status", interactive=False)
upload_btn.click(
upload_documents,
inputs=file_upload,
outputs=upload_status
)
with gr.Tab("ā¹ļø About"):
gr.Markdown("""
### How it works
1. **Upload Documents**: Add PDFs or text files
2. **Processing**: Documents are chunked and embedded
3. **Ask Questions**: Chat about your documents
4. **Get Answers**: Responses include source citations
### Features
- Semantic search with vector embeddings
- Conversational memory (remembers context)
- Source citations for transparency
- Supports PDF and text files
### Technologies
- **LangChain**: RAG pipeline
- **ChromaDB**: Vector database
- **OpenAI**: Embeddings and GPT-4
- **Gradio**: Web interface
""")
return demo
# Launch the app
if __name__ == "__main__":
# Try to initialize chatbot if data exists
if os.path.exists("./data/chroma_db"):
initialize_chatbot()
print("ā
Chatbot initialized with existing data")
else:
print("ā ļø No existing data. Please upload documents first.")
# Create and launch UI
demo = create_ui()
demo.launch(
server_name="0.0.0.0",
server_port=7860,
share=False # Set to True to create public link
)
š Launch the Application
# Run the Gradio app
python app.py
# Open browser to: http://localhost:7860
# To create public link (accessible from anywhere):
# demo.launch(share=True)
ā Your RAG chatbot is now running!
- Go to http://localhost:7860 in your browser
- Upload documents (PDFs, TXT files)
- Wait for processing (~30 seconds for 10 pages)
- Start asking questions!
š Step 6: Testing & Evaluation
Test Different Question Types
# test_chatbot.py - Comprehensive testing
from chatbot import create_chatbot, chat
from retriever import load_vector_store
def test_question_types():
"""Test different types of questions"""
vectorstore = load_vector_store()
qa_chain = create_chatbot(vectorstore)
test_cases = [
# Factual questions
("What is the definition of machine learning?", "factual"),
# Comparison questions
("What's the difference between supervised and unsupervised learning?", "comparison"),
# List questions
("What are the main applications of AI?", "list"),
# Out-of-scope questions
("What's the weather today?", "out-of-scope"),
# Follow-up questions
("Tell me more about neural networks", "follow-up"),
("What are the advantages?", "follow-up-pronoun"),
]
print("=== Testing Different Question Types ===\n")
for question, q_type in test_cases:
result = chat(qa_chain, question)
print(f"Type: {q_type}")
print(f"Q: {question}")
print(f"A: {result['answer'][:200]}...")
print(f"Sources: {len(result['sources'])}")
print("-" * 80)
print()
if __name__ == "__main__":
test_question_types()
Measure Retrieval Quality
def evaluate_retrieval(vectorstore, test_queries):
"""Evaluate retrieval quality"""
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
print("=== Retrieval Quality Analysis ===\n")
for query in test_queries:
docs = retriever.get_relevant_documents(query)
print(f"Query: {query}")
print(f"Retrieved: {len(docs)} documents")
# Check diversity (unique sources)
sources = [doc.metadata.get('source') for doc in docs]
unique_sources = len(set(sources))
print(f"Unique sources: {unique_sources}/{len(docs)}")
# Check relevance (manual inspection needed)
print("Top result preview:")
print(f" {docs[0].page_content[:150]}...")
print()
test_queries = [
"What is deep learning?",
"Explain natural language processing",
"What are neural networks?"
]
vectorstore = load_vector_store()
evaluate_retrieval(vectorstore, test_queries)
š Expected Performance
Typical Metrics:
- Response Time: 2-4 seconds (including retrieval + generation)
- Retrieval Accuracy: 80-90% (relevant docs in top 4)
- Answer Quality: High when docs contain answer
- Source Coverage: 2-4 unique sources per answer
- Hallucination Rate: <5% (with good prompting)
š Step 7: Production Enhancements
Add Caching for Faster Responses
from functools import lru_cache
import hashlib
@lru_cache(maxsize=100)
def cached_retrieval(query_hash):
"""Cache retrieval results"""
# Implement cached retrieval logic
pass
def get_cached_answer(question):
"""Check cache before querying"""
question_hash = hashlib.md5(question.encode()).hexdigest()
# Check if answer is cached
cached = redis_client.get(f"answer:{question_hash}")
if cached:
return cached
# Generate new answer
result = chat(qa_chain, question)
# Cache for 1 hour
redis_client.setex(
f"answer:{question_hash}",
3600,
result['answer']
)
return result['answer']
Add User Feedback
# In Gradio interface
def rate_answer(rating):
"""Collect user feedback"""
# Save to database
save_feedback(rating, question, answer)
return f"Thanks for your feedback! (Rating: {rating}/5)"
with gr.Row():
rating = gr.Slider(1, 5, step=1, label="Rate this answer")
submit_rating = gr.Button("Submit Rating")
submit_rating.click(rate_answer, inputs=rating, outputs=feedback_text)
Monitoring & Logging
import logging
from datetime import datetime
# Setup logging
logging.basicConfig(
filename='chatbot.log',
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
def chat_with_logging(qa_chain, question):
"""Chat with comprehensive logging"""
start_time = datetime.now()
try:
result = chat(qa_chain, question)
latency = (datetime.now() - start_time).total_seconds()
logging.info(f"Question: {question}")
logging.info(f"Answer length: {len(result['answer'])}")
logging.info(f"Sources: {len(result['sources'])}")
logging.info(f"Latency: {latency:.2f}s")
return result
except Exception as e:
logging.error(f"Error: {e}")
raise
šÆ Production Checklist
- ā Document processing pipeline (PDF, TXT, URLs)
- ā Vector database with efficient retrieval
- ā Conversational memory for context
- ā Source citations for transparency
- ā Web interface (Gradio)
- ā Error handling and validation
- ā Caching for performance
- ā User feedback collection
- ā Logging and monitoring
- ā API key security (.env)
š Extensions & Improvements
Advanced Features
Multi-modal RAG
Add support for images, tables, charts. Use GPT-4 Vision for visual Q&A.
Multi-lingual
Support documents in multiple languages. Use multilingual embeddings.
Real-time Updates
Auto-sync documents from Google Drive, Notion, or Confluence.
Multi-user
Add authentication, per-user document collections, shared workspaces.
Deployment Options
# Docker deployment
docker build -t rag-chatbot .
docker run -p 7860:7860 -v $(pwd)/data:/app/data rag-chatbot
# Cloud deployment (AWS, GCP, Azure)
# Use managed vector DB: Pinecone, Weaviate, or pgvector
# Serverless deployment
# Use Lambda + API Gateway + DynamoDB
š Challenge Extensions
- Hybrid Search: Combine vector search with BM25 keyword search
- Query Rewriting: Use LLM to reformulate ambiguous questions
- Answer Confidence: Add confidence scores to answers
- Streaming Responses: Stream answers token-by-token for better UX
- Voice Interface: Add speech-to-text and text-to-speech
š Complete Code Summary
Project Structure
rag-chatbot/
āāā data/
ā āāā documents/ # Your PDFs, TXT files
ā ā āāā ai_overview.pdf
ā ā āāā ml_guide.txt
ā ā āāā ...
ā āāā chroma_db/ # Vector database storage
ā āāā ...
āāā src/
ā āāā ingest.py # Document loading, chunking, embedding
ā āāā retriever.py # Vector search and retrieval
ā āāā chatbot.py # RAG pipeline with memory
ā āāā app.py # Gradio web interface
āāā .env # OPENAI_API_KEY=sk-...
āāā requirements.txt # Dependencies
āāā README.md # Documentation
What You Learned
- ā Load and process documents (PDF, TXT, web pages)
- ā Chunk documents intelligently with overlap
- ā Create embeddings with OpenAI ada-002
- ā Store and query vector database (ChromaDB)
- ā Build RAG pipeline with LangChain
- ā Add conversational memory for context
- ā Implement semantic search (similarity, MMR)
- ā Generate answers with source citations
- ā Create web interface with Gradio
- ā Deploy production-ready chatbot
š Cost Estimate
| Component | Usage | Cost |
|---|---|---|
| Embeddings | 100k tokens (one-time) | $0.10 |
| GPT-4 Queries | 100 questions | $3.00 |
| ChromaDB | Local storage | Free |
| Total (dev) | Testing + 100 queries | ~$3.50 |
š Congratulations! You've built a production-ready RAG chatbot. You can now:
- Build document Q&A systems for any domain
- Implement semantic search over large document collections
- Create conversational AI with grounded responses
- Deploy LangChain applications to production
š Resources & Next Steps
Code Repository
Full project code: github.com/your-repo/rag-chatbot
Further Reading
Next Project
- Project 3: Deploy a Fine-tuned LLM at scale with vLLM
Test Your Knowledge
Q1: What does RAG stand for?
Q2: What is the purpose of vector embeddings in RAG systems?
Q3: Which database is commonly used for vector storage in RAG applications?
Q4: What are the main steps in a RAG pipeline?
Q5: What is the key benefit of RAG over vanilla LLMs?