Project: AI Research Assistant

🎯 Project Overview

Research is time-consuming. You search Google, read articles, synthesize information, and write reports. What if an AI agent could do this autonomously? In this project, you'll build an intelligent research assistant that:

Searches the web using multiple search engines (Google, DuckDuckGo, Wikipedia)
Reads and extracts content from websites intelligently
Summarizes findings from multiple sources
Generates reports with citations and structured information
Handles errors gracefully when sources are unavailable
Tracks progress and shows reasoning steps

🚀 Real-World Impact: Companies like Perplexity.ai and Bing Chat use similar architectures. This project demonstrates enterprise-level agent design!

What You'll Build

┌─────────────────────────────────────────────┐
│         Research Assistant Agent            │
│                                             │
│  User Query: "Explain quantum computing"    │
└──────────────────┬──────────────────────────┘
                   │
                   ▼
        ┌──────────────────────┐
        │   Planning Module    │
        │  (Break into steps)  │
        └──────────┬───────────┘
                   │
          ┌────────┼────────┐
          │        │        │
          ▼        ▼        ▼
      ┌─────┐  ┌─────┐  ┌─────┐
      │Google│ │Wiki │  │DuckD│  ← Search Tools
      └──┬──┘  └──┬──┘  └──┬──┘
          │        │        │
          └────────┼────────┘
                   │
                   ▼
        ┌──────────────────────┐
        │  Content Extraction  │
        │   (Read & Parse)     │
        └──────────┬───────────┘
                   │
                   ▼
        ┌──────────────────────┐
        │   Summarization      │
        │  (Synthesize Info)   │
        └──────────┬───────────┘
                   │
                   ▼
        ┌──────────────────────┐
        │   Report Generation  │
        │  (Final Document)    │
        └──────────────────────┘

🛠️ Setup & Dependencies

1 Install Required Packages

# Core dependencies
pip install langchain langchain-community langchain-openai
pip install openai python-dotenv

# Search & web tools
pip install google-search-results wikipedia duckduckgo-search
pip install beautifulsoup4 requests html2text

# Optional: For better PDF handling
pip install pypdf pdfplumber

2 Set Up API Keys

Create a .env file in your project root:

# OpenAI API key (required)
OPENAI_API_KEY=your_openai_key_here

# SerpAPI key for Google search (optional but recommended)
SERPAPI_API_KEY=your_serpapi_key_here

# Note: Wikipedia and DuckDuckGo don't require API keys

💡 API Keys:

OpenAI: Get from platform.openai.com
SerpAPI: Free 100 searches/month at serpapi.com

3 Create Project Structure

research-assistant/
├── .env                    # API keys
├── research_agent.py       # Main agent code
├── tools/
│   ├── search_tools.py     # Search implementations
│   ├── web_tools.py        # Web scraping
│   └── report_tools.py     # Report generation
├── outputs/                # Generated reports
└── requirements.txt        # Dependencies

💻 Building the Research Assistant

Step 1: Create Search Tools

tools/search_tools.py

"""
Search tools for the research assistant
"""
from langchain.tools import Tool
from langchain_community.utilities import GoogleSearchAPIWrapper, WikipediaAPIWrapper
from langchain_community.utilities import DuckDuckGoSearchAPIWrapper
import os
from typing import List, Dict

class SearchTools:
    """Collection of search tools"""
    
    def __init__(self):
        # Initialize search APIs
        self.serpapi_key = os.getenv("SERPAPI_API_KEY")
        
        # Google Search (if API key available)
        if self.serpapi_key:
            self.google_search = GoogleSearchAPIWrapper()
        
        # Wikipedia (always available)
        self.wikipedia = WikipediaAPIWrapper()
        
        # DuckDuckGo (always available, no API key needed)
        self.ddg_search = DuckDuckGoSearchAPIWrapper()
    
    def search_google(self, query: str) -> str:
        """Search Google and return results"""
        try:
            if not self.serpapi_key:
                return "Google search unavailable (no API key)"
            
            results = self.google_search.run(query)
            return f"Google Results:\\n{results}"
        except Exception as e:
            return f"Google search failed: {str(e)}"
    
    def search_wikipedia(self, query: str) -> str:
        """Search Wikipedia and return article summary"""
        try:
            results = self.wikipedia.run(query)
            # Truncate to first 1000 characters
            return f"Wikipedia Summary:\\n{results[:1000]}..."
        except Exception as e:
            return f"Wikipedia search failed: {str(e)}"
    
    def search_duckduckgo(self, query: str) -> str:
        """Search DuckDuckGo and return results"""
        try:
            results = self.ddg_search.run(query)
            return f"DuckDuckGo Results:\\n{results}"
        except Exception as e:
            return f"DuckDuckGo search failed: {str(e)}"
    
    def get_tools(self) -> List[Tool]:
        """Get all search tools as LangChain Tools"""
        tools = [
            Tool(
                name="Google Search",
                func=self.search_google,
                description="Search Google for current information. Use for recent events, news, and general web content."
            ),
            Tool(
                name="Wikipedia Search",
                func=self.search_wikipedia,
                description="Search Wikipedia for factual, encyclopedic information. Best for historical facts, definitions, and well-established knowledge."
            ),
            Tool(
                name="DuckDuckGo Search",
                func=self.search_duckduckgo,
                description="Search DuckDuckGo for web results. Good fallback when other searches fail. Privacy-focused."
            )
        ]
        
        return tools

# Test the tools
if __name__ == "__main__":
    search_tools = SearchTools()
    
    # Test Wikipedia
    result = search_tools.search_wikipedia("Artificial Intelligence")
    print(result)
    
    # Test DuckDuckGo
    result = search_tools.search_duckduckgo("latest AI developments")
    print(result)

Step 2: Create Web Scraping Tools

tools/web_tools.py

"""
Web scraping and content extraction tools
"""
import requests
from bs4 import BeautifulSoup
import html2text
from typing import Optional
from langchain.tools import Tool

class WebTools:
    """Tools for fetching and parsing web content"""
    
    def __init__(self):
        self.html2text = html2text.HTML2Text()
        self.html2text.ignore_links = False
        self.html2text.ignore_images = True
    
    def fetch_webpage(self, url: str) -> str:
        """Fetch and extract text content from a webpage"""
        try:
            # Fetch page
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
            }
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()
            
            # Parse HTML
            soup = BeautifulSoup(response.content, 'html.parser')
            
            # Remove script and style elements
            for script in soup(["script", "style", "nav", "footer", "header"]):
                script.decompose()
            
            # Get text
            text = self.html2text.handle(str(soup))
            
            # Clean and truncate
            text = text.strip()
            if len(text) > 3000:
                text = text[:3000] + "\\n\\n[Content truncated...]"
            
            return f"Content from {url}:\\n{text}"
            
        except requests.exceptions.Timeout:
            return f"Error: Timeout fetching {url}"
        except requests.exceptions.RequestException as e:
            return f"Error fetching {url}: {str(e)}"
        except Exception as e:
            return f"Error parsing {url}: {str(e)}"
    
    def extract_links(self, url: str) -> str:
        """Extract all links from a webpage"""
        try:
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
            }
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.content, 'html.parser')
            
            # Extract all links
            links = []
            for link in soup.find_all('a', href=True):
                href = link['href']
                if href.startswith('http'):
                    links.append(href)
            
            # Return first 10 links
            links = links[:10]
            return f"Links found on {url}:\\n" + "\\n".join(links)
            
        except Exception as e:
            return f"Error extracting links from {url}: {str(e)}"
    
    def get_tools(self) -> list:
        """Get web tools as LangChain Tools"""
        return [
            Tool(
                name="Fetch Webpage",
                func=self.fetch_webpage,
                description="Fetch and read the full text content of a webpage given its URL. Returns cleaned text content."
            ),
            Tool(
                name="Extract Links",
                func=self.extract_links,
                description="Extract all hyperlinks from a webpage. Useful for finding related resources."
            )
        ]

# Test
if __name__ == "__main__":
    web_tools = WebTools()
    content = web_tools.fetch_webpage("https://en.wikipedia.org/wiki/Artificial_intelligence")
    print(content[:500])

Step 3: Build the Research Agent

research_agent.py

"""
Autonomous Research Assistant Agent
"""
import os
from dotenv import load_dotenv
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from tools.search_tools import SearchTools
from tools.web_tools import WebTools
from datetime import datetime
import json

# Load environment variables
load_dotenv()

class ResearchAgent:
    """Autonomous research assistant that can search and synthesize information"""
    
    def __init__(self, model: str = "gpt-4-turbo-preview", verbose: bool = True):
        self.model = model
        self.verbose = verbose
        
        # Initialize LLM
        self.llm = ChatOpenAI(
            model=model,
            temperature=0.7,
            openai_api_key=os.getenv("OPENAI_API_KEY")
        )
        
        # Initialize tools
        self.search_tools = SearchTools()
        self.web_tools = WebTools()
        
        # Combine all tools
        self.tools = self.search_tools.get_tools() + self.web_tools.get_tools()
        
        # Create agent
        self.agent_executor = self._create_agent()
    
    def _create_agent(self) -> AgentExecutor:
        """Create the research agent with tools"""
        
        # System prompt
        system_prompt = """You are an expert research assistant. Your goal is to help users research topics thoroughly by:

1. **Planning**: Break down the research question into sub-questions
2. **Searching**: Use multiple search tools (Google, Wikipedia, DuckDuckGo) to gather information
3. **Reading**: Fetch and read relevant webpages when needed
4. **Synthesizing**: Combine information from multiple sources
5. **Citing**: Always cite your sources with URLs

**Research Process:**
- Start by searching Wikipedia for foundational knowledge
- Use Google or DuckDuckGo for recent information and diverse perspectives
- Fetch full webpage content when you need detailed information
- Cross-reference facts across multiple sources
- Organize findings clearly with headings and bullet points

**Output Format:**
Your final report should include:
- Executive Summary (2-3 sentences)
- Key Findings (bullet points)
- Detailed Analysis (organized sections)
- Sources Cited (URLs with descriptions)

Be thorough but concise. Prioritize accuracy over speed.
"""
        
        # Create prompt template
        prompt = ChatPromptTemplate.from_messages([
            ("system", system_prompt),
            ("human", "{input}"),
            MessagesPlaceholder(variable_name="agent_scratchpad")
        ])
        
        # Create agent
        agent = create_openai_tools_agent(
            llm=self.llm,
            tools=self.tools,
            prompt=prompt
        )
        
        # Create executor
        return AgentExecutor(
            agent=agent,
            tools=self.tools,
            verbose=self.verbose,
            max_iterations=10,
            handle_parsing_errors=True
        )
    
    def research(self, query: str) -> dict:
        """Conduct research on a topic"""
        
        print(f"\\n🔍 Starting research on: {query}\\n")
        print("=" * 60)
        
        try:
            # Run agent
            result = self.agent_executor.invoke({
                "input": f"""Research this topic thoroughly: {query}

Please provide a comprehensive report with:
1. Executive Summary
2. Key Findings
3. Detailed Analysis
4. Sources Cited"""
            })
            
            # Extract output
            report = result["output"]
            
            print("\\n" + "=" * 60)
            print("✅ Research Complete!")
            print("=" * 60)
            
            return {
                "success": True,
                "query": query,
                "report": report,
                "timestamp": datetime.now().isoformat()
            }
            
        except Exception as e:
            print(f"\\n❌ Research failed: {str(e)}")
            return {
                "success": False,
                "query": query,
                "error": str(e),
                "timestamp": datetime.now().isoformat()
            }
    
    def save_report(self, result: dict, filename: str = None):
        """Save research report to file"""
        
        if not filename:
            # Generate filename from query and timestamp
            query_slug = result["query"][:30].replace(" ", "_")
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"outputs/research_{query_slug}_{timestamp}.md"
        
        # Create outputs directory if it doesn't exist
        os.makedirs("outputs", exist_ok=True)
        
        # Format report
        content = f"""# Research Report: {result['query']}

**Generated:** {result['timestamp']}
**Status:** {'✅ Success' if result['success'] else '❌ Failed'}

---

{result.get('report', result.get('error', 'No content'))}

---

*Generated by AI Research Assistant*
"""
        
        # Save to file
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(content)
        
        print(f"\\n📄 Report saved to: {filename}")

# Main execution
if __name__ == "__main__":
    # Create agent
    agent = ResearchAgent(model="gpt-4-turbo-preview", verbose=True)
    
    # Example research queries
    queries = [
        "What is quantum computing and how does it differ from classical computing?",
        "Explain the latest developments in large language models as of 2024",
        "What are the environmental impacts of AI and data centers?"
    ]
    
    # Research first query
    result = agent.research(queries[0])
    
    # Save report
    agent.save_report(result)
    
    # Print report
    if result["success"]:
        print("\\n" + "=" * 60)
        print("RESEARCH REPORT")
        print("=" * 60)
        print(result["report"])

✅ Checkpoint: Test Your Agent

Run the agent with a simple query:

python research_agent.py

You should see the agent:

Planning its research approach
Searching multiple sources
Reading relevant content
Generating a comprehensive report

🚀 Advanced Features

Add Memory for Context

Memory-enabled research

from langchain.memory import ConversationBufferMemory

class ResearchAgentWithMemory(ResearchAgent):
    """Research agent with conversation memory"""
    
    def __init__(self, model: str = "gpt-4-turbo-preview", verbose: bool = True):
        super().__init__(model, verbose)
        
        # Add memory
        self.memory = ConversationBufferMemory(
            memory_key="chat_history",
            return_messages=True
        )
    
    def research_followup(self, query: str) -> dict:
        """Research with access to previous conversation"""
        
        # Get conversation history
        history = self.memory.load_memory_variables({})
        
        # Add to prompt
        enhanced_query = f"""Previous context: {history}

New research question: {query}"""
        
        result = self.research(enhanced_query)
        
        # Save to memory
        self.memory.save_context(
            {"input": query},
            {"output": result.get("report", "")}
        )
        
        return result

# Usage
agent = ResearchAgentWithMemory()
result1 = agent.research("What is quantum computing?")
result2 = agent.research_followup("How is it used in cryptography?")  # Has context!

Add Progress Tracking

Track research progress

from langchain.callbacks import StdOutCallbackHandler
from typing import Any

class ProgressCallback(StdOutCallbackHandler):
    """Custom callback to track progress"""
    
    def __init__(self):
        super().__init__()
        self.steps = []
        self.current_step = 0
    
    def on_tool_start(self, serialized: dict, input_str: str, **kwargs):
        """Called when tool starts"""
        self.current_step += 1
        tool_name = serialized.get("name", "Unknown")
        
        print(f"\\n🔧 Step {self.current_step}: Using {tool_name}")
        print(f"   Input: {input_str[:100]}...")
        
        self.steps.append({
            "step": self.current_step,
            "tool": tool_name,
            "input": input_str
        })
    
    def on_tool_end(self, output: str, **kwargs):
        """Called when tool completes"""
        print(f"   ✓ Complete: {output[:100]}...")

# Use with agent
agent = ResearchAgent(verbose=True)
callback = ProgressCallback()

result = agent.agent_executor.invoke(
    {"input": "Research quantum computing"},
    config={"callbacks": [callback]}
)

# View progress
print(f"\\n📊 Research completed in {len(callback.steps)} steps")

Add Cost Tracking

Track API costs

from langchain.callbacks import get_openai_callback

def research_with_cost_tracking(agent: ResearchAgent, query: str):
    """Research and track costs"""
    
    with get_openai_callback() as cb:
        result = agent.research(query)
        
        # Print cost summary
        print(f"\\n💰 Cost Summary:")
        print(f"   Tokens used: {cb.total_tokens}")
        print(f"   Prompt tokens: {cb.prompt_tokens}")
        print(f"   Completion tokens: {cb.completion_tokens}")
        print(f"   Total cost: ${cb.total_cost:.4f}")
        
        # Add to result
        result["cost"] = {
            "tokens": cb.total_tokens,
            "cost_usd": cb.total_cost
        }
    
    return result

# Usage
agent = ResearchAgent()
result = research_with_cost_tracking(agent, "Explain blockchain technology")

💪 Challenges & Extensions

🥉 Challenge 1: Multi-Topic Research

Modify the agent to research multiple related topics and compare findings.

def compare_research(topics: list) -> dict:
    """Research multiple topics and compare"""
    agent = ResearchAgent()
    
    results = {}
    for topic in topics:
        results[topic] = agent.research(topic)
    
    # Generate comparison report
    comparison_prompt = f"""Compare these research findings:
    
{json.dumps(results, indent=2)}

Provide a comparative analysis highlighting:
1. Common themes
2. Key differences
3. Unique insights from each
"""
    
    # TODO: Use agent to generate comparison
    pass

# Test
compare_research([
    "Quantum computing",
    "Classical computing",
    "Neuromorphic computing"
])

🥈 Challenge 2: Add PDF Research

Enable the agent to read and analyze PDF documents.

import pypdf

def read_pdf_tool(file_path: str) -> str:
    """Read and extract text from PDF"""
    try:
        reader = pypdf.PdfReader(file_path)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
        return text[:3000]  # Truncate
    except Exception as e:
        return f"Error reading PDF: {e}"

# Add to agent tools
Tool(
    name="Read PDF",
    func=read_pdf_tool,
    description="Read and extract text from a PDF file"
)

🥇 Challenge 3: Build a Web Interface

Create a Streamlit web interface for the research assistant.

import streamlit as st

st.title("🔍 AI Research Assistant")

query = st.text_input("What would you like to research?")

if st.button("Start Research"):
    with st.spinner("Researching..."):
        agent = ResearchAgent(verbose=False)
        result = agent.research(query)
        
        if result["success"]:
            st.success("Research complete!")
            st.markdown(result["report"])
        else:
            st.error(f"Research failed: {result['error']}")

# Run with: streamlit run app.py

🚀 Production Deployment

1. Add Error Handling & Retries

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def search_with_retry(search_func, query):
    """Search with automatic retries"""
    return search_func(query)

2. Add Rate Limiting

from ratelimit import limits, sleep_and_retry

@sleep_and_retry
@limits(calls=10, period=60)  # 10 calls per minute
def rate_limited_search(query):
    """Rate-limited search"""
    return search(query)

3. Cache Results

import functools
import hashlib

@functools.lru_cache(maxsize=100)
def cached_research(query_hash: str):
    """Cache research results"""
    # Implementation
    pass

# Usage
query_hash = hashlib.md5(query.encode()).hexdigest()
result = cached_research(query_hash)

⚠️ Production Considerations:

API Limits: Monitor OpenAI and search API usage
Cost Control: Set max_tokens limits to control costs
Error Handling: Implement comprehensive try-catch blocks
Logging: Log all searches and results for debugging
Security: Validate all URLs before fetching content

🎯 Key Takeaways

Agent Architecture: Tools + LLM + Planning = Autonomous behavior
Tool Design: Each tool should have clear purpose and error handling
Search Strategy: Use multiple sources for comprehensive coverage
Error Handling: Graceful degradation when tools fail
Cost Management: Track tokens and implement caching
Iterative Improvement: Start simple, add features incrementally

📚 Next Steps

🔗 Project 2: Multi-Agent Code Review System - Learn agent collaboration
🔗 Project 3: Business Process Automation - Build practical automation
🔗 Agent Evaluation & Safety - Learn to evaluate and secure agents
🔗 Production Agent Systems - Deploy at scale

💡 Share Your Project: Built something cool? Share it on Twitter/LinkedIn with #AIAgents and tag @AITutorialsSite!

🔍 Build an Autonomous AI Research Assistant