Building LLM Applications - LLMs & Transformers Tutorial

🎓 Complete all tutorials to earn your Free LLMs & Transformers Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup fee

🚀 From Model to Product

You've learned how LLMs work, how to fine-tune them, and optimize inference. Now comes the hardest part: building production systems that real users rely on.

A model in isolation isn't useful. You need to build systems around it: APIs, error handling, monitoring, user interfaces, cost management, and safety guardrails.

💡 The 10/90 Rule

Building the model is 10% of the work. The other 90% is engineering: integration, reliability, monitoring, user experience, cost optimization, and handling edge cases.

📊 Production Reality Check

Challenge	Development	Production
Latency	3-5 seconds is fine	Users leave after 2s
Cost	$0.01 per request	$1,000/day at 100k users
Errors	Restart script	Must handle gracefully
Scale	1 request at a time	1,000 concurrent requests
Safety	Trusted input	Adversarial attacks

🏗️ Application Architecture Layers

🎨

1. User Interface

Web UI, mobile app, Slack bot, API client. How users interact with your LLM.

🔌

2. API Layer

FastAPI/Flask server. Input validation, rate limiting, auth, request routing.

🧠

3. LLM Logic

Prompt construction, RAG retrieval, agent tools, conversation management.

⚡

4. Model Serving

vLLM/TensorRT-LLM. Efficient inference, batching, caching, quantization.

💾

5. Data Layer

PostgreSQL for conversations, Redis for caching, vector DB for RAG.

📊

6. Monitoring

Logging, metrics, alerts. Track latency, costs, quality, errors.

🎯 Real Example: Customer Support Bot

Frontend: React chat widget

API: FastAPI with auth + rate limiting

Logic: RAG (retrieve docs) + Agent (create tickets)

Model: GPT-4-turbo (for complex queries) + GPT-3.5 (for simple ones)

Storage: PostgreSQL (tickets) + Pinecone (docs)

Monitoring: Datadog for metrics, Sentry for errors

Cost: $0.015/conversation, handles 10k conversations/day = $150/day

🤖 Types of LLM Applications

💬

Chatbots

Conversational interface. Remember context, handle multi-turn conversations.

🤝

Agents

Autonomous systems that use tools (APIs, databases) to solve tasks.

⚙️

Pipelines

Chain multiple LLM calls and tools for complex tasks.

📱

Content Apps

Generate text: summarization, translation, writing assistance.

💬 Building Production Chatbots

Chatbots are the most common LLM application. They seem simple but have hidden complexity: memory management, context windows, latency optimization, and safety.

🏗️ Chatbot Architecture

UI: Chat interface (web, mobile, Slack, Discord, Teams)
API: Backend endpoint for processing messages
Memory: Store and retrieve conversation history
LLM: Generate contextual responses
Tools/RAG: Retrieve data, call APIs, take actions
Safety: Content moderation, input validation

💾 Memory Strategies

📝

Buffer Memory

Store: Last N messages
Use: Short conversations
Cost: Low

📊

Summary Memory

Store: Summary + recent messages
Use: Long conversations
Cost: Medium (summarization)

🔍

Vector Memory

Store: Embeddings of messages
Use: Retrieve relevant context
Cost: High (embeddings + search)

🧠

Entity Memory

Store: Extract key facts
Use: Remember user preferences
Cost: Medium (extraction)

🚀 Production Chatbot Implementation

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationSummaryBufferMemory
from langchain.chains import ConversationChain
from langchain.callbacks import get_openai_callback
import redis
import json

app = FastAPI()
llm = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0.7)

# Redis for conversation storage
redis_client = redis.Redis(host='localhost', port=6379, decode_responses=True)

class Message(BaseModel):
    text: str
    conversation_id: str
    user_id: str

def get_conversation_memory(conversation_id: str):
    """Retrieve or create conversation memory"""
    history = redis_client.get(f"conversation:{conversation_id}")
    
    memory = ConversationSummaryBufferMemory(
        llm=llm,
        max_token_limit=2000,  # Keep recent context under 2k tokens
        return_messages=True
    )
    
    if history:
        # Load previous messages
        messages = json.loads(history)
        for msg in messages:
            if msg['role'] == 'user':
                memory.chat_memory.add_user_message(msg['content'])
            else:
                memory.chat_memory.add_ai_message(msg['content'])
    
    return memory

def save_conversation(conversation_id: str, user_msg: str, ai_msg: str):
    """Save conversation to Redis with 24hr expiration"""
    history = redis_client.get(f"conversation:{conversation_id}")
    messages = json.loads(history) if history else []
    
    messages.append({"role": "user", "content": user_msg})
    messages.append({"role": "assistant", "content": ai_msg})
    
    # Keep only last 50 messages to prevent memory explosion
    messages = messages[-50:]
    
    redis_client.setex(
        f"conversation:{conversation_id}",
        86400,  # 24 hours
        json.dumps(messages)
    )

@app.post("/chat")
async def chat(message: Message):
    """Main chat endpoint with error handling and monitoring"""
    try:
        # Get conversation memory
        memory = get_conversation_memory(message.conversation_id)
        
        # Create conversation chain
        conversation = ConversationChain(
            llm=llm,
            memory=memory,
            verbose=True
        )
        
        # Generate response with cost tracking
        with get_openai_callback() as cb:
            response = conversation.run(message.text)
            
            # Log metrics
            print(f"Tokens used: {cb.total_tokens}, Cost: ${cb.total_cost:.4f}")
        
        # Save to database
        save_conversation(message.conversation_id, message.text, response)
        
        return {
            "response": response,
            "conversation_id": message.conversation_id,
            "tokens_used": cb.total_tokens,
            "cost": cb.total_cost
        }
        
    except Exception as e:
        print(f"Error in chat endpoint: {e}")
        raise HTTPException(status_code=500, detail="Failed to generate response")

# Run: uvicorn app:app --host 0.0.0.0 --port 8000

⚡ Context Window Management

# Handle long conversations that exceed context window
from langchain.memory import ConversationSummaryBufferMemory

# Strategy 1: Sliding window (keep last N messages)
memory = ConversationBufferWindowMemory(k=10)  # Last 10 messages

# Strategy 2: Summarization (summarize old messages)
memory = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=2000  # When history exceeds 2k tokens, summarize oldest
)

# Strategy 3: Token counting (precise control)
from langchain.memory import ConversationTokenBufferMemory

memory = ConversationTokenBufferMemory(
    llm=llm,
    max_token_limit=4000  # Exactly 4k tokens max
)

# Strategy 4: Hybrid (summary + recent messages + vector search)
from langchain.memory import VectorStoreRetrieverMemory
from langchain.vectorstores import FAISS

# Recent messages
recent_memory = ConversationBufferWindowMemory(k=5)

# Semantic search over full history
vector_memory = VectorStoreRetrieverMemory(
    retriever=FAISS.from_texts(conversation_history).as_retriever(k=3)
)

# Combine: recent context + semantically relevant past messages

🎯 Advanced Features

# 1. User-specific system prompts
user_context = get_user_profile(user_id)  # Get preferences, history, etc.

system_prompt = f"""
You are a helpful assistant for {user_context['name']}.
User preferences: {user_context['preferences']}
Conversation style: {user_context['style']}
"""

# 2. Streaming responses (for better UX)
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

llm = ChatOpenAI(
    streaming=True,
    callbacks=[StreamingStdOutCallbackHandler()]
)

# 3. Multi-turn planning (for complex tasks)
from langchain.chains import LLMChain

# First: understand intent
intent_chain = LLMChain(llm=llm, prompt=intent_prompt)
intent = intent_chain.run(user_message)

# Second: route to appropriate handler
if intent == "question":
    response = rag_qa_chain.run(user_message)
elif intent == "task":
    response = agent.run(user_message)
else:
    response = general_conversation.run(user_message)

⚠️ Common Pitfalls

1. Context Overflow: Don't concatenate unlimited messages. Use summarization or sliding windows.

2. Memory Leaks: Clean up old conversations. Set TTL (time-to-live) in Redis.

3. No Fallbacks: Always have error messages when LLM fails.

4. Ignoring Latency: Stream responses. Show "typing..." indicator.

5. Poor Safety: Filter harmful inputs. Moderate outputs.

💰 Cost Optimization

Caching: Cache common responses ("What's your name?", "Hello")
Model routing: Use GPT-3.5 for simple, GPT-4 for complex
Token limits: Set max_tokens to prevent runaway costs
Rate limiting: Limit requests per user (10/minute)

🤝 Building Intelligent Agents

Agents are LLMs that can use tools autonomously. They reason about what actions to take, execute those actions using tools (APIs, databases, calculators), and iterate until the task is complete.

Think of agents as LLMs with hands - they can interact with the world, not just generate text.

🔄 The ReAct Framework (Reasoning + Acting)

Thought: "I need to find the current Tesla stock price"
Action: Use Search tool with query "Tesla stock price today"
Observation: "Tesla (TSLA) is trading at $242.50"
Thought: "Now I need to multiply by 1.1"
Action: Use Calculator tool with "242.50 * 1.1"
Observation: "266.75"
Thought: "I have the answer"
Answer: "If Tesla stock increases by 10%, it would be $266.75"

🧠 Why ReAct Works

By forcing the LLM to verbalize its reasoning before each action, we get better planning, easier debugging, and fewer hallucinations. The model "thinks out loud" like a human solving a problem.

🛠️ Building Custom Tools

from langchain.agents import Tool
import requests
import sqlite3

# Tool 1: Web Search
def web_search(query: str) -> str:
    """Search the web using Tavily API"""
    response = requests.post(
        "https://api.tavily.com/search",
        json={"query": query, "max_results": 3},
        headers={"Authorization": f"Bearer {TAVILY_API_KEY}"}
    )
    results = response.json()["results"]
    return "\n".join([f"{r['title']}: {r['snippet']}" for r in results])

# Tool 2: Database Query
def query_database(sql: str) -> str:
    """Execute SQL query on customer database"""
    conn = sqlite3.connect('customers.db')
    cursor = conn.cursor()
    
    # Safety: only allow SELECT queries
    if not sql.strip().upper().startswith('SELECT'):
        return "Error: Only SELECT queries allowed"
    
    try:
        cursor.execute(sql)
        results = cursor.fetchall()
        return str(results)
    except Exception as e:
        return f"Error: {e}"
    finally:
        conn.close()

# Tool 3: Send Email
def send_email(to: str, subject: str, body: str) -> str:
    """Send email to customer"""
    # Integrate with SendGrid, Mailgun, etc.
    # For safety, require human approval for production
    return f"Email queued for approval: To {to}, Subject: {subject}"

# Tool 4: Calculator
from math import *

def calculator(expression: str) -> str:
    """Evaluate mathematical expressions"""
    try:
        # Safety: use eval carefully, whitelist only math functions
        result = eval(expression, {"__builtins__": {}}, {"sqrt": sqrt, "pow": pow})
        return str(result)
    except Exception as e:
        return f"Error: {e}"

# Define tools with clear descriptions (crucial for LLM to use correctly)
tools = [
    Tool(
        name="Search",
        func=web_search,
        description="Search the web for current information. Use for facts, news, prices, etc. Input: search query as string"
    ),
    Tool(
        name="DatabaseQuery",
        func=query_database,
        description="Query customer database using SQL. Use for customer info, orders, analytics. Input: SQL SELECT query"
    ),
    Tool(
        name="SendEmail",
        func=send_email,
        description="Send email to customer. Requires to, subject, body. Input: 'to@email.com | Subject | Body text'"
    ),
    Tool(
        name="Calculator",
        func=calculator,
        description="Evaluate math expressions. Use for calculations. Input: mathematical expression as string"
    )
]

🤖 Creating the Agent

from langchain.agents import initialize_agent, AgentType
from langchain.chat_models import ChatOpenAI
from langchain.callbacks import get_openai_callback

# Initialize LLM (GPT-4 works best for agents)
llm = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0)

# Create agent with tools
agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True,  # See reasoning steps
    max_iterations=10,  # Prevent infinite loops
    max_execution_time=60,  # Timeout after 60 seconds
    early_stopping_method="generate"  # Return partial answer on timeout
)

# Example 1: Multi-step query
with get_openai_callback() as cb:
    response = agent.run(
        "Find the top 3 customers by order count and email them a thank you"
    )
    print(f"Cost: ${cb.total_cost:.4f}")

# Agent reasoning:
# Thought: Need to query database for top customers
# Action: DatabaseQuery
# Action Input: SELECT customer_email, COUNT(*) as orders FROM orders GROUP BY customer_email ORDER BY orders DESC LIMIT 3
# Observation: [('alice@example.com', 45), ('bob@example.com', 38), ('charlie@example.com', 32)]
# Thought: Now send thank you emails
# Action: SendEmail (3 times)
# Final Answer: Sent thank you emails to top 3 customers

# Example 2: Real-time data + calculation
response = agent.run(
    "What's the market cap of Apple? Compare it to Microsoft."
)

# Agent reasoning:
# Thought: Need current stock prices and share counts
# Action: Search "Apple stock price market cap"
# Observation: Apple market cap is $2.8 trillion
# Action: Search "Microsoft stock price market cap"
# Observation: Microsoft market cap is $3.1 trillion
# Thought: Compare the two
# Action: Calculator "3.1 / 2.8"
# Observation: 1.107
# Final Answer: Microsoft's market cap ($3.1T) is 1.11x larger than Apple's ($2.8T)

🔒 Agent Safety & Guardrails

Agents are powerful but risky. They can call APIs, modify databases, send emails. Always implement safety measures.

# Guardrail 1: Human-in-the-loop for critical actions
def send_email_with_approval(to: str, subject: str, body: str) -> str:
    """Queue email for human approval"""
    approval_needed.append({"to": to, "subject": subject, "body": body})
    return "Email queued for approval. Will be reviewed by human."

# Guardrail 2: Read-only tools
def query_database_readonly(sql: str) -> str:
    """Only allow SELECT queries"""
    if not sql.strip().upper().startswith('SELECT'):
        return "Error: Only SELECT queries allowed for safety"
    # ... execute query

# Guardrail 3: Cost limits
class CostLimitAgent:
    def __init__(self, max_cost=1.0):
        self.total_cost = 0
        self.max_cost = max_cost
    
    def run(self, query):
        with get_openai_callback() as cb:
            response = agent.run(query)
            self.total_cost += cb.total_cost
            
            if self.total_cost > self.max_cost:
                raise Exception(f"Cost limit exceeded: ${self.total_cost:.4f}")
        
        return response

# Guardrail 4: Timeout and max iterations
agent = initialize_agent(
    tools,
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    max_iterations=10,  # Stop after 10 tool calls
    max_execution_time=60,  # Timeout after 60 seconds
    early_stopping_method="generate"  # Return partial answer
)

# Guardrail 5: Input validation
def validate_input(user_input: str) -> bool:
    """Check for prompt injections and unsafe inputs"""
    # Check for common injection patterns
    dangerous_patterns = [
        "ignore previous instructions",
        "disregard all rules",
        "system:",
        "<|im_start|>"
    ]
    
    for pattern in dangerous_patterns:
        if pattern in user_input.lower():
            return False
    
    return True

if not validate_input(user_query):
    return "Invalid input detected"

📊 Agent Performance Tips

🎯

Clear Tool Descriptions

The more specific your tool descriptions, the better the agent knows when to use them. Bad: "search tool". Good: "Search web for current facts, news, stock prices".

🧠

Use GPT-4

GPT-3.5 struggles with multi-step reasoning. GPT-4 is much better at planning and tool use. Worth the extra cost.

⏱️

Set Timeouts

Agents can get stuck in loops. Always set max_iterations and max_execution_time.

📝

Log Everything

Save all agent reasoning steps. Essential for debugging and improving prompts.

🎉 Real-World Agent Examples

Customer Support: Look up orders, check inventory, process refunds
Data Analysis: Query databases, create visualizations, export reports
Research: Search papers, summarize findings, compare approaches
DevOps: Check server status, restart services, analyze logs
Sales: Look up leads, send emails, schedule meetings (CRM integration)

⚙️ Production Deployment & Scaling

Deploying an LLM application means making it fast, reliable, scalable, and cost-effective. This section covers Docker, Kubernetes, monitoring, and real-world deployment strategies.

🚀 Production-Ready FastAPI Server

from fastapi import FastAPI, HTTPException, Depends, Header
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, validator
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
import logging
import time
from typing import Optional

app = FastAPI(title="LLM Chat API", version="1.0.0")

# Rate limiting (prevent abuse)
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter

# CORS middleware
app.add_middleware(
    CORSMiddleware,
    allow_origins=["https://yourapp.com"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@app.on_event("startup")
async def load_model():
    global llm
    logger.info("Loading LLM...")
    llm = ChatOpenAI(model="gpt-4-turbo-preview")
    logger.info("LLM loaded successfully")

class ChatRequest(BaseModel):
    text: str
    conversation_id: Optional[str] = None
    user_id: str
    
    @validator('text')
    def validate_text(cls, v):
        if len(v) > 5000:
            raise ValueError('Message too long')
        if not v.strip():
            raise ValueError('Message cannot be empty')
        return v

# API key authentication
async def verify_api_key(x_api_key: str = Header(...)):
    if x_api_key not in valid_api_keys:
        raise HTTPException(status_code=401, detail="Invalid API key")
    return x_api_key

@app.post("/chat")
@limiter.limit("10/minute")  # 10 requests per minute
async def chat(
    request: ChatRequest,
    api_key: str = Depends(verify_api_key)
):
    start_time = time.time()
    
    try:
        # Content moderation
        if is_unsafe_content(request.text):
            raise HTTPException(status_code=400, detail="Unsafe content")
        
        # Generate response
        response = await generate_response(request.text)
        
        # Save to database
        await save_message(request.conversation_id, request.text, response)
        
        latency = (time.time() - start_time) * 1000
        
        logger.info(f"Request completed: {latency:.2f}ms, {response['tokens']} tokens")
        
        return {
            "response": response['text'],
            "tokens_used": response['tokens'],
            "latency_ms": latency
        }
        
    except Exception as e:
        logger.error(f"Error: {e}", exc_info=True)
        raise HTTPException(status_code=500, detail="Internal error")

@app.get("/health")
async def health_check():
    return {"status": "healthy", "model_loaded": llm is not None}

# Run: uvicorn app:app --host 0.0.0.0 --port 8000 --workers 4

🐳 Docker Deployment

# Dockerfile
FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8000

CMD ["gunicorn", "app:app", "-w", "4", "-k", "uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:8000"]

# docker-compose.yml
version: '3.8'

services:
  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - DATABASE_URL=${DATABASE_URL}
      - REDIS_URL=redis://redis:6379
    depends_on:
      - redis
      - postgres
  
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
  
  postgres:
    image: postgres:15-alpine
    environment:
      - POSTGRES_DB=llm_app
      - POSTGRES_USER=admin
      - POSTGRES_PASSWORD=${DB_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
  
  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - api

volumes:
  postgres_data:

# Run: docker-compose up -d

☸️ Kubernetes (for scale)

# kubernetes-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-api
spec:
  replicas: 3  # 3 instances
  selector:
    matchLabels:
      app: llm-api
  template:
    metadata:
      labels:
        app: llm-api
    spec:
      containers:
      - name: api
        image: your-registry/llm-api:latest
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
        readinessProbe:
          httpGet:
            path: /health
            port: 8000

# Deploy: kubectl apply -f kubernetes-deployment.yaml

📊 Monitoring & Logging

# Prometheus metrics
from prometheus_client import Counter, Histogram

request_count = Counter(
    'llm_requests_total',
    'Total LLM requests',
    ['endpoint', 'status']
)

request_latency = Histogram(
    'llm_request_duration_seconds',
    'Request latency'
)

token_usage = Counter(
    'llm_tokens_used_total',
    'Total tokens consumed'
)

# In your endpoint:
request_count.labels(endpoint='/chat', status='success').inc()
request_latency.observe(latency_seconds)
token_usage.inc(tokens_used)

💰 Cost Tracking

class CostTracker:
    def __init__(self):
        self.costs = {}
    
    def track_request(self, user_id: str, tokens: int, model: str):
        prices = {
            'gpt-4-turbo': 0.03,  # per 1k tokens
            'gpt-3.5-turbo': 0.0015
        }
        
        cost = (tokens / 1000) * prices[model]
        self.costs[user_id] = self.costs.get(user_id, 0) + cost
        
        if self.costs[user_id] > 10.0:  # $10 limit
            send_alert(f"User {user_id} exceeded budget")
        
        return cost

⚡ Caching Strategy

import redis
import hashlib

redis_client = redis.Redis()

def get_cached_response(query: str):
    cache_key = hashlib.md5(query.encode()).hexdigest()
    return redis_client.get(cache_key)

def cache_response(query: str, response: str):
    cache_key = hashlib.md5(query.encode()).hexdigest()
    redis_client.setex(cache_key, 3600, response)  # 1 hour TTL

# In your endpoint:
cached = get_cached_response(request.text)
if cached:
    return {"response": cached, "from_cache": True}

🎯 Deployment Comparison

Approach	Pros	Cons	Best For
OpenAI API	Zero setup, reliable	Expensive at scale	MVPs, prototypes
vLLM	5-10x faster, cheaper	Need GPU infrastructure	High traffic
llama.cpp	CPU, offline	Slower	Privacy, edge

🔒 Security Checklist

✅ API keys in environment variables (not hardcoded)
✅ Rate limiting (10-100 req/min per user)
✅ Input validation (length, content, format)
✅ HTTPS only (TLS certificates)
✅ Authentication (API keys or OAuth)
✅ Content moderation (filter harmful inputs/outputs)
✅ Logging (without sensitive data)
✅ Firewall rules (restrict IPs if possible)

⚠️ Safety, Quality & Monitoring

Production LLMs need safety guardrails to prevent harm, quality checks to ensure accuracy, and monitoring to track performance and costs.

🛡️ Input Validation & Moderation

# 1. Content moderation (OpenAI built-in)
from langchain.chains import OpenAIModerationChain

moderation = OpenAIModerationChain()

user_input = "How do I make explosives?"
result = moderation.run(user_input)

if result['flagged']:
    # result['categories'] = {'violence': True, 'hate': False, ...}
    return "I can't assist with that request."

# 2. Prompt injection detection
def detect_prompt_injection(text: str) -> bool:
    """Detect attempts to hijack the system prompt"""
    injection_patterns = [
        "ignore previous instructions",
        "disregard all above",
        "you are now",
        "system:",
        "assistant:",
        "<|im_start|>",
        "<|im_end|>"
    ]
    
    text_lower = text.lower()
    for pattern in injection_patterns:
        if pattern in text_lower:
            return True
    return False

if detect_prompt_injection(user_input):
    return "Invalid request format."

# 3. Input length limits
if len(user_input) > 5000:
    return "Message too long. Please keep it under 5000 characters."

# 4. PII detection (don't log sensitive data)
import re

def contains_pii(text: str) -> bool:
    """Detect credit cards, SSNs, etc."""
    patterns = {
        'credit_card': r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
        'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
        'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    }
    
    for name, pattern in patterns.items():
        if re.search(pattern, text):
            logger.warning(f"PII detected: {name}")
            return True
    return False

✅ Output Quality Checks

# 1. Fact-checking with RAG
def verify_with_rag(response: str, query: str) -> dict:
    """Check if response is grounded in retrieved documents"""
    docs = retriever.get_relevant_documents(query)
    
    # Check if response contains facts from docs
    doc_text = " ".join([d.page_content for d in docs])
    
    # Simple check: are response claims in the docs?
    response_sentences = response.split('.')
    grounded_count = 0
    
    for sentence in response_sentences:
        if sentence.strip() in doc_text:
            grounded_count += 1
    
    grounding_score = grounded_count / len(response_sentences)
    
    return {
        'grounded': grounding_score > 0.5,
        'score': grounding_score,
        'sources': [d.metadata for d in docs]
    }

# 2. Confidence scoring
def get_confidence_score(response: str, logprobs: list) -> float:
    """Use log probabilities to estimate confidence"""
    # Lower perplexity = higher confidence
    avg_logprob = sum(logprobs) / len(logprobs)
    confidence = math.exp(avg_logprob)  # Convert to probability
    return confidence

# 3. Hallucination detection
def check_hallucination(response: str, context: str) -> bool:
    """Check if response makes claims not in context"""
    # Use a separate LLM to verify
    verification_prompt = f"""
    Context: {context}
    Response: {response}
    
    Does the response make claims not supported by the context? Answer yes or no.
    """
    
    result = verification_llm.generate(verification_prompt)
    return "yes" in result.lower()

# 4. Add uncertainty disclaimers
def add_disclaimer(response: str, confidence: float) -> str:
    """Add disclaimers for low-confidence responses"""
    if confidence < 0.5:
        return f"⚠️ Note: I'm not very confident about this answer. Please verify independently.\n\n{response}"
    elif confidence < 0.7:
        return f"💡 {response}\n\n(This is my best interpretation, but please double-check if accuracy is critical.)"
    else:
        return response

📊 Comprehensive Monitoring

# Production monitoring setup
import structlog
from prometheus_client import Counter, Histogram, Gauge
from datadog import statsd
import sentry_sdk

# Structured logging
logger = structlog.get_logger()

# Metrics
REQUEST_COUNT = Counter('llm_requests_total', 'Total requests', ['endpoint', 'status', 'model'])
REQUEST_LATENCY = Histogram('llm_latency_seconds', 'Request latency', ['endpoint'])
TOKEN_USAGE = Counter('llm_tokens_total', 'Token usage', ['model', 'type'])
COST = Counter('llm_cost_dollars', 'Total cost', ['model'])
ACTIVE_USERS = Gauge('llm_active_users', 'Active users')
ERROR_RATE = Counter('llm_errors_total', 'Total errors', ['error_type'])

# Error tracking with Sentry
sentry_sdk.init(dsn="YOUR_SENTRY_DSN")

@app.post("/chat")
async def chat(request: ChatRequest):
    log = logger.bind(
        user_id=request.user_id,
        conversation_id=request.conversation_id,
        endpoint='/chat'
    )
    
    start_time = time.time()
    
    try:
        log.info("request_received", message_length=len(request.text))
        
        # Generate response
        with get_openai_callback() as cb:
            response = await generate_response(request.text)
        
        latency = time.time() - start_time
        
        # Update metrics
        REQUEST_COUNT.labels(endpoint='/chat', status='success', model='gpt-4').inc()
        REQUEST_LATENCY.labels(endpoint='/chat').observe(latency)
        TOKEN_USAGE.labels(model='gpt-4', type='input').inc(cb.prompt_tokens)
        TOKEN_USAGE.labels(model='gpt-4', type='output').inc(cb.completion_tokens)
        COST.labels(model='gpt-4').inc(cb.total_cost)
        
        # Datadog custom metrics
        statsd.increment('llm.requests')
        statsd.histogram('llm.latency', latency * 1000)  # milliseconds
        statsd.gauge('llm.tokens', cb.total_tokens)
        
        log.info(
            "request_completed",
            latency_ms=latency * 1000,
            tokens=cb.total_tokens,
            cost=cb.total_cost
        )
        
        return response
        
    except Exception as e:
        ERROR_RATE.labels(error_type=type(e).__name__).inc()
        
        log.error(
            "request_failed",
            error=str(e),
            error_type=type(e).__name__
        )
        
        # Send to Sentry
        sentry_sdk.capture_exception(e)
        
        raise HTTPException(status_code=500, detail="Internal error")

📈 Quality Monitoring Dashboard

# Track user satisfaction
@app.post("/feedback")
async def feedback(conversation_id: str, rating: int, comment: str = None):
    """Collect user feedback"""
    save_feedback(conversation_id, rating, comment)
    
    # Alert if ratings drop
    recent_avg = get_average_rating(last_n=100)
    if recent_avg < 3.5:
        send_alert(f"Quality alert: Average rating dropped to {recent_avg}")
    
    return {"status": "ok"}

# A/B testing framework
def select_model(user_id: str) -> str:
    """Route 10% of users to GPT-4, 90% to GPT-3.5"""
    if hash(user_id) % 10 == 0:
        return "gpt-4-turbo-preview"
    else:
        return "gpt-3.5-turbo"

# Cost alerts
async def check_cost_limits():
    """Run hourly to check if costs are too high"""
    hourly_cost = get_costs_last_hour()
    daily_cost = get_costs_today()
    
    if hourly_cost > 50:  # $50/hour
        send_urgent_alert(f"High cost: ${hourly_cost}/hour")
    
    if daily_cost > 500:  # $500/day budget
        # Automatically switch to cheaper model
        app.state.default_model = "gpt-3.5-turbo"
        send_alert("Switched to GPT-3.5 due to budget limits")

🔍 Logging Best Practices

✅

DO Log

Timestamps, user IDs, request IDs, latency, tokens, costs, errors, model versions

❌

DON'T Log

Passwords, credit cards, SSNs, private medical info, API keys, sensitive user messages

📊

Analyze

Response times by endpoint, error rates by error type, costs by user/model, user satisfaction trends

🚨

Alert On

Error rate >1%, P95 latency >5s, hourly cost >$50, average rating <3.5

🎯 Production Checklist

✅ Rate limiting (prevent abuse)
✅ Content moderation (filter harmful content)
✅ Input validation (length, format, PII)
✅ Output quality checks (fact-checking, confidence)
✅ Error handling (graceful failures, retries)
✅ Monitoring (logs, metrics, alerts)
✅ Cost tracking (budgets, alerts)
✅ User feedback (ratings, comments)
✅ A/B testing (compare models/prompts)
✅ Security (HTTPS, auth, API keys)
✅ Compliance (GDPR, data retention)
✅ Documentation (API docs, deployment guide)

📊 LLM Development Checklist

☑️ Choose model (GPT-4, open-source, fine-tuned?)
☑️ Implement core LLM logic
☑️ Add RAG/tools if needed
☑️ Build conversation management
☑️ Implement error handling
☑️ Add input/output safety checks
☑️ Create API server
☑️ Add monitoring/logging
☑️ Performance optimization (quantization, caching)
☑️ Cost tracking and limits
☑️ User feedback mechanism
☑️ Deploy to production
☑️ Continuous monitoring and improvement

📋 Summary & Congratulations!

What You've Learned in This Course:

✅ What LLMs are and how they work
✅ Tokenization and embeddings
✅ Prompt engineering techniques
✅ In-context learning and RAG
✅ Fine-tuning strategies (full, LoRA, QLoRA)
✅ Inference optimization (quantization, distillation)
✅ Building production applications (chatbots, agents)

Your Next Steps

Build something! Start with a simple chatbot or RAG system. Iterate based on user feedback. The best way to learn is by doing.

Try a small LLM locally with llama.cpp
Build a RAG chatbot over your documents
Fine-tune a model on your data (QLoRA)
Deploy to production
Share your project on GitHub

🎉 Congratulations! You've completed the LLMs & Transformers course. You now have the skills to build cutting-edge LLM applications. The future of AI is in your hands!

Test Your Knowledge

Q1: What is the main purpose of a system prompt in an LLM application?

To increase model size

To reduce API costs

To define the AI assistant's behavior, role, and constraints

To train the model

Q2: What is the benefit of streaming responses in LLM applications?

It makes responses more accurate

It provides a better user experience by showing progressive output instead of waiting for complete response

It reduces costs

It eliminates errors

Q3: Why is error handling critical in production LLM applications?

It's not important

Only for debugging

It increases response speed

API calls can fail due to rate limits, timeouts, or service issues; graceful handling prevents crashes

Q4: What should you implement to prevent excessive API costs?

Rate limiting, token limits, and usage monitoring

Nothing, costs don't matter

Use the largest model available

Never cache responses

Q5: What is the purpose of conversation memory in chatbot applications?

To increase costs

To slow down responses

To maintain context across multiple turns and provide coherent multi-turn conversations

It serves no purpose