π Complete all tutorials to earn your Free LLMs & Transformers Certificate
Shareable on LinkedIn β’ Verified by AITutorials.site β’ No signup fee
π From Model to Product
You've learned how LLMs work, how to fine-tune them, and optimize inference. Now comes the hardest part: building production systems that real users rely on.
A model in isolation isn't useful. You need to build systems around it: APIs, error handling, monitoring, user interfaces, cost management, and safety guardrails.
π‘ The 10/90 Rule
Building the model is 10% of the work. The other 90% is engineering: integration, reliability, monitoring, user experience, cost optimization, and handling edge cases.
π Production Reality Check
| Challenge | Development | Production |
|---|---|---|
| Latency | 3-5 seconds is fine | Users leave after 2s |
| Cost | $0.01 per request | $1,000/day at 100k users |
| Errors | Restart script | Must handle gracefully |
| Scale | 1 request at a time | 1,000 concurrent requests |
| Safety | Trusted input | Adversarial attacks |
ποΈ Application Architecture Layers
1. User Interface
Web UI, mobile app, Slack bot, API client. How users interact with your LLM.
2. API Layer
FastAPI/Flask server. Input validation, rate limiting, auth, request routing.
3. LLM Logic
Prompt construction, RAG retrieval, agent tools, conversation management.
4. Model Serving
vLLM/TensorRT-LLM. Efficient inference, batching, caching, quantization.
5. Data Layer
PostgreSQL for conversations, Redis for caching, vector DB for RAG.
6. Monitoring
Logging, metrics, alerts. Track latency, costs, quality, errors.
π― Real Example: Customer Support Bot
Frontend: React chat widget
API: FastAPI with auth + rate limiting
Logic: RAG (retrieve docs) + Agent (create tickets)
Model: GPT-4-turbo (for complex queries) + GPT-3.5 (for simple ones)
Storage: PostgreSQL (tickets) + Pinecone (docs)
Monitoring: Datadog for metrics, Sentry for errors
Cost: $0.015/conversation, handles 10k conversations/day = $150/day
π€ Types of LLM Applications
Chatbots
Conversational interface. Remember context, handle multi-turn conversations.
Agents
Autonomous systems that use tools (APIs, databases) to solve tasks.
Pipelines
Chain multiple LLM calls and tools for complex tasks.
Content Apps
Generate text: summarization, translation, writing assistance.
π¬ Building Production Chatbots
Chatbots are the most common LLM application. They seem simple but have hidden complexity: memory management, context windows, latency optimization, and safety.
ποΈ Chatbot Architecture
- UI: Chat interface (web, mobile, Slack, Discord, Teams)
- API: Backend endpoint for processing messages
- Memory: Store and retrieve conversation history
- LLM: Generate contextual responses
- Tools/RAG: Retrieve data, call APIs, take actions
- Safety: Content moderation, input validation
πΎ Memory Strategies
Buffer Memory
Store: Last N messages
Use: Short conversations
Cost: Low
Summary Memory
Store: Summary + recent messages
Use: Long conversations
Cost: Medium (summarization)
Vector Memory
Store: Embeddings of messages
Use: Retrieve relevant context
Cost: High (embeddings + search)
Entity Memory
Store: Extract key facts
Use: Remember user preferences
Cost: Medium (extraction)
π Production Chatbot Implementation
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationSummaryBufferMemory
from langchain.chains import ConversationChain
from langchain.callbacks import get_openai_callback
import redis
import json
app = FastAPI()
llm = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0.7)
# Redis for conversation storage
redis_client = redis.Redis(host='localhost', port=6379, decode_responses=True)
class Message(BaseModel):
text: str
conversation_id: str
user_id: str
def get_conversation_memory(conversation_id: str):
"""Retrieve or create conversation memory"""
history = redis_client.get(f"conversation:{conversation_id}")
memory = ConversationSummaryBufferMemory(
llm=llm,
max_token_limit=2000, # Keep recent context under 2k tokens
return_messages=True
)
if history:
# Load previous messages
messages = json.loads(history)
for msg in messages:
if msg['role'] == 'user':
memory.chat_memory.add_user_message(msg['content'])
else:
memory.chat_memory.add_ai_message(msg['content'])
return memory
def save_conversation(conversation_id: str, user_msg: str, ai_msg: str):
"""Save conversation to Redis with 24hr expiration"""
history = redis_client.get(f"conversation:{conversation_id}")
messages = json.loads(history) if history else []
messages.append({"role": "user", "content": user_msg})
messages.append({"role": "assistant", "content": ai_msg})
# Keep only last 50 messages to prevent memory explosion
messages = messages[-50:]
redis_client.setex(
f"conversation:{conversation_id}",
86400, # 24 hours
json.dumps(messages)
)
@app.post("/chat")
async def chat(message: Message):
"""Main chat endpoint with error handling and monitoring"""
try:
# Get conversation memory
memory = get_conversation_memory(message.conversation_id)
# Create conversation chain
conversation = ConversationChain(
llm=llm,
memory=memory,
verbose=True
)
# Generate response with cost tracking
with get_openai_callback() as cb:
response = conversation.run(message.text)
# Log metrics
print(f"Tokens used: {cb.total_tokens}, Cost: ${cb.total_cost:.4f}")
# Save to database
save_conversation(message.conversation_id, message.text, response)
return {
"response": response,
"conversation_id": message.conversation_id,
"tokens_used": cb.total_tokens,
"cost": cb.total_cost
}
except Exception as e:
print(f"Error in chat endpoint: {e}")
raise HTTPException(status_code=500, detail="Failed to generate response")
# Run: uvicorn app:app --host 0.0.0.0 --port 8000
β‘ Context Window Management
# Handle long conversations that exceed context window
from langchain.memory import ConversationSummaryBufferMemory
# Strategy 1: Sliding window (keep last N messages)
memory = ConversationBufferWindowMemory(k=10) # Last 10 messages
# Strategy 2: Summarization (summarize old messages)
memory = ConversationSummaryBufferMemory(
llm=llm,
max_token_limit=2000 # When history exceeds 2k tokens, summarize oldest
)
# Strategy 3: Token counting (precise control)
from langchain.memory import ConversationTokenBufferMemory
memory = ConversationTokenBufferMemory(
llm=llm,
max_token_limit=4000 # Exactly 4k tokens max
)
# Strategy 4: Hybrid (summary + recent messages + vector search)
from langchain.memory import VectorStoreRetrieverMemory
from langchain.vectorstores import FAISS
# Recent messages
recent_memory = ConversationBufferWindowMemory(k=5)
# Semantic search over full history
vector_memory = VectorStoreRetrieverMemory(
retriever=FAISS.from_texts(conversation_history).as_retriever(k=3)
)
# Combine: recent context + semantically relevant past messages
π― Advanced Features
# 1. User-specific system prompts
user_context = get_user_profile(user_id) # Get preferences, history, etc.
system_prompt = f"""
You are a helpful assistant for {user_context['name']}.
User preferences: {user_context['preferences']}
Conversation style: {user_context['style']}
"""
# 2. Streaming responses (for better UX)
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
llm = ChatOpenAI(
streaming=True,
callbacks=[StreamingStdOutCallbackHandler()]
)
# 3. Multi-turn planning (for complex tasks)
from langchain.chains import LLMChain
# First: understand intent
intent_chain = LLMChain(llm=llm, prompt=intent_prompt)
intent = intent_chain.run(user_message)
# Second: route to appropriate handler
if intent == "question":
response = rag_qa_chain.run(user_message)
elif intent == "task":
response = agent.run(user_message)
else:
response = general_conversation.run(user_message)
β οΈ Common Pitfalls
1. Context Overflow: Don't concatenate unlimited messages. Use summarization or sliding windows.
2. Memory Leaks: Clean up old conversations. Set TTL (time-to-live) in Redis.
3. No Fallbacks: Always have error messages when LLM fails.
4. Ignoring Latency: Stream responses. Show "typing..." indicator.
5. Poor Safety: Filter harmful inputs. Moderate outputs.
π° Cost Optimization
- Caching: Cache common responses ("What's your name?", "Hello")
- Model routing: Use GPT-3.5 for simple, GPT-4 for complex
- Token limits: Set max_tokens to prevent runaway costs
- Rate limiting: Limit requests per user (10/minute)
π€ Building Intelligent Agents
Agents are LLMs that can use tools autonomously. They reason about what actions to take, execute those actions using tools (APIs, databases, calculators), and iterate until the task is complete.
Think of agents as LLMs with hands - they can interact with the world, not just generate text.
π The ReAct Framework (Reasoning + Acting)
- Thought: "I need to find the current Tesla stock price"
- Action: Use Search tool with query "Tesla stock price today"
- Observation: "Tesla (TSLA) is trading at $242.50"
- Thought: "Now I need to multiply by 1.1"
- Action: Use Calculator tool with "242.50 * 1.1"
- Observation: "266.75"
- Thought: "I have the answer"
- Answer: "If Tesla stock increases by 10%, it would be $266.75"
π§ Why ReAct Works
By forcing the LLM to verbalize its reasoning before each action, we get better planning, easier debugging, and fewer hallucinations. The model "thinks out loud" like a human solving a problem.
π οΈ Building Custom Tools
from langchain.agents import Tool
import requests
import sqlite3
# Tool 1: Web Search
def web_search(query: str) -> str:
"""Search the web using Tavily API"""
response = requests.post(
"https://api.tavily.com/search",
json={"query": query, "max_results": 3},
headers={"Authorization": f"Bearer {TAVILY_API_KEY}"}
)
results = response.json()["results"]
return "\n".join([f"{r['title']}: {r['snippet']}" for r in results])
# Tool 2: Database Query
def query_database(sql: str) -> str:
"""Execute SQL query on customer database"""
conn = sqlite3.connect('customers.db')
cursor = conn.cursor()
# Safety: only allow SELECT queries
if not sql.strip().upper().startswith('SELECT'):
return "Error: Only SELECT queries allowed"
try:
cursor.execute(sql)
results = cursor.fetchall()
return str(results)
except Exception as e:
return f"Error: {e}"
finally:
conn.close()
# Tool 3: Send Email
def send_email(to: str, subject: str, body: str) -> str:
"""Send email to customer"""
# Integrate with SendGrid, Mailgun, etc.
# For safety, require human approval for production
return f"Email queued for approval: To {to}, Subject: {subject}"
# Tool 4: Calculator
from math import *
def calculator(expression: str) -> str:
"""Evaluate mathematical expressions"""
try:
# Safety: use eval carefully, whitelist only math functions
result = eval(expression, {"__builtins__": {}}, {"sqrt": sqrt, "pow": pow})
return str(result)
except Exception as e:
return f"Error: {e}"
# Define tools with clear descriptions (crucial for LLM to use correctly)
tools = [
Tool(
name="Search",
func=web_search,
description="Search the web for current information. Use for facts, news, prices, etc. Input: search query as string"
),
Tool(
name="DatabaseQuery",
func=query_database,
description="Query customer database using SQL. Use for customer info, orders, analytics. Input: SQL SELECT query"
),
Tool(
name="SendEmail",
func=send_email,
description="Send email to customer. Requires to, subject, body. Input: 'to@email.com | Subject | Body text'"
),
Tool(
name="Calculator",
func=calculator,
description="Evaluate math expressions. Use for calculations. Input: mathematical expression as string"
)
]
π€ Creating the Agent
from langchain.agents import initialize_agent, AgentType
from langchain.chat_models import ChatOpenAI
from langchain.callbacks import get_openai_callback
# Initialize LLM (GPT-4 works best for agents)
llm = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0)
# Create agent with tools
agent = initialize_agent(
tools=tools,
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True, # See reasoning steps
max_iterations=10, # Prevent infinite loops
max_execution_time=60, # Timeout after 60 seconds
early_stopping_method="generate" # Return partial answer on timeout
)
# Example 1: Multi-step query
with get_openai_callback() as cb:
response = agent.run(
"Find the top 3 customers by order count and email them a thank you"
)
print(f"Cost: ${cb.total_cost:.4f}")
# Agent reasoning:
# Thought: Need to query database for top customers
# Action: DatabaseQuery
# Action Input: SELECT customer_email, COUNT(*) as orders FROM orders GROUP BY customer_email ORDER BY orders DESC LIMIT 3
# Observation: [('alice@example.com', 45), ('bob@example.com', 38), ('charlie@example.com', 32)]
# Thought: Now send thank you emails
# Action: SendEmail (3 times)
# Final Answer: Sent thank you emails to top 3 customers
# Example 2: Real-time data + calculation
response = agent.run(
"What's the market cap of Apple? Compare it to Microsoft."
)
# Agent reasoning:
# Thought: Need current stock prices and share counts
# Action: Search "Apple stock price market cap"
# Observation: Apple market cap is $2.8 trillion
# Action: Search "Microsoft stock price market cap"
# Observation: Microsoft market cap is $3.1 trillion
# Thought: Compare the two
# Action: Calculator "3.1 / 2.8"
# Observation: 1.107
# Final Answer: Microsoft's market cap ($3.1T) is 1.11x larger than Apple's ($2.8T)
π Agent Safety & Guardrails
Agents are powerful but risky. They can call APIs, modify databases, send emails. Always implement safety measures.
# Guardrail 1: Human-in-the-loop for critical actions
def send_email_with_approval(to: str, subject: str, body: str) -> str:
"""Queue email for human approval"""
approval_needed.append({"to": to, "subject": subject, "body": body})
return "Email queued for approval. Will be reviewed by human."
# Guardrail 2: Read-only tools
def query_database_readonly(sql: str) -> str:
"""Only allow SELECT queries"""
if not sql.strip().upper().startswith('SELECT'):
return "Error: Only SELECT queries allowed for safety"
# ... execute query
# Guardrail 3: Cost limits
class CostLimitAgent:
def __init__(self, max_cost=1.0):
self.total_cost = 0
self.max_cost = max_cost
def run(self, query):
with get_openai_callback() as cb:
response = agent.run(query)
self.total_cost += cb.total_cost
if self.total_cost > self.max_cost:
raise Exception(f"Cost limit exceeded: ${self.total_cost:.4f}")
return response
# Guardrail 4: Timeout and max iterations
agent = initialize_agent(
tools,
llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
max_iterations=10, # Stop after 10 tool calls
max_execution_time=60, # Timeout after 60 seconds
early_stopping_method="generate" # Return partial answer
)
# Guardrail 5: Input validation
def validate_input(user_input: str) -> bool:
"""Check for prompt injections and unsafe inputs"""
# Check for common injection patterns
dangerous_patterns = [
"ignore previous instructions",
"disregard all rules",
"system:",
"<|im_start|>"
]
for pattern in dangerous_patterns:
if pattern in user_input.lower():
return False
return True
if not validate_input(user_query):
return "Invalid input detected"
π Agent Performance Tips
Clear Tool Descriptions
The more specific your tool descriptions, the better the agent knows when to use them. Bad: "search tool". Good: "Search web for current facts, news, stock prices".
Use GPT-4
GPT-3.5 struggles with multi-step reasoning. GPT-4 is much better at planning and tool use. Worth the extra cost.
Set Timeouts
Agents can get stuck in loops. Always set max_iterations and max_execution_time.
Log Everything
Save all agent reasoning steps. Essential for debugging and improving prompts.
π Real-World Agent Examples
- Customer Support: Look up orders, check inventory, process refunds
- Data Analysis: Query databases, create visualizations, export reports
- Research: Search papers, summarize findings, compare approaches
- DevOps: Check server status, restart services, analyze logs
- Sales: Look up leads, send emails, schedule meetings (CRM integration)
βοΈ Production Deployment & Scaling
Deploying an LLM application means making it fast, reliable, scalable, and cost-effective. This section covers Docker, Kubernetes, monitoring, and real-world deployment strategies.
π Production-Ready FastAPI Server
from fastapi import FastAPI, HTTPException, Depends, Header
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, validator
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
import logging
import time
from typing import Optional
app = FastAPI(title="LLM Chat API", version="1.0.0")
# Rate limiting (prevent abuse)
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
# CORS middleware
app.add_middleware(
CORSMiddleware,
allow_origins=["https://yourapp.com"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@app.on_event("startup")
async def load_model():
global llm
logger.info("Loading LLM...")
llm = ChatOpenAI(model="gpt-4-turbo-preview")
logger.info("LLM loaded successfully")
class ChatRequest(BaseModel):
text: str
conversation_id: Optional[str] = None
user_id: str
@validator('text')
def validate_text(cls, v):
if len(v) > 5000:
raise ValueError('Message too long')
if not v.strip():
raise ValueError('Message cannot be empty')
return v
# API key authentication
async def verify_api_key(x_api_key: str = Header(...)):
if x_api_key not in valid_api_keys:
raise HTTPException(status_code=401, detail="Invalid API key")
return x_api_key
@app.post("/chat")
@limiter.limit("10/minute") # 10 requests per minute
async def chat(
request: ChatRequest,
api_key: str = Depends(verify_api_key)
):
start_time = time.time()
try:
# Content moderation
if is_unsafe_content(request.text):
raise HTTPException(status_code=400, detail="Unsafe content")
# Generate response
response = await generate_response(request.text)
# Save to database
await save_message(request.conversation_id, request.text, response)
latency = (time.time() - start_time) * 1000
logger.info(f"Request completed: {latency:.2f}ms, {response['tokens']} tokens")
return {
"response": response['text'],
"tokens_used": response['tokens'],
"latency_ms": latency
}
except Exception as e:
logger.error(f"Error: {e}", exc_info=True)
raise HTTPException(status_code=500, detail="Internal error")
@app.get("/health")
async def health_check():
return {"status": "healthy", "model_loaded": llm is not None}
# Run: uvicorn app:app --host 0.0.0.0 --port 8000 --workers 4
π³ Docker Deployment
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["gunicorn", "app:app", "-w", "4", "-k", "uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:8000"]
# docker-compose.yml
version: '3.8'
services:
api:
build: .
ports:
- "8000:8000"
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- DATABASE_URL=${DATABASE_URL}
- REDIS_URL=redis://redis:6379
depends_on:
- redis
- postgres
redis:
image: redis:7-alpine
ports:
- "6379:6379"
postgres:
image: postgres:15-alpine
environment:
- POSTGRES_DB=llm_app
- POSTGRES_USER=admin
- POSTGRES_PASSWORD=${DB_PASSWORD}
volumes:
- postgres_data:/var/lib/postgresql/data
nginx:
image: nginx:alpine
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
depends_on:
- api
volumes:
postgres_data:
# Run: docker-compose up -d
βΈοΈ Kubernetes (for scale)
# kubernetes-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-api
spec:
replicas: 3 # 3 instances
selector:
matchLabels:
app: llm-api
template:
metadata:
labels:
app: llm-api
spec:
containers:
- name: api
image: your-registry/llm-api:latest
ports:
- containerPort: 8000
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8000
# Deploy: kubectl apply -f kubernetes-deployment.yaml
π Monitoring & Logging
# Prometheus metrics
from prometheus_client import Counter, Histogram
request_count = Counter(
'llm_requests_total',
'Total LLM requests',
['endpoint', 'status']
)
request_latency = Histogram(
'llm_request_duration_seconds',
'Request latency'
)
token_usage = Counter(
'llm_tokens_used_total',
'Total tokens consumed'
)
# In your endpoint:
request_count.labels(endpoint='/chat', status='success').inc()
request_latency.observe(latency_seconds)
token_usage.inc(tokens_used)
π° Cost Tracking
class CostTracker:
def __init__(self):
self.costs = {}
def track_request(self, user_id: str, tokens: int, model: str):
prices = {
'gpt-4-turbo': 0.03, # per 1k tokens
'gpt-3.5-turbo': 0.0015
}
cost = (tokens / 1000) * prices[model]
self.costs[user_id] = self.costs.get(user_id, 0) + cost
if self.costs[user_id] > 10.0: # $10 limit
send_alert(f"User {user_id} exceeded budget")
return cost
β‘ Caching Strategy
import redis
import hashlib
redis_client = redis.Redis()
def get_cached_response(query: str):
cache_key = hashlib.md5(query.encode()).hexdigest()
return redis_client.get(cache_key)
def cache_response(query: str, response: str):
cache_key = hashlib.md5(query.encode()).hexdigest()
redis_client.setex(cache_key, 3600, response) # 1 hour TTL
# In your endpoint:
cached = get_cached_response(request.text)
if cached:
return {"response": cached, "from_cache": True}
π― Deployment Comparison
| Approach | Pros | Cons | Best For |
|---|---|---|---|
| OpenAI API | Zero setup, reliable | Expensive at scale | MVPs, prototypes |
| vLLM | 5-10x faster, cheaper | Need GPU infrastructure | High traffic |
| llama.cpp | CPU, offline | Slower | Privacy, edge |
π Security Checklist
- β API keys in environment variables (not hardcoded)
- β Rate limiting (10-100 req/min per user)
- β Input validation (length, content, format)
- β HTTPS only (TLS certificates)
- β Authentication (API keys or OAuth)
- β Content moderation (filter harmful inputs/outputs)
- β Logging (without sensitive data)
- β Firewall rules (restrict IPs if possible)
β οΈ Safety, Quality & Monitoring
Production LLMs need safety guardrails to prevent harm, quality checks to ensure accuracy, and monitoring to track performance and costs.
π‘οΈ Input Validation & Moderation
# 1. Content moderation (OpenAI built-in)
from langchain.chains import OpenAIModerationChain
moderation = OpenAIModerationChain()
user_input = "How do I make explosives?"
result = moderation.run(user_input)
if result['flagged']:
# result['categories'] = {'violence': True, 'hate': False, ...}
return "I can't assist with that request."
# 2. Prompt injection detection
def detect_prompt_injection(text: str) -> bool:
"""Detect attempts to hijack the system prompt"""
injection_patterns = [
"ignore previous instructions",
"disregard all above",
"you are now",
"system:",
"assistant:",
"<|im_start|>",
"<|im_end|>"
]
text_lower = text.lower()
for pattern in injection_patterns:
if pattern in text_lower:
return True
return False
if detect_prompt_injection(user_input):
return "Invalid request format."
# 3. Input length limits
if len(user_input) > 5000:
return "Message too long. Please keep it under 5000 characters."
# 4. PII detection (don't log sensitive data)
import re
def contains_pii(text: str) -> bool:
"""Detect credit cards, SSNs, etc."""
patterns = {
'credit_card': r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
}
for name, pattern in patterns.items():
if re.search(pattern, text):
logger.warning(f"PII detected: {name}")
return True
return False
β Output Quality Checks
# 1. Fact-checking with RAG
def verify_with_rag(response: str, query: str) -> dict:
"""Check if response is grounded in retrieved documents"""
docs = retriever.get_relevant_documents(query)
# Check if response contains facts from docs
doc_text = " ".join([d.page_content for d in docs])
# Simple check: are response claims in the docs?
response_sentences = response.split('.')
grounded_count = 0
for sentence in response_sentences:
if sentence.strip() in doc_text:
grounded_count += 1
grounding_score = grounded_count / len(response_sentences)
return {
'grounded': grounding_score > 0.5,
'score': grounding_score,
'sources': [d.metadata for d in docs]
}
# 2. Confidence scoring
def get_confidence_score(response: str, logprobs: list) -> float:
"""Use log probabilities to estimate confidence"""
# Lower perplexity = higher confidence
avg_logprob = sum(logprobs) / len(logprobs)
confidence = math.exp(avg_logprob) # Convert to probability
return confidence
# 3. Hallucination detection
def check_hallucination(response: str, context: str) -> bool:
"""Check if response makes claims not in context"""
# Use a separate LLM to verify
verification_prompt = f"""
Context: {context}
Response: {response}
Does the response make claims not supported by the context? Answer yes or no.
"""
result = verification_llm.generate(verification_prompt)
return "yes" in result.lower()
# 4. Add uncertainty disclaimers
def add_disclaimer(response: str, confidence: float) -> str:
"""Add disclaimers for low-confidence responses"""
if confidence < 0.5:
return f"β οΈ Note: I'm not very confident about this answer. Please verify independently.\n\n{response}"
elif confidence < 0.7:
return f"π‘ {response}\n\n(This is my best interpretation, but please double-check if accuracy is critical.)"
else:
return response
π Comprehensive Monitoring
# Production monitoring setup
import structlog
from prometheus_client import Counter, Histogram, Gauge
from datadog import statsd
import sentry_sdk
# Structured logging
logger = structlog.get_logger()
# Metrics
REQUEST_COUNT = Counter('llm_requests_total', 'Total requests', ['endpoint', 'status', 'model'])
REQUEST_LATENCY = Histogram('llm_latency_seconds', 'Request latency', ['endpoint'])
TOKEN_USAGE = Counter('llm_tokens_total', 'Token usage', ['model', 'type'])
COST = Counter('llm_cost_dollars', 'Total cost', ['model'])
ACTIVE_USERS = Gauge('llm_active_users', 'Active users')
ERROR_RATE = Counter('llm_errors_total', 'Total errors', ['error_type'])
# Error tracking with Sentry
sentry_sdk.init(dsn="YOUR_SENTRY_DSN")
@app.post("/chat")
async def chat(request: ChatRequest):
log = logger.bind(
user_id=request.user_id,
conversation_id=request.conversation_id,
endpoint='/chat'
)
start_time = time.time()
try:
log.info("request_received", message_length=len(request.text))
# Generate response
with get_openai_callback() as cb:
response = await generate_response(request.text)
latency = time.time() - start_time
# Update metrics
REQUEST_COUNT.labels(endpoint='/chat', status='success', model='gpt-4').inc()
REQUEST_LATENCY.labels(endpoint='/chat').observe(latency)
TOKEN_USAGE.labels(model='gpt-4', type='input').inc(cb.prompt_tokens)
TOKEN_USAGE.labels(model='gpt-4', type='output').inc(cb.completion_tokens)
COST.labels(model='gpt-4').inc(cb.total_cost)
# Datadog custom metrics
statsd.increment('llm.requests')
statsd.histogram('llm.latency', latency * 1000) # milliseconds
statsd.gauge('llm.tokens', cb.total_tokens)
log.info(
"request_completed",
latency_ms=latency * 1000,
tokens=cb.total_tokens,
cost=cb.total_cost
)
return response
except Exception as e:
ERROR_RATE.labels(error_type=type(e).__name__).inc()
log.error(
"request_failed",
error=str(e),
error_type=type(e).__name__
)
# Send to Sentry
sentry_sdk.capture_exception(e)
raise HTTPException(status_code=500, detail="Internal error")
π Quality Monitoring Dashboard
# Track user satisfaction
@app.post("/feedback")
async def feedback(conversation_id: str, rating: int, comment: str = None):
"""Collect user feedback"""
save_feedback(conversation_id, rating, comment)
# Alert if ratings drop
recent_avg = get_average_rating(last_n=100)
if recent_avg < 3.5:
send_alert(f"Quality alert: Average rating dropped to {recent_avg}")
return {"status": "ok"}
# A/B testing framework
def select_model(user_id: str) -> str:
"""Route 10% of users to GPT-4, 90% to GPT-3.5"""
if hash(user_id) % 10 == 0:
return "gpt-4-turbo-preview"
else:
return "gpt-3.5-turbo"
# Cost alerts
async def check_cost_limits():
"""Run hourly to check if costs are too high"""
hourly_cost = get_costs_last_hour()
daily_cost = get_costs_today()
if hourly_cost > 50: # $50/hour
send_urgent_alert(f"High cost: ${hourly_cost}/hour")
if daily_cost > 500: # $500/day budget
# Automatically switch to cheaper model
app.state.default_model = "gpt-3.5-turbo"
send_alert("Switched to GPT-3.5 due to budget limits")
π Logging Best Practices
DO Log
Timestamps, user IDs, request IDs, latency, tokens, costs, errors, model versions
DON'T Log
Passwords, credit cards, SSNs, private medical info, API keys, sensitive user messages
Analyze
Response times by endpoint, error rates by error type, costs by user/model, user satisfaction trends
Alert On
Error rate >1%, P95 latency >5s, hourly cost >$50, average rating <3.5
π― Production Checklist
- β Rate limiting (prevent abuse)
- β Content moderation (filter harmful content)
- β Input validation (length, format, PII)
- β Output quality checks (fact-checking, confidence)
- β Error handling (graceful failures, retries)
- β Monitoring (logs, metrics, alerts)
- β Cost tracking (budgets, alerts)
- β User feedback (ratings, comments)
- β A/B testing (compare models/prompts)
- β Security (HTTPS, auth, API keys)
- β Compliance (GDPR, data retention)
- β Documentation (API docs, deployment guide)
π LLM Development Checklist
- βοΈ Choose model (GPT-4, open-source, fine-tuned?)
- βοΈ Implement core LLM logic
- βοΈ Add RAG/tools if needed
- βοΈ Build conversation management
- βοΈ Implement error handling
- βοΈ Add input/output safety checks
- βοΈ Create API server
- βοΈ Add monitoring/logging
- βοΈ Performance optimization (quantization, caching)
- βοΈ Cost tracking and limits
- βοΈ User feedback mechanism
- βοΈ Deploy to production
- βοΈ Continuous monitoring and improvement
π Summary & Congratulations!
What You've Learned in This Course:
- β What LLMs are and how they work
- β Tokenization and embeddings
- β Prompt engineering techniques
- β In-context learning and RAG
- β Fine-tuning strategies (full, LoRA, QLoRA)
- β Inference optimization (quantization, distillation)
- β Building production applications (chatbots, agents)
Your Next Steps
Build something! Start with a simple chatbot or RAG system. Iterate based on user feedback. The best way to learn is by doing.
- Try a small LLM locally with llama.cpp
- Build a RAG chatbot over your documents
- Fine-tune a model on your data (QLoRA)
- Deploy to production
- Share your project on GitHub
π Congratulations! You've completed the LLMs & Transformers course. You now have the skills to build cutting-edge LLM applications. The future of AI is in your hands!
Get Your Completion Certificate
Showcase your LLMs & Transformers expertise!
π Your certificate includes:
- β Official completion verification
- β Unique certificate ID
- β Shareable on LinkedIn, Twitter, and resume
- β Public verification page
- β Professional PDF download
Test Your Knowledge
Q1: What is the main purpose of a system prompt in an LLM application?
Q2: What is the benefit of streaming responses in LLM applications?
Q3: Why is error handling critical in production LLM applications?
Q4: What should you implement to prevent excessive API costs?
Q5: What is the purpose of conversation memory in chatbot applications?