You've built prompts. You've tested them. They work great... on your laptop, with the examples you chose.
But will they work in production? When thousands of users hit them with unpredictable inputs? When costs scale? When a bad output damages your brand?
This module teaches professional prompt engineeringβthe unglamorous but essential practices that separate hobby projects from production systems.
π‘ By the end of this module, you'll know how to:
- Evaluate prompt quality with quantitative rubrics
- Use LLM-as-judge for automated testing
- Implement safety guardrails and content filters
- Monitor costs, errors, and performance in production
- Version and manage prompts like code
- Handle edge cases and graceful degradation
β Concept 1: Evaluating Prompt Quality
You can't improve what you don't measure. Before deploying prompts, establish clear evaluation criteria.
The 5-Criteria Rubric
| Criterion | Poor (1-2) | Good (3-4) | Excellent (5) |
|---|---|---|---|
| Relevance | Off-topic or vague | Mostly on-topic | Perfectly addresses request |
| Accuracy | Contains errors | Mostly correct | Factually accurate, verifiable |
| Completeness | Missing key elements | Covers main points | Comprehensive, nothing missing |
| Format | Wrong structure | Acceptable format | Perfect format match |
| Tone | Inappropriate | Acceptable | Perfect for audience |
How to Use the Rubric
Task-Specific Metrics
Beyond the rubric, measure what matters for your use case:
- Summarization: ROUGE score (overlap with reference summaries)
- Classification: Precision, recall, F1 score
- Code generation: Pass rate on unit tests
- Translation: BLEU score (similarity to human translations)
- Creative writing: Diversity score (unique words/phrases)
βοΈ Concept 2: LLM-as-Judge Pattern
Manual evaluation doesn't scale. Use ChatGPT itself to evaluate outputsβfast, consistent, and surprisingly accurate.
Basic LLM-as-Judge Prompt
Advanced: Pairwise Comparison
Instead of scoring individually, have ChatGPT compare two outputsβmore reliable for subjective quality.
Building an Evaluation Pipeline
# Python pseudo-code for automated evaluation
def evaluate_prompt(prompt, test_cases):
scores = []
for test_input in test_cases:
# Generate output
output = call_chatgpt(prompt + test_input)
# Evaluate with LLM-as-judge
eval_prompt = f"""
Evaluate this output on clarity, accuracy, completeness (1-5 each):
Output: {output}
Return JSON: {{"clarity": X, "accuracy": X, "completeness": X}}
"""
evaluation = call_chatgpt(eval_prompt)
scores.append(parse_json(evaluation))
# Calculate statistics
avg_score = sum(scores) / len(scores)
return avg_score
# Test prompt v1
score_v1 = evaluate_prompt(prompt_v1, test_cases) # 3.8/5
# Improve prompt, test v2
score_v2 = evaluate_prompt(prompt_v2, test_cases) # 4.4/5 β
Better!
β οΈ LLM-as-Judge Limitations:
- Can be biased (prefers verbose responses, certain styles)
- May miss factual errors (doesn't verify facts)
- Should be calibrated against human judgments
- Best combined with other metrics, not used alone
π Concept 3: Safety, Privacy & Guardrails
Content Filtering
Prevent harmful, biased, or off-brand outputs before they reach users.
Data Privacy Checklist
π« NEVER include in prompts:
- Passwords, API keys, credentials - Will be logged, potentially leaked
- SSN, passport numbers, government IDs - Privacy violation
- Protected health information (PHI/HIPAA) - Legal risk
- Financial account details - Security risk
- Trade secrets, confidential business data - Unless Enterprise plan with Data Processing Agreement
β Best practices:
- Use placeholders: "Customer with ID {customer_id}" instead of actual PII
- Anonymize data: "A 35-year-old patient" vs "John Smith, age 35"
- Scope permissions: Only give AI access to data it needs
- Enterprise plans: Use OpenAI Business/Enterprise with zero data retention
- Audit logs: Track what data was sent to ChatGPT APIs
Bias Detection & Mitigation
π Concept 4: Monitoring & Observability
Key Metrics to Track
| Metric | Why It Matters | Target/Alert |
|---|---|---|
| Latency (response time) | User experience, timeout risks | P95 < 5s, alert if > 10s |
| Token usage (cost) | Budget control, optimization opportunities | Track avg per request, alert if +50% spike |
| Error rate | Reliability, need for retries | Target < 1%, alert if > 5% |
| Quality score (from eval) | Output quality degradation over time | Weekly average > 4.0/5 |
| Fallback rate | How often safety filters block outputs | < 2% (higher = too restrictive) |
Logging Strategy
# Comprehensive logging for production prompts
import logging
import time
import json
def call_chatgpt_with_logging(prompt, user_id, request_id):
start_time = time.time()
try:
# Make API call
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
latency = time.time() - start_time
tokens_used = response.usage.total_tokens
cost = calculate_cost(tokens_used, "gpt-4o")
# Log success
log_data = {
"timestamp": time.time(),
"request_id": request_id,
"user_id": user_id,
"model": "gpt-4o",
"latency_seconds": latency,
"tokens_used": tokens_used,
"cost_usd": cost,
"status": "success",
"prompt_version": "v2.3" # Track which prompt version
}
logging.info(json.dumps(log_data))
return response.choices[0].message.content
except Exception as e:
# Log error
log_data = {
"timestamp": time.time(),
"request_id": request_id,
"user_id": user_id,
"status": "error",
"error_type": type(e).__name__,
"error_message": str(e),
"prompt_version": "v2.3"
}
logging.error(json.dumps(log_data))
raise
Alert Examples
Set up alerts for:
- Cost spike: If daily cost > $200 (or 2x normal), alert ops team
- Error spike: If error rate > 5% for 5+ minutes, page on-call
- Latency degradation: If P95 latency > 10s, investigate bottleneck
- Quality drop: If eval score < 3.5 for 100+ samples, review prompt
- Rate limit hit: If 429 errors, scale down or upgrade tier
π Concept 5: Prompt Versioning & Deployment
Treat Prompts Like Code
Store prompts in version control (Git), test before deploying, and roll back if needed.
A/B Testing Prompts
Never deploy untested changes to all users. Run A/B tests to validate improvements.
# A/B testing framework
def route_to_prompt_version(user_id):
# 80% get stable version, 20% get new version
bucket = hash(user_id) % 100
if bucket < 80:
return load_prompt("customer_support_v2.txt") # Control
else:
return load_prompt("customer_support_v3.txt") # Test
# After 1 week, compare metrics:
# v2 (control): 4.4/5 quality, 2.8s latency, 1.2% error rate
# v3 (test): 4.6/5 quality, 3.2s latency, 0.9% error rate
# Conclusion: v3 is better on all metrics β rollout to 100%
Rollback Plan
Always have a rollback strategy:
- Keep previous version: Don't delete old prompts, archive them
- Monitor new deployments: Watch metrics closely for 24-48 hours
- Quick rollback: Single command/click to revert to previous version
- Post-mortem: If rollback needed, document what went wrong
β οΈ Concept 6: Handling Edge Cases
Common Edge Cases
Empty Input
Problem: User submits blank form
Solution: Validate input before calling API, return friendly error
Extremely Long Input
Problem: Exceeds token limit (128K for GPT-4o)
Solution: Truncate or chunk, or reject with helpful message
Non-English Input
Problem: Prompt assumes English, user sends Spanish
Solution: Detect language, route to multilingual prompt
Malformed Output
Problem: Expected JSON, got text explanation
Solution: Retry with stronger formatting instructions
Graceful Degradation Pattern
Timeout Handling
import asyncio
async def call_with_timeout(prompt, timeout_seconds=10):
try:
# Race between API call and timeout
response = await asyncio.wait_for(
call_chatgpt_async(prompt),
timeout=timeout_seconds
)
return response
except asyncio.TimeoutError:
# Fallback: return cached response or error message
logging.warning(f"Timeout after {timeout_seconds}s")
return get_cached_response() or "Request timed out. Try a shorter query."
β Concept 7: Production Readiness Checklist
Before deploying prompts to production, verify every item:
π Pre-Deployment Checklist
π― Final Challenge: Production-Ready Prompt System
π Capstone: Build & Deploy Your First Production Prompt
Your Mission: Take a prompt from development to production with full professional practices.
Deliverables:
- Prompt Design
- Choose a real use case (customer support, content generation, data analysis)
- Write prompt with clear instructions, examples, constraints
- Store in version control with documentation
- Evaluation System
- Create evaluation rubric (5 criteria)
- Test on 20+ diverse inputs
- Calculate quality scores, identify weaknesses
- Iterate until score > 4.0/5
- Safety & Guardrails
- Implement input validation
- Add content filtering (pre and post)
- Handle edge cases (empty, long, non-English)
- Monitoring & Logging
- Log every request: timestamp, latency, tokens, cost
- Track quality scores over time
- Set up 2-3 alerts (cost spike, error rate, quality drop)
- Production Deployment
- Deploy to test group (10-20% traffic)
- Monitor for 48 hours
- If successful: full rollout
- If issues: rollback and iterate
Bonus Points:
- Build LLM-as-judge automated evaluation
- Create A/B test comparing two prompt versions
- Implement graceful degradation with fallbacks
- Write post-deployment report with metrics
π Summary: Your Production Prompt Engineering Toolkit
You've learned the essentials of professional prompt engineering:
- β Evaluation: 5-criteria rubric, LLM-as-judge, quantitative metrics
- β Safety: Content filtering, PII protection, bias detection
- β Monitoring: Logging, metrics tracking, alerts
- β Versioning: Git storage, A/B testing, rollback plans
- β Error handling: Retries, timeouts, graceful degradation
- β Production readiness: Comprehensive deployment checklist
π Congratulations! You've Completed the ChatGPT Prompt Engineering Course
From basic prompts to production systemsβyou now have the skills to:
- Craft expert-level prompts across any domain
- Use advanced techniques (CoT, few-shot, JSON mode, function calling)
- Build multimodal workflows with vision and voice
- Deploy production-grade AI systems with safety and monitoring
- Optimize costs and reliability at scale
You're ready to build professional AI applications. Now go create something amazing! π
π Production-Grade AI Tools
Ready to deploy AI at scale? These professional platforms handle monitoring, evaluation, and production infrastructure:
LangSmith - LLM Observability Platform
LangChain | Free tier + paid plans
Professional LLM ops: Debug, test, evaluate, and monitor your LLM applications. Track every prompt, response, latency, and cost. Built by the creators of LangChain for production-grade AI systems.
π‘ Perfect for: Engineering teams deploying AI at scale. See exactly which prompts work, which fail, where latency spikes, and how much each request costs. Essential for production debugging and optimization.
- Tracing: Track every LLM call with full context
- Evaluation: Automated testing and scoring
- Monitoring: Real-time dashboards for errors, latency, costs
- Datasets: Build test suites for regression testing
OpenAI API Platform
OpenAI | Pay-as-you-go
Build production apps: Official OpenAI API with GPT-4, GPT-4o, o1, function calling, vision, and voice. Programmatic access to integrate ChatGPT into your applications, websites, and workflows.
π‘ Use Case: Build customer support chatbots, content generation pipelines, data analysis tools, or AI-powered features in your product. Pay only for what you use (~$0.002-0.10 per 1K tokens).
- All Models: GPT-4o, o1, embeddings, vision, TTS
- Function Calling: Connect LLMs to your tools and APIs
- Fine-Tuning: Customize models on your data
- Usage Dashboard: Track costs and usage in real-time
PromptLayer - Prompt Management
promptlayer.com | Free tier available
Version control for prompts: Log, search, and manage all your LLM requests. Track prompt versions, A/B test variations, and see performance metrics across your entire team's AI usage.
- Request Logging: Every prompt and response stored
- Version Control: Git-like management for prompts
- Analytics: Cost, latency, and quality metrics
- Team Collaboration: Share prompts across organization
π Course Complete!
You've mastered ChatGPT Prompt Engineering from fundamentals to production deployment
You're now equipped to build powerful AI-driven solutions. Time to make it official! π
π Knowledge Check
Test your understanding of production best practices for ChatGPT!
Question 1: What is the most important factor in production ChatGPT deployments?
Question 2: What should you implement to handle rate limits?
Question 3: Why is prompt versioning important in production?
Question 4: What's the best approach for handling sensitive data?
Question 5: What metrics should you monitor in production?
Ready to make it official?
π Get Your Completion Certificate
Demonstrate your prompt engineering expertise to employers and clients!
Your certificate includes:
- β Official completion verification
- β Unique certificate ID
- β Shareable on LinkedIn, Twitter, and resume
- β Public verification page
- β Professional PDF download