🎓 Final Tutorial! Complete this to earn your Free Statistics for AI Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup required
🎯 Why Bayesian Statistics?
"There's a 70% chance of rain tomorrow." "This email is 95% likely to be spam." "The model is 80% confident this is a cat." These are Bayesian statements—they express uncertainty as probability. Unlike frequentist statistics, which treats parameters as fixed unknowns, Bayesian statistics treats everything as a probability distribution.
Bayesian thinking is how humans naturally reason: we start with beliefs (priors), observe data (likelihood), and update our beliefs (posterior). This framework is incredibly powerful for AI: it handles small data elegantly, quantifies uncertainty naturally, incorporates domain knowledge systematically, and updates beliefs incrementally as new evidence arrives.
Bayesian methods power: spam filters (Naive Bayes), recommendation systems (Bayesian collaborative filtering), A/B testing with continuous monitoring, medical diagnosis systems, search engines (probabilistic information retrieval), reinforcement learning (Bayesian optimization), and AutoML hyperparameter tuning. Modern deep learning uses Bayesian neural networks for uncertainty quantification in safety-critical systems like autonomous vehicles.
🔄 Bayesian vs Frequentist Paradigms
These are two fundamentally different philosophies for statistical inference. Understanding both makes you a better data scientist.
| Aspect | Frequentist | Bayesian |
|---|---|---|
| Parameters | Fixed but unknown | Random variables with distributions |
| Probability | Long-run frequency | Degree of belief |
| Inference | Point estimates, p-values, confidence intervals | Posterior distributions, credible intervals |
| Interpretation | "If we repeated this experiment..." | "Given this data, I believe..." |
| Prior knowledge | Cannot incorporate formally | Incorporated via prior distribution |
| Small samples | Problematic (need large n) | Handles naturally via priors |
Example: Estimating Coin Bias
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Observe 7 heads in 10 flips
heads = 7
n = 10
print("="*70)
print("Estimating Coin Bias: Frequentist vs Bayesian")
print("="*70)
# FREQUENTIST APPROACH
print("\n🔵 FREQUENTIST APPROACH")
print("-" * 70)
# Point estimate (MLE)
p_mle = heads / n
print(f"Point estimate (MLE): p̂ = {p_mle:.2f}")
# 95% confidence interval (Wald method)
se = np.sqrt(p_mle * (1 - p_mle) / n)
ci_freq = stats.norm.interval(0.95, loc=p_mle, scale=se)
print(f"95% CI: [{ci_freq[0]:.3f}, {ci_freq[1]:.3f}]")
print(f"\nInterpretation: If we repeated this experiment many times,")
print(f"95% of intervals would contain the true (fixed) value of p")
print(f"⚠️ We CANNOT say 'p has 95% probability of being in this interval'")
# BAYESIAN APPROACH
print("\n🟢 BAYESIAN APPROACH")
print("-" * 70)
# Prior: Beta(2, 2) - slightly favor fair coins
alpha_prior = 2
beta_prior = 2
print(f"Prior: Beta({alpha_prior}, {beta_prior}) - weak belief coin is fair")
# Likelihood: Binomial(n=10, k=7)
print(f"Data: {heads} heads in {n} flips")
# Posterior: Beta(alpha_prior + heads, beta_prior + tails)
alpha_post = alpha_prior + heads
beta_post = beta_prior + (n - heads)
print(f"Posterior: Beta({alpha_post}, {beta_post})")
# Posterior mean
p_bayes = alpha_post / (alpha_post + beta_post)
print(f"\nPosterior mean: {p_bayes:.3f}")
# 95% credible interval
ci_bayes = stats.beta.interval(0.95, alpha_post, beta_post)
print(f"95% Credible interval: [{ci_bayes[0]:.3f}, {ci_bayes[1]:.3f}]")
print(f"\nInterpretation: Given the data, there's a 95% probability")
print(f"that p is between {ci_bayes[0]:.3f} and {ci_bayes[1]:.3f}")
print(f"✅ This is the intuitive interpretation we want!")
# Visualize
p_values = np.linspace(0, 1, 1000)
# Prior
prior = stats.beta.pdf(p_values, alpha_prior, beta_prior)
# Likelihood (scaled for visualization)
likelihood = stats.binom.pmf(heads, n, p_values)
likelihood = likelihood / likelihood.max() * prior.max()
# Posterior
posterior = stats.beta.pdf(p_values, alpha_post, beta_post)
plt.figure(figsize=(10, 6))
plt.plot(p_values, prior, 'b--', linewidth=2, label='Prior: Beta(2,2)')
plt.plot(p_values, likelihood, 'g:', linewidth=2, label='Likelihood (scaled)')
plt.plot(p_values, posterior, 'r-', linewidth=2, label='Posterior: Beta(9,5)')
plt.axvline(p_bayes, color='red', linestyle='--', alpha=0.5, label=f'Posterior mean: {p_bayes:.2f}')
plt.axvspan(ci_bayes[0], ci_bayes[1], alpha=0.2, color='red', label='95% Credible Interval')
plt.xlabel('Coin Bias (p)')
plt.ylabel('Density')
plt.title('Bayesian Updating: Prior × Likelihood → Posterior')
plt.legend()
plt.grid(True, alpha=0.3)
# plt.show()
print(f"\n💡 Key Insight:")
print(f"Bayesian posterior is a compromise between prior and data.")
print(f"With more data, posterior converges regardless of prior.")
Use Frequentist: Large samples, no prior knowledge, want objective analysis, regulatory requirements (FDA, etc.)
Use Bayesian: Small samples, have domain knowledge, need uncertainty quantification, sequential decision-making, complex hierarchical models
🧮 Bayes' Theorem: The Foundation
Bayes' theorem is the engine of Bayesian inference. It tells us how to update beliefs in light of new evidence.
P(θ | D) = P(D | θ) × P(θ) / P(D)
Posterior = Likelihood × Prior / Evidence
Components Explained
- Prior P(θ): What we believe about parameter θ before seeing data
- Likelihood P(D | θ): Probability of observing data D given θ
- Evidence P(D): Total probability of data (normalizing constant)
- Posterior P(θ | D): Updated belief about θ after seeing data D
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
def bayesian_updating_demo():
"""Demonstrate how Bayes' theorem updates beliefs"""
# Medical test example
print("🏥 Medical Diagnosis with Bayes' Theorem")
print("="*70)
# Scenario: Testing for rare disease
prevalence = 0.01 # 1% of population has disease
sensitivity = 0.95 # P(+test | disease) - true positive rate
specificity = 0.90 # P(-test | no disease) - true negative rate
print(f"Disease prevalence: {prevalence*100}%")
print(f"Test sensitivity: {sensitivity*100}%")
print(f"Test specificity: {specificity*100}%")
# Prior: P(disease)
p_disease = prevalence
p_no_disease = 1 - prevalence
print(f"\nPRIOR (before test):")
print(f" P(Disease) = {p_disease:.4f}")
print(f" P(No Disease) = {p_no_disease:.4f}")
# Likelihood: P(positive test | condition)
p_pos_given_disease = sensitivity
p_pos_given_no_disease = 1 - specificity
# Evidence: P(positive test) - total probability
p_positive = (p_pos_given_disease * p_disease +
p_pos_given_no_disease * p_no_disease)
print(f"\nLIKELIHOOD:")
print(f" P(+test | Disease) = {p_pos_given_disease:.4f}")
print(f" P(+test | No Disease) = {p_pos_given_no_disease:.4f}")
print(f"\nEVIDENCE:")
print(f" P(+test) = {p_positive:.4f}")
# Posterior: P(disease | positive test) - Bayes' theorem!
p_disease_given_pos = (p_pos_given_disease * p_disease) / p_positive
print(f"\nPOSTERIOR (after positive test):")
print(f" P(Disease | +test) = {p_disease_given_pos:.4f}")
print(f" = {p_disease_given_pos*100:.2f}%")
print(f"\n💡 Key Insight:")
print(f"Even with positive test (95% sensitivity),")
print(f"only {p_disease_given_pos*100:.1f}% chance of disease")
print(f"because disease is rare (1% prevalence)!")
# What if we test positive twice?
print(f"\n" + "="*70)
print(f"Sequential Updating: What if second test is also positive?")
print(f"="*70)
# First test posterior becomes second test prior
p_disease_after_1st = p_disease_given_pos
p_no_disease_after_1st = 1 - p_disease_after_1st
# Second test (assuming independent)
p_positive_2nd = (p_pos_given_disease * p_disease_after_1st +
p_pos_given_no_disease * p_no_disease_after_1st)
p_disease_after_2nd = (p_pos_given_disease * p_disease_after_1st) / p_positive_2nd
print(f"After 1st positive: {p_disease_after_1st*100:.2f}%")
print(f"After 2nd positive: {p_disease_after_2nd*100:.2f}%")
print(f"\nConfidence increased dramatically with more evidence!")
return p_disease, p_disease_given_pos, p_disease_after_2nd
# Run demonstration
probs = bayesian_updating_demo()
People often ignore base rates (priors) and focus only on test accuracy. A 95% accurate test for a 1% prevalent disease means most positives are false positives! This is why Bayesian thinking matters—it forces you to consider base rates.
🔗 Conjugate Priors
Some prior-likelihood pairs result in posteriors from the same family as the prior. These "conjugate" relationships make Bayesian inference computationally tractable.
| Likelihood | Conjugate Prior | Example |
|---|---|---|
| Binomial | Beta | Coin flips, conversion rates |
| Poisson | Gamma | Event counts, traffic rates |
| Normal (known σ) | Normal | Measurement errors |
| Normal (unknown σ) | Normal-Inverse-Gamma | General continuous data |
Beta-Binomial: The Workhorse
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
def beta_binomial_conjugacy():
"""Demonstrate Beta-Binomial conjugacy for A/B testing"""
print("🧪 Bayesian A/B Test: Email Subject Lines")
print("="*70)
# Scenario: Testing two email subject lines
# Variant A: 45 opens out of 200 sends
# Variant B: 58 opens out of 200 sends
# Prior: Beta(1, 1) = Uniform [0,1] - no prior knowledge
alpha_prior = 1
beta_prior = 1
# Data
opens_a = 45
sends_a = 200
opens_b = 58
sends_b = 200
print(f"Variant A: {opens_a}/{sends_a} opens ({opens_a/sends_a*100:.1f}%)")
print(f"Variant B: {opens_b}/{sends_b} opens ({opens_b/sends_b*100:.1f}%)")
# Posterior for A: Beta(alpha_prior + opens_a, beta_prior + (sends_a - opens_a))
alpha_post_a = alpha_prior + opens_a
beta_post_a = beta_prior + (sends_a - opens_a)
# Posterior for B
alpha_post_b = alpha_prior + opens_b
beta_post_b = beta_prior + (sends_b - opens_b)
print(f"\nPosterior A: Beta({alpha_post_a}, {beta_post_a})")
print(f"Posterior B: Beta({alpha_post_b}, {beta_post_b})")
# Posterior means
mean_a = alpha_post_a / (alpha_post_a + beta_post_a)
mean_b = alpha_post_b / (alpha_post_b + beta_post_b)
print(f"\nPosterior means:")
print(f" E[p_A | data] = {mean_a:.4f}")
print(f" E[p_B | data] = {mean_b:.4f}")
# Credible intervals
ci_a = stats.beta.interval(0.95, alpha_post_a, beta_post_a)
ci_b = stats.beta.interval(0.95, alpha_post_b, beta_post_b)
print(f"\n95% Credible Intervals:")
print(f" A: [{ci_a[0]:.4f}, {ci_a[1]:.4f}]")
print(f" B: [{ci_b[0]:.4f}, {ci_b[1]:.4f}]")
# Probability B > A (via Monte Carlo)
np.random.seed(42)
samples_a = np.random.beta(alpha_post_a, beta_post_a, 10000)
samples_b = np.random.beta(alpha_post_b, beta_post_b, 10000)
prob_b_better = (samples_b > samples_a).mean()
print(f"\n🎯 Decision Metric:")
print(f"P(p_B > p_A | data) = {prob_b_better:.4f} = {prob_b_better*100:.2f}%")
if prob_b_better > 0.95:
print(f"✅ Strong evidence that B is better! Launch B.")
elif prob_b_better > 0.90:
print(f"⚠️ Moderate evidence B is better. Consider more data.")
else:
print(f"❌ Not enough evidence. Continue testing.")
# Visualize posteriors
p_values = np.linspace(0, 0.5, 1000)
pdf_a = stats.beta.pdf(p_values, alpha_post_a, beta_post_a)
pdf_b = stats.beta.pdf(p_values, alpha_post_b, beta_post_b)
plt.figure(figsize=(10, 6))
plt.plot(p_values, pdf_a, 'b-', linewidth=2, label=f'Variant A')
plt.plot(p_values, pdf_b, 'r-', linewidth=2, label=f'Variant B')
plt.axvline(mean_a, color='blue', linestyle='--', alpha=0.5)
plt.axvline(mean_b, color='red', linestyle='--', alpha=0.5)
plt.xlabel('Open Rate')
plt.ylabel('Posterior Density')
plt.title(f'Bayesian A/B Test: P(B better) = {prob_b_better:.2%}')
plt.legend()
plt.grid(True, alpha=0.3)
# plt.show()
return prob_b_better
# Run demo
prob = beta_binomial_conjugacy()
🤖 Bayesian Machine Learning
Many ML algorithms have Bayesian interpretations or implementations. Let's explore the most important ones.
1. Naive Bayes Classifier (Revisited)
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import numpy as np
# Email spam classification
emails = [
("Win free money now!", "spam"),
("Meeting at 3pm tomorrow", "ham"),
("Congratulations! You won!", "spam"),
("Can you review this document?", "ham"),
("Claim your prize today", "spam"),
("Lunch on Thursday?", "ham"),
("Limited time offer!!!", "spam"),
("Project deadline reminder", "ham"),
] * 50 # Repeat to have more data
texts, labels = zip(*emails)
# Vectorize
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
y = np.array([1 if label == "spam" else 0 for label in labels])
# Train Naive Bayes
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
nb = MultinomialNB(alpha=1.0) # alpha is Laplace smoothing (prior)
nb.fit(X_train, y_train)
print("📧 Naive Bayes Email Classifier")
print("="*70)
print(f"Training accuracy: {nb.score(X_train, y_train):.3f}")
print(f"Test accuracy: {nb.score(X_test, y_test):.3f}")
# Test new emails
new_emails = [
"Free prize waiting for you",
"Can we schedule a meeting",
]
X_new = vectorizer.transform(new_emails)
predictions = nb.predict(X_new)
probabilities = nb.predict_proba(X_new)
print(f"\nPredictions with probabilities:")
for email, pred, prob in zip(new_emails, predictions, probabilities):
label = "spam" if pred == 1 else "ham"
confidence = prob[pred]
print(f"\n '{email}'")
print(f" → {label.upper()} (confidence: {confidence:.3f})")
print(f" P(spam|text) = {prob[1]:.3f}")
print(f" P(ham|text) = {prob[0]:.3f}")
print(f"\n💡 Why 'Naive'?")
print(f"Assumes word probabilities are independent given class.")
print(f"P(spam|'free','prize') = P(spam) × P('free'|spam) × P('prize'|spam)")
print(f"This is naive but works surprisingly well in practice!")
2. Bayesian Linear Regression
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
def bayesian_linear_regression():
"""Bayesian approach to linear regression with uncertainty"""
np.random.seed(42)
# Generate data: y = 2x + 1 + noise
n_points = 10
X_train = np.linspace(0, 10, n_points)
y_train = 2 * X_train + 1 + np.random.normal(0, 2, n_points)
print("📈 Bayesian Linear Regression")
print("="*70)
print(f"Training on {n_points} data points")
# For simplicity, assume known noise variance
sigma_noise = 2.0
# Prior: Normal distribution for slope and intercept
# P(slope) ~ N(0, 10) - vague prior
# P(intercept) ~ N(0, 10)
# Frequentist fit (for comparison)
slope_freq = np.cov(X_train, y_train)[0, 1] / np.var(X_train)
intercept_freq = y_train.mean() - slope_freq * X_train.mean()
print(f"\nFrequentist estimate:")
print(f" y = {slope_freq:.2f}x + {intercept_freq:.2f}")
# Bayesian approach: posterior distribution over parameters
# (Simplified - in practice use MCMC or variational inference)
# Posterior mean (MAP estimate with vague prior ≈ MLE)
X_design = np.column_stack([np.ones(len(X_train)), X_train])
# Posterior covariance
prior_cov = np.eye(2) * 100 # Vague prior
post_cov_inv = np.linalg.inv(prior_cov) + (X_design.T @ X_design) / sigma_noise**2
post_cov = np.linalg.inv(post_cov_inv)
# Posterior mean
post_mean = post_cov @ (X_design.T @ y_train) / sigma_noise**2
slope_bayes = post_mean[1]
intercept_bayes = post_mean[0]
print(f"\nBayesian estimate (posterior mean):")
print(f" y = {slope_bayes:.2f}x + {intercept_bayes:.2f}")
# Uncertainty in predictions
X_test = np.linspace(-2, 12, 100)
X_test_design = np.column_stack([np.ones(len(X_test)), X_test])
# Posterior predictive mean
y_pred = X_test_design @ post_mean
# Posterior predictive variance (epistemic + aleatoric)
pred_var = np.array([x @ post_cov @ x for x in X_test_design]) + sigma_noise**2
pred_std = np.sqrt(pred_var)
# 95% credible interval
ci_lower = y_pred - 1.96 * pred_std
ci_upper = y_pred + 1.96 * pred_std
# Plot
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, color='blue', s=50, label='Training data', zorder=3)
plt.plot(X_test, y_pred, 'r-', linewidth=2, label='Posterior mean', zorder=2)
plt.fill_between(X_test, ci_lower, ci_upper, alpha=0.3, color='red',
label='95% Credible interval', zorder=1)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Bayesian Linear Regression: Uncertainty Quantification')
plt.legend()
plt.grid(True, alpha=0.3)
# plt.show()
print(f"\n💡 Key Insight:")
print(f"Uncertainty is high where we have no data (extrapolation).")
print(f"Uncertainty decreases where we have observations.")
print(f"This is automatic in Bayesian approach!")
return post_mean, post_cov
# Run demo
params = bayesian_linear_regression()
3. Bayesian Optimization for Hyperparameter Tuning
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel
from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt
def bayesian_optimization_demo():
"""Simple Bayesian Optimization example"""
print("🎯 Bayesian Optimization: Finding Optimal Hyperparameter")
print("="*70)
# Objective function (unknown to optimizer)
# Simulating model performance as function of learning rate
def objective(x):
return -((x - 0.7)**2) * np.sin(14 * x) # Complex function
# Initial random samples
np.random.seed(42)
X_init = np.random.uniform(0, 1, 3).reshape(-1, 1)
y_init = objective(X_init).ravel()
print(f"Starting with {len(X_init)} random evaluations")
# Bayesian optimization loop
X_samples = X_init.copy()
y_samples = y_init.copy()
n_iterations = 7
for iteration in range(n_iterations):
# Fit Gaussian Process
kernel = ConstantKernel(1.0) * RBF(length_scale=0.1)
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10, random_state=42)
gp.fit(X_samples, y_samples)
# Acquisition function: Upper Confidence Bound (UCB)
X_test = np.linspace(0, 1, 1000).reshape(-1, 1)
mu, sigma = gp.predict(X_test, return_std=True)
# UCB = mean + kappa * std (exploration-exploitation trade-off)
kappa = 2.0
ucb = mu + kappa * sigma
# Find next point to sample
next_x = X_test[np.argmax(ucb)]
next_y = objective(next_x)
# Add to samples
X_samples = np.vstack([X_samples, next_x])
y_samples = np.append(y_samples, next_y)
print(f"Iteration {iteration+1}: x={next_x[0]:.3f}, f(x)={next_y:.3f}")
# Best found
best_idx = np.argmax(y_samples)
best_x = X_samples[best_idx]
best_y = y_samples[best_idx]
# True optimum (for comparison)
X_true = np.linspace(0, 1, 10000).reshape(-1, 1)
y_true = objective(X_true).ravel()
true_best_idx = np.argmax(y_true)
true_best_x = X_true[true_best_idx]
true_best_y = y_true[true_best_idx]
print(f"\n🏆 Results:")
print(f"Best found: x={best_x[0]:.3f}, f(x)={best_y:.3f}")
print(f"True optimum: x={true_best_x[0]:.3f}, f(x)={true_best_y:.3f}")
print(f"Found {best_y/true_best_y*100:.1f}% of optimal in {len(X_samples)} evaluations")
print(f"\n💡 Why Bayesian Optimization?")
print(f"- Uses Gaussian Process to model objective function")
print(f"- Balances exploration (uncertain regions) and exploitation (promising regions)")
print(f"- Much more sample-efficient than grid/random search")
print(f"- Perfect for expensive evaluations (deep learning hyperparameters)")
return best_x, best_y
# Run demo
best = bayesian_optimization_demo()
Modern research extends Bayesian methods to neural networks: Bayesian Neural Networks place distributions over weights for uncertainty quantification. Dropout as Bayesian Approximation - Monte Carlo dropout estimates uncertainty. Variational Inference - approximate posteriors in high dimensions. Used in safety-critical applications: autonomous vehicles (know when model is uncertain), medical diagnosis (quantify confidence), reinforcement learning (exploration-exploitation).
🎲 MCMC: Sampling from Posteriors
For complex models, we can't compute posterior distributions analytically. Markov Chain Monte Carlo (MCMC) methods let us sample from posteriors even when we can't write them down.
The Idea
Instead of computing P(θ | D) exactly, we generate samples {θ₁, θ₂, ..., θₙ} that follow the posterior distribution. With enough samples, we can approximate any posterior quantity (mean, variance, credible intervals, etc.).
Metropolis-Hastings Algorithm (Simple MCMC)
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
def metropolis_hastings_demo():
"""Demonstrate Metropolis-Hastings algorithm"""
print("🎲 Metropolis-Hastings MCMC Sampling")
print("="*70)
# Target distribution: mixture of two Gaussians (hard to sample directly)
def target_distribution(x):
return (0.3 * stats.norm.pdf(x, -2, 0.5) +
0.7 * stats.norm.pdf(x, 3, 1.0))
# Metropolis-Hastings algorithm
def metropolis_hastings(target, n_samples=10000, proposal_std=1.0):
samples = []
current = 0.0 # Starting point
accepted = 0
for i in range(n_samples):
# Propose new state
proposal = current + np.random.normal(0, proposal_std)
# Calculate acceptance ratio
acceptance_ratio = target(proposal) / target(current)
# Accept or reject
if np.random.uniform() < acceptance_ratio:
current = proposal
accepted += 1
samples.append(current)
acceptance_rate = accepted / n_samples
return np.array(samples), acceptance_rate
# Run MCMC
samples, acc_rate = metropolis_hastings(target_distribution, n_samples=10000)
print(f"Generated {len(samples)} samples")
print(f"Acceptance rate: {acc_rate:.3f}")
# Compute statistics from samples
print(f"\nPosterior statistics (from samples):")
print(f" Mean: {samples.mean():.3f}")
print(f" Median: {np.median(samples):.3f}")
print(f" Std: {samples.std():.3f}")
# 95% credible interval
ci = np.percentile(samples, [2.5, 97.5])
print(f" 95% Credible Interval: [{ci[0]:.3f}, {ci[1]:.3f}]")
# Visualize
x = np.linspace(-5, 6, 1000)
true_pdf = target_distribution(x)
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# True distribution vs samples
axes[0, 0].plot(x, true_pdf, 'r-', linewidth=2, label='True distribution')
axes[0, 0].hist(samples[1000:], bins=50, density=True, alpha=0.6, label='MCMC samples')
axes[0, 0].set_xlabel('x')
axes[0, 0].set_ylabel('Density')
axes[0, 0].set_title('Samples vs True Distribution')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# Trace plot
axes[0, 1].plot(samples[:1000], alpha=0.7)
axes[0, 1].set_xlabel('Iteration')
axes[0, 1].set_ylabel('Value')
axes[0, 1].set_title('Trace Plot (first 1000 samples)')
axes[0, 1].grid(True, alpha=0.3)
# Autocorrelation
from numpy import correlate
autocorr = correlate(samples - samples.mean(), samples - samples.mean(), mode='full')
autocorr = autocorr[len(autocorr)//2:]
autocorr = autocorr / autocorr[0]
axes[1, 0].plot(autocorr[:100])
axes[1, 0].set_xlabel('Lag')
axes[1, 0].set_ylabel('Autocorrelation')
axes[1, 0].set_title('Autocorrelation Plot')
axes[1, 0].grid(True, alpha=0.3)
# Running mean
running_mean = np.cumsum(samples) / np.arange(1, len(samples) + 1)
axes[1, 1].plot(running_mean)
axes[1, 1].axhline(samples.mean(), color='r', linestyle='--', label='Final mean')
axes[1, 1].set_xlabel('Iteration')
axes[1, 1].set_ylabel('Running Mean')
axes[1, 1].set_title('Convergence: Running Mean')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
# plt.show()
print(f"\n💡 Key MCMC Concepts:")
print(f"- Trace plot: Should look like 'fuzzy caterpillar' (good mixing)")
print(f"- Autocorrelation: Should decay quickly (independent samples)")
print(f"- Running mean: Should converge (chain has converged)")
print(f"- Burn-in: Discard first samples (before convergence)")
return samples
# Run demo
samples = metropolis_hastings_demo()
In practice, use specialized libraries: PyMC3/PyMC - Probabilistic programming in Python. Stan - Fast Hamiltonian Monte Carlo. TensorFlow Probability - Bayesian deep learning. Pyro - Deep probabilistic programming. These handle complex models automatically and are much more efficient than basic Metropolis-Hastings.
💻 Practice Exercises
Exercise 1: Medical Diagnosis
Scenario: Disease affects 0.5% of population. Test has 99% sensitivity and 95% specificity.
- If you test positive, what's probability you have disease?
- If you test negative, what's probability you're healthy?
- Why is the answer to question 1 so surprising?
Exercise 2: Beta-Binomial Updating
Scenario: Prior belief: Beta(2, 2). Observe 8 successes in 10 trials.
- What is the posterior distribution?
- What is the posterior mean?
- Calculate 95% credible interval
- How does this differ from a uniform Beta(1,1) prior?
Exercise 3: Bayesian A/B Test
Scenario: Control: 120/1000 conversions. Treatment: 140/1000 conversions.
- Use Beta-Binomial model to estimate P(treatment > control)
- Would you launch treatment if threshold is 95%?
- How many more samples needed to reach 95% confidence?
Exercise 4: Prior Sensitivity
Challenge: Compare three priors for coin flip problem (5 heads in 10 flips):
- Uniform: Beta(1, 1)
- Skeptical: Beta(10, 10) - believes coin is fair
- Optimistic: Beta(8, 2) - believes heads likely
- Calculate posterior for each prior
- How do posteriors differ?
- Which prior is most affected by data?
Exercise 5: Credible vs Confidence Intervals
Scenario: Sample mean = 75, sample std = 10, n = 25. True population mean unknown.
- Calculate 95% confidence interval (frequentist)
- Calculate 95% credible interval (Bayesian with vague prior)
- Write correct interpretation for each
- Which interpretation is more intuitive?
📝 Summary
Congratulations! You've completed the Statistics for AI course. You've mastered the Bayesian paradigm—the foundation of probabilistic thinking in AI:
🎯 Bayesian Paradigm
Probability as belief. Parameters as distributions. Update beliefs with data via Bayes' theorem. More intuitive than frequentist approach.
🧮 Bayes' Theorem
Posterior = Likelihood × Prior / Evidence. Foundation of all Bayesian inference. Combines data with prior knowledge systematically.
🔗 Conjugate Priors
Beta-Binomial, Gamma-Poisson, Normal-Normal. Analytical posteriors. Foundation for efficient Bayesian inference in simple models.
🤖 Bayesian ML
Naive Bayes, Bayesian regression, Bayesian optimization, Bayesian neural networks. Uncertainty quantification built-in.
🎲 MCMC Sampling
Sample from complex posteriors. Metropolis-Hastings, Hamiltonian MC. Modern tools: PyMC, Stan, TensorFlow Probability.
📊 Practical Applications
A/B testing, spam filtering, medical diagnosis, hyperparameter tuning, recommender systems. Bayesian thinking everywhere in modern AI.
You've mastered statistics for AI across 6 comprehensive tutorials: descriptive statistics, probability foundations, distributions, sampling & CLT, hypothesis testing, and Bayesian inference. You now have the statistical toolkit to: understand research papers, design rigorous experiments, build probabilistic models, quantify uncertainty, and make data-driven decisions with confidence. This is the foundation every AI practitioner needs!
🏆 Ready to Earn Your Certificate?
Complete the quiz below to demonstrate your mastery and earn your free Statistics for AI certificate!
Generate My Certificate →🎯 Test Your Knowledge
Question 1: In Bayesian inference, the posterior distribution represents:
Question 2: A 95% credible interval means:
Question 3: For a Binomial likelihood, the conjugate prior is:
Question 4: The main advantage of Bayesian methods over frequentist is:
Question 5: MCMC methods are used to:
Question 6: In Bayesian A/B testing, to decide if treatment is better we calculate:
Question 7: The "Naive" in Naive Bayes refers to:
Question 8: Bayesian optimization is particularly useful for: