HomeStatistics for AIBayesian Statistics & Inference

Bayesian Statistics & Inference

Think like a Bayesian: update beliefs with evidence and build probabilistic models for uncertain AI systems

📅 Tutorial 6 🎓 Advanced ⏱️ 70 min

🎓 Final Tutorial! Complete this to earn your Free Statistics for AI Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup required

🎯 Why Bayesian Statistics?

"There's a 70% chance of rain tomorrow." "This email is 95% likely to be spam." "The model is 80% confident this is a cat." These are Bayesian statements—they express uncertainty as probability. Unlike frequentist statistics, which treats parameters as fixed unknowns, Bayesian statistics treats everything as a probability distribution.

Bayesian thinking is how humans naturally reason: we start with beliefs (priors), observe data (likelihood), and update our beliefs (posterior). This framework is incredibly powerful for AI: it handles small data elegantly, quantifies uncertainty naturally, incorporates domain knowledge systematically, and updates beliefs incrementally as new evidence arrives.

💡 Real-World Impact

Bayesian methods power: spam filters (Naive Bayes), recommendation systems (Bayesian collaborative filtering), A/B testing with continuous monitoring, medical diagnosis systems, search engines (probabilistic information retrieval), reinforcement learning (Bayesian optimization), and AutoML hyperparameter tuning. Modern deep learning uses Bayesian neural networks for uncertainty quantification in safety-critical systems like autonomous vehicles.

🔄 Bayesian vs Frequentist Paradigms

These are two fundamentally different philosophies for statistical inference. Understanding both makes you a better data scientist.

Aspect Frequentist Bayesian
Parameters Fixed but unknown Random variables with distributions
Probability Long-run frequency Degree of belief
Inference Point estimates, p-values, confidence intervals Posterior distributions, credible intervals
Interpretation "If we repeated this experiment..." "Given this data, I believe..."
Prior knowledge Cannot incorporate formally Incorporated via prior distribution
Small samples Problematic (need large n) Handles naturally via priors

Example: Estimating Coin Bias

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Observe 7 heads in 10 flips
heads = 7
n = 10

print("="*70)
print("Estimating Coin Bias: Frequentist vs Bayesian")
print("="*70)

# FREQUENTIST APPROACH
print("\n🔵 FREQUENTIST APPROACH")
print("-" * 70)

# Point estimate (MLE)
p_mle = heads / n
print(f"Point estimate (MLE): p̂ = {p_mle:.2f}")

# 95% confidence interval (Wald method)
se = np.sqrt(p_mle * (1 - p_mle) / n)
ci_freq = stats.norm.interval(0.95, loc=p_mle, scale=se)
print(f"95% CI: [{ci_freq[0]:.3f}, {ci_freq[1]:.3f}]")
print(f"\nInterpretation: If we repeated this experiment many times,")
print(f"95% of intervals would contain the true (fixed) value of p")
print(f"⚠️  We CANNOT say 'p has 95% probability of being in this interval'")

# BAYESIAN APPROACH
print("\n🟢 BAYESIAN APPROACH")
print("-" * 70)

# Prior: Beta(2, 2) - slightly favor fair coins
alpha_prior = 2
beta_prior = 2
print(f"Prior: Beta({alpha_prior}, {beta_prior}) - weak belief coin is fair")

# Likelihood: Binomial(n=10, k=7)
print(f"Data: {heads} heads in {n} flips")

# Posterior: Beta(alpha_prior + heads, beta_prior + tails)
alpha_post = alpha_prior + heads
beta_post = beta_prior + (n - heads)
print(f"Posterior: Beta({alpha_post}, {beta_post})")

# Posterior mean
p_bayes = alpha_post / (alpha_post + beta_post)
print(f"\nPosterior mean: {p_bayes:.3f}")

# 95% credible interval
ci_bayes = stats.beta.interval(0.95, alpha_post, beta_post)
print(f"95% Credible interval: [{ci_bayes[0]:.3f}, {ci_bayes[1]:.3f}]")
print(f"\nInterpretation: Given the data, there's a 95% probability")
print(f"that p is between {ci_bayes[0]:.3f} and {ci_bayes[1]:.3f}")
print(f"✅ This is the intuitive interpretation we want!")

# Visualize
p_values = np.linspace(0, 1, 1000)

# Prior
prior = stats.beta.pdf(p_values, alpha_prior, beta_prior)

# Likelihood (scaled for visualization)
likelihood = stats.binom.pmf(heads, n, p_values)
likelihood = likelihood / likelihood.max() * prior.max()

# Posterior
posterior = stats.beta.pdf(p_values, alpha_post, beta_post)

plt.figure(figsize=(10, 6))
plt.plot(p_values, prior, 'b--', linewidth=2, label='Prior: Beta(2,2)')
plt.plot(p_values, likelihood, 'g:', linewidth=2, label='Likelihood (scaled)')
plt.plot(p_values, posterior, 'r-', linewidth=2, label='Posterior: Beta(9,5)')
plt.axvline(p_bayes, color='red', linestyle='--', alpha=0.5, label=f'Posterior mean: {p_bayes:.2f}')
plt.axvspan(ci_bayes[0], ci_bayes[1], alpha=0.2, color='red', label='95% Credible Interval')

plt.xlabel('Coin Bias (p)')
plt.ylabel('Density')
plt.title('Bayesian Updating: Prior × Likelihood → Posterior')
plt.legend()
plt.grid(True, alpha=0.3)
# plt.show()

print(f"\n💡 Key Insight:")
print(f"Bayesian posterior is a compromise between prior and data.")
print(f"With more data, posterior converges regardless of prior.")
✅ When to Use Each

Use Frequentist: Large samples, no prior knowledge, want objective analysis, regulatory requirements (FDA, etc.)
Use Bayesian: Small samples, have domain knowledge, need uncertainty quantification, sequential decision-making, complex hierarchical models

🧮 Bayes' Theorem: The Foundation

Bayes' theorem is the engine of Bayesian inference. It tells us how to update beliefs in light of new evidence.

P(θ | D) = P(D | θ) × P(θ) / P(D)

Posterior = Likelihood × Prior / Evidence

Components Explained

  • Prior P(θ): What we believe about parameter θ before seeing data
  • Likelihood P(D | θ): Probability of observing data D given θ
  • Evidence P(D): Total probability of data (normalizing constant)
  • Posterior P(θ | D): Updated belief about θ after seeing data D
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def bayesian_updating_demo():
    """Demonstrate how Bayes' theorem updates beliefs"""
    
    # Medical test example
    print("🏥 Medical Diagnosis with Bayes' Theorem")
    print("="*70)
    
    # Scenario: Testing for rare disease
    prevalence = 0.01  # 1% of population has disease
    sensitivity = 0.95  # P(+test | disease) - true positive rate
    specificity = 0.90  # P(-test | no disease) - true negative rate
    
    print(f"Disease prevalence: {prevalence*100}%")
    print(f"Test sensitivity: {sensitivity*100}%")
    print(f"Test specificity: {specificity*100}%")
    
    # Prior: P(disease)
    p_disease = prevalence
    p_no_disease = 1 - prevalence
    
    print(f"\nPRIOR (before test):")
    print(f"  P(Disease) = {p_disease:.4f}")
    print(f"  P(No Disease) = {p_no_disease:.4f}")
    
    # Likelihood: P(positive test | condition)
    p_pos_given_disease = sensitivity
    p_pos_given_no_disease = 1 - specificity
    
    # Evidence: P(positive test) - total probability
    p_positive = (p_pos_given_disease * p_disease + 
                  p_pos_given_no_disease * p_no_disease)
    
    print(f"\nLIKELIHOOD:")
    print(f"  P(+test | Disease) = {p_pos_given_disease:.4f}")
    print(f"  P(+test | No Disease) = {p_pos_given_no_disease:.4f}")
    
    print(f"\nEVIDENCE:")
    print(f"  P(+test) = {p_positive:.4f}")
    
    # Posterior: P(disease | positive test) - Bayes' theorem!
    p_disease_given_pos = (p_pos_given_disease * p_disease) / p_positive
    
    print(f"\nPOSTERIOR (after positive test):")
    print(f"  P(Disease | +test) = {p_disease_given_pos:.4f}")
    print(f"  = {p_disease_given_pos*100:.2f}%")
    
    print(f"\n💡 Key Insight:")
    print(f"Even with positive test (95% sensitivity),")
    print(f"only {p_disease_given_pos*100:.1f}% chance of disease")
    print(f"because disease is rare (1% prevalence)!")
    
    # What if we test positive twice?
    print(f"\n" + "="*70)
    print(f"Sequential Updating: What if second test is also positive?")
    print(f"="*70)
    
    # First test posterior becomes second test prior
    p_disease_after_1st = p_disease_given_pos
    p_no_disease_after_1st = 1 - p_disease_after_1st
    
    # Second test (assuming independent)
    p_positive_2nd = (p_pos_given_disease * p_disease_after_1st + 
                      p_pos_given_no_disease * p_no_disease_after_1st)
    
    p_disease_after_2nd = (p_pos_given_disease * p_disease_after_1st) / p_positive_2nd
    
    print(f"After 1st positive: {p_disease_after_1st*100:.2f}%")
    print(f"After 2nd positive: {p_disease_after_2nd*100:.2f}%")
    print(f"\nConfidence increased dramatically with more evidence!")
    
    return p_disease, p_disease_given_pos, p_disease_after_2nd

# Run demonstration
probs = bayesian_updating_demo()
⚠️ Base Rate Fallacy

People often ignore base rates (priors) and focus only on test accuracy. A 95% accurate test for a 1% prevalent disease means most positives are false positives! This is why Bayesian thinking matters—it forces you to consider base rates.

🔗 Conjugate Priors

Some prior-likelihood pairs result in posteriors from the same family as the prior. These "conjugate" relationships make Bayesian inference computationally tractable.

Likelihood Conjugate Prior Example
Binomial Beta Coin flips, conversion rates
Poisson Gamma Event counts, traffic rates
Normal (known σ) Normal Measurement errors
Normal (unknown σ) Normal-Inverse-Gamma General continuous data

Beta-Binomial: The Workhorse

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def beta_binomial_conjugacy():
    """Demonstrate Beta-Binomial conjugacy for A/B testing"""
    
    print("🧪 Bayesian A/B Test: Email Subject Lines")
    print("="*70)
    
    # Scenario: Testing two email subject lines
    # Variant A: 45 opens out of 200 sends
    # Variant B: 58 opens out of 200 sends
    
    # Prior: Beta(1, 1) = Uniform [0,1] - no prior knowledge
    alpha_prior = 1
    beta_prior = 1
    
    # Data
    opens_a = 45
    sends_a = 200
    opens_b = 58
    sends_b = 200
    
    print(f"Variant A: {opens_a}/{sends_a} opens ({opens_a/sends_a*100:.1f}%)")
    print(f"Variant B: {opens_b}/{sends_b} opens ({opens_b/sends_b*100:.1f}%)")
    
    # Posterior for A: Beta(alpha_prior + opens_a, beta_prior + (sends_a - opens_a))
    alpha_post_a = alpha_prior + opens_a
    beta_post_a = beta_prior + (sends_a - opens_a)
    
    # Posterior for B
    alpha_post_b = alpha_prior + opens_b
    beta_post_b = beta_prior + (sends_b - opens_b)
    
    print(f"\nPosterior A: Beta({alpha_post_a}, {beta_post_a})")
    print(f"Posterior B: Beta({alpha_post_b}, {beta_post_b})")
    
    # Posterior means
    mean_a = alpha_post_a / (alpha_post_a + beta_post_a)
    mean_b = alpha_post_b / (alpha_post_b + beta_post_b)
    
    print(f"\nPosterior means:")
    print(f"  E[p_A | data] = {mean_a:.4f}")
    print(f"  E[p_B | data] = {mean_b:.4f}")
    
    # Credible intervals
    ci_a = stats.beta.interval(0.95, alpha_post_a, beta_post_a)
    ci_b = stats.beta.interval(0.95, alpha_post_b, beta_post_b)
    
    print(f"\n95% Credible Intervals:")
    print(f"  A: [{ci_a[0]:.4f}, {ci_a[1]:.4f}]")
    print(f"  B: [{ci_b[0]:.4f}, {ci_b[1]:.4f}]")
    
    # Probability B > A (via Monte Carlo)
    np.random.seed(42)
    samples_a = np.random.beta(alpha_post_a, beta_post_a, 10000)
    samples_b = np.random.beta(alpha_post_b, beta_post_b, 10000)
    
    prob_b_better = (samples_b > samples_a).mean()
    
    print(f"\n🎯 Decision Metric:")
    print(f"P(p_B > p_A | data) = {prob_b_better:.4f} = {prob_b_better*100:.2f}%")
    
    if prob_b_better > 0.95:
        print(f"✅ Strong evidence that B is better! Launch B.")
    elif prob_b_better > 0.90:
        print(f"⚠️  Moderate evidence B is better. Consider more data.")
    else:
        print(f"❌ Not enough evidence. Continue testing.")
    
    # Visualize posteriors
    p_values = np.linspace(0, 0.5, 1000)
    pdf_a = stats.beta.pdf(p_values, alpha_post_a, beta_post_a)
    pdf_b = stats.beta.pdf(p_values, alpha_post_b, beta_post_b)
    
    plt.figure(figsize=(10, 6))
    plt.plot(p_values, pdf_a, 'b-', linewidth=2, label=f'Variant A')
    plt.plot(p_values, pdf_b, 'r-', linewidth=2, label=f'Variant B')
    plt.axvline(mean_a, color='blue', linestyle='--', alpha=0.5)
    plt.axvline(mean_b, color='red', linestyle='--', alpha=0.5)
    
    plt.xlabel('Open Rate')
    plt.ylabel('Posterior Density')
    plt.title(f'Bayesian A/B Test: P(B better) = {prob_b_better:.2%}')
    plt.legend()
    plt.grid(True, alpha=0.3)
    # plt.show()
    
    return prob_b_better

# Run demo
prob = beta_binomial_conjugacy()

🤖 Bayesian Machine Learning

Many ML algorithms have Bayesian interpretations or implementations. Let's explore the most important ones.

1. Naive Bayes Classifier (Revisited)

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import numpy as np

# Email spam classification
emails = [
    ("Win free money now!", "spam"),
    ("Meeting at 3pm tomorrow", "ham"),
    ("Congratulations! You won!", "spam"),
    ("Can you review this document?", "ham"),
    ("Claim your prize today", "spam"),
    ("Lunch on Thursday?", "ham"),
    ("Limited time offer!!!", "spam"),
    ("Project deadline reminder", "ham"),
] * 50  # Repeat to have more data

texts, labels = zip(*emails)

# Vectorize
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
y = np.array([1 if label == "spam" else 0 for label in labels])

# Train Naive Bayes
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

nb = MultinomialNB(alpha=1.0)  # alpha is Laplace smoothing (prior)
nb.fit(X_train, y_train)

print("📧 Naive Bayes Email Classifier")
print("="*70)
print(f"Training accuracy: {nb.score(X_train, y_train):.3f}")
print(f"Test accuracy: {nb.score(X_test, y_test):.3f}")

# Test new emails
new_emails = [
    "Free prize waiting for you",
    "Can we schedule a meeting",
]

X_new = vectorizer.transform(new_emails)
predictions = nb.predict(X_new)
probabilities = nb.predict_proba(X_new)

print(f"\nPredictions with probabilities:")
for email, pred, prob in zip(new_emails, predictions, probabilities):
    label = "spam" if pred == 1 else "ham"
    confidence = prob[pred]
    print(f"\n  '{email}'")
    print(f"  → {label.upper()} (confidence: {confidence:.3f})")
    print(f"     P(spam|text) = {prob[1]:.3f}")
    print(f"     P(ham|text) = {prob[0]:.3f}")

print(f"\n💡 Why 'Naive'?")
print(f"Assumes word probabilities are independent given class.")
print(f"P(spam|'free','prize') = P(spam) × P('free'|spam) × P('prize'|spam)")
print(f"This is naive but works surprisingly well in practice!")

2. Bayesian Linear Regression

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def bayesian_linear_regression():
    """Bayesian approach to linear regression with uncertainty"""
    
    np.random.seed(42)
    
    # Generate data: y = 2x + 1 + noise
    n_points = 10
    X_train = np.linspace(0, 10, n_points)
    y_train = 2 * X_train + 1 + np.random.normal(0, 2, n_points)
    
    print("📈 Bayesian Linear Regression")
    print("="*70)
    print(f"Training on {n_points} data points")
    
    # For simplicity, assume known noise variance
    sigma_noise = 2.0
    
    # Prior: Normal distribution for slope and intercept
    # P(slope) ~ N(0, 10) - vague prior
    # P(intercept) ~ N(0, 10)
    
    # Frequentist fit (for comparison)
    slope_freq = np.cov(X_train, y_train)[0, 1] / np.var(X_train)
    intercept_freq = y_train.mean() - slope_freq * X_train.mean()
    
    print(f"\nFrequentist estimate:")
    print(f"  y = {slope_freq:.2f}x + {intercept_freq:.2f}")
    
    # Bayesian approach: posterior distribution over parameters
    # (Simplified - in practice use MCMC or variational inference)
    
    # Posterior mean (MAP estimate with vague prior ≈ MLE)
    X_design = np.column_stack([np.ones(len(X_train)), X_train])
    
    # Posterior covariance
    prior_cov = np.eye(2) * 100  # Vague prior
    post_cov_inv = np.linalg.inv(prior_cov) + (X_design.T @ X_design) / sigma_noise**2
    post_cov = np.linalg.inv(post_cov_inv)
    
    # Posterior mean
    post_mean = post_cov @ (X_design.T @ y_train) / sigma_noise**2
    
    slope_bayes = post_mean[1]
    intercept_bayes = post_mean[0]
    
    print(f"\nBayesian estimate (posterior mean):")
    print(f"  y = {slope_bayes:.2f}x + {intercept_bayes:.2f}")
    
    # Uncertainty in predictions
    X_test = np.linspace(-2, 12, 100)
    X_test_design = np.column_stack([np.ones(len(X_test)), X_test])
    
    # Posterior predictive mean
    y_pred = X_test_design @ post_mean
    
    # Posterior predictive variance (epistemic + aleatoric)
    pred_var = np.array([x @ post_cov @ x for x in X_test_design]) + sigma_noise**2
    pred_std = np.sqrt(pred_var)
    
    # 95% credible interval
    ci_lower = y_pred - 1.96 * pred_std
    ci_upper = y_pred + 1.96 * pred_std
    
    # Plot
    plt.figure(figsize=(10, 6))
    plt.scatter(X_train, y_train, color='blue', s=50, label='Training data', zorder=3)
    plt.plot(X_test, y_pred, 'r-', linewidth=2, label='Posterior mean', zorder=2)
    plt.fill_between(X_test, ci_lower, ci_upper, alpha=0.3, color='red', 
                     label='95% Credible interval', zorder=1)
    
    plt.xlabel('X')
    plt.ylabel('y')
    plt.title('Bayesian Linear Regression: Uncertainty Quantification')
    plt.legend()
    plt.grid(True, alpha=0.3)
    # plt.show()
    
    print(f"\n💡 Key Insight:")
    print(f"Uncertainty is high where we have no data (extrapolation).")
    print(f"Uncertainty decreases where we have observations.")
    print(f"This is automatic in Bayesian approach!")
    
    return post_mean, post_cov

# Run demo
params = bayesian_linear_regression()

3. Bayesian Optimization for Hyperparameter Tuning

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel
from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt

def bayesian_optimization_demo():
    """Simple Bayesian Optimization example"""
    
    print("🎯 Bayesian Optimization: Finding Optimal Hyperparameter")
    print("="*70)
    
    # Objective function (unknown to optimizer)
    # Simulating model performance as function of learning rate
    def objective(x):
        return -((x - 0.7)**2) * np.sin(14 * x)  # Complex function
    
    # Initial random samples
    np.random.seed(42)
    X_init = np.random.uniform(0, 1, 3).reshape(-1, 1)
    y_init = objective(X_init).ravel()
    
    print(f"Starting with {len(X_init)} random evaluations")
    
    # Bayesian optimization loop
    X_samples = X_init.copy()
    y_samples = y_init.copy()
    
    n_iterations = 7
    
    for iteration in range(n_iterations):
        # Fit Gaussian Process
        kernel = ConstantKernel(1.0) * RBF(length_scale=0.1)
        gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10, random_state=42)
        gp.fit(X_samples, y_samples)
        
        # Acquisition function: Upper Confidence Bound (UCB)
        X_test = np.linspace(0, 1, 1000).reshape(-1, 1)
        mu, sigma = gp.predict(X_test, return_std=True)
        
        # UCB = mean + kappa * std (exploration-exploitation trade-off)
        kappa = 2.0
        ucb = mu + kappa * sigma
        
        # Find next point to sample
        next_x = X_test[np.argmax(ucb)]
        next_y = objective(next_x)
        
        # Add to samples
        X_samples = np.vstack([X_samples, next_x])
        y_samples = np.append(y_samples, next_y)
        
        print(f"Iteration {iteration+1}: x={next_x[0]:.3f}, f(x)={next_y:.3f}")
    
    # Best found
    best_idx = np.argmax(y_samples)
    best_x = X_samples[best_idx]
    best_y = y_samples[best_idx]
    
    # True optimum (for comparison)
    X_true = np.linspace(0, 1, 10000).reshape(-1, 1)
    y_true = objective(X_true).ravel()
    true_best_idx = np.argmax(y_true)
    true_best_x = X_true[true_best_idx]
    true_best_y = y_true[true_best_idx]
    
    print(f"\n🏆 Results:")
    print(f"Best found: x={best_x[0]:.3f}, f(x)={best_y:.3f}")
    print(f"True optimum: x={true_best_x[0]:.3f}, f(x)={true_best_y:.3f}")
    print(f"Found {best_y/true_best_y*100:.1f}% of optimal in {len(X_samples)} evaluations")
    
    print(f"\n💡 Why Bayesian Optimization?")
    print(f"- Uses Gaussian Process to model objective function")
    print(f"- Balances exploration (uncertain regions) and exploitation (promising regions)")
    print(f"- Much more sample-efficient than grid/random search")
    print(f"- Perfect for expensive evaluations (deep learning hyperparameters)")
    
    return best_x, best_y

# Run demo
best = bayesian_optimization_demo()
💡 Bayesian Deep Learning

Modern research extends Bayesian methods to neural networks: Bayesian Neural Networks place distributions over weights for uncertainty quantification. Dropout as Bayesian Approximation - Monte Carlo dropout estimates uncertainty. Variational Inference - approximate posteriors in high dimensions. Used in safety-critical applications: autonomous vehicles (know when model is uncertain), medical diagnosis (quantify confidence), reinforcement learning (exploration-exploitation).

🎲 MCMC: Sampling from Posteriors

For complex models, we can't compute posterior distributions analytically. Markov Chain Monte Carlo (MCMC) methods let us sample from posteriors even when we can't write them down.

The Idea

Instead of computing P(θ | D) exactly, we generate samples {θ₁, θ₂, ..., θₙ} that follow the posterior distribution. With enough samples, we can approximate any posterior quantity (mean, variance, credible intervals, etc.).

Metropolis-Hastings Algorithm (Simple MCMC)

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

def metropolis_hastings_demo():
    """Demonstrate Metropolis-Hastings algorithm"""
    
    print("🎲 Metropolis-Hastings MCMC Sampling")
    print("="*70)
    
    # Target distribution: mixture of two Gaussians (hard to sample directly)
    def target_distribution(x):
        return (0.3 * stats.norm.pdf(x, -2, 0.5) + 
                0.7 * stats.norm.pdf(x, 3, 1.0))
    
    # Metropolis-Hastings algorithm
    def metropolis_hastings(target, n_samples=10000, proposal_std=1.0):
        samples = []
        current = 0.0  # Starting point
        
        accepted = 0
        
        for i in range(n_samples):
            # Propose new state
            proposal = current + np.random.normal(0, proposal_std)
            
            # Calculate acceptance ratio
            acceptance_ratio = target(proposal) / target(current)
            
            # Accept or reject
            if np.random.uniform() < acceptance_ratio:
                current = proposal
                accepted += 1
            
            samples.append(current)
        
        acceptance_rate = accepted / n_samples
        return np.array(samples), acceptance_rate
    
    # Run MCMC
    samples, acc_rate = metropolis_hastings(target_distribution, n_samples=10000)
    
    print(f"Generated {len(samples)} samples")
    print(f"Acceptance rate: {acc_rate:.3f}")
    
    # Compute statistics from samples
    print(f"\nPosterior statistics (from samples):")
    print(f"  Mean: {samples.mean():.3f}")
    print(f"  Median: {np.median(samples):.3f}")
    print(f"  Std: {samples.std():.3f}")
    
    # 95% credible interval
    ci = np.percentile(samples, [2.5, 97.5])
    print(f"  95% Credible Interval: [{ci[0]:.3f}, {ci[1]:.3f}]")
    
    # Visualize
    x = np.linspace(-5, 6, 1000)
    true_pdf = target_distribution(x)
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # True distribution vs samples
    axes[0, 0].plot(x, true_pdf, 'r-', linewidth=2, label='True distribution')
    axes[0, 0].hist(samples[1000:], bins=50, density=True, alpha=0.6, label='MCMC samples')
    axes[0, 0].set_xlabel('x')
    axes[0, 0].set_ylabel('Density')
    axes[0, 0].set_title('Samples vs True Distribution')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    
    # Trace plot
    axes[0, 1].plot(samples[:1000], alpha=0.7)
    axes[0, 1].set_xlabel('Iteration')
    axes[0, 1].set_ylabel('Value')
    axes[0, 1].set_title('Trace Plot (first 1000 samples)')
    axes[0, 1].grid(True, alpha=0.3)
    
    # Autocorrelation
    from numpy import correlate
    autocorr = correlate(samples - samples.mean(), samples - samples.mean(), mode='full')
    autocorr = autocorr[len(autocorr)//2:]
    autocorr = autocorr / autocorr[0]
    
    axes[1, 0].plot(autocorr[:100])
    axes[1, 0].set_xlabel('Lag')
    axes[1, 0].set_ylabel('Autocorrelation')
    axes[1, 0].set_title('Autocorrelation Plot')
    axes[1, 0].grid(True, alpha=0.3)
    
    # Running mean
    running_mean = np.cumsum(samples) / np.arange(1, len(samples) + 1)
    axes[1, 1].plot(running_mean)
    axes[1, 1].axhline(samples.mean(), color='r', linestyle='--', label='Final mean')
    axes[1, 1].set_xlabel('Iteration')
    axes[1, 1].set_ylabel('Running Mean')
    axes[1, 1].set_title('Convergence: Running Mean')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    # plt.show()
    
    print(f"\n💡 Key MCMC Concepts:")
    print(f"- Trace plot: Should look like 'fuzzy caterpillar' (good mixing)")
    print(f"- Autocorrelation: Should decay quickly (independent samples)")
    print(f"- Running mean: Should converge (chain has converged)")
    print(f"- Burn-in: Discard first samples (before convergence)")
    
    return samples

# Run demo
samples = metropolis_hastings_demo()
✅ Modern MCMC Tools

In practice, use specialized libraries: PyMC3/PyMC - Probabilistic programming in Python. Stan - Fast Hamiltonian Monte Carlo. TensorFlow Probability - Bayesian deep learning. Pyro - Deep probabilistic programming. These handle complex models automatically and are much more efficient than basic Metropolis-Hastings.

💻 Practice Exercises

Exercise 1: Medical Diagnosis

Scenario: Disease affects 0.5% of population. Test has 99% sensitivity and 95% specificity.

  1. If you test positive, what's probability you have disease?
  2. If you test negative, what's probability you're healthy?
  3. Why is the answer to question 1 so surprising?

Exercise 2: Beta-Binomial Updating

Scenario: Prior belief: Beta(2, 2). Observe 8 successes in 10 trials.

  1. What is the posterior distribution?
  2. What is the posterior mean?
  3. Calculate 95% credible interval
  4. How does this differ from a uniform Beta(1,1) prior?

Exercise 3: Bayesian A/B Test

Scenario: Control: 120/1000 conversions. Treatment: 140/1000 conversions.

  1. Use Beta-Binomial model to estimate P(treatment > control)
  2. Would you launch treatment if threshold is 95%?
  3. How many more samples needed to reach 95% confidence?

Exercise 4: Prior Sensitivity

Challenge: Compare three priors for coin flip problem (5 heads in 10 flips):

  • Uniform: Beta(1, 1)
  • Skeptical: Beta(10, 10) - believes coin is fair
  • Optimistic: Beta(8, 2) - believes heads likely
  1. Calculate posterior for each prior
  2. How do posteriors differ?
  3. Which prior is most affected by data?

Exercise 5: Credible vs Confidence Intervals

Scenario: Sample mean = 75, sample std = 10, n = 25. True population mean unknown.

  1. Calculate 95% confidence interval (frequentist)
  2. Calculate 95% credible interval (Bayesian with vague prior)
  3. Write correct interpretation for each
  4. Which interpretation is more intuitive?

📝 Summary

Congratulations! You've completed the Statistics for AI course. You've mastered the Bayesian paradigm—the foundation of probabilistic thinking in AI:

🎯 Bayesian Paradigm

Probability as belief. Parameters as distributions. Update beliefs with data via Bayes' theorem. More intuitive than frequentist approach.

🧮 Bayes' Theorem

Posterior = Likelihood × Prior / Evidence. Foundation of all Bayesian inference. Combines data with prior knowledge systematically.

🔗 Conjugate Priors

Beta-Binomial, Gamma-Poisson, Normal-Normal. Analytical posteriors. Foundation for efficient Bayesian inference in simple models.

🤖 Bayesian ML

Naive Bayes, Bayesian regression, Bayesian optimization, Bayesian neural networks. Uncertainty quantification built-in.

🎲 MCMC Sampling

Sample from complex posteriors. Metropolis-Hastings, Hamiltonian MC. Modern tools: PyMC, Stan, TensorFlow Probability.

📊 Practical Applications

A/B testing, spam filtering, medical diagnosis, hyperparameter tuning, recommender systems. Bayesian thinking everywhere in modern AI.

🎓 Course Complete!

You've mastered statistics for AI across 6 comprehensive tutorials: descriptive statistics, probability foundations, distributions, sampling & CLT, hypothesis testing, and Bayesian inference. You now have the statistical toolkit to: understand research papers, design rigorous experiments, build probabilistic models, quantify uncertainty, and make data-driven decisions with confidence. This is the foundation every AI practitioner needs!

🏆 Ready to Earn Your Certificate?

Complete the quiz below to demonstrate your mastery and earn your free Statistics for AI certificate!

Generate My Certificate →

🎯 Test Your Knowledge

Question 1: In Bayesian inference, the posterior distribution represents:

a) Our belief before seeing data
b) The probability of the data
c) Our updated belief after observing data
d) The likelihood function

Question 2: A 95% credible interval means:

a) 95% of repeated experiments will contain the true parameter
b) There's 95% probability the parameter is in this interval given the data
c) The parameter is definitely in this range
d) Same as a confidence interval

Question 3: For a Binomial likelihood, the conjugate prior is:

a) Beta distribution
b) Gamma distribution
c) Normal distribution
d) Uniform distribution

Question 4: The main advantage of Bayesian methods over frequentist is:

a) Always gives the same answer
b) Computationally faster
c) Doesn't require assumptions
d) Natural uncertainty quantification and incorporation of prior knowledge

Question 5: MCMC methods are used to:

a) Calculate exact posteriors
b) Avoid using priors
c) Sample from complex posterior distributions
d) Compute p-values

Question 6: In Bayesian A/B testing, to decide if treatment is better we calculate:

a) A p-value
b) P(treatment > control | data)
c) A confidence interval
d) The z-statistic

Question 7: The "Naive" in Naive Bayes refers to:

a) Assumption that features are independent given the class
b) It's a simple algorithm
c) It doesn't use Bayes' theorem correctly
d) It ignores prior probabilities

Question 8: Bayesian optimization is particularly useful for:

a) Large datasets with many evaluations
b) Computing exact solutions
c) Expensive function evaluations like hyperparameter tuning
d) Linear regression problems