HomeStatistics for AIHypothesis Testing & A/B Testing

Hypothesis Testing & A/B Testing

Learn to make data-driven decisions using statistical tests and run rigorous A/B experiments

📅 Tutorial 5 📊 Intermediate ⏱️ 80 min

🎓 Complete all 6 tutorials to earn your Free Statistics for AI Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup required

🎯 Why Hypothesis Testing Matters

Should you launch the new website design? Does the drug actually work? Is your ML model truly better than the baseline? These aren't questions of opinion—they're questions that demand statistical evidence. Hypothesis testing provides a rigorous framework for making decisions under uncertainty.

Every day, tech companies run thousands of A/B tests to optimize products. Netflix tests thumbnails, Amazon tests checkout flows, Google tests search algorithms. These tests all use hypothesis testing to separate real improvements from random noise. Master this, and you'll make better decisions backed by data, not hunches.

💡 Real-World Impact

Google famously tested 41 shades of blue for ad links and found the optimal shade generated $200M additional revenue. Microsoft's Bing ran an A/B test that increased revenue by $100M annually from a simple UI change. Facebook runs over 10,000 A/B tests per year. Understanding hypothesis testing isn't optional—it's how modern tech companies make billion-dollar decisions.

🔬 What is Hypothesis Testing?

Hypothesis testing is a statistical method for making decisions about population parameters based on sample data. It's a formalized way of saying "Is this effect real, or could it just be random chance?"

The Process

  1. State hypotheses: Null hypothesis (H₀) and alternative hypothesis (H₁)
  2. Choose significance level: Usually α = 0.05 (5% chance of false positive)
  3. Collect data: Gather sample(s) from population(s)
  4. Calculate test statistic: Convert data to a single number (t, z, χ², etc.)
  5. Find p-value: Probability of seeing your result if H₀ is true
  6. Make decision: If p-value < α, reject H₀; otherwise, fail to reject H₀

Null vs Alternative Hypothesis

  • Null Hypothesis (H₀): The status quo, "no effect", "no difference". What we assume is true until proven otherwise.
  • Alternative Hypothesis (H₁ or Hₐ): What we're trying to prove, the "interesting" result, the claim.
import numpy as np
from scipy import stats

# Example: Is a coin fair?
# H₀: p = 0.5 (coin is fair)
# H₁: p ≠ 0.5 (coin is biased)

np.random.seed(42)

# Flip coin 100 times
flips = np.random.choice(['H', 'T'], size=100, p=[0.5, 0.5])
heads = (flips == 'H').sum()
p_observed = heads / len(flips)

print("Coin Fairness Test:")
print(f"Flips: {len(flips)}")
print(f"Heads: {heads}")
print(f"Observed proportion: {p_observed:.3f}")
print(f"\nH₀: p = 0.5 (fair coin)")
print(f"H₁: p ≠ 0.5 (biased coin)")

# Binomial test
p_value = stats.binom_test(heads, len(flips), p=0.5, alternative='two-sided')
print(f"\np-value: {p_value:.4f}")

# Decision
alpha = 0.05
if p_value < alpha:
    print(f"Decision: Reject H₀ (p < {alpha})")
    print("Conclusion: Coin appears to be biased")
else:
    print(f"Decision: Fail to reject H₀ (p ≥ {alpha})")
    print("Conclusion: No evidence coin is biased")
⚠️ "Fail to Reject" ≠ "Accept"

We never "accept" the null hypothesis. We either reject it (have evidence against it) or fail to reject it (don't have enough evidence). Absence of evidence is not evidence of absence!

📊 Understanding P-values

The p-value is the probability of observing your data (or something more extreme) if the null hypothesis is true. It's NOT the probability that H₀ is true!

Correct Interpretation

  • p = 0.03: "If there's really no effect, we'd see results this extreme 3% of the time"
  • NOT: "3% chance the null hypothesis is true"
  • NOT: "97% chance the alternative is true"
  • NOT: "The effect size is 3%"

What P-values Tell Us

  • Small p-value (< 0.05): Data is unusual under H₀, evidence against null
  • Large p-value (≥ 0.05): Data is compatible with H₀, no strong evidence
  • p-value is continuous: p=0.049 and p=0.051 are essentially the same, don't be dogmatic about 0.05 threshold
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Visualize p-value concept
np.random.seed(42)

# Scenario: Testing if a new teaching method improves test scores
# H₀: μ = 70 (no improvement)
# H₁: μ > 70 (improvement)

population_mean_h0 = 70
population_std = 10
sample_size = 30

# Sample data (actual class with new method)
sample = np.random.normal(73, 10, sample_size)  # True mean is 73
sample_mean = sample.mean()
se = population_std / np.sqrt(sample_size)

# Calculate test statistic (z-score)
z_statistic = (sample_mean - population_mean_h0) / se

# Calculate p-value (one-tailed: testing if > 70)
p_value = 1 - stats.norm.cdf(z_statistic)

print("Teaching Method Experiment:")
print(f"H₀: μ = {population_mean_h0} (no improvement)")
print(f"H₁: μ > {population_mean_h0} (improvement)")
print(f"\nSample size: {sample_size}")
print(f"Sample mean: {sample_mean:.2f}")
print(f"Standard error: {se:.2f}")
print(f"Z-statistic: {z_statistic:.3f}")
print(f"P-value: {p_value:.4f}")

# Visualize
x = np.linspace(60, 80, 1000)
y = stats.norm.pdf(x, population_mean_h0, se)

plt.figure(figsize=(10, 5))
plt.plot(x, y, linewidth=2, color='steelblue', label='Null distribution')
plt.axvline(population_mean_h0, color='red', linestyle='--', linewidth=2, label=f'H₀: μ={population_mean_h0}')
plt.axvline(sample_mean, color='green', linestyle='--', linewidth=2, label=f'Observed: {sample_mean:.1f}')

# Shade p-value region
x_fill = x[x >= sample_mean]
y_fill = stats.norm.pdf(x_fill, population_mean_h0, se)
plt.fill_between(x_fill, y_fill, alpha=0.3, color='coral', label=f'p-value={p_value:.4f}')

plt.xlabel('Test Score Mean')
plt.ylabel('Probability Density')
plt.title('P-value: Probability of Observing This Result (or More Extreme) if H₀ is True')
plt.legend()
plt.grid(True, alpha=0.3)
# plt.show()  # Uncomment to display

if p_value < 0.05:
    print(f"\nDecision: Reject H₀ (p={p_value:.4f} < 0.05)")
    print("Conclusion: New teaching method significantly improves scores")
else:
    print(f"\nDecision: Fail to reject H₀ (p={p_value:.4f} ≥ 0.05)")
    print("Conclusion: No significant evidence of improvement")
✅ P-value Best Practices

1. Set α before collecting data: Don't p-hack by trying different tests. 2. Report exact p-value: Say p=0.032, not just "p < 0.05". 3. Report effect size too: Statistical significance ≠ practical importance. 4. Consider confidence intervals: They provide more information than p-values alone.

⚠️ Type I and Type II Errors

Hypothesis testing isn't perfect—we can make two types of mistakes. Understanding these errors is crucial for designing experiments correctly.

H₀ is Actually True H₀ is Actually False
Reject H₀ Type I Error
False Positive
Probability = α (usually 0.05)
Correct Decision
True Positive
Power = 1-β
Fail to Reject H₀ Correct Decision
True Negative
Probability = 1-α
Type II Error
False Negative
Probability = β

Understanding the Errors

  • Type I Error (α): Rejecting H₀ when it's actually true. "False alarm". We control this with significance level α.
  • Type II Error (β): Failing to reject H₀ when it's actually false. "Missed detection". Related to statistical power.
  • Statistical Power (1-β): Probability of correctly rejecting false H₀. Higher power = better at detecting real effects.

Real-World Examples

# Medical test analogy
print("Medical Test Interpretation:")
print("H₀: Patient is healthy")
print("H₁: Patient has disease\n")

print("Type I Error (α = False Positive):")
print("  → Diagnose disease when patient is healthy")
print("  → Patient undergoes unnecessary treatment")
print("  → Anxiety, cost, side effects\n")

print("Type II Error (β = False Negative):")
print("  → Miss disease when patient is sick")
print("  → Disease goes untreated")
print("  → Potentially fatal consequences\n")

print("Which is worse? Depends on context!")
print("  → Serious disease: Type II worse (missing it is dangerous)")
print("  → Benign condition: Type I worse (false alarm causes harm)")

print("\n" + "="*60)
print("A/B Test Analogy:")
print("H₀: New design has same conversion as old")
print("H₁: New design has higher conversion\n")

print("Type I Error:")
print("  → Launch 'better' design that's actually worse")
print("  → Lose revenue, waste engineering time\n")

print("Type II Error:")
print("  → Miss actually better design")
print("  → Opportunity cost of not improving")

Power Analysis

from scipy import stats
import numpy as np

def calculate_power(effect_size, sample_size, alpha=0.05):
    """Calculate statistical power for detecting an effect"""
    
    # For two-sample t-test
    # effect_size = (mean1 - mean2) / pooled_std
    
    # Non-centrality parameter
    ncp = effect_size * np.sqrt(sample_size / 2)
    
    # Critical value for alpha
    critical_value = stats.t.ppf(1 - alpha/2, df=2*sample_size-2)
    
    # Power = P(reject H0 | H1 is true)
    power = 1 - stats.t.cdf(critical_value, df=2*sample_size-2, loc=ncp)
    
    return power

# Example: How does power change with sample size and effect size?
print("Statistical Power Analysis:")
print("="*60)

effect_sizes = [0.2, 0.5, 0.8]  # Small, medium, large (Cohen's d)
sample_sizes = [30, 50, 100, 200, 500]

print(f"{'Sample Size':<15} {'Small (d=0.2)':<20} {'Medium (d=0.5)':<20} {'Large (d=0.8)'}")
print("-"*75)

for n in sample_sizes:
    powers = [calculate_power(d, n) for d in effect_sizes]
    print(f"{n:<15} {powers[0]:<20.3f} {powers[1]:<20.3f} {powers[2]:.3f}")

print("\nKey Insights:")
print("  → Larger effects easier to detect (higher power)")
print("  → Larger samples give higher power")
print("  → Power ≥ 0.80 is typically desired")
print("  → Detecting small effects requires large samples")
⚠️ The Trade-off

Decreasing α (Type I error rate) increases β (Type II error rate) for fixed sample size. You can't minimize both simultaneously! Solution: increase sample size or accept the trade-off based on consequences of each error type.

📈 T-Tests

T-tests are used to compare means when dealing with small to moderate sample sizes or when population variance is unknown. They're the workhorse of hypothesis testing.

1. One-Sample T-Test

Compare sample mean to a known value.

from scipy import stats
import numpy as np

# Example: Is average user session time different from 5 minutes?
np.random.seed(42)

# Sample data (in minutes)
session_times = np.random.normal(5.5, 1.2, 40)

# Hypothesis
mu_0 = 5.0  # Historical average
print("One-Sample T-Test:")
print(f"H₀: μ = {mu_0} minutes")
print(f"H₁: μ ≠ {mu_0} minutes")
print(f"\nSample size: {len(session_times)}")
print(f"Sample mean: {session_times.mean():.3f} minutes")
print(f"Sample std: {session_times.std(ddof=1):.3f}")

# Perform t-test
t_statistic, p_value = stats.ttest_1samp(session_times, mu_0)

print(f"\nT-statistic: {t_statistic:.3f}")
print(f"P-value: {p_value:.4f}")

# Decision
alpha = 0.05
if p_value < alpha:
    print(f"\nDecision: Reject H₀ (p={p_value:.4f} < {alpha})")
    print("Conclusion: Average session time is significantly different from 5 minutes")
else:
    print(f"\nDecision: Fail to reject H₀ (p={p_value:.4f} ≥ {alpha})")
    print("Conclusion: No significant difference from 5 minutes")

# Confidence interval
ci = stats.t.interval(0.95, len(session_times)-1, 
                      loc=session_times.mean(),
                      scale=stats.sem(session_times))
print(f"\n95% CI: [{ci[0]:.3f}, {ci[1]:.3f}] minutes")

2. Independent Two-Sample T-Test

Compare means of two independent groups.

# Example: Does new UI improve engagement time?
np.random.seed(42)

# Control group (old UI)
control = np.random.normal(5.0, 1.5, 50)

# Treatment group (new UI)
treatment = np.random.normal(5.8, 1.5, 50)

print("Independent Two-Sample T-Test:")
print("H₀: μ_treatment = μ_control (no difference)")
print("H₁: μ_treatment ≠ μ_control (there is a difference)")

print(f"\nControl group (n={len(control)}):")
print(f"  Mean: {control.mean():.3f}, Std: {control.std(ddof=1):.3f}")

print(f"\nTreatment group (n={len(treatment)}):")
print(f"  Mean: {treatment.mean():.3f}, Std: {treatment.std(ddof=1):.3f}")

print(f"\nObserved difference: {treatment.mean() - control.mean():.3f}")

# Perform t-test (assuming equal variances)
t_statistic, p_value = stats.ttest_ind(treatment, control)

print(f"\nT-statistic: {t_statistic:.3f}")
print(f"P-value: {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
    print(f"\nDecision: Reject H₀ (p={p_value:.4f} < {alpha})")
    print("Conclusion: New UI significantly affects engagement time")
    
    # Effect size (Cohen's d)
    pooled_std = np.sqrt((control.var(ddof=1) + treatment.var(ddof=1)) / 2)
    cohens_d = (treatment.mean() - control.mean()) / pooled_std
    print(f"Effect size (Cohen's d): {cohens_d:.3f}")
    
    if abs(cohens_d) < 0.2:
        print("Effect size: Small")
    elif abs(cohens_d) < 0.5:
        print("Effect size: Small to Medium")
    elif abs(cohens_d) < 0.8:
        print("Effect size: Medium to Large")
    else:
        print("Effect size: Large")
else:
    print(f"\nDecision: Fail to reject H₀ (p={p_value:.4f} ≥ {alpha})")
    print("Conclusion: No significant difference")

3. Paired T-Test

Compare means when samples are related (before/after, matched pairs).

# Example: Before/after training program
np.random.seed(42)

# 30 employees measured before and after training
before_training = np.random.normal(70, 10, 30)
improvement = np.random.normal(5, 8, 30)  # Average 5 point improvement
after_training = before_training + improvement

print("Paired T-Test (Before/After Training):")
print("H₀: μ_difference = 0 (no improvement)")
print("H₁: μ_difference > 0 (improvement)")

print(f"\nSample size: {len(before_training)} employees")
print(f"Before training: Mean={before_training.mean():.2f}, Std={before_training.std(ddof=1):.2f}")
print(f"After training:  Mean={after_training.mean():.2f}, Std={after_training.std(ddof=1):.2f}")

differences = after_training - before_training
print(f"\nDifferences: Mean={differences.mean():.2f}, Std={differences.std(ddof=1):.2f}")

# Perform paired t-test
t_statistic, p_value = stats.ttest_rel(after_training, before_training)

# One-tailed test (testing if improvement > 0)
p_value_one_tailed = p_value / 2 if t_statistic > 0 else 1 - p_value/2

print(f"\nT-statistic: {t_statistic:.3f}")
print(f"P-value (one-tailed): {p_value_one_tailed:.4f}")

alpha = 0.05
if p_value_one_tailed < alpha:
    print(f"\nDecision: Reject H₀ (p={p_value_one_tailed:.4f} < {alpha})")
    print("Conclusion: Training significantly improves performance")
else:
    print(f"\nDecision: Fail to reject H₀ (p={p_value_one_tailed:.4f} ≥ {alpha})")
    print("Conclusion: No significant improvement from training")

🎲 Chi-Square Test

Chi-square tests are used for categorical data. They test whether observed frequencies differ from expected frequencies or whether two categorical variables are independent.

Chi-Square Test of Independence

Test if two categorical variables are related.

from scipy.stats import chi2_contingency
import numpy as np
import pandas as pd

# Example: Does device type affect purchase behavior?
# Contingency table: Device x Purchase Decision

data = {
    'Device': ['Mobile']*300 + ['Desktop']*200 + ['Tablet']*100,
    'Purchased': (['Yes']*45 + ['No']*255 +  # Mobile
                  ['Yes']*50 + ['No']*150 +  # Desktop
                  ['Yes']*15 + ['No']*85)    # Tablet
}

df = pd.DataFrame(data)

# Create contingency table
contingency_table = pd.crosstab(df['Device'], df['Purchased'])
print("Chi-Square Test of Independence:")
print("H₀: Device type and purchase behavior are independent")
print("H₁: Device type and purchase behavior are related\n")

print("Observed Frequencies:")
print(contingency_table)
print()

# Calculate percentages
print("Purchase Rates by Device:")
for device in contingency_table.index:
    total = contingency_table.loc[device].sum()
    yes = contingency_table.loc[device, 'Yes']
    rate = yes / total * 100
    print(f"  {device}: {yes}/{total} = {rate:.1f}%")

# Perform chi-square test
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"\nChi-square statistic: {chi2_stat:.3f}")
print(f"Degrees of freedom: {dof}")
print(f"P-value: {p_value:.4f}")

print("\nExpected Frequencies (if independent):")
print(pd.DataFrame(expected, 
                   index=contingency_table.index, 
                   columns=contingency_table.columns))

alpha = 0.05
if p_value < alpha:
    print(f"\nDecision: Reject H₀ (p={p_value:.4f} < {alpha})")
    print("Conclusion: Device type and purchase behavior are significantly related")
    print("Recommendation: Optimize conversion funnels differently per device")
else:
    print(f"\nDecision: Fail to reject H₀ (p={p_value:.4f} ≥ {alpha})")
    print("Conclusion: No significant relationship between device and purchases")

Chi-Square Goodness-of-Fit Test

Test if observed distribution matches expected distribution.

from scipy.stats import chisquare

# Example: Are website visitors evenly distributed across days of week?
observed = np.array([152, 145, 148, 142, 156, 180, 177])  # Mon-Sun
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

print("Chi-Square Goodness-of-Fit Test:")
print("H₀: Visitors are evenly distributed across days")
print("H₁: Visitors are not evenly distributed\n")

print("Observed visitors:")
for day, count in zip(days, observed):
    print(f"  {day}: {count}")

total = observed.sum()
expected = np.array([total/7] * 7)  # Equal distribution

print(f"\nExpected per day (if equal): {expected[0]:.1f}")

# Perform test
chi2_stat, p_value = chisquare(observed, expected)

print(f"\nChi-square statistic: {chi2_stat:.3f}")
print(f"P-value: {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
    print(f"\nDecision: Reject H₀ (p={p_value:.4f} < {alpha})")
    print("Conclusion: Traffic is not evenly distributed")
    print("Weekend (Sat-Sun) appears to have higher traffic")
else:
    print(f"\nDecision: Fail to reject H₀ (p={p_value:.4f} ≥ {alpha})")
    print("Conclusion: Traffic is relatively evenly distributed")

🧪 A/B Testing Framework

A/B testing is hypothesis testing applied to product decisions. It's how tech companies make data-driven improvements by showing different versions to different users and measuring which performs better.

Complete A/B Test Workflow

import numpy as np
from scipy import stats
import pandas as pd

class ABTest:
    """Complete A/B testing framework"""
    
    def __init__(self, name, metric='conversion'):
        self.name = name
        self.metric = metric
        self.data = {'control': [], 'treatment': []}
        
    def add_observation(self, variant, value):
        """Add a data point (0 or 1 for conversion, numeric for continuous)"""
        if variant in ['control', 'treatment']:
            self.data[variant].append(value)
    
    def calculate_sample_size(self, baseline_rate, min_detectable_effect, 
                             alpha=0.05, power=0.80):
        """Calculate required sample size per variant"""
        
        # For proportion test (conversion rate)
        p1 = baseline_rate
        p2 = baseline_rate * (1 + min_detectable_effect)
        
        z_alpha = stats.norm.ppf(1 - alpha/2)
        z_beta = stats.norm.ppf(power)
        
        p_pooled = (p1 + p2) / 2
        
        n = ((z_alpha * np.sqrt(2 * p_pooled * (1-p_pooled)) + 
              z_beta * np.sqrt(p1*(1-p1) + p2*(1-p2))) / (p2 - p1)) ** 2
        
        return int(np.ceil(n))
    
    def analyze_conversion(self, alpha=0.05):
        """Analyze A/B test for conversion rate (binary metric)"""
        
        control = np.array(self.data['control'])
        treatment = np.array(self.data['treatment'])
        
        n_control = len(control)
        n_treatment = len(treatment)
        conversions_control = control.sum()
        conversions_treatment = treatment.sum()
        
        rate_control = conversions_control / n_control
        rate_treatment = conversions_treatment / n_treatment
        
        # Pooled proportion
        p_pooled = (conversions_control + conversions_treatment) / (n_control + n_treatment)
        
        # Standard error
        se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_control + 1/n_treatment))
        
        # Z-test
        z_stat = (rate_treatment - rate_control) / se
        p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))  # Two-tailed
        
        # Confidence interval for difference
        se_diff = np.sqrt(rate_control*(1-rate_control)/n_control + 
                         rate_treatment*(1-rate_treatment)/n_treatment)
        ci_diff = stats.norm.interval(0.95, loc=rate_treatment - rate_control, 
                                      scale=se_diff)
        
        # Relative improvement
        relative_improvement = (rate_treatment - rate_control) / rate_control * 100
        
        results = {
            'control_n': n_control,
            'treatment_n': n_treatment,
            'control_rate': rate_control,
            'treatment_rate': rate_treatment,
            'absolute_diff': rate_treatment - rate_control,
            'relative_improvement': relative_improvement,
            'z_statistic': z_stat,
            'p_value': p_value,
            'ci_lower': ci_diff[0],
            'ci_upper': ci_diff[1],
            'significant': p_value < alpha
        }
        
        return results
    
    def print_results(self, results):
        """Pretty print A/B test results"""
        
        print(f"\n{'='*70}")
        print(f"A/B Test Results: {self.name}")
        print(f"{'='*70}")
        
        print(f"\nSample Sizes:")
        print(f"  Control:   {results['control_n']:,}")
        print(f"  Treatment: {results['treatment_n']:,}")
        
        print(f"\nConversion Rates:")
        print(f"  Control:   {results['control_rate']:.4f} ({results['control_rate']*100:.2f}%)")
        print(f"  Treatment: {results['treatment_rate']:.4f} ({results['treatment_rate']*100:.2f}%)")
        
        print(f"\nDifference:")
        print(f"  Absolute:  {results['absolute_diff']:+.4f} ({results['absolute_diff']*100:+.2f}pp)")
        print(f"  Relative:  {results['relative_improvement']:+.2f}%")
        
        print(f"\nStatistical Test:")
        print(f"  Z-statistic: {results['z_statistic']:.3f}")
        print(f"  P-value:     {results['p_value']:.4f}")
        print(f"  95% CI:      [{results['ci_lower']:.4f}, {results['ci_upper']:.4f}]")
        
        print(f"\nDecision:")
        if results['significant']:
            print(f"  ✅ SIGNIFICANT (p={results['p_value']:.4f} < 0.05)")
            if results['relative_improvement'] > 0:
                print(f"  Treatment is better! Launch it.")
            else:
                print(f"  Control is better! Don't launch treatment.")
        else:
            print(f"  ❌ NOT SIGNIFICANT (p={results['p_value']:.4f} ≥ 0.05)")
            print(f"  No clear winner. Consider:")
            print(f"    - Running test longer for more data")
            print(f"    - Test may be underpowered for this effect size")

# Example: Email campaign A/B test
np.random.seed(42)

test = ABTest("Email Subject Line Test")

# Simulate test data
# Control: 10% conversion rate
# Treatment: 12% conversion rate (20% relative improvement)

n_per_variant = 1000

control_conversions = np.random.binomial(1, 0.10, n_per_variant)
treatment_conversions = np.random.binomial(1, 0.12, n_per_variant)

for conv in control_conversions:
    test.add_observation('control', conv)
    
for conv in treatment_conversions:
    test.add_observation('treatment', conv)

# Required sample size calculation
print("Sample Size Planning:")
required_n = test.calculate_sample_size(
    baseline_rate=0.10,
    min_detectable_effect=0.20,  # 20% relative improvement
    alpha=0.05,
    power=0.80
)
print(f"Required sample size per variant: {required_n:,}")
print(f"Total samples needed: {2*required_n:,}")
print(f"Actual samples collected: {2*n_per_variant:,}")

# Analyze results
results = test.analyze_conversion()
test.print_results(results)

Common A/B Testing Mistakes

⚠️ Avoid These Pitfalls

1. Peeking: Checking results before planned sample size. Inflates Type I error rate.
2. Multiple comparisons: Testing 20 variations without correction → expect 1 false positive!
3. Ignoring practical significance: p < 0.05 but 0.01% improvement → not worth launching.
4. Running too short: Need full week to account for day-of-week effects.
5. Sample ratio mismatch: If 50/50 split, check you got ~50/50 before trusting results.
6. Not testing holdout: Launch to 10% first to catch any issues.

💻 Practice Exercises

Exercise 1: One-Sample T-Test

Scenario: Historical average response time is 2.5 seconds. After optimization, you measure 50 requests: mean=2.3s, std=0.6s.

  1. Set up null and alternative hypotheses
  2. Calculate t-statistic and p-value
  3. At α=0.05, did optimization significantly improve response time?

Exercise 2: Independent T-Test

Scenario: Testing new algorithm vs old:

  • Old: n=40, mean=85, std=12
  • New: n=40, mean=91, std=13

  1. Is the new algorithm significantly better?
  2. Calculate effect size (Cohen's d)
  3. What would happen with larger sample sizes?

Exercise 3: Chi-Square Test

Scenario: Testing if ad click-through rate differs by age group:

  • 18-25: 45 clicks / 500 views
  • 26-40: 62 clicks / 600 views
  • 41+: 38 clicks / 400 views

  1. Create contingency table
  2. Perform chi-square test
  3. Are click rates significantly different across age groups?

Exercise 4: A/B Test Design

Scenario: Current checkout flow has 15% completion rate. You want to detect 10% relative improvement (15% → 16.5%) with 80% power at α=0.05.

  1. Calculate required sample size per variant
  2. If 500 users/day, how long will test take?
  3. What if you want to detect 5% improvement instead?

Exercise 5: Power Analysis

Challenge: You ran an A/B test with n=200 per variant. Control: 12% conversion, Treatment: 14% conversion. p-value = 0.15 (not significant). Was the test underpowered? Calculate what sample size you'd need to detect this 2pp difference with 80% power.

📝 Summary

You've mastered hypothesis testing and A/B testing—the foundation of data-driven decision making:

🔬 Hypothesis Testing

Formal framework for testing claims. Null vs alternative, p-values, significance levels. Never "accept" null, only reject or fail to reject.

📊 P-values

Probability of data if H₀ true. Small p-value = evidence against null. Not probability H₀ is true! Report exact value and effect size.

⚠️ Error Types

Type I (α): False positive. Type II (β): False negative. Power = 1-β. Trade-off requires careful consideration of consequences.

📈 T-Tests

Compare means. One-sample, independent, paired. Use when n is small or σ unknown. Report confidence intervals and effect sizes.

🎲 Chi-Square

For categorical data. Test independence or goodness-of-fit. Essential for conversion rate analysis and categorical relationships.

🧪 A/B Testing

Rigorous experimentation framework. Calculate sample size, avoid peeking, test practical significance. How billion-dollar decisions are made.

✅ Key Takeaway

Hypothesis testing transforms intuition into evidence-based decisions. Every A/B test at Google, Facebook, Netflix uses these exact methods. Master this, and you can: design rigorous experiments, avoid costly mistakes, prove your ML improvements are real, and make data-driven decisions with confidence. Statistical significance ≠ practical importance—always report effect sizes and confidence intervals!

🎯 Test Your Knowledge

Question 1: A p-value of 0.03 means:

a) 3% chance the null hypothesis is true
b) 97% chance the alternative is true
c) If H₀ is true, we'd see results this extreme 3% of the time
d) The effect size is 3%

Question 2: Type I error is:

a) Rejecting H₀ when it's actually true (false positive)
b) Failing to reject H₀ when it's false (false negative)
c) Accepting H₀ when it's true
d) Using the wrong statistical test

Question 3: Statistical power is:

a) The probability of Type I error
b) The probability of correctly rejecting a false H₀
c) Always equal to 1 - α
d) The effect size of the test

Question 4: Which test should you use to compare means of two independent groups?

a) One-sample t-test
b) Paired t-test
c) Independent two-sample t-test
d) Chi-square test

Question 5: In A/B testing, "peeking" refers to:

a) Looking at competitor tests
b) Checking user feedback
c) Reviewing the experimental design
d) Checking results before reaching planned sample size

Question 6: If you increase sample size, what happens to statistical power?

a) Decreases
b) Increases
c) Stays the same
d) Depends on the effect size

Question 7: Chi-square test is appropriate for:

a) Comparing means of continuous variables
b) Testing correlation between continuous variables
c) Testing independence of categorical variables
d) Calculating confidence intervals

Question 8: When conducting hypothesis tests, we:

a) Either reject H₀ or fail to reject H₀
b) Either reject H₀ or accept H₀
c) Either accept H₁ or reject H₁
d) Prove that H₀ is true