Descriptive Statistics & Data Summarization - Statistics for AI

🎓 Complete all 6 tutorials to earn your Free Statistics for AI Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup required

Did you know? 80% of data science work involves understanding and cleaning data before any model building begins. Descriptive statistics is your first and most powerful tool for exploring datasets.

Statistics is the language of data science and AI. Before we can build intelligent models, we need to understand our data - and that starts with descriptive statistics. In this tutorial, you'll learn how to summarize datasets using measures of central tendency, understand data spread with measures of variability, and create visualizations that reveal insights hidden in numbers.

💡 Why This Matters for AI:

Before training any machine learning model, you must understand your data. Descriptive statistics reveal data quality issues, help identify features for modeling, and guide preprocessing decisions. Companies like Netflix use descriptive statistics to understand user behavior, while healthcare AI systems use it to detect anomalies in patient data.

📊 What is Descriptive Statistics?

Descriptive statistics are numerical and graphical methods for summarizing and describing data. They help us answer critical questions:

📍 What's a typical value in my dataset?
📏 How spread out are the values?
🎯 Are there any outliers or unusual patterns?
📈 What does the distribution of data look like?

Think of descriptive statistics as the "executive summary" of your data - they distill thousands or millions of data points into a few key numbers that tell the story.

📍 Measures of Central Tendency

Central tendency describes the "center" or "typical" value of a dataset. The three main measures are mean, median, and mode.

1. Mean (Average)

The mean is the sum of all values divided by the number of values. It's the most commonly used measure of center.

# Calculate mean
import numpy as np

data = [23, 25, 27, 29, 31, 33, 35, 37]
mean = np.mean(data)
print(f"Mean: {mean}")  # Output: 30.0

# Manual calculation
manual_mean = sum(data) / len(data)
print(f"Manual Mean: {manual_mean}")  # Output: 30.0

✅ When to use: Ideal for symmetric distributions without outliers (normal distributions).

⚠️ Limitation: Sensitive to extreme values (outliers). A single very large or small value can dramatically change the mean.

2. Median (Middle Value)

The median is the middle value when data is sorted. If there's an even number of values, it's the average of the two middle values.

# Calculate median - more robust to outliers
data = [23, 25, 27, 29, 31, 33, 35, 37, 100]  # Added outlier
median = np.median(data)
mean = np.mean(data)

print(f"Median: {median}")  # Output: 31.0 (not affected by outlier)
print(f"Mean: {mean:.2f}")      # Output: 37.78 (affected by outlier)

# The median stays stable while mean gets pulled by the outlier

✅ When to use: Robust to outliers and better for skewed distributions.

🏠 Real-world example: House prices - where a few mansions shouldn't inflate the "typical" price. The median home price gives a better sense of the market.

3. Mode (Most Frequent)

The mode is the value that appears most frequently in the dataset.

from scipy import stats

data = [1, 2, 2, 3, 4, 4, 4, 5, 6]
mode_result = stats.mode(data, keepdims=True)
print(f"Mode: {mode_result.mode[0]}")  # Output: 4
print(f"Count: {mode_result.count[0]}")  # Output: 3 (appears 3 times)

# For categorical data
colors = ['red', 'blue', 'red', 'green', 'red', 'blue']
mode_color = max(set(colors), key=colors.count)
print(f"Most common color: {mode_color}")  # Output: red

✅ When to use: Best for categorical data or discrete values (e.g., most common product purchased, most frequent customer segment, shoe size).

💼 Industry Example: Salary Analysis

You're analyzing salaries at a tech company. A few executives earn $500K+ while most employees earn $50K-$80K. Which measure tells the real story?

Mean: $120K (inflated by executives)
Median: $65K (typical employee salary) ✅
Mode: $60K (most common salary)

The median best represents the "typical" employee experience here!

📏 Measures of Variability (Spread)

Variability measures tell us how spread out the data is from the center. Two datasets can have the same mean but vastly different spreads!

1. Range

The range is the difference between maximum and minimum values - the simplest measure of spread.

data = [23, 25, 27, 29, 31, 33, 35, 37]
data_range = np.max(data) - np.min(data)
print(f"Range: {data_range}")  # Output: 14

# Also using built-in functions
print(f"Min: {np.min(data)}, Max: {np.max(data)}")  # Output: Min: 23, Max: 37

⚠️ Limitation: Only uses two values, ignores everything in between, very sensitive to outliers.

2. Variance

The variance measures the average squared deviation from the mean. It tells us how far data points spread from the average.

data = [23, 25, 27, 29, 31, 33, 35, 37]

# Variance - uses ddof=1 for sample variance (Bessel's correction)
variance = np.var(data, ddof=1)
print(f"Variance: {variance:.2f}")  # Output: 26.57

# Manual calculation to understand the concept
mean = np.mean(data)
squared_diffs = [(x - mean)**2 for x in data]
manual_variance = sum(squared_diffs) / (len(data) - 1)
print(f"Manual Variance: {manual_variance:.2f}")

# Show the squared differences
print(f"Mean: {mean}")
for x in data:
    print(f"{x}: deviation = {x - mean:.2f}, squared = {(x - mean)**2:.2f}")

📐 Formula: σ² = Σ(xᵢ - μ)² / (n-1)

Note: We use (n-1) for sample variance (Bessel's correction) to get an unbiased estimate of the population variance.

3. Standard Deviation

The standard deviation is the square root of variance, expressed in the same units as the original data - making it much more interpretable!

std_dev = np.std(data, ddof=1)
print(f"Standard Deviation: {std_dev:.2f}")  # Output: 5.15

# Relationship with variance
print(f"Square root of variance: {np.sqrt(variance):.2f}")  # Same as std_dev

# Interpretation
print(f"Mean: {mean:.2f}")
print(f"Std Dev: {std_dev:.2f}")
print(f"Range: [{mean - std_dev:.2f}, {mean + std_dev:.2f}]")
# Output: On average, data points deviate from the mean by about 5.15 units

📊 The 68-95-99.7 Rule (Empirical Rule)

For normally distributed data:

~68% of data falls within 1 standard deviation of the mean
~95% falls within 2 standard deviations
~99.7% falls within 3 standard deviations

This rule helps identify outliers: values beyond 3 standard deviations are extremely rare!

4. Interquartile Range (IQR)

The IQR is the range of the middle 50% of the data - the most robust measure of spread.

data = [23, 25, 27, 29, 31, 33, 35, 37, 100]  # With outlier

q1 = np.percentile(data, 25)  # First quartile (25th percentile)
q3 = np.percentile(data, 75)  # Third quartile (75th percentile)
iqr = q3 - q1

print(f"Q1 (25th percentile): {q1}")  # Output: 27.0
print(f"Q3 (75th percentile): {q3}")  # Output: 36.0
print(f"IQR: {iqr}")  # Output: 9.0

# IQR is not affected by the outlier (100)
# Standard deviation would be much higher due to the outlier

# Outlier detection using IQR method
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers = [x for x in data if x < lower_bound or x > upper_bound]
print(f"Outliers: {outliers}")  # Output: [100]

✅ Advantage: Robust to outliers - the extreme values don't affect IQR much.

🎯 Use Case: Outlier detection - values beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR are potential outliers (used in box plots).

📊 Data Visualization

Visualizations reveal patterns that numbers alone cannot show. Always visualize your data!

1. Histograms

Histograms show the distribution of data by dividing it into bins and counting frequencies.

import matplotlib.pyplot as plt
import numpy as np

# Generate sample data (normally distributed)
data = np.random.normal(loc=100, scale=15, size=1000)

# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, edgecolor='black', alpha=0.7, color='#3b82f6')

# Add mean and median lines
plt.axvline(np.mean(data), color='red', linestyle='--', 
           linewidth=2, label=f'Mean: {np.mean(data):.1f}')
plt.axvline(np.median(data), color='green', linestyle='--', 
           linewidth=2, label=f'Median: {np.median(data):.1f}')

plt.xlabel('Value', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Data', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

# For normal distributions, mean ≈ median

2. Box Plots

Box plots visualize quartiles, median, and outliers in a compact format.

# Box plot showing quartiles and outliers
data_with_outliers = np.concatenate([
    np.random.normal(100, 15, 100),  # Normal data
    [150, 160, 45, 35]  # Outliers
])

plt.figure(figsize=(10, 6))
plt.boxplot(data_with_outliers, vert=False)
plt.xlabel('Value', fontsize=12)
plt.title('Box Plot: Quartiles and Outliers', fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.show()

# What a box plot shows:
# • Box edges: Q1 (25th) and Q3 (75th percentiles) - this is the IQR
# • Line in box: Median (Q2, 50th percentile)
# • Whiskers: Extend to min/max within 1.5×IQR from quartiles
# • Dots/circles: Outliers beyond whiskers

3. Summary Statistics with Pandas

Pandas provides a comprehensive summary of all statistics at once.

import pandas as pd

# Create DataFrame with multiple features
df = pd.DataFrame({
    'feature_a': np.random.normal(100, 15, 1000),      # Normal distribution
    'feature_b': np.random.exponential(50, 1000),      # Exponential (right-skewed)
    'feature_c': np.random.uniform(0, 100, 1000)       # Uniform distribution
})

# Get comprehensive summary - ONE LINE!
summary = df.describe()
print(summary)

# Output shows:
# • count: Number of non-null values
# • mean: Average
# • std: Standard deviation
# • min: Minimum value
# • 25%: First quartile (Q1)
# • 50%: Median (Q2)
# • 75%: Third quartile (Q3)
# • max: Maximum value

# Additional statistics
print(f"\nVariance:\n{df.var()}")
print(f"\nSkewness (asymmetry):\n{df.skew()}")
print(f"\nKurtosis (tail heaviness):\n{df.kurt()}")

🎯 Real-World Application: Analyzing ML Features

Let's apply descriptive statistics to the famous Iris dataset used in machine learning.

from sklearn.datasets import load_iris
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import skew

# Load Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target_names[iris.target]

# 1. Summary statistics
print("=" * 50)
print("SUMMARY STATISTICS")
print("=" * 50)
print(df.describe())

# 2. Check for skewness (asymmetry in distribution)
print("\n" + "=" * 50)
print("SKEWNESS ANALYSIS")
print("=" * 50)
for col in iris.feature_names:
    skewness = skew(df[col])
    print(f"{col}: skewness = {skewness:.3f}")
    if abs(skewness) < 0.5:
        print("  → Nearly symmetric (normal)")
    elif skewness > 0:
        print("  → Right-skewed (tail extends right)")
    else:
        print("  → Left-skewed (tail extends left)")

# 3. Visualize distributions
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for idx, col in enumerate(iris.feature_names):
    ax = axes[idx // 2, idx % 2]
    
    # Histogram with mean and median
    ax.hist(df[col], bins=20, edgecolor='black', alpha=0.7, color='#3b82f6')
    ax.axvline(df[col].mean(), color='red', linestyle='--', 
              linewidth=2, label='Mean')
    ax.axvline(df[col].median(), color='green', linestyle='--', 
              linewidth=2, label='Median')
    
    ax.set_xlabel(col, fontsize=11)
    ax.set_ylabel('Frequency', fontsize=11)
    ax.set_title(f'Distribution of {col}', fontsize=12, fontweight='bold')
    ax.legend()
    ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

# 4. Check for outliers using IQR method
def find_outliers(data):
    q1 = data.quantile(0.25)
    q3 = data.quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    outliers = data[(data < lower_bound) | (data > upper_bound)]
    return outliers, lower_bound, upper_bound

print("\n" + "=" * 50)
print("OUTLIER DETECTION")
print("=" * 50)
for col in iris.feature_names:
    outliers, lower, upper = find_outliers(df[col])
    print(f"\n{col}:")
    print(f"  Bounds: [{lower:.2f}, {upper:.2f}]")
    if len(outliers) > 0:
        print(f"  Found {len(outliers)} outliers: {outliers.values}")
    else:
        print(f"  No outliers detected ✓")

✅ Key Insights from Iris Dataset Analysis:

Sepal width has nearly symmetric distribution (mean ≈ median)
Petal length shows bimodal distribution (two peaks) - indicating two distinct groups
Some features have outliers - important to handle before ML modeling
Different features have different scales - normalization may be needed

🚀 Next Steps: These insights guide feature engineering and preprocessing decisions!

💻 Practice Exercises

📝 Exercise 1: Dataset Analysis

Load any dataset of your choice (try the Boston Housing dataset or Titanic dataset from Kaggle). Calculate mean, median, mode, variance, and standard deviation for all numerical features. Which features have the highest variability? Why might this matter for ML?

📝 Exercise 2: Build a Statistics Function

Create a Python function analyze_data(df) that takes a DataFrame and returns a dictionary with: mean, median, std, min, max, Q1, Q3, IQR, and a list of outliers for each numerical column. Test it on the Iris dataset.

📝 Exercise 3: Distribution Comparison

Generate two datasets: one normally distributed (np.random.normal) and one exponentially distributed (np.random.exponential). Plot histograms for both and calculate all measures of central tendency. Explain why mean vs median differ more in one than the other.

📝 Exercise 4: Real-World Scenario

You're analyzing customer purchase amounts at an e-commerce site. Most customers spend $20-$50, but a few spend $500+. Should you use mean or median to report "average" purchase? Create synthetic data and demonstrate your answer with visualizations.

📋 Key Takeaways

📍 Central Tendency

Mean is sensitive to outliers; use median for skewed data. Mode is best for categorical data.

📏 Variability

Standard deviation is more interpretable than variance (same units as data). IQR is robust to outliers.

📊 Visualization

Always visualize your data! Numbers miss important patterns that graphs reveal instantly.

🎯 Outliers

Use IQR method or 3-sigma rule to detect outliers. Decide whether to remove, transform, or keep them.

🤖 ML Application

Descriptive statistics guide feature engineering, preprocessing, and help detect data quality issues early.

🚀 What's Next?

In Tutorial 2, we'll dive into Probability Foundations - learning how to quantify uncertainty and make predictions with probability rules, conditional probability, and Bayes' theorem. You'll discover how probability powers every ML algorithm!

🎯 Knowledge Check

Test your understanding of descriptive statistics concepts. Select the best answer for each question.

Question 1: Central Tendency with Outliers

You're analyzing salaries at a company. A few executives earn $500K+ while most employees earn $50K-$80K. Which measure of central tendency best represents the "typical" salary?

Mean - average of all salaries

Median - middle salary when sorted

Mode - most common salary

Range - difference between highest and lowest

Question 2: Standard Deviation Interpretation

A dataset has mean = 100 and standard deviation = 15. Assuming normal distribution, approximately what percentage of data falls between 85 and 115?

a) 50%

b) 95%

c) 68%

d) 99.7%

Question 3: Variance vs Standard Deviation

Why is standard deviation more commonly reported than variance?

a) Standard deviation is always smaller

b) Standard deviation is in the same units as the original data

c) Variance is harder to calculate

d) Standard deviation is more accurate

Question 4: IQR and Outliers

For a dataset with Q1 = 25 and Q3 = 75, which value would be considered an outlier using the 1.5×IQR rule?

a) 30

b) 80

c) 100

d) 160

Question 5: Mean vs Median Comparison

If mean > median significantly, what does this tell you about the data distribution?

The data is normally distributed

The data is right-skewed (has high outliers)

The data is uniformly distributed

The data has two peaks

Question 6: Choosing the Right Measure

Which measure of central tendency is most appropriate for categorical data like "favorite color"?

a) Mean

b) Median

c) Mode

d) Standard deviation