๐ Complete all 6 tutorials to earn your Free Statistics for AI Certificate
Shareable on LinkedIn โข Verified by AITutorials.site โข No signup required
Did you know? 80% of data science work involves understanding and cleaning data before any model building begins. Descriptive statistics is your first and most powerful tool for exploring datasets.
Statistics is the language of data science and AI. Before we can build intelligent models, we need to understand our data - and that starts with descriptive statistics. In this tutorial, you'll learn how to summarize datasets using measures of central tendency, understand data spread with measures of variability, and create visualizations that reveal insights hidden in numbers.
Before training any machine learning model, you must understand your data. Descriptive statistics reveal data quality issues, help identify features for modeling, and guide preprocessing decisions. Companies like Netflix use descriptive statistics to understand user behavior, while healthcare AI systems use it to detect anomalies in patient data.
๐ What is Descriptive Statistics?
Descriptive statistics are numerical and graphical methods for summarizing and describing data. They help us answer critical questions:
- ๐ What's a typical value in my dataset?
- ๐ How spread out are the values?
- ๐ฏ Are there any outliers or unusual patterns?
- ๐ What does the distribution of data look like?
Think of descriptive statistics as the "executive summary" of your data - they distill thousands or millions of data points into a few key numbers that tell the story.
๐ Measures of Central Tendency
Central tendency describes the "center" or "typical" value of a dataset. The three main measures are mean, median, and mode.
1. Mean (Average)
The mean is the sum of all values divided by the number of values. It's the most commonly used measure of center.
# Calculate mean
import numpy as np
data = [23, 25, 27, 29, 31, 33, 35, 37]
mean = np.mean(data)
print(f"Mean: {mean}") # Output: 30.0
# Manual calculation
manual_mean = sum(data) / len(data)
print(f"Manual Mean: {manual_mean}") # Output: 30.0
โ When to use: Ideal for symmetric distributions without outliers (normal distributions).
โ ๏ธ Limitation: Sensitive to extreme values (outliers). A single very large or small value can dramatically change the mean.
2. Median (Middle Value)
The median is the middle value when data is sorted. If there's an even number of values, it's the average of the two middle values.
# Calculate median - more robust to outliers
data = [23, 25, 27, 29, 31, 33, 35, 37, 100] # Added outlier
median = np.median(data)
mean = np.mean(data)
print(f"Median: {median}") # Output: 31.0 (not affected by outlier)
print(f"Mean: {mean:.2f}") # Output: 37.78 (affected by outlier)
# The median stays stable while mean gets pulled by the outlier
โ When to use: Robust to outliers and better for skewed distributions.
๐ Real-world example: House prices - where a few mansions shouldn't inflate the "typical" price. The median home price gives a better sense of the market.
3. Mode (Most Frequent)
The mode is the value that appears most frequently in the dataset.
from scipy import stats
data = [1, 2, 2, 3, 4, 4, 4, 5, 6]
mode_result = stats.mode(data, keepdims=True)
print(f"Mode: {mode_result.mode[0]}") # Output: 4
print(f"Count: {mode_result.count[0]}") # Output: 3 (appears 3 times)
# For categorical data
colors = ['red', 'blue', 'red', 'green', 'red', 'blue']
mode_color = max(set(colors), key=colors.count)
print(f"Most common color: {mode_color}") # Output: red
โ When to use: Best for categorical data or discrete values (e.g., most common product purchased, most frequent customer segment, shoe size).
You're analyzing salaries at a tech company. A few executives earn $500K+ while most employees earn $50K-$80K. Which measure tells the real story?
- Mean: $120K (inflated by executives)
- Median: $65K (typical employee salary) โ
- Mode: $60K (most common salary)
The median best represents the "typical" employee experience here!
๐ Measures of Variability (Spread)
Variability measures tell us how spread out the data is from the center. Two datasets can have the same mean but vastly different spreads!
1. Range
The range is the difference between maximum and minimum values - the simplest measure of spread.
data = [23, 25, 27, 29, 31, 33, 35, 37]
data_range = np.max(data) - np.min(data)
print(f"Range: {data_range}") # Output: 14
# Also using built-in functions
print(f"Min: {np.min(data)}, Max: {np.max(data)}") # Output: Min: 23, Max: 37
โ ๏ธ Limitation: Only uses two values, ignores everything in between, very sensitive to outliers.
2. Variance
The variance measures the average squared deviation from the mean. It tells us how far data points spread from the average.
data = [23, 25, 27, 29, 31, 33, 35, 37]
# Variance - uses ddof=1 for sample variance (Bessel's correction)
variance = np.var(data, ddof=1)
print(f"Variance: {variance:.2f}") # Output: 26.57
# Manual calculation to understand the concept
mean = np.mean(data)
squared_diffs = [(x - mean)**2 for x in data]
manual_variance = sum(squared_diffs) / (len(data) - 1)
print(f"Manual Variance: {manual_variance:.2f}")
# Show the squared differences
print(f"Mean: {mean}")
for x in data:
print(f"{x}: deviation = {x - mean:.2f}, squared = {(x - mean)**2:.2f}")
๐ Formula: ฯยฒ = ฮฃ(xแตข - ฮผ)ยฒ / (n-1)
Note: We use (n-1) for sample variance (Bessel's correction) to get an unbiased estimate of the population variance.
3. Standard Deviation
The standard deviation is the square root of variance, expressed in the same units as the original data - making it much more interpretable!
std_dev = np.std(data, ddof=1)
print(f"Standard Deviation: {std_dev:.2f}") # Output: 5.15
# Relationship with variance
print(f"Square root of variance: {np.sqrt(variance):.2f}") # Same as std_dev
# Interpretation
print(f"Mean: {mean:.2f}")
print(f"Std Dev: {std_dev:.2f}")
print(f"Range: [{mean - std_dev:.2f}, {mean + std_dev:.2f}]")
# Output: On average, data points deviate from the mean by about 5.15 units
For normally distributed data:
- ~68% of data falls within 1 standard deviation of the mean
- ~95% falls within 2 standard deviations
- ~99.7% falls within 3 standard deviations
This rule helps identify outliers: values beyond 3 standard deviations are extremely rare!
4. Interquartile Range (IQR)
The IQR is the range of the middle 50% of the data - the most robust measure of spread.
data = [23, 25, 27, 29, 31, 33, 35, 37, 100] # With outlier
q1 = np.percentile(data, 25) # First quartile (25th percentile)
q3 = np.percentile(data, 75) # Third quartile (75th percentile)
iqr = q3 - q1
print(f"Q1 (25th percentile): {q1}") # Output: 27.0
print(f"Q3 (75th percentile): {q3}") # Output: 36.0
print(f"IQR: {iqr}") # Output: 9.0
# IQR is not affected by the outlier (100)
# Standard deviation would be much higher due to the outlier
# Outlier detection using IQR method
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers = [x for x in data if x < lower_bound or x > upper_bound]
print(f"Outliers: {outliers}") # Output: [100]
โ Advantage: Robust to outliers - the extreme values don't affect IQR much.
๐ฏ Use Case: Outlier detection - values beyond Q1 - 1.5รIQR or Q3 + 1.5รIQR are potential outliers (used in box plots).
๐ Data Visualization
Visualizations reveal patterns that numbers alone cannot show. Always visualize your data!
1. Histograms
Histograms show the distribution of data by dividing it into bins and counting frequencies.
import matplotlib.pyplot as plt
import numpy as np
# Generate sample data (normally distributed)
data = np.random.normal(loc=100, scale=15, size=1000)
# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, edgecolor='black', alpha=0.7, color='#3b82f6')
# Add mean and median lines
plt.axvline(np.mean(data), color='red', linestyle='--',
linewidth=2, label=f'Mean: {np.mean(data):.1f}')
plt.axvline(np.median(data), color='green', linestyle='--',
linewidth=2, label=f'Median: {np.median(data):.1f}')
plt.xlabel('Value', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Data', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.show()
# For normal distributions, mean โ median
2. Box Plots
Box plots visualize quartiles, median, and outliers in a compact format.
# Box plot showing quartiles and outliers
data_with_outliers = np.concatenate([
np.random.normal(100, 15, 100), # Normal data
[150, 160, 45, 35] # Outliers
])
plt.figure(figsize=(10, 6))
plt.boxplot(data_with_outliers, vert=False)
plt.xlabel('Value', fontsize=12)
plt.title('Box Plot: Quartiles and Outliers', fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.show()
# What a box plot shows:
# โข Box edges: Q1 (25th) and Q3 (75th percentiles) - this is the IQR
# โข Line in box: Median (Q2, 50th percentile)
# โข Whiskers: Extend to min/max within 1.5รIQR from quartiles
# โข Dots/circles: Outliers beyond whiskers
3. Summary Statistics with Pandas
Pandas provides a comprehensive summary of all statistics at once.
import pandas as pd
# Create DataFrame with multiple features
df = pd.DataFrame({
'feature_a': np.random.normal(100, 15, 1000), # Normal distribution
'feature_b': np.random.exponential(50, 1000), # Exponential (right-skewed)
'feature_c': np.random.uniform(0, 100, 1000) # Uniform distribution
})
# Get comprehensive summary - ONE LINE!
summary = df.describe()
print(summary)
# Output shows:
# โข count: Number of non-null values
# โข mean: Average
# โข std: Standard deviation
# โข min: Minimum value
# โข 25%: First quartile (Q1)
# โข 50%: Median (Q2)
# โข 75%: Third quartile (Q3)
# โข max: Maximum value
# Additional statistics
print(f"\nVariance:\n{df.var()}")
print(f"\nSkewness (asymmetry):\n{df.skew()}")
print(f"\nKurtosis (tail heaviness):\n{df.kurt()}")
๐ฏ Real-World Application: Analyzing ML Features
Let's apply descriptive statistics to the famous Iris dataset used in machine learning.
from sklearn.datasets import load_iris
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import skew
# Load Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target_names[iris.target]
# 1. Summary statistics
print("=" * 50)
print("SUMMARY STATISTICS")
print("=" * 50)
print(df.describe())
# 2. Check for skewness (asymmetry in distribution)
print("\n" + "=" * 50)
print("SKEWNESS ANALYSIS")
print("=" * 50)
for col in iris.feature_names:
skewness = skew(df[col])
print(f"{col}: skewness = {skewness:.3f}")
if abs(skewness) < 0.5:
print(" โ Nearly symmetric (normal)")
elif skewness > 0:
print(" โ Right-skewed (tail extends right)")
else:
print(" โ Left-skewed (tail extends left)")
# 3. Visualize distributions
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for idx, col in enumerate(iris.feature_names):
ax = axes[idx // 2, idx % 2]
# Histogram with mean and median
ax.hist(df[col], bins=20, edgecolor='black', alpha=0.7, color='#3b82f6')
ax.axvline(df[col].mean(), color='red', linestyle='--',
linewidth=2, label='Mean')
ax.axvline(df[col].median(), color='green', linestyle='--',
linewidth=2, label='Median')
ax.set_xlabel(col, fontsize=11)
ax.set_ylabel('Frequency', fontsize=11)
ax.set_title(f'Distribution of {col}', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# 4. Check for outliers using IQR method
def find_outliers(data):
q1 = data.quantile(0.25)
q3 = data.quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers = data[(data < lower_bound) | (data > upper_bound)]
return outliers, lower_bound, upper_bound
print("\n" + "=" * 50)
print("OUTLIER DETECTION")
print("=" * 50)
for col in iris.feature_names:
outliers, lower, upper = find_outliers(df[col])
print(f"\n{col}:")
print(f" Bounds: [{lower:.2f}, {upper:.2f}]")
if len(outliers) > 0:
print(f" Found {len(outliers)} outliers: {outliers.values}")
else:
print(f" No outliers detected โ")
- Sepal width has nearly symmetric distribution (mean โ median)
- Petal length shows bimodal distribution (two peaks) - indicating two distinct groups
- Some features have outliers - important to handle before ML modeling
- Different features have different scales - normalization may be needed
๐ Next Steps: These insights guide feature engineering and preprocessing decisions!
๐ป Practice Exercises
Load any dataset of your choice (try the Boston Housing dataset or Titanic dataset from Kaggle). Calculate mean, median, mode, variance, and standard deviation for all numerical features. Which features have the highest variability? Why might this matter for ML?
Create a Python function analyze_data(df) that takes a DataFrame and returns a dictionary with: mean, median, std, min, max, Q1, Q3, IQR, and a list of outliers for each numerical column. Test it on the Iris dataset.
Generate two datasets: one normally distributed (np.random.normal) and one exponentially distributed (np.random.exponential). Plot histograms for both and calculate all measures of central tendency. Explain why mean vs median differ more in one than the other.
You're analyzing customer purchase amounts at an e-commerce site. Most customers spend $20-$50, but a few spend $500+. Should you use mean or median to report "average" purchase? Create synthetic data and demonstrate your answer with visualizations.
๐ Key Takeaways
Mean is sensitive to outliers; use median for skewed data. Mode is best for categorical data.
Standard deviation is more interpretable than variance (same units as data). IQR is robust to outliers.
Always visualize your data! Numbers miss important patterns that graphs reveal instantly.
Use IQR method or 3-sigma rule to detect outliers. Decide whether to remove, transform, or keep them.
Descriptive statistics guide feature engineering, preprocessing, and help detect data quality issues early.
In Tutorial 2, we'll dive into Probability Foundations - learning how to quantify uncertainty and make predictions with probability rules, conditional probability, and Bayes' theorem. You'll discover how probability powers every ML algorithm!
๐ฏ Knowledge Check
Test your understanding of descriptive statistics concepts. Select the best answer for each question.
Question 1: Central Tendency with Outliers
You're analyzing salaries at a company. A few executives earn $500K+ while most employees earn $50K-$80K. Which measure of central tendency best represents the "typical" salary?
Question 2: Standard Deviation Interpretation
A dataset has mean = 100 and standard deviation = 15. Assuming normal distribution, approximately what percentage of data falls between 85 and 115?
Question 3: Variance vs Standard Deviation
Why is standard deviation more commonly reported than variance?
Question 4: IQR and Outliers
For a dataset with Q1 = 25 and Q3 = 75, which value would be considered an outlier using the 1.5รIQR rule?
Question 5: Mean vs Median Comparison
If mean > median significantly, what does this tell you about the data distribution?
Question 6: Choosing the Right Measure
Which measure of central tendency is most appropriate for categorical data like "favorite color"?