Decision Trees - ML Tree-Based Models Tutorial

🎓 Complete all modules to earn your Free Machine Learning Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup fee

🌳 Welcome to Decision Trees

Decision Trees are one of the most intuitive and powerful machine learning algorithms. Unlike Logistic Regression's mathematical approach, Decision Trees mimic how humans make decisions by asking a series of yes/no questions about data.

They're also the foundation for ensemble methods like Random Forests and Gradient Boosting, making them essential to understand.

🤔 What is a Decision Tree?

Definition

A Decision Tree is a tree-structured model that makes predictions by repeatedly asking questions about features. Each question splits the data into smaller subsets until reaching a prediction (leaf).

Imagine deciding whether to go to the beach:

Is it sunny? (If yes, continue)
Is the water warm? (If yes, continue)
Am I free? (If yes, go to beach!)

Decision Trees follow exactly this logic!

Tree Components

Root Node: First decision (top of tree)
Internal Nodes: Decision points (questions)
Branches: Outcomes (yes/no, left/right)
Leaf Nodes: Final predictions

⚙️ How Do Trees Learn to Split?

The key challenge: How does the algorithm decide which feature and threshold to use for each split?

Gini Index & Information Gain

Decision Trees use metrics to evaluate splits. The most common is Gini Index:

Gini = 0: Pure node (all same class) - perfect split!
Gini = 1: Perfectly mixed classes - worst split

The algorithm tries all possible splits and picks the one with the lowest Gini (most pure children).

🌳 Example: Deciding customer loyalty:

Test: "Purchase > $100?" - Maybe 80% loyal customers say yes
Test: "Age > 35?" - Maybe only 60% correlation
Choose first split because it's "purer"

💻 Decision Trees in Python

🎯 Complete Classification Example: Iris Flower Dataset

# Complete Decision Tree Classification Example
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import numpy as np

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Features: {iris.feature_names}")
print(f"Classes: {iris.target_names}")

# Create and train tree
tree = DecisionTreeClassifier(
    criterion='gini',        # or 'entropy'
    max_depth=3,             # prevent overfitting
    min_samples_split=2,
    min_samples_leaf=1,
    random_state=42
)
tree.fit(X_train, y_train)

# Make predictions
y_pred = tree.predict(X_test)
y_proba = tree.predict_proba(X_test)

# Evaluate
train_accuracy = tree.score(X_train, y_train)
test_accuracy = tree.score(X_test, y_test)

print(f"\nTraining Accuracy: {train_accuracy:.2%}")
print(f"Test Accuracy: {test_accuracy:.2%}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Cross-validation for robust evaluation
cv_scores = cross_val_score(tree, X_train, y_train, cv=5)
print(f"\n5-Fold CV Accuracy: {cv_scores.mean():.2%} (+/- {cv_scores.std() * 2:.2%})")

# Feature importance
print("\n🔍 Feature Importance:")
for feature, importance in zip(iris.feature_names, tree.feature_importances_):
    print(f"  {feature}: {importance:.2%}")

# Predict for new sample
new_flower = np.array([[5.1, 3.5, 1.4, 0.2]])  # Typical setosa measurements
prediction = tree.predict(new_flower)
probabilities = tree.predict_proba(new_flower)[0]

print(f"\n🌸 New flower prediction: {iris.target_names[prediction[0]]}")
for class_name, prob in zip(iris.target_names, probabilities):
    print(f"  {class_name}: {prob:.1%}")

📊 Visualizing Decision Trees

# Visualization 1: Tree Structure
plt.figure(figsize=(20, 10))
plot_tree(
    tree,
    feature_names=iris.feature_names,
    class_names=iris.target_names,
    filled=True,          # color nodes by class
    rounded=True,         # rounded boxes
    fontsize=10
)
plt.title("Decision Tree for Iris Classification", fontsize=16)
plt.tight_layout()
plt.savefig('decision_tree_structure.png', dpi=300, bbox_inches='tight')
plt.show()

# Visualization 2: Decision Boundaries (2D projection)
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# Use only 2 features for visualization
X_2d = X[:, [2, 3]]  # petal length, petal width
tree_2d = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_2d.fit(X_2d, y)

# Create mesh
x_min, x_max = X_2d[:, 0].min() - 0.5, X_2d[:, 0].max() + 0.5
y_min, y_max = X_2d[:, 1].min() - 0.5, X_2d[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))

# Predict on mesh
Z = tree_2d.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap=ListedColormap(['#FF6B6B', '#4ECDC4', '#45B7D1']))
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap=ListedColormap(['#FF0000', '#00FF00', '#0000FF']), 
            edgecolor='black', s=50)
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.title('Decision Tree Decision Boundaries')
plt.colorbar()
plt.show()

# Visualization 3: Feature Importance Bar Chart
importances = tree.feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(10, 6))
plt.bar(range(X.shape[1]), importances[indices])
plt.xticks(range(X.shape[1]), [iris.feature_names[i] for i in indices], rotation=45)
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.title('Feature Importance in Decision Tree')
plt.tight_layout()
plt.show()

💾 Saving and Loading Models

# Save trained model
import joblib

# Save
joblib.dump(tree, 'decision_tree_model.pkl')
print("Model saved!")

# Load
loaded_tree = joblib.load('decision_tree_model.pkl')
print("Model loaded!")

# Verify it works
test_prediction = loaded_tree.predict(X_test[:5])
print(f"Predictions: {test_prediction}")

# Alternative: Pickle
import pickle
with open('tree_model.pickle', 'wb') as f:
    pickle.dump(tree, f)

with open('tree_model.pickle', 'rb') as f:
    loaded_tree2 = pickle.load(f)

🎓 Advanced: Comparing Different Hyperparameters

# Compare trees with different max_depth
from sklearn.model_selection import validation_curve

param_range = range(1, 15)
train_scores, test_scores = validation_curve(
    DecisionTreeClassifier(random_state=42),
    X_train, y_train,
    param_name="max_depth",
    param_range=param_range,
    cv=5,
    scoring="accuracy"
)

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

plt.figure(figsize=(10, 6))
plt.plot(param_range, train_mean, label='Training score', color='blue', marker='o')
plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, alpha=0.15, color='blue')
plt.plot(param_range, test_mean, label='Validation score', color='red', marker='s')
plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, alpha=0.15, color='red')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.title('Training vs Validation Accuracy by Tree Depth')
plt.legend(loc='best')
plt.grid(True)
plt.show()

print(f"Optimal max_depth: {param_range[np.argmax(test_mean)]}")
print(f"Best validation accuracy: {np.max(test_mean):.2%}")

✅ Feature Importance: Decision Trees automatically tell you which features matter most! This is incredibly valuable for business insights and feature selection.

💡 Pro Tip: Always visualize your tree! Use plot_tree() to understand how decisions are made. If your tree is too deep to visualize, it's probably overfitting.

📈 Regression Trees: Predicting Continuous Values

Decision Trees aren't just for classification! Regression Trees predict continuous values by averaging the target values in leaf nodes.

Key Differences for Regression:

Classification Trees:
• Split criterion: Gini Index or Entropy
• Leaf prediction: Majority class
• Example: "This email is Spam"

Regression Trees:
• Split criterion: MSE (Mean Squared Error) or MAE
• Leaf prediction: Average of target values
• Example: "This house costs $250,000"

# Regression Tree: Predicting House Prices
from sklearn.tree import DecisionTreeRegressor
import numpy as np

# Sample data: [size_sqft, bedrooms, age_years]
X_train = np.array([
    [1500, 3, 10],
    [2000, 4, 5],
    [1200, 2, 20],
    [2500, 4, 2],
    [1800, 3, 8]
])
y_train = np.array([250000, 350000, 180000, 420000, 300000])

# Train regression tree
tree_reg = DecisionTreeRegressor(
    max_depth=3,
    min_samples_split=2,
    random_state=42
)
tree_reg.fit(X_train, y_train)

# Predict
new_house = np.array([[1700, 3, 12]])
price = tree_reg.predict(new_house)
print(f"Predicted price: ${price[0]:,.0f}")

# Feature importance
for feature, importance in zip(['Size', 'Bedrooms', 'Age'], tree_reg.feature_importances_):
    print(f"{feature}: {importance:.2%}")

How Regression Trees Split

Instead of Gini, regression trees minimize MSE (Mean Squared Error) when splitting:

• Try split: "Size ≤ 1600 sqft"
• Left child: houses [1200, 1500] → avg price = $215k
• Right child: houses [1800, 2000, 2500] → avg price = $357k
• Calculate MSE of predictions vs actual prices
• Pick split with lowest MSE

⚡ Overfitting: The Main Problem

Decision Trees have a critical weakness: they can grow arbitrarily deep and memorize training data (overfitting).

⚠️ An unrestricted tree will keep splitting until each leaf has perfect purity — meaning it memorizes the training data but fails catastrophically on new data. This is called high variance.

😱 Overfitting Example:

Training accuracy: 100% (memorized perfectly)
Test accuracy: 65% (terrible on new data)

The tree learned noise and outliers instead of patterns!

🛡️ Pre-Pruning: Stop Growing Early

Pre-pruning prevents overfitting by stopping tree growth early using hyperparameters:

Hyperparameter	What It Does	Typical Values
max_depth	Maximum tree depth	3-10 (start with 5)
min_samples_split	Min samples needed to split a node	10-50
min_samples_leaf	Min samples required in leaf nodes	5-20
max_leaf_nodes	Maximum number of leaf nodes	10-100
min_impurity_decrease	Min impurity reduction needed to split	0.001-0.01

# Comprehensive Overfitting Control
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Option 1: Manual tuning
tree = DecisionTreeClassifier(
    max_depth=5,              # Limit depth
    min_samples_split=20,     # Need 20+ samples to split
    min_samples_leaf=10,      # Need 10+ samples per leaf
    max_leaf_nodes=20,        # Max 20 leaves
    min_impurity_decrease=0.01,  # Only split if impurity reduces by 0.01+
    random_state=42
)
tree.fit(X_train, y_train)

print(f"Training accuracy: {tree.score(X_train, y_train):.2%}")
print(f"Test accuracy: {tree.score(X_test, y_test):.2%}")

# Option 2: GridSearch for best hyperparameters
param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [10, 20, 50],
    'min_samples_leaf': [5, 10, 20]
}

grid_search = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5,  # 5-fold cross-validation
    scoring='accuracy'
)
grid_search.fit(X_train, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.2%}")
print(f"Test score: {grid_search.score(X_test, y_test):.2%}")

✂️ Post-Pruning: Grow Then Cut

Post-pruning (also called "pruning") grows a full tree, then removes branches that don't improve validation performance:

Post-Pruning Process:

1️⃣ Grow complete tree (no restrictions)
2️⃣ Evaluate each branch on validation set
3️⃣ Remove branches that don't improve validation accuracy
4️⃣ Repeat until no more branches can be removed

Advantage: Can discover optimal tree size automatically
Disadvantage: Computationally expensive (grows full tree first)

# Post-pruning using cost complexity pruning (sklearn 0.22+)
from sklearn.tree import DecisionTreeClassifier

# Train full tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)

print(f"Full tree depth: {full_tree.get_depth()}")
print(f"Full tree leaves: {full_tree.get_n_leaves()}")

# Get cost complexity path
path = full_tree.cost_complexity_pruning_path(X_train, y_train)
alphas = path.ccp_alphas

# Train trees with different alpha values
trees = []
for alpha in alphas:
    tree = DecisionTreeClassifier(random_state=42, ccp_alpha=alpha)
    tree.fit(X_train, y_train)
    trees.append(tree)

# Find best alpha based on validation set
train_scores = [tree.score(X_train, y_train) for tree in trees]
test_scores = [tree.score(X_test, y_test) for tree in trees]

best_idx = np.argmax(test_scores)
print(f"\nBest alpha: {alphas[best_idx]:.6f}")
print(f"Pruned tree depth: {trees[best_idx].get_depth()}")
print(f"Pruned tree leaves: {trees[best_idx].get_n_leaves()}")
print(f"Test accuracy: {test_scores[best_idx]:.2%}")

✅ Pro Tip: Start with pre-pruning (max_depth, min_samples_split) as it's faster and usually sufficient. Only use post-pruning if you need to squeeze out the last few percent of accuracy.

✅ Strengths & ❌ Limitations

✅

Highly Interpretable

You can visualize and explain every decision

✅

Handles Non-Linear Patterns

Works with curved, complex relationships

✅

Feature Importance Built-in

Automatically identifies which features matter

❌

Prone to Overfitting

Can memorize training data without careful tuning

❌

Unstable

Small data changes can drastically change the tree

❌

Greedy Algorithm

Makes locally optimal choices that may not be globally best

🎯 When to Use Decision Trees

✅ Need interpretable models you can explain to stakeholders
✅ Non-linear relationships in your data
✅ Want automatic feature importance
✅ Mixed data types (numerical and categorical)
✅ As foundation for ensemble methods (Random Forests, XGBoost)
✅ Quick baseline model for any classification/regression problem
❌ Avoid when: Need highest accuracy (use ensembles instead)
❌ Avoid when: Data has high dimensionality (many features)
❌ Avoid when: Extrapolation needed (trees can't predict beyond training range)

💡 Pro Tip: Decision Trees are rarely used alone in production. They're usually combined with ensemble methods like Random Forests or Gradient Boosting for better performance and stability.

🔧 Troubleshooting Guide

Problem	Symptoms	Solution
Overfitting	Training accuracy 95%+ Test accuracy 70%	• Reduce max_depth (try 3-7) • Increase min_samples_split • Use pruning (ccp_alpha)
Underfitting	Both training and test accuracy low	• Increase max_depth • Decrease min_samples_split • Add more features
Imbalanced Classes	Model always predicts majority class	• Use class_weight='balanced' • Oversample minority class (SMOTE) • Try stratified splits
Tree Too Deep	Can't visualize tree 100+ nodes	• Set max_depth=5 or lower • Use max_leaf_nodes=20 • Consider ensemble methods
Slow Training	Takes minutes to fit tree	• Reduce max_features • Sample data for prototyping • Set max_depth limit
Unstable Predictions	Tree changes drastically with small data changes	• Use Random Forests • Set random_state for reproducibility • Collect more data
Poor Feature Importance	All features show ~0% importance	• Scale features (StandardScaler) • Remove constant/duplicate features • Check for data leakage

🐛 Common Code Issues

# ❌ WRONG: Fitting on entire dataset
tree.fit(X, y)
score = tree.score(X, y)  # This is training accuracy, misleading!

# ✅ RIGHT: Always use train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
tree.fit(X_train, y_train)
train_score = tree.score(X_train, y_train)
test_score = tree.score(X_test, y_test)  # Real performance

# ❌ WRONG: Not setting random_state
tree = DecisionTreeClassifier()  # Results not reproducible

# ✅ RIGHT: Set random_state for reproducibility
tree = DecisionTreeClassifier(random_state=42)

# ❌ WRONG: Ignoring class imbalance
# 95% class 0, 5% class 1 → model predicts all class 0

# ✅ RIGHT: Handle class imbalance
tree = DecisionTreeClassifier(class_weight='balanced', random_state=42)

# ❌ WRONG: Not handling categorical features
# sklearn requires numeric input

# ✅ RIGHT: Encode categorical features
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X['category_encoded'] = le.fit_transform(X['category'])

🚀 Practice Projects

🌸 Project 1: Iris Classification

Difficulty: Beginner

Dataset: sklearn.datasets.load_iris()

Goal: Classify iris species based on measurements

Tasks:

• Train decision tree with max_depth=3
• Visualize tree with plot_tree()
• Calculate accuracy on test set
• Identify most important feature

💳 Project 2: Credit Card Fraud Detection

Difficulty: Intermediate

Dataset: Kaggle Credit Card Fraud Detection

Goal: Detect fraudulent transactions

Tasks:

• Handle highly imbalanced classes (use class_weight)
• Compare Gini vs Entropy criterion
• Tune hyperparameters with GridSearchCV
• Calculate precision, recall, F1-score

🏠 Project 3: House Price Prediction

Difficulty: Intermediate

Dataset: Boston Housing or California Housing

Goal: Predict house prices using regression trees

Tasks:

• Use DecisionTreeRegressor
• Compare MSE vs MAE criterion
• Visualize actual vs predicted prices
• Find optimal max_depth (3-15)

📧 Project 4: Email Spam Filter

Difficulty: Advanced

Dataset: SMS Spam Collection or Enron Email

Goal: Build spam classifier from text

Tasks:

• Extract features with TfidfVectorizer
• Train tree on text features
• Identify most important words (feature_importances_)
• Compare with Naive Bayes baseline

📝 Project Starter Code

# Template for Practice Projects
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
import numpy as np

# 1. Load data
# X, y = load_your_dataset()

# 2. Explore data
print(f"Samples: {len(X)}, Features: {X.shape[1]}")
print(f"Class distribution: {np.bincount(y)}")

# 3. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 4. Train tree
tree = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=10,
    random_state=42
)
tree.fit(X_train, y_train)

# 5. Evaluate
train_acc = tree.score(X_train, y_train)
test_acc = tree.score(X_test, y_test)
print(f"Training: {train_acc:.2%}, Test: {test_acc:.2%}")

# 6. Cross-validation
cv_scores = cross_val_score(tree, X_train, y_train, cv=5)
print(f"CV: {cv_scores.mean():.2%} (+/- {cv_scores.std()*2:.2%})")

# 7. Feature importance
if hasattr(X, 'columns'):  # pandas DataFrame
    for feat, imp in zip(X.columns, tree.feature_importances_):
        if imp > 0.05:  # Only show important features
            print(f"{feat}: {imp:.1%}")

# 8. Detailed evaluation
y_pred = tree.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

🎯 Challenge: After completing the basic projects, try building an ensemble! Combine 10 decision trees trained on random data subsets, then average their predictions. This is the foundation of Random Forests (next tutorial)!

📋 Summary

✅ What You've Learned:

Core Concepts

• Recursive partitioning algorithm
• Binary splits at each node
• Classification & regression trees

Splitting Criteria

• Gini Index (default, faster)
• Entropy (information gain)
• MSE for regression trees

Overfitting Control

• Pre-pruning (max_depth, min_samples)
• Post-pruning (ccp_alpha)
• Cross-validation for tuning

Python Implementation

• sklearn DecisionTreeClassifier/Regressor
• Feature importance extraction
• Visualization with plot_tree()

🎓 Key Takeaways

Best Feature	Interpretability - you can explain every decision to stakeholders
Biggest Weakness	Overfitting - requires careful hyperparameter tuning
When to Use	Quick baselines, interpretable models, feature selection
When to Avoid	Need max accuracy (use ensembles), extrapolation tasks
Next Evolution	Random Forests, Gradient Boosting (combine many trees)

🔑 Quick Reference Card

# Decision Tree Essentials Cheat Sheet

# Classification
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(
    criterion='gini',        # or 'entropy'
    max_depth=5,             # CRITICAL: controls overfitting
    min_samples_split=20,    # min samples to split
    min_samples_leaf=10,     # min samples in leaf
    random_state=42          # reproducibility
)
tree.fit(X_train, y_train)

# Regression
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(
    criterion='squared_error',  # or 'absolute_error'
    max_depth=5
)

# Key Methods
tree.predict(X_test)              # predictions
tree.predict_proba(X_test)        # probabilities
tree.feature_importances_         # importance scores
tree.get_depth()                  # tree depth
tree.get_n_leaves()               # leaf count

# Visualization
from sklearn.tree import plot_tree
plot_tree(tree, filled=True, feature_names=['f1', 'f2'])

# Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
params = {'max_depth': [3,5,7], 'min_samples_split': [10,20,50]}
grid = GridSearchCV(tree, params, cv=5)
grid.fit(X_train, y_train)

🚀 What's Next?

In the next tutorial, Random Forests, we'll learn how to combine multiple Decision Trees to create a more powerful and stable model. You'll discover:

✅ Ensemble Learning - Combining multiple weak learners
✅ Bagging - Bootstrap aggregating for variance reduction
✅ Feature Randomness - Decorrelating trees
✅ Out-of-Bag Evaluation - Free validation without test set
✅ Practical Benefits - Better accuracy, less overfitting, no tuning needed

🎉 Congratulations! You've now mastered Decision Trees - one of the most interpretable and foundational ML algorithms. You understand recursive partitioning, splitting criteria (Gini vs Entropy), overfitting control, and how to implement trees in Python. You're ready for ensemble methods!

📝 Knowledge Check

Test your understanding of Decision Trees!

1. How do decision trees make predictions?

A) By calculating distances to data points

B) By fitting a line to the data

C) By splitting data based on if-else rules

D) By transforming data with sigmoid function

2. What does Gini impurity measure?

A) Tree depth

B) How mixed/impure the classes are at a node

C) Number of samples in a node

D) Model accuracy

3. What is the main risk with deep decision trees?

A) Training too slowly

B) Using too much memory

C) Underfitting the data

D) Overfitting by memorizing training data

4. Which hyperparameter controls the maximum depth of a decision tree?

A) max_depth

B) learning_rate

C) n_estimators

D) alpha

5. What is feature importance in decision trees?

A) The number of times a feature appears

B) The average value of a feature

C) How much a feature contributes to splits and predictions

D) The correlation between features