š Complete all modules to earn your Free Machine Learning Certificate
Shareable on LinkedIn ⢠Verified by AITutorials.site ⢠No signup fee
š³ Welcome to Decision Trees
Decision Trees are one of the most intuitive and powerful machine learning algorithms. Unlike Logistic Regression's mathematical approach, Decision Trees mimic how humans make decisions by asking a series of yes/no questions about data.
They're also the foundation for ensemble methods like Random Forests and Gradient Boosting, making them essential to understand.
š¤ What is a Decision Tree?
Definition
A Decision Tree is a tree-structured model that makes predictions by repeatedly asking questions about features. Each question splits the data into smaller subsets until reaching a prediction (leaf).
Imagine deciding whether to go to the beach:
- Is it sunny? (If yes, continue)
- Is the water warm? (If yes, continue)
- Am I free? (If yes, go to beach!)
Decision Trees follow exactly this logic!
Tree Components
- Root Node: First decision (top of tree)
- Internal Nodes: Decision points (questions)
- Branches: Outcomes (yes/no, left/right)
- Leaf Nodes: Final predictions
āļø How Do Trees Learn to Split?
The key challenge: How does the algorithm decide which feature and threshold to use for each split?
Gini Index & Information Gain
Decision Trees use metrics to evaluate splits. The most common is Gini Index:
- Gini = 0: Pure node (all same class) - perfect split!
- Gini = 1: Perfectly mixed classes - worst split
The algorithm tries all possible splits and picks the one with the lowest Gini (most pure children).
š³ Example: Deciding customer loyalty:
- Test: "Purchase > $100?" - Maybe 80% loyal customers say yes
- Test: "Age > 35?" - Maybe only 60% correlation
- Choose first split because it's "purer"
š» Decision Trees in Python
šÆ Complete Classification Example: Iris Flower Dataset
# Complete Decision Tree Classification Example
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import numpy as np
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Features: {iris.feature_names}")
print(f"Classes: {iris.target_names}")
# Create and train tree
tree = DecisionTreeClassifier(
criterion='gini', # or 'entropy'
max_depth=3, # prevent overfitting
min_samples_split=2,
min_samples_leaf=1,
random_state=42
)
tree.fit(X_train, y_train)
# Make predictions
y_pred = tree.predict(X_test)
y_proba = tree.predict_proba(X_test)
# Evaluate
train_accuracy = tree.score(X_train, y_train)
test_accuracy = tree.score(X_test, y_test)
print(f"\nTraining Accuracy: {train_accuracy:.2%}")
print(f"Test Accuracy: {test_accuracy:.2%}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
# Cross-validation for robust evaluation
cv_scores = cross_val_score(tree, X_train, y_train, cv=5)
print(f"\n5-Fold CV Accuracy: {cv_scores.mean():.2%} (+/- {cv_scores.std() * 2:.2%})")
# Feature importance
print("\nš Feature Importance:")
for feature, importance in zip(iris.feature_names, tree.feature_importances_):
print(f" {feature}: {importance:.2%}")
# Predict for new sample
new_flower = np.array([[5.1, 3.5, 1.4, 0.2]]) # Typical setosa measurements
prediction = tree.predict(new_flower)
probabilities = tree.predict_proba(new_flower)[0]
print(f"\nšø New flower prediction: {iris.target_names[prediction[0]]}")
for class_name, prob in zip(iris.target_names, probabilities):
print(f" {class_name}: {prob:.1%}")
š Visualizing Decision Trees
# Visualization 1: Tree Structure
plt.figure(figsize=(20, 10))
plot_tree(
tree,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, # color nodes by class
rounded=True, # rounded boxes
fontsize=10
)
plt.title("Decision Tree for Iris Classification", fontsize=16)
plt.tight_layout()
plt.savefig('decision_tree_structure.png', dpi=300, bbox_inches='tight')
plt.show()
# Visualization 2: Decision Boundaries (2D projection)
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
# Use only 2 features for visualization
X_2d = X[:, [2, 3]] # petal length, petal width
tree_2d = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_2d.fit(X_2d, y)
# Create mesh
x_min, x_max = X_2d[:, 0].min() - 0.5, X_2d[:, 0].max() + 0.5
y_min, y_max = X_2d[:, 1].min() - 0.5, X_2d[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
# Predict on mesh
Z = tree_2d.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap=ListedColormap(['#FF6B6B', '#4ECDC4', '#45B7D1']))
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap=ListedColormap(['#FF0000', '#00FF00', '#0000FF']),
edgecolor='black', s=50)
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.title('Decision Tree Decision Boundaries')
plt.colorbar()
plt.show()
# Visualization 3: Feature Importance Bar Chart
importances = tree.feature_importances_
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(10, 6))
plt.bar(range(X.shape[1]), importances[indices])
plt.xticks(range(X.shape[1]), [iris.feature_names[i] for i in indices], rotation=45)
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.title('Feature Importance in Decision Tree')
plt.tight_layout()
plt.show()
š¾ Saving and Loading Models
# Save trained model
import joblib
# Save
joblib.dump(tree, 'decision_tree_model.pkl')
print("Model saved!")
# Load
loaded_tree = joblib.load('decision_tree_model.pkl')
print("Model loaded!")
# Verify it works
test_prediction = loaded_tree.predict(X_test[:5])
print(f"Predictions: {test_prediction}")
# Alternative: Pickle
import pickle
with open('tree_model.pickle', 'wb') as f:
pickle.dump(tree, f)
with open('tree_model.pickle', 'rb') as f:
loaded_tree2 = pickle.load(f)
š Advanced: Comparing Different Hyperparameters
# Compare trees with different max_depth
from sklearn.model_selection import validation_curve
param_range = range(1, 15)
train_scores, test_scores = validation_curve(
DecisionTreeClassifier(random_state=42),
X_train, y_train,
param_name="max_depth",
param_range=param_range,
cv=5,
scoring="accuracy"
)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
plt.figure(figsize=(10, 6))
plt.plot(param_range, train_mean, label='Training score', color='blue', marker='o')
plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, alpha=0.15, color='blue')
plt.plot(param_range, test_mean, label='Validation score', color='red', marker='s')
plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, alpha=0.15, color='red')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.title('Training vs Validation Accuracy by Tree Depth')
plt.legend(loc='best')
plt.grid(True)
plt.show()
print(f"Optimal max_depth: {param_range[np.argmax(test_mean)]}")
print(f"Best validation accuracy: {np.max(test_mean):.2%}")
ā Feature Importance: Decision Trees automatically tell you which features matter most! This is incredibly valuable for business insights and feature selection.
š” Pro Tip: Always visualize your tree! Use plot_tree() to understand how decisions are made. If your tree is too deep to visualize, it's probably overfitting.
š Regression Trees: Predicting Continuous Values
Decision Trees aren't just for classification! Regression Trees predict continuous values by averaging the target values in leaf nodes.
Classification Trees:
⢠Split criterion: Gini Index or Entropy
⢠Leaf prediction: Majority class
⢠Example: "This email is Spam"
Regression Trees:
⢠Split criterion: MSE (Mean Squared Error) or MAE
⢠Leaf prediction: Average of target values
⢠Example: "This house costs $250,000"
# Regression Tree: Predicting House Prices
from sklearn.tree import DecisionTreeRegressor
import numpy as np
# Sample data: [size_sqft, bedrooms, age_years]
X_train = np.array([
[1500, 3, 10],
[2000, 4, 5],
[1200, 2, 20],
[2500, 4, 2],
[1800, 3, 8]
])
y_train = np.array([250000, 350000, 180000, 420000, 300000])
# Train regression tree
tree_reg = DecisionTreeRegressor(
max_depth=3,
min_samples_split=2,
random_state=42
)
tree_reg.fit(X_train, y_train)
# Predict
new_house = np.array([[1700, 3, 12]])
price = tree_reg.predict(new_house)
print(f"Predicted price: ${price[0]:,.0f}")
# Feature importance
for feature, importance in zip(['Size', 'Bedrooms', 'Age'], tree_reg.feature_importances_):
print(f"{feature}: {importance:.2%}")
How Regression Trees Split
Instead of Gini, regression trees minimize MSE (Mean Squared Error) when splitting:
- ⢠Try split: "Size ⤠1600 sqft"
- ⢠Left child: houses [1200, 1500] ā avg price = $215k
- ⢠Right child: houses [1800, 2000, 2500] ā avg price = $357k
- ⢠Calculate MSE of predictions vs actual prices
- ⢠Pick split with lowest MSE
ā” Overfitting: The Main Problem
Decision Trees have a critical weakness: they can grow arbitrarily deep and memorize training data (overfitting).
ā ļø An unrestricted tree will keep splitting until each leaf has perfect purity ā meaning it memorizes the training data but fails catastrophically on new data. This is called high variance.
š± Overfitting Example:
Training accuracy: 100% (memorized perfectly)
Test accuracy: 65% (terrible on new data)
The tree learned noise and outliers instead of patterns!
š”ļø Pre-Pruning: Stop Growing Early
Pre-pruning prevents overfitting by stopping tree growth early using hyperparameters:
| Hyperparameter | What It Does | Typical Values |
|---|---|---|
| max_depth | Maximum tree depth | 3-10 (start with 5) |
| min_samples_split | Min samples needed to split a node | 10-50 |
| min_samples_leaf | Min samples required in leaf nodes | 5-20 |
| max_leaf_nodes | Maximum number of leaf nodes | 10-100 |
| min_impurity_decrease | Min impurity reduction needed to split | 0.001-0.01 |
# Comprehensive Overfitting Control
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Option 1: Manual tuning
tree = DecisionTreeClassifier(
max_depth=5, # Limit depth
min_samples_split=20, # Need 20+ samples to split
min_samples_leaf=10, # Need 10+ samples per leaf
max_leaf_nodes=20, # Max 20 leaves
min_impurity_decrease=0.01, # Only split if impurity reduces by 0.01+
random_state=42
)
tree.fit(X_train, y_train)
print(f"Training accuracy: {tree.score(X_train, y_train):.2%}")
print(f"Test accuracy: {tree.score(X_test, y_test):.2%}")
# Option 2: GridSearch for best hyperparameters
param_grid = {
'max_depth': [3, 5, 7, 10],
'min_samples_split': [10, 20, 50],
'min_samples_leaf': [5, 10, 20]
}
grid_search = GridSearchCV(
DecisionTreeClassifier(random_state=42),
param_grid,
cv=5, # 5-fold cross-validation
scoring='accuracy'
)
grid_search.fit(X_train, y_train)
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.2%}")
print(f"Test score: {grid_search.score(X_test, y_test):.2%}")
āļø Post-Pruning: Grow Then Cut
Post-pruning (also called "pruning") grows a full tree, then removes branches that don't improve validation performance:
1ļøā£ Grow complete tree (no restrictions)
2ļøā£ Evaluate each branch on validation set
3ļøā£ Remove branches that don't improve validation accuracy
4ļøā£ Repeat until no more branches can be removed
Advantage: Can discover optimal tree size automatically
Disadvantage: Computationally expensive (grows full tree first)
# Post-pruning using cost complexity pruning (sklearn 0.22+)
from sklearn.tree import DecisionTreeClassifier
# Train full tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
print(f"Full tree depth: {full_tree.get_depth()}")
print(f"Full tree leaves: {full_tree.get_n_leaves()}")
# Get cost complexity path
path = full_tree.cost_complexity_pruning_path(X_train, y_train)
alphas = path.ccp_alphas
# Train trees with different alpha values
trees = []
for alpha in alphas:
tree = DecisionTreeClassifier(random_state=42, ccp_alpha=alpha)
tree.fit(X_train, y_train)
trees.append(tree)
# Find best alpha based on validation set
train_scores = [tree.score(X_train, y_train) for tree in trees]
test_scores = [tree.score(X_test, y_test) for tree in trees]
best_idx = np.argmax(test_scores)
print(f"\nBest alpha: {alphas[best_idx]:.6f}")
print(f"Pruned tree depth: {trees[best_idx].get_depth()}")
print(f"Pruned tree leaves: {trees[best_idx].get_n_leaves()}")
print(f"Test accuracy: {test_scores[best_idx]:.2%}")
ā Pro Tip: Start with pre-pruning (max_depth, min_samples_split) as it's faster and usually sufficient. Only use post-pruning if you need to squeeze out the last few percent of accuracy.
ā Strengths & ā Limitations
Highly Interpretable
You can visualize and explain every decision
Handles Non-Linear Patterns
Works with curved, complex relationships
Feature Importance Built-in
Automatically identifies which features matter
Prone to Overfitting
Can memorize training data without careful tuning
Unstable
Small data changes can drastically change the tree
Greedy Algorithm
Makes locally optimal choices that may not be globally best
šÆ When to Use Decision Trees
- ā Need interpretable models you can explain to stakeholders
- ā Non-linear relationships in your data
- ā Want automatic feature importance
- ā Mixed data types (numerical and categorical)
- ā As foundation for ensemble methods (Random Forests, XGBoost)
- ā Quick baseline model for any classification/regression problem
- ā Avoid when: Need highest accuracy (use ensembles instead)
- ā Avoid when: Data has high dimensionality (many features)
- ā Avoid when: Extrapolation needed (trees can't predict beyond training range)
š” Pro Tip: Decision Trees are rarely used alone in production. They're usually combined with ensemble methods like Random Forests or Gradient Boosting for better performance and stability.
š§ Troubleshooting Guide
| Problem | Symptoms | Solution |
|---|---|---|
| Overfitting | Training accuracy 95%+ Test accuracy 70% |
⢠Reduce max_depth (try 3-7) ⢠Increase min_samples_split ⢠Use pruning (ccp_alpha) |
| Underfitting | Both training and test accuracy low | ⢠Increase max_depth ⢠Decrease min_samples_split ⢠Add more features |
| Imbalanced Classes | Model always predicts majority class | ⢠Use class_weight='balanced' ⢠Oversample minority class (SMOTE) ⢠Try stratified splits |
| Tree Too Deep | Can't visualize tree 100+ nodes |
⢠Set max_depth=5 or lower ⢠Use max_leaf_nodes=20 ⢠Consider ensemble methods |
| Slow Training | Takes minutes to fit tree | ⢠Reduce max_features ⢠Sample data for prototyping ⢠Set max_depth limit |
| Unstable Predictions | Tree changes drastically with small data changes | ⢠Use Random Forests ⢠Set random_state for reproducibility ⢠Collect more data |
| Poor Feature Importance | All features show ~0% importance | ⢠Scale features (StandardScaler) ⢠Remove constant/duplicate features ⢠Check for data leakage |
š Common Code Issues
# ā WRONG: Fitting on entire dataset
tree.fit(X, y)
score = tree.score(X, y) # This is training accuracy, misleading!
# ā
RIGHT: Always use train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
tree.fit(X_train, y_train)
train_score = tree.score(X_train, y_train)
test_score = tree.score(X_test, y_test) # Real performance
# ā WRONG: Not setting random_state
tree = DecisionTreeClassifier() # Results not reproducible
# ā
RIGHT: Set random_state for reproducibility
tree = DecisionTreeClassifier(random_state=42)
# ā WRONG: Ignoring class imbalance
# 95% class 0, 5% class 1 ā model predicts all class 0
# ā
RIGHT: Handle class imbalance
tree = DecisionTreeClassifier(class_weight='balanced', random_state=42)
# ā WRONG: Not handling categorical features
# sklearn requires numeric input
# ā
RIGHT: Encode categorical features
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X['category_encoded'] = le.fit_transform(X['category'])
š Practice Projects
šø Project 1: Iris Classification
Difficulty: Beginner
Dataset: sklearn.datasets.load_iris()
Goal: Classify iris species based on measurements
Tasks:
- ⢠Train decision tree with max_depth=3
- ⢠Visualize tree with plot_tree()
- ⢠Calculate accuracy on test set
- ⢠Identify most important feature
š³ Project 2: Credit Card Fraud Detection
Difficulty: Intermediate
Dataset: Kaggle Credit Card Fraud Detection
Goal: Detect fraudulent transactions
Tasks:
- ⢠Handle highly imbalanced classes (use class_weight)
- ⢠Compare Gini vs Entropy criterion
- ⢠Tune hyperparameters with GridSearchCV
- ⢠Calculate precision, recall, F1-score
š Project 3: House Price Prediction
Difficulty: Intermediate
Dataset: Boston Housing or California Housing
Goal: Predict house prices using regression trees
Tasks:
- ⢠Use DecisionTreeRegressor
- ⢠Compare MSE vs MAE criterion
- ⢠Visualize actual vs predicted prices
- ⢠Find optimal max_depth (3-15)
š§ Project 4: Email Spam Filter
Difficulty: Advanced
Dataset: SMS Spam Collection or Enron Email
Goal: Build spam classifier from text
Tasks:
- ⢠Extract features with TfidfVectorizer
- ⢠Train tree on text features
- ⢠Identify most important words (feature_importances_)
- ⢠Compare with Naive Bayes baseline
š Project Starter Code
# Template for Practice Projects
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
import numpy as np
# 1. Load data
# X, y = load_your_dataset()
# 2. Explore data
print(f"Samples: {len(X)}, Features: {X.shape[1]}")
print(f"Class distribution: {np.bincount(y)}")
# 3. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 4. Train tree
tree = DecisionTreeClassifier(
max_depth=5,
min_samples_split=10,
random_state=42
)
tree.fit(X_train, y_train)
# 5. Evaluate
train_acc = tree.score(X_train, y_train)
test_acc = tree.score(X_test, y_test)
print(f"Training: {train_acc:.2%}, Test: {test_acc:.2%}")
# 6. Cross-validation
cv_scores = cross_val_score(tree, X_train, y_train, cv=5)
print(f"CV: {cv_scores.mean():.2%} (+/- {cv_scores.std()*2:.2%})")
# 7. Feature importance
if hasattr(X, 'columns'): # pandas DataFrame
for feat, imp in zip(X.columns, tree.feature_importances_):
if imp > 0.05: # Only show important features
print(f"{feat}: {imp:.1%}")
# 8. Detailed evaluation
y_pred = tree.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
šÆ Challenge: After completing the basic projects, try building an ensemble! Combine 10 decision trees trained on random data subsets, then average their predictions. This is the foundation of Random Forests (next tutorial)!
š Summary
ā What You've Learned:
- ⢠Recursive partitioning algorithm
- ⢠Binary splits at each node
- ⢠Classification & regression trees
- ⢠Gini Index (default, faster)
- ⢠Entropy (information gain)
- ⢠MSE for regression trees
- ⢠Pre-pruning (max_depth, min_samples)
- ⢠Post-pruning (ccp_alpha)
- ⢠Cross-validation for tuning
- ⢠sklearn DecisionTreeClassifier/Regressor
- ⢠Feature importance extraction
- ⢠Visualization with plot_tree()
š Key Takeaways
| Best Feature | Interpretability - you can explain every decision to stakeholders |
| Biggest Weakness | Overfitting - requires careful hyperparameter tuning |
| When to Use | Quick baselines, interpretable models, feature selection |
| When to Avoid | Need max accuracy (use ensembles), extrapolation tasks |
| Next Evolution | Random Forests, Gradient Boosting (combine many trees) |
š Quick Reference Card
# Decision Tree Essentials Cheat Sheet
# Classification
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(
criterion='gini', # or 'entropy'
max_depth=5, # CRITICAL: controls overfitting
min_samples_split=20, # min samples to split
min_samples_leaf=10, # min samples in leaf
random_state=42 # reproducibility
)
tree.fit(X_train, y_train)
# Regression
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(
criterion='squared_error', # or 'absolute_error'
max_depth=5
)
# Key Methods
tree.predict(X_test) # predictions
tree.predict_proba(X_test) # probabilities
tree.feature_importances_ # importance scores
tree.get_depth() # tree depth
tree.get_n_leaves() # leaf count
# Visualization
from sklearn.tree import plot_tree
plot_tree(tree, filled=True, feature_names=['f1', 'f2'])
# Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
params = {'max_depth': [3,5,7], 'min_samples_split': [10,20,50]}
grid = GridSearchCV(tree, params, cv=5)
grid.fit(X_train, y_train)
š What's Next?
In the next tutorial, Random Forests, we'll learn how to combine multiple Decision Trees to create a more powerful and stable model. You'll discover:
- ā Ensemble Learning - Combining multiple weak learners
- ā Bagging - Bootstrap aggregating for variance reduction
- ā Feature Randomness - Decorrelating trees
- ā Out-of-Bag Evaluation - Free validation without test set
- ā Practical Benefits - Better accuracy, less overfitting, no tuning needed
š Congratulations! You've now mastered Decision Trees - one of the most interpretable and foundational ML algorithms. You understand recursive partitioning, splitting criteria (Gini vs Entropy), overfitting control, and how to implement trees in Python. You're ready for ensemble methods!
š Knowledge Check
Test your understanding of Decision Trees!