Machine learning algorithms work with numbers, not text. When your dataset contains categorical variables like "Color", "City", or "Product Type", you need to convert them into numerical format. This process is called categorical encoding, and choosing the right encoding method can significantly impact your model's performance.
🎯 What You'll Learn
- Five essential encoding techniques and when to use each
- Handling ordinal vs nominal categorical variables
- Avoiding common encoding pitfalls (dummy variable trap, data leakage)
- Implementing encoders in scikit-learn and category_encoders
- Encoding in production pipelines
Understanding Categorical Variables
Before encoding, it's crucial to understand the type of categorical variable you're working with.
Nominal vs Ordinal Categories
Nominal categories have no inherent order:
- Color: Red, Blue, Green (no natural ordering)
- City: New York, London, Tokyo
- Product Type: Electronics, Clothing, Food
Ordinal categories have a meaningful order:
- Education: High School < Bachelor's < Master's < PhD
- Rating: Poor < Fair < Good < Excellent
- Size: Small < Medium < Large < XL
⚠️ Common Mistake
Using label encoding (0, 1, 2...) for nominal categories creates false ordinal relationships. The model will think "Blue" is somehow "between" "Red" and "Green"!
# Example: Identifying categorical types
import pandas as pd
import numpy as np
# Create sample dataset
df = pd.DataFrame({
'color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],
'size': ['Small', 'Medium', 'Large', 'Small', 'XL'],
'city': ['NYC', 'LA', 'SF', 'NYC', 'LA'],
'rating': ['Poor', 'Good', 'Excellent', 'Fair', 'Good']
})
print("Categorical columns:")
print(df.dtypes)
print("\nUnique values:")
for col in df.columns:
print(f"{col}: {df[col].unique()}")
# Output:
# color object
# size object
# city object
# rating object
#
# Unique values:
# color: ['Red' 'Blue' 'Green'] # Nominal
# size: ['Small' 'Medium' 'Large' 'XL'] # Ordinal
# city: ['NYC' 'LA' 'SF'] # Nominal
# rating: ['Poor' 'Good' 'Excellent' 'Fair'] # Ordinal
Method 1: One-Hot Encoding (Dummy Variables)
Best for: Nominal categories with low cardinality (few unique values)
One-hot encoding creates a new binary column for each category. Only one column has a value of 1 (hot), while others are 0.
How It Works
# Original data
colors = ['Red', 'Blue', 'Green', 'Red']
# After one-hot encoding:
# Red Blue Green
# 1 0 0 (Red)
# 0 1 0 (Blue)
# 0 0 1 (Green)
# 1 0 0 (Red)
Implementation with Pandas
import pandas as pd
# Simple one-hot encoding
df = pd.DataFrame({'color': ['Red', 'Blue', 'Green', 'Red', 'Blue']})
# Method 1: Using pd.get_dummies()
df_encoded = pd.get_dummies(df, columns=['color'], prefix='color')
print(df_encoded)
# color_Blue color_Green color_Red
# 0 0 0 1
# 1 1 0 0
# 2 0 1 0
# 3 0 0 1
# 4 1 0 0
# Method 2: Drop first column to avoid multicollinearity
df_encoded = pd.get_dummies(df, columns=['color'], prefix='color', drop_first=True)
print(df_encoded)
# color_Green color_Red
# 0 0 1
# 1 0 0
# 2 1 0
# 3 0 1
# 4 0 0
Implementation with Scikit-learn
from sklearn.preprocessing import OneHotEncoder
import numpy as np
# Create encoder
encoder = OneHotEncoder(sparse_output=False, drop='first')
# Sample data
data = np.array(['Red', 'Blue', 'Green', 'Red', 'Blue']).reshape(-1, 1)
# Fit and transform
encoded = encoder.fit_transform(data)
print("Encoded shape:", encoded.shape)
print("Categories:", encoder.categories_)
print("\nEncoded data:")
print(encoded)
# Get feature names
feature_names = encoder.get_feature_names_out(['color'])
print("\nFeature names:", feature_names)
# Output:
# Encoded shape: (5, 2)
# Categories: [array(['Blue', 'Green', 'Red'], dtype='
💡 The Dummy Variable Trap
When using one-hot encoding with linear models, always drop one category (using drop_first=True). This prevents perfect multicollinearity where columns are linearly dependent.
Why? If you have Red, Blue, and Green columns, knowing two tells you the third: if Red=0 and Blue=0, then Green must be 1.
Pros and Cons
✅ Advantages:
- No false ordinal relationships created
- Works well with linear models
- Easy to interpret
❌ Disadvantages:
- Creates many columns (curse of dimensionality) with high cardinality
- Sparse matrices (mostly zeros) require more memory
- Poor for categories with 50+ unique values
Method 2: Label Encoding
Best for: Ordinal categories (with natural order) and tree-based models
Label encoding assigns each category a unique integer (0, 1, 2, 3...). Simple and memory-efficient, but creates ordinal relationships.
Implementation
from sklearn.preprocessing import LabelEncoder
# For ordinal categories (has order)
education = ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor', 'Master']
encoder = LabelEncoder()
encoded = encoder.fit_transform(education)
print("Original:", education)
print("Encoded:", encoded)
print("Classes:", encoder.classes_)
# Output:
# Original: ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor', 'Master']
# Encoded: [2 0 3 4 0 3]
# Classes: ['Bachelor' 'High School' 'Master' 'PhD']
# Note: Alphabetical order! We need to fix this for true ordinal data
Handling True Ordinal Data
import pandas as pd
# Define the correct order
education_order = ['High School', 'Bachelor', 'Master', 'PhD']
# Method 1: Using pandas Categorical with ordered=True
df = pd.DataFrame({'education': ['Bachelor', 'Master', 'High School', 'PhD', 'Bachelor']})
df['education_encoded'] = pd.Categorical(
df['education'],
categories=education_order,
ordered=True
).codes
print(df)
# education education_encoded
# 0 Bachelor 1
# 1 Master 2
# 2 High School 0
# 3 PhD 3
# 4 Bachelor 1
# Method 2: Using OrdinalEncoder from scikit-learn
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories=[education_order])
df['education_encoded2'] = encoder.fit_transform(df[['education']])
print("\nWith OrdinalEncoder:")
print(df)
When to Use Label Encoding
✅ Good Use Cases
- Tree-based models: Random Forest, XGBoost, LightGBM can handle label encoded data well because they split on thresholds
- Ordinal categories: When there's a clear ranking (Small → Medium → Large)
- Target variable: For classification tasks (converting class labels)
❌ Avoid For
- Nominal categories with linear models: Creates false ordinal relationships
- High cardinality nominal data: Random integer assignments are meaningless
# Example: Label encoding with tree models
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
import numpy as np
# Sample data: City (nominal) and Purchase (target)
cities = ['NYC', 'LA', 'SF', 'NYC', 'LA', 'SF', 'NYC', 'LA']
purchases = [1, 0, 1, 1, 0, 1, 1, 0]
# Label encode cities
le = LabelEncoder()
cities_encoded = le.fit_transform(cities).reshape(-1, 1)
# Train Random Forest (tree-based model handles label encoding)
rf = RandomForestClassifier(random_state=42)
rf.fit(cities_encoded, purchases)
print("Feature importance:", rf.feature_importances_)
print("Accuracy:", rf.score(cities_encoded, purchases))
# Trees will split: "if city <= 1.5 then..."
# This works because trees find patterns, not linear relationships
Method 3: Ordinal Encoding
Best for: Categories with explicit ranking/order
Ordinal encoding is similar to label encoding but allows you to explicitly define the order. Perfect for categories like "Low", "Medium", "High".
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
# Data with multiple ordinal features
df = pd.DataFrame({
'size': ['M', 'L', 'S', 'XL', 'M', 'S'],
'priority': ['Low', 'High', 'Medium', 'Low', 'High', 'Medium'],
'rating': ['Fair', 'Excellent', 'Poor', 'Good', 'Excellent', 'Fair']
})
# Define ordering for each feature
size_order = ['S', 'M', 'L', 'XL']
priority_order = ['Low', 'Medium', 'High']
rating_order = ['Poor', 'Fair', 'Good', 'Excellent']
# Create encoder with explicit categories
encoder = OrdinalEncoder(
categories=[size_order, priority_order, rating_order]
)
# Fit and transform
df_encoded = pd.DataFrame(
encoder.fit_transform(df),
columns=['size_enc', 'priority_enc', 'rating_enc']
)
# Combine with original
result = pd.concat([df, df_encoded], axis=1)
print(result)
# size priority rating size_enc priority_enc rating_enc
# 0 M Low Fair 1.0 0.0 1.0
# 1 L High Excellent 2.0 2.0 3.0
# 2 S Medium Poor 0.0 1.0 0.0
# 3 XL Low Good 3.0 0.0 2.0
# 4 M High Excellent 1.0 2.0 3.0
# 5 S Medium Fair 0.0 1.0 1.0
Handling Unknown Categories
from sklearn.preprocessing import OrdinalEncoder
# Training data
train_data = pd.DataFrame({'size': ['S', 'M', 'L']})
# Create encoder that handles unknown categories
encoder = OrdinalEncoder(
categories=[['S', 'M', 'L', 'XL']],
handle_unknown='use_encoded_value',
unknown_value=-1 # Assign -1 to unknown categories
)
encoder.fit(train_data)
# Test data with new category
test_data = pd.DataFrame({'size': ['M', 'XXL', 'L']}) # XXL is unknown
encoded_test = encoder.transform(test_data)
print("Test data encoded:")
print(encoded_test)
# Output:
# [[ 1.] # M
# [-1.] # XXL (unknown)
# [ 2.]] # L
Method 4: Target Encoding (Mean Encoding)
Best for: High cardinality nominal categories (many unique values)
Target encoding replaces each category with the mean of the target variable for that category. Very powerful but requires careful implementation to avoid data leakage.
How It Works
# Example: City and Purchase behavior
# City | Purchase
# NYC | 1
# LA | 0
# NYC | 1
# SF | 1
# LA | 0
# NYC | 1
# Target encoding calculates mean purchase rate per city:
# NYC: (1+1+1)/3 = 1.0
# LA: (0+0)/2 = 0.0
# SF: (1)/1 = 1.0
# Then replaces city names with these values
Basic Implementation
import pandas as pd
import numpy as np
# Sample data
df = pd.DataFrame({
'city': ['NYC', 'LA', 'SF', 'NYC', 'LA', 'SF', 'NYC', 'Chicago'],
'purchase': [1, 0, 1, 1, 0, 1, 1, 0]
})
# Calculate mean purchase rate per city
target_means = df.groupby('city')['purchase'].mean()
print("Target means per city:")
print(target_means)
# Output:
# city
# Chicago 0.0
# LA 0.0
# NYC 1.0
# SF 1.0
# Apply target encoding
df['city_encoded'] = df['city'].map(target_means)
print("\nEncoded data:")
print(df)
⚠️ Critical: Avoiding Data Leakage
The Problem: Basic target encoding uses information from the target variable, which causes overfitting. The model learns to memorize training patterns rather than generalize.
Solution: Use cross-validation or leave-one-out encoding.
Proper Implementation: K-Fold Target Encoding
from sklearn.model_selection import KFold
import pandas as pd
import numpy as np
def target_encode_kfold(df, categorical_col, target_col, n_folds=5):
"""
Perform target encoding with K-fold cross-validation to prevent leakage.
"""
# Initialize encoded column
encoded_col = np.zeros(len(df))
# Calculate global mean for unseen categories
global_mean = df[target_col].mean()
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(df):
# Split data
train_df = df.iloc[train_idx]
# Calculate target means from training fold only
target_means = train_df.groupby(categorical_col)[target_col].mean()
# Apply to validation fold
encoded_col[val_idx] = df.iloc[val_idx][categorical_col].map(target_means)
# Fill unknown categories with global mean
encoded_col[val_idx] = np.where(
np.isnan(encoded_col[val_idx]),
global_mean,
encoded_col[val_idx]
)
return encoded_col
# Example usage
df = pd.DataFrame({
'city': ['NYC', 'LA', 'SF'] * 30 + ['Chicago'] * 10,
'purchase': np.random.binomial(1, 0.6, 100)
})
df['city_encoded'] = target_encode_kfold(df, 'city', 'purchase', n_folds=5)
print(df.head(10))
print("\nEncoding statistics:")
print(df.groupby('city')['city_encoded'].mean())
Using Category Encoders Library
# Install: pip install category_encoders
from category_encoders import TargetEncoder
from sklearn.model_selection import train_test_split
import pandas as pd
# Sample high-cardinality data
df = pd.DataFrame({
'product_id': [f'P{i%50}' for i in range(1000)], # 50 unique products
'sales': np.random.randint(0, 2, 1000)
})
# Split data
X = df[['product_id']]
y = df['sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create target encoder with smoothing
encoder = TargetEncoder(smoothing=1.0) # Smoothing prevents overfitting
# Fit on training data only
encoder.fit(X_train, y_train)
# Transform both sets
X_train_encoded = encoder.transform(X_train)
X_test_encoded = encoder.transform(X_test)
print("Training set encoded:")
print(X_train_encoded.head())
print("\nTest set encoded:")
print(X_test_encoded.head())
Pros and Cons
✅ Advantages:
- Handles high cardinality (100s or 1000s of categories) efficiently
- Creates only one column per feature (no dimensionality explosion)
- Captures relationship between category and target
❌ Disadvantages:
- Risk of data leakage if not implemented carefully
- Can overfit on rare categories
- Only works for supervised learning (requires target variable)
Method 5: Frequency Encoding
Best for: High cardinality nominal categories, unsupervised learning
Frequency encoding replaces each category with its frequency count or percentage in the dataset. Simple, fast, and no data leakage risk.
Implementation
import pandas as pd
# Sample data with high cardinality
df = pd.DataFrame({
'user_id': ['U1', 'U2', 'U1', 'U3', 'U1', 'U2', 'U4', 'U1'],
'product': ['A', 'B', 'A', 'C', 'A', 'B', 'D', 'A']
})
# Method 1: Count frequency
df['user_frequency'] = df['user_id'].map(df['user_id'].value_counts())
df['product_frequency'] = df['product'].map(df['product'].value_counts())
print("Frequency encoding:")
print(df)
# user_id product user_frequency product_frequency
# 0 U1 A 4 4
# 1 U2 B 2 2
# 2 U1 A 4 4
# 3 U3 C 1 1
# 4 U1 A 4 4
# 5 U2 B 2 2
# 6 U4 D 1 1
# 7 U1 A 4 4
# Method 2: Normalize to percentage
df['user_freq_pct'] = df['user_id'].map(
df['user_id'].value_counts(normalize=True)
)
print("\nNormalized frequency:")
print(df[['user_id', 'user_frequency', 'user_freq_pct']])
Handling Training and Test Sets
from sklearn.base import BaseEstimator, TransformerMixin
class FrequencyEncoder(BaseEstimator, TransformerMixin):
"""
Frequency encoder that handles train/test splits properly.
"""
def __init__(self, normalize=False):
self.normalize = normalize
self.frequency_map = {}
def fit(self, X, y=None):
"""Learn frequency mapping from training data."""
for col in X.columns:
if self.normalize:
self.frequency_map[col] = X[col].value_counts(normalize=True).to_dict()
else:
self.frequency_map[col] = X[col].value_counts().to_dict()
return self
def transform(self, X):
"""Apply frequency encoding."""
X_encoded = X.copy()
for col in X.columns:
# Map known categories, unknown get 0
X_encoded[col] = X[col].map(self.frequency_map[col]).fillna(0)
return X_encoded
# Example usage
from sklearn.model_selection import train_test_split
# Generate data
df = pd.DataFrame({
'city': ['NYC'] * 50 + ['LA'] * 30 + ['SF'] * 20 + ['Chicago'] * 10
})
# Split data
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
# Fit encoder on training data
encoder = FrequencyEncoder(normalize=True)
encoder.fit(train_df[['city']])
# Transform both sets
train_encoded = encoder.transform(train_df[['city']])
test_encoded = encoder.transform(test_df[['city']])
print("Training frequencies:")
print(encoder.frequency_map)
print("\nTest set encoded (first 5 rows):")
print(test_encoded.head())
When to Use Frequency Encoding
✅ Good Use Cases
- High cardinality features: User IDs, product IDs with 1000+ unique values
- Unsupervised learning: No target variable available (unlike target encoding)
- Time series: Capture popularity trends over time
- Quick baseline: Fast and simple implementation
Comparing Encoding Methods
Here's a practical comparison to help you choose the right encoding method:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# Generate sample data
np.random.seed(42)
n_samples = 1000
df = pd.DataFrame({
'category_low': np.random.choice(['A', 'B', 'C'], n_samples), # 3 categories
'category_high': np.random.choice([f'Cat_{i}' for i in range(50)], n_samples), # 50 categories
'target': np.random.binomial(1, 0.5, n_samples)
})
# Test different encoding methods
results = []
# 1. One-Hot Encoding (low cardinality)
ohe = OneHotEncoder(sparse_output=False, drop='first')
X_onehot = ohe.fit_transform(df[['category_low']])
lr = LogisticRegression(max_iter=1000)
score_lr_onehot = cross_val_score(lr, X_onehot, df['target'], cv=5, scoring='accuracy').mean()
results.append(('One-Hot + Logistic Regression (low card)', score_lr_onehot))
# 2. Label Encoding (tree-based model)
le = LabelEncoder()
X_label = le.fit_transform(df['category_high']).reshape(-1, 1)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
score_rf_label = cross_val_score(rf, X_label, df['target'], cv=5, scoring='accuracy').mean()
results.append(('Label + Random Forest (high card)', score_rf_label))
# 3. Frequency Encoding (high cardinality)
freq_map = df['category_high'].value_counts(normalize=True).to_dict()
X_freq = df['category_high'].map(freq_map).values.reshape(-1, 1)
gb = GradientBoostingClassifier(random_state=42)
score_gb_freq = cross_val_score(gb, X_freq, df['target'], cv=5, scoring='accuracy').mean()
results.append(('Frequency + Gradient Boosting (high card)', score_gb_freq))
# Display results
print("Encoding Method Comparison:")
print("=" * 60)
for method, score in results:
print(f"{method:50s} {score:.4f}")
# Output will vary but shows relative performance
📊 Quick Decision Guide
| Scenario | Best Method | Why? |
|---|---|---|
| Nominal, 2-10 categories, Linear model | One-Hot | No false ordinal relationships |
| Ordinal with clear order | Ordinal | Preserves meaningful ranking |
| Nominal, any size, Tree model | Label | Trees handle it well, memory efficient |
| High cardinality (50+), Supervised | Target | Captures target relationship, 1 column |
| High cardinality, Unsupervised | Frequency | No target needed, simple, fast |
Building a Production-Ready Encoding Pipeline
In real projects, you need to handle multiple categorical features with different encoding strategies in a single pipeline.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
# Sample dataset
df = pd.DataFrame({
'color': np.random.choice(['Red', 'Blue', 'Green'], 1000),
'size': np.random.choice(['S', 'M', 'L', 'XL'], 1000),
'education': np.random.choice(['HS', 'Bachelor', 'Master', 'PhD'], 1000),
'city': np.random.choice([f'City_{i}' for i in range(20)], 1000),
'price': np.random.uniform(10, 100, 1000),
'target': np.random.binomial(1, 0.5, 1000)
})
# Define column types
nominal_low = ['color'] # One-hot encode
ordinal = ['size', 'education'] # Ordinal encode
nominal_high = ['city'] # Label encode for tree model
numeric = ['price']
# Define ordinal mappings
size_order = [['S', 'M', 'L', 'XL']]
education_order = [['HS', 'Bachelor', 'Master', 'PhD']]
# Create preprocessing pipeline
preprocessor = ColumnTransformer(
transformers=[
('onehot', OneHotEncoder(drop='first', sparse_output=False), nominal_low),
('ordinal', OrdinalEncoder(categories=size_order + education_order), ordinal),
('label', OrdinalEncoder(), nominal_high), # For tree models
('numeric', 'passthrough', numeric)
],
remainder='drop' # Drop any other columns
)
# Create full pipeline
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Train
X = df.drop('target', axis=1)
y = df['target']
pipeline.fit(X, y)
# Predict on new data
new_data = pd.DataFrame({
'color': ['Red'],
'size': ['L'],
'education': ['Master'],
'city': ['City_5'],
'price': [45.0]
})
prediction = pipeline.predict(new_data)
print(f"Prediction: {prediction[0]}")
print(f"Model score: {pipeline.score(X, y):.4f}")
Saving and Loading Encoders
import joblib
# Save the entire pipeline (includes fitted encoders)
joblib.dump(pipeline, 'encoding_pipeline.pkl')
# Load and use later
loaded_pipeline = joblib.load('encoding_pipeline.pkl')
# Make predictions with loaded pipeline
new_predictions = loaded_pipeline.predict(new_data)
print(f"Prediction from loaded model: {new_predictions[0]}")
# The pipeline remembers:
# - Category mappings
# - Ordinal orders
# - Unknown category handling
# - Feature names and order
Common Pitfalls and How to Avoid Them
1. Data Leakage in Target Encoding
❌ Wrong Way
# WRONG: Calculating target means on entire dataset
target_means = df.groupby('category')['target'].mean()
df['category_encoded'] = df['category'].map(target_means)
# Then splitting train/test
# Result: Test data leakage! Model has seen test target values
✅ Right Way
# RIGHT: Calculate on training data only
train_target_means = train_df.groupby('category')['target'].mean()
train_df['category_encoded'] = train_df['category'].map(train_target_means)
test_df['category_encoded'] = test_df['category'].map(train_target_means)
2. Forgetting to Drop First Column in One-Hot Encoding
# For linear models, always drop one category
# ❌ Wrong: Creates multicollinearity
df_encoded = pd.get_dummies(df, columns=['color']) # 3 columns created
# ✅ Right: Drops first category
df_encoded = pd.get_dummies(df, columns=['color'], drop_first=True) # 2 columns
3. Using Label Encoding for Nominal Categories with Linear Models
# ❌ Wrong: Color has no order, but label encoding creates one
colors = ['Red', 'Blue', 'Green']
# Encoded as [0, 1, 2] - implies Red < Blue < Green
# Linear model will treat Red+Green/2 = Blue (nonsense!)
# ✅ Right: Use one-hot encoding for nominal + linear models
df_encoded = pd.get_dummies(df, columns=['color'], drop_first=True)
4. Not Handling Unknown Categories in Production
from sklearn.preprocessing import OneHotEncoder
# ✅ Handle unknown categories gracefully
encoder = OneHotEncoder(
handle_unknown='ignore', # Don't error on new categories
sparse_output=False
)
# Train on limited categories
train = pd.DataFrame({'city': ['NYC', 'LA', 'SF']})
encoder.fit(train)
# Test with new category
test = pd.DataFrame({'city': ['NYC', 'Chicago']}) # Chicago is new
encoded = encoder.transform(test)
# Chicago gets [0, 0, 0] - all zeros (ignored)
🧠 Knowledge Check
Question 1: When should you use one-hot encoding?
Question 2: What is the dummy variable trap?
Question 3: Which encoding method is safe to use with tree-based models and nominal categories?
Question 4: What's the main risk of target encoding?
Question 5: You have a 'User_ID' feature with 10,000 unique values for a supervised learning task. Best encoding?
Question 6: For education levels (High School, Bachelor, Master, PhD), which encoding preserves order?
Question 7: When should you use frequency encoding?
Question 8: In a production pipeline, how should you handle categories in test data not seen during training?
💻 Practice Exercises
Exercise 1: Multi-Method Encoding
Given this dataset, apply appropriate encoding for each feature:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'country': ['USA', 'UK', 'Canada', 'USA', 'Germany'], # Nominal, low cardinality
'rating': ['Poor', 'Good', 'Excellent', 'Fair', 'Good'], # Ordinal
'product_id': [f'P{i}' for i in range(100)]*5, # Nominal, high cardinality
'sales': np.random.randint(0, 2, 500) # Target variable
})
# TODO:
# 1. One-hot encode 'country'
# 2. Ordinal encode 'rating' (Poor < Fair < Good < Excellent)
# 3. Target encode 'product_id' with K-fold
# 4. Create a pipeline that combines all encodings
# 5. Evaluate with a Random Forest classifier
Exercise 2: Handling Unknown Categories
Create an encoder that handles new categories in production:
# Training data
train_df = pd.DataFrame({
'city': ['NYC', 'LA', 'SF', 'NYC', 'LA'],
'size': ['S', 'M', 'L', 'M', 'S']
})
# Test data with unknown categories
test_df = pd.DataFrame({
'city': ['NYC', 'Chicago', 'LA'], # Chicago is new
'size': ['M', 'XXL', 'S'] # XXL is new
})
# TODO:
# 1. Create encoders that handle unknown categories
# 2. For city: use frequency encoding, assign 0 to unknown
# 3. For size: use ordinal encoding, assign -1 to unknown
# 4. Transform both train and test sets
Exercise 3: Avoiding Data Leakage
Fix this leaky target encoding implementation:
# WRONG implementation (has data leakage)
df = pd.DataFrame({
'category': ['A', 'B', 'C', 'A', 'B', 'C'] * 100,
'target': np.random.binomial(1, 0.5, 600)
})
# Calculate target means on entire dataset
target_means = df.groupby('category')['target'].mean()
df['category_encoded'] = df['category'].map(target_means)
# Then split
train_df = df[:500]
test_df = df[500:]
# TODO:
# 1. Identify the data leakage problem
# 2. Fix it using proper train/test split
# 3. Implement K-fold target encoding
# 4. Compare scores with leaky vs proper implementation
Exercise 4: Encoding Comparison
Compare different encoding methods on the same dataset:
from sklearn.datasets import fetch_openml
# Load categorical dataset (adult income dataset)
data = fetch_openml('adult', version=2, as_frame=True)
df = data.frame
# Features: workclass, education, marital-status, occupation, etc.
# Target: income (>50K or <=50K)
# TODO:
# 1. Apply different encodings to 'occupation' (15 unique values)
# - One-hot encoding
# - Label encoding
# - Target encoding
# - Frequency encoding
# 2. Train Logistic Regression with each encoding
# 3. Train Random Forest with each encoding
# 4. Compare accuracy scores
# 5. Analyze which encoding works best for which model
Exercise 5: Production Pipeline
Build a complete production-ready encoding pipeline:
# TODO:
# 1. Create a dataset with mixed categorical types:
# - 2 nominal low cardinality features
# - 1 ordinal feature
# - 1 nominal high cardinality feature (50+ categories)
# - 3 numeric features
#
# 2. Build a ColumnTransformer that:
# - One-hot encodes nominal low cardinality
# - Ordinal encodes ordinal feature
# - Target encodes high cardinality feature
# - StandardScales numeric features
#
# 3. Create a Pipeline with the transformer + classifier
# 4. Implement proper train/test split
# 5. Save the pipeline with joblib
# 6. Load and make predictions on new data
# 7. Add error handling for unknown categories
📝 Summary
You've learned five powerful encoding techniques for transforming categorical variables:
Key Takeaways
- One-Hot Encoding: Best for nominal categories with 2-10 unique values. Creates binary columns. Drop one column for linear models to avoid dummy variable trap.
- Label Encoding: Assigns integers 0, 1, 2... Good for tree-based models and ordinal data, but creates false ordinal relationships for nominal categories with linear models.
- Ordinal Encoding: Like label encoding but with explicit ordering. Essential for true ordinal features like "Small < Medium < Large".
- Target Encoding: Replaces categories with mean target value. Powerful for high cardinality (100+ categories), but requires K-fold implementation to prevent data leakage.
- Frequency Encoding: Replaces with frequency/count. Good for high cardinality in unsupervised learning. Simple and no leakage risk.
🎯 Decision Framework
- Identify category type: Nominal (no order) vs Ordinal (has order)
- Count unique values: Low (2-10) vs High (50+) cardinality
- Consider model type: Linear models need one-hot, tree models handle label encoding
- Check for target variable: Target encoding only for supervised learning
- Prevent data leakage: Always fit encoders on training data only
Next tutorial, you'll learn about Feature Scaling and Normalization - preparing numerical features for machine learning algorithms. See you there! 🚀
🎯 Ready for the Next Challenge?
Continue your feature engineering journey