Encoding Categorical Variables - Feature Engineering Tutorial

Machine learning algorithms work with numbers, not text. When your dataset contains categorical variables like "Color", "City", or "Product Type", you need to convert them into numerical format. This process is called categorical encoding, and choosing the right encoding method can significantly impact your model's performance.

🎯 What You'll Learn

Five essential encoding techniques and when to use each
Handling ordinal vs nominal categorical variables
Avoiding common encoding pitfalls (dummy variable trap, data leakage)
Implementing encoders in scikit-learn and category_encoders
Encoding in production pipelines

Understanding Categorical Variables

Before encoding, it's crucial to understand the type of categorical variable you're working with.

Nominal vs Ordinal Categories

Nominal categories have no inherent order:

Color: Red, Blue, Green (no natural ordering)
City: New York, London, Tokyo
Product Type: Electronics, Clothing, Food

Ordinal categories have a meaningful order:

Education: High School < Bachelor's < Master's < PhD
Rating: Poor < Fair < Good < Excellent
Size: Small < Medium < Large < XL

⚠️ Common Mistake

Using label encoding (0, 1, 2...) for nominal categories creates false ordinal relationships. The model will think "Blue" is somehow "between" "Red" and "Green"!

# Example: Identifying categorical types
import pandas as pd
import numpy as np

# Create sample dataset
df = pd.DataFrame({
    'color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],
    'size': ['Small', 'Medium', 'Large', 'Small', 'XL'],
    'city': ['NYC', 'LA', 'SF', 'NYC', 'LA'],
    'rating': ['Poor', 'Good', 'Excellent', 'Fair', 'Good']
})

print("Categorical columns:")
print(df.dtypes)
print("\nUnique values:")
for col in df.columns:
    print(f"{col}: {df[col].unique()}")

# Output:
# color      object
# size       object
# city       object
# rating     object
# 
# Unique values:
# color: ['Red' 'Blue' 'Green']          # Nominal
# size: ['Small' 'Medium' 'Large' 'XL']  # Ordinal
# city: ['NYC' 'LA' 'SF']                # Nominal
# rating: ['Poor' 'Good' 'Excellent' 'Fair']  # Ordinal

Method 1: One-Hot Encoding (Dummy Variables)

Best for: Nominal categories with low cardinality (few unique values)

One-hot encoding creates a new binary column for each category. Only one column has a value of 1 (hot), while others are 0.

How It Works

# Original data
colors = ['Red', 'Blue', 'Green', 'Red']

# After one-hot encoding:
# Red  Blue  Green
#  1    0     0      (Red)
#  0    1     0      (Blue)
#  0    0     1      (Green)
#  1    0     0      (Red)

Implementation with Pandas

import pandas as pd

# Simple one-hot encoding
df = pd.DataFrame({'color': ['Red', 'Blue', 'Green', 'Red', 'Blue']})

# Method 1: Using pd.get_dummies()
df_encoded = pd.get_dummies(df, columns=['color'], prefix='color')
print(df_encoded)

#    color_Blue  color_Green  color_Red
# 0           0            0          1
# 1           1            0          0
# 2           0            1          0
# 3           0            0          1
# 4           1            0          0

# Method 2: Drop first column to avoid multicollinearity
df_encoded = pd.get_dummies(df, columns=['color'], prefix='color', drop_first=True)
print(df_encoded)

#    color_Green  color_Red
# 0            0          1
# 1            0          0
# 2            1          0
# 3            0          1
# 4            0          0

Implementation with Scikit-learn

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Create encoder
encoder = OneHotEncoder(sparse_output=False, drop='first')

# Sample data
data = np.array(['Red', 'Blue', 'Green', 'Red', 'Blue']).reshape(-1, 1)

# Fit and transform
encoded = encoder.fit_transform(data)
print("Encoded shape:", encoded.shape)
print("Categories:", encoder.categories_)
print("\nEncoded data:")
print(encoded)

# Get feature names
feature_names = encoder.get_feature_names_out(['color'])
print("\nFeature names:", feature_names)

# Output:
# Encoded shape: (5, 2)
# Categories: [array(['Blue', 'Green', 'Red'], dtype='



            
                💡 The Dummy Variable Trap
                When using one-hot encoding with linear models, always drop one category (using drop_first=True). This prevents perfect multicollinearity where columns are linearly dependent.
                Why? If you have Red, Blue, and Green columns, knowing two tells you the third: if Red=0 and Blue=0, then Green must be 1.
            

            Pros and Cons
            
            ✅ Advantages:
            
                No false ordinal relationships created
                Works well with linear models
                Easy to interpret
            

            ❌ Disadvantages:
            
                Creates many columns (curse of dimensionality) with high cardinality
                Sparse matrices (mostly zeros) require more memory
                Poor for categories with 50+ unique values



        
        
            Method 2: Label Encoding
            
            Best for: Ordinal categories (with natural order) and tree-based models
            
            Label encoding assigns each category a unique integer (0, 1, 2, 3...). Simple and memory-efficient, but creates ordinal relationships.

            Implementation

from sklearn.preprocessing import LabelEncoder

# For ordinal categories (has order)
education = ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor', 'Master']

encoder = LabelEncoder()
encoded = encoder.fit_transform(education)

print("Original:", education)
print("Encoded:", encoded)
print("Classes:", encoder.classes_)

# Output:
# Original: ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor', 'Master']
# Encoded: [2 0 3 4 0 3]
# Classes: ['Bachelor' 'High School' 'Master' 'PhD']

# Note: Alphabetical order! We need to fix this for true ordinal data


            Handling True Ordinal Data

import pandas as pd

# Define the correct order
education_order = ['High School', 'Bachelor', 'Master', 'PhD']

# Method 1: Using pandas Categorical with ordered=True
df = pd.DataFrame({'education': ['Bachelor', 'Master', 'High School', 'PhD', 'Bachelor']})

df['education_encoded'] = pd.Categorical(
    df['education'],
    categories=education_order,
    ordered=True
).codes

print(df)

#      education  education_encoded
# 0     Bachelor                  1
# 1       Master                  2
# 2  High School                  0
# 3          PhD                  3
# 4     Bachelor                  1

# Method 2: Using OrdinalEncoder from scikit-learn
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder(categories=[education_order])
df['education_encoded2'] = encoder.fit_transform(df[['education']])

print("\nWith OrdinalEncoder:")
print(df)


            When to Use Label Encoding

            
                ✅ Good Use Cases
                
                    Tree-based models: Random Forest, XGBoost, LightGBM can handle label encoded data well because they split on thresholds
                    Ordinal categories: When there's a clear ranking (Small → Medium → Large)
                    Target variable: For classification tasks (converting class labels)
                
            

            
                ❌ Avoid For
                
                    Nominal categories with linear models: Creates false ordinal relationships
                    High cardinality nominal data: Random integer assignments are meaningless
                
            

# Example: Label encoding with tree models
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Sample data: City (nominal) and Purchase (target)
cities = ['NYC', 'LA', 'SF', 'NYC', 'LA', 'SF', 'NYC', 'LA']
purchases = [1, 0, 1, 1, 0, 1, 1, 0]

# Label encode cities
le = LabelEncoder()
cities_encoded = le.fit_transform(cities).reshape(-1, 1)

# Train Random Forest (tree-based model handles label encoding)
rf = RandomForestClassifier(random_state=42)
rf.fit(cities_encoded, purchases)

print("Feature importance:", rf.feature_importances_)
print("Accuracy:", rf.score(cities_encoded, purchases))

# Trees will split: "if city <= 1.5 then..."
# This works because trees find patterns, not linear relationships

        

        
        
            Method 3: Ordinal Encoding
            
            Best for: Categories with explicit ranking/order
            
            Ordinal encoding is similar to label encoding but allows you to explicitly define the order. Perfect for categories like "Low", "Medium", "High".

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

# Data with multiple ordinal features
df = pd.DataFrame({
    'size': ['M', 'L', 'S', 'XL', 'M', 'S'],
    'priority': ['Low', 'High', 'Medium', 'Low', 'High', 'Medium'],
    'rating': ['Fair', 'Excellent', 'Poor', 'Good', 'Excellent', 'Fair']
})

# Define ordering for each feature
size_order = ['S', 'M', 'L', 'XL']
priority_order = ['Low', 'Medium', 'High']
rating_order = ['Poor', 'Fair', 'Good', 'Excellent']

# Create encoder with explicit categories
encoder = OrdinalEncoder(
    categories=[size_order, priority_order, rating_order]
)

# Fit and transform
df_encoded = pd.DataFrame(
    encoder.fit_transform(df),
    columns=['size_enc', 'priority_enc', 'rating_enc']
)

# Combine with original
result = pd.concat([df, df_encoded], axis=1)
print(result)

#   size priority     rating  size_enc  priority_enc  rating_enc
# 0    M      Low       Fair       1.0           0.0         1.0
# 1    L     High  Excellent       2.0           2.0         3.0
# 2    S   Medium       Poor       0.0           1.0         0.0
# 3   XL      Low       Good       3.0           0.0         2.0
# 4    M     High  Excellent       1.0           2.0         3.0
# 5    S   Medium       Fair       0.0           1.0         1.0


            Handling Unknown Categories

from sklearn.preprocessing import OrdinalEncoder

# Training data
train_data = pd.DataFrame({'size': ['S', 'M', 'L']})

# Create encoder that handles unknown categories
encoder = OrdinalEncoder(
    categories=[['S', 'M', 'L', 'XL']],
    handle_unknown='use_encoded_value',
    unknown_value=-1  # Assign -1 to unknown categories
)

encoder.fit(train_data)

# Test data with new category
test_data = pd.DataFrame({'size': ['M', 'XXL', 'L']})  # XXL is unknown
encoded_test = encoder.transform(test_data)

print("Test data encoded:")
print(encoded_test)

# Output:
# [[ 1.]   # M
#  [-1.]   # XXL (unknown)
#  [ 2.]]  # L

        

        
        
            Method 4: Target Encoding (Mean Encoding)
            
            Best for: High cardinality nominal categories (many unique values)
            
            Target encoding replaces each category with the mean of the target variable for that category. Very powerful but requires careful implementation to avoid data leakage.

            How It Works

# Example: City and Purchase behavior
# City    | Purchase
# NYC     | 1
# LA      | 0
# NYC     | 1
# SF      | 1
# LA      | 0
# NYC     | 1

# Target encoding calculates mean purchase rate per city:
# NYC: (1+1+1)/3 = 1.0
# LA:  (0+0)/2   = 0.0
# SF:  (1)/1     = 1.0

# Then replaces city names with these values


            Basic Implementation

import pandas as pd
import numpy as np

# Sample data
df = pd.DataFrame({
    'city': ['NYC', 'LA', 'SF', 'NYC', 'LA', 'SF', 'NYC', 'Chicago'],
    'purchase': [1, 0, 1, 1, 0, 1, 1, 0]
})

# Calculate mean purchase rate per city
target_means = df.groupby('city')['purchase'].mean()
print("Target means per city:")
print(target_means)

# Output:
# city
# Chicago    0.0
# LA         0.0
# NYC        1.0
# SF         1.0

# Apply target encoding
df['city_encoded'] = df['city'].map(target_means)
print("\nEncoded data:")
print(df)


            
                ⚠️ Critical: Avoiding Data Leakage
                The Problem: Basic target encoding uses information from the target variable, which causes overfitting. The model learns to memorize training patterns rather than generalize.
                Solution: Use cross-validation or leave-one-out encoding.
            

            Proper Implementation: K-Fold Target Encoding

from sklearn.model_selection import KFold
import pandas as pd
import numpy as np

def target_encode_kfold(df, categorical_col, target_col, n_folds=5):
    """
    Perform target encoding with K-fold cross-validation to prevent leakage.
    """
    # Initialize encoded column
    encoded_col = np.zeros(len(df))
    
    # Calculate global mean for unseen categories
    global_mean = df[target_col].mean()
    
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    
    for train_idx, val_idx in kf.split(df):
        # Split data
        train_df = df.iloc[train_idx]
        
        # Calculate target means from training fold only
        target_means = train_df.groupby(categorical_col)[target_col].mean()
        
        # Apply to validation fold
        encoded_col[val_idx] = df.iloc[val_idx][categorical_col].map(target_means)
        
        # Fill unknown categories with global mean
        encoded_col[val_idx] = np.where(
            np.isnan(encoded_col[val_idx]),
            global_mean,
            encoded_col[val_idx]
        )
    
    return encoded_col

# Example usage
df = pd.DataFrame({
    'city': ['NYC', 'LA', 'SF'] * 30 + ['Chicago'] * 10,
    'purchase': np.random.binomial(1, 0.6, 100)
})

df['city_encoded'] = target_encode_kfold(df, 'city', 'purchase', n_folds=5)
print(df.head(10))
print("\nEncoding statistics:")
print(df.groupby('city')['city_encoded'].mean())


            Using Category Encoders Library

# Install: pip install category_encoders
from category_encoders import TargetEncoder
from sklearn.model_selection import train_test_split
import pandas as pd

# Sample high-cardinality data
df = pd.DataFrame({
    'product_id': [f'P{i%50}' for i in range(1000)],  # 50 unique products
    'sales': np.random.randint(0, 2, 1000)
})

# Split data
X = df[['product_id']]
y = df['sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create target encoder with smoothing
encoder = TargetEncoder(smoothing=1.0)  # Smoothing prevents overfitting

# Fit on training data only
encoder.fit(X_train, y_train)

# Transform both sets
X_train_encoded = encoder.transform(X_train)
X_test_encoded = encoder.transform(X_test)

print("Training set encoded:")
print(X_train_encoded.head())
print("\nTest set encoded:")
print(X_test_encoded.head())


            Pros and Cons
            
            ✅ Advantages:
            
                Handles high cardinality (100s or 1000s of categories) efficiently
                Creates only one column per feature (no dimensionality explosion)
                Captures relationship between category and target
            

            ❌ Disadvantages:
            
                Risk of data leakage if not implemented carefully
                Can overfit on rare categories
                Only works for supervised learning (requires target variable)
            
        

        
        
            Method 5: Frequency Encoding
            
            Best for: High cardinality nominal categories, unsupervised learning
            
            Frequency encoding replaces each category with its frequency count or percentage in the dataset. Simple, fast, and no data leakage risk.

            Implementation

import pandas as pd

# Sample data with high cardinality
df = pd.DataFrame({
    'user_id': ['U1', 'U2', 'U1', 'U3', 'U1', 'U2', 'U4', 'U1'],
    'product': ['A', 'B', 'A', 'C', 'A', 'B', 'D', 'A']
})

# Method 1: Count frequency
df['user_frequency'] = df['user_id'].map(df['user_id'].value_counts())
df['product_frequency'] = df['product'].map(df['product'].value_counts())

print("Frequency encoding:")
print(df)

#   user_id product  user_frequency  product_frequency
# 0      U1       A               4                  4
# 1      U2       B               2                  2
# 2      U1       A               4                  4
# 3      U3       C               1                  1
# 4      U1       A               4                  4
# 5      U2       B               2                  2
# 6      U4       D               1                  1
# 7      U1       A               4                  4

# Method 2: Normalize to percentage
df['user_freq_pct'] = df['user_id'].map(
    df['user_id'].value_counts(normalize=True)
)

print("\nNormalized frequency:")
print(df[['user_id', 'user_frequency', 'user_freq_pct']])


            Handling Training and Test Sets

from sklearn.base import BaseEstimator, TransformerMixin

class FrequencyEncoder(BaseEstimator, TransformerMixin):
    """
    Frequency encoder that handles train/test splits properly.
    """
    def __init__(self, normalize=False):
        self.normalize = normalize
        self.frequency_map = {}
    
    def fit(self, X, y=None):
        """Learn frequency mapping from training data."""
        for col in X.columns:
            if self.normalize:
                self.frequency_map[col] = X[col].value_counts(normalize=True).to_dict()
            else:
                self.frequency_map[col] = X[col].value_counts().to_dict()
        return self
    
    def transform(self, X):
        """Apply frequency encoding."""
        X_encoded = X.copy()
        for col in X.columns:
            # Map known categories, unknown get 0
            X_encoded[col] = X[col].map(self.frequency_map[col]).fillna(0)
        return X_encoded

# Example usage
from sklearn.model_selection import train_test_split

# Generate data
df = pd.DataFrame({
    'city': ['NYC'] * 50 + ['LA'] * 30 + ['SF'] * 20 + ['Chicago'] * 10
})

# Split data
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Fit encoder on training data
encoder = FrequencyEncoder(normalize=True)
encoder.fit(train_df[['city']])

# Transform both sets
train_encoded = encoder.transform(train_df[['city']])
test_encoded = encoder.transform(test_df[['city']])

print("Training frequencies:")
print(encoder.frequency_map)
print("\nTest set encoded (first 5 rows):")
print(test_encoded.head())


            When to Use Frequency Encoding

            
                ✅ Good Use Cases
                
                    High cardinality features: User IDs, product IDs with 1000+ unique values
                    Unsupervised learning: No target variable available (unlike target encoding)
                    Time series: Capture popularity trends over time
                    Quick baseline: Fast and simple implementation
                
            
        

        
        
            Comparing Encoding Methods
            
            Here's a practical comparison to help you choose the right encoding method:

import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Generate sample data
np.random.seed(42)
n_samples = 1000

df = pd.DataFrame({
    'category_low': np.random.choice(['A', 'B', 'C'], n_samples),  # 3 categories
    'category_high': np.random.choice([f'Cat_{i}' for i in range(50)], n_samples),  # 50 categories
    'target': np.random.binomial(1, 0.5, n_samples)
})

# Test different encoding methods
results = []

# 1. One-Hot Encoding (low cardinality)
ohe = OneHotEncoder(sparse_output=False, drop='first')
X_onehot = ohe.fit_transform(df[['category_low']])

lr = LogisticRegression(max_iter=1000)
score_lr_onehot = cross_val_score(lr, X_onehot, df['target'], cv=5, scoring='accuracy').mean()
results.append(('One-Hot + Logistic Regression (low card)', score_lr_onehot))

# 2. Label Encoding (tree-based model)
le = LabelEncoder()
X_label = le.fit_transform(df['category_high']).reshape(-1, 1)

rf = RandomForestClassifier(n_estimators=100, random_state=42)
score_rf_label = cross_val_score(rf, X_label, df['target'], cv=5, scoring='accuracy').mean()
results.append(('Label + Random Forest (high card)', score_rf_label))

# 3. Frequency Encoding (high cardinality)
freq_map = df['category_high'].value_counts(normalize=True).to_dict()
X_freq = df['category_high'].map(freq_map).values.reshape(-1, 1)

gb = GradientBoostingClassifier(random_state=42)
score_gb_freq = cross_val_score(gb, X_freq, df['target'], cv=5, scoring='accuracy').mean()
results.append(('Frequency + Gradient Boosting (high card)', score_gb_freq))

# Display results
print("Encoding Method Comparison:")
print("=" * 60)
for method, score in results:
    print(f"{method:50s} {score:.4f}")

# Output will vary but shows relative performance


            
                📊 Quick Decision Guide
                
                    
                        
                            Scenario
                            Best Method
                            Why?
                        
                    
                    
                        
                            Nominal, 2-10 categories, Linear model
                            One-Hot
                            No false ordinal relationships
                        
                        
                            Ordinal with clear order
                            Ordinal
                            Preserves meaningful ranking
                        
                        
                            Nominal, any size, Tree model
                            Label
                            Trees handle it well, memory efficient
                        
                        
                            High cardinality (50+), Supervised
                            Target
                            Captures target relationship, 1 column
                        
                        
                            High cardinality, Unsupervised
                            Frequency
                            No target needed, simple, fast
                        
                    
                
            
        

        
        
            Building a Production-Ready Encoding Pipeline
            
            In real projects, you need to handle multiple categorical features with different encoding strategies in a single pipeline.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

# Sample dataset
df = pd.DataFrame({
    'color': np.random.choice(['Red', 'Blue', 'Green'], 1000),
    'size': np.random.choice(['S', 'M', 'L', 'XL'], 1000),
    'education': np.random.choice(['HS', 'Bachelor', 'Master', 'PhD'], 1000),
    'city': np.random.choice([f'City_{i}' for i in range(20)], 1000),
    'price': np.random.uniform(10, 100, 1000),
    'target': np.random.binomial(1, 0.5, 1000)
})

# Define column types
nominal_low = ['color']  # One-hot encode
ordinal = ['size', 'education']  # Ordinal encode
nominal_high = ['city']  # Label encode for tree model
numeric = ['price']

# Define ordinal mappings
size_order = [['S', 'M', 'L', 'XL']]
education_order = [['HS', 'Bachelor', 'Master', 'PhD']]

# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(drop='first', sparse_output=False), nominal_low),
        ('ordinal', OrdinalEncoder(categories=size_order + education_order), ordinal),
        ('label', OrdinalEncoder(), nominal_high),  # For tree models
        ('numeric', 'passthrough', numeric)
    ],
    remainder='drop'  # Drop any other columns
)

# Create full pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train
X = df.drop('target', axis=1)
y = df['target']
pipeline.fit(X, y)

# Predict on new data
new_data = pd.DataFrame({
    'color': ['Red'],
    'size': ['L'],
    'education': ['Master'],
    'city': ['City_5'],
    'price': [45.0]
})

prediction = pipeline.predict(new_data)
print(f"Prediction: {prediction[0]}")
print(f"Model score: {pipeline.score(X, y):.4f}")


            Saving and Loading Encoders

import joblib

# Save the entire pipeline (includes fitted encoders)
joblib.dump(pipeline, 'encoding_pipeline.pkl')

# Load and use later
loaded_pipeline = joblib.load('encoding_pipeline.pkl')

# Make predictions with loaded pipeline
new_predictions = loaded_pipeline.predict(new_data)
print(f"Prediction from loaded model: {new_predictions[0]}")

# The pipeline remembers:
# - Category mappings
# - Ordinal orders
# - Unknown category handling
# - Feature names and order

        

        
        
            Common Pitfalls and How to Avoid Them

            1. Data Leakage in Target Encoding

            
                ❌ Wrong Way
# WRONG: Calculating target means on entire dataset
target_means = df.groupby('category')['target'].mean()
df['category_encoded'] = df['category'].map(target_means)

# Then splitting train/test
# Result: Test data leakage! Model has seen test target values


                ✅ Right Way
# RIGHT: Calculate on training data only
train_target_means = train_df.groupby('category')['target'].mean()
train_df['category_encoded'] = train_df['category'].map(train_target_means)
test_df['category_encoded'] = test_df['category'].map(train_target_means)

            

            2. Forgetting to Drop First Column in One-Hot Encoding

# For linear models, always drop one category
# ❌ Wrong: Creates multicollinearity
df_encoded = pd.get_dummies(df, columns=['color'])  # 3 columns created

# ✅ Right: Drops first category
df_encoded = pd.get_dummies(df, columns=['color'], drop_first=True)  # 2 columns


            3. Using Label Encoding for Nominal Categories with Linear Models

# ❌ Wrong: Color has no order, but label encoding creates one
colors = ['Red', 'Blue', 'Green']
# Encoded as [0, 1, 2] - implies Red < Blue < Green
# Linear model will treat Red+Green/2 = Blue (nonsense!)

# ✅ Right: Use one-hot encoding for nominal + linear models
df_encoded = pd.get_dummies(df, columns=['color'], drop_first=True)


            4. Not Handling Unknown Categories in Production

from sklearn.preprocessing import OneHotEncoder

# ✅ Handle unknown categories gracefully
encoder = OneHotEncoder(
    handle_unknown='ignore',  # Don't error on new categories
    sparse_output=False
)

# Train on limited categories
train = pd.DataFrame({'city': ['NYC', 'LA', 'SF']})
encoder.fit(train)

# Test with new category
test = pd.DataFrame({'city': ['NYC', 'Chicago']})  # Chicago is new
encoded = encoder.transform(test)
# Chicago gets [0, 0, 0] - all zeros (ignored)

        

        
        
            🧠 Knowledge Check
            
            
                
                    Question 1: When should you use one-hot encoding?
                    For ordinal categories with clear ranking
                    For high cardinality nominal features (100+ categories)
                    For nominal categories with low cardinality (2-10 categories)
                    Always, it's the best encoding method
                    
                

                
                    Question 2: What is the dummy variable trap?
                    When you encode too many categories
                    Perfect multicollinearity caused by keeping all one-hot encoded columns
                    Using label encoding for nominal categories
                    Forgetting to encode categorical variables
                    
                

                
                    Question 3: Which encoding method is safe to use with tree-based models and nominal categories?
                    Only one-hot encoding
                    Only target encoding
                    Never use encoding with tree models
                    Label encoding works well because trees split on thresholds
                    
                

                
                    Question 4: What's the main risk of target encoding?
                    Creates too many features
                    Can't handle ordinal categories
                    Data leakage if not implemented with cross-validation
                    Only works with binary classification
                    
                

                
                    Question 5: You have a 'User_ID' feature with 10,000 unique values for a supervised learning task. Best encoding?
                    One-hot encoding (creates 10,000 columns)
                    Target encoding with K-fold cross-validation
                    Ordinal encoding with explicit order
                    Drop the feature, too many categories
                    
                

                
                    Question 6: For education levels (High School, Bachelor, Master, PhD), which encoding preserves order?
                    Ordinal encoding with explicit category order
                    One-hot encoding
                    Frequency encoding
                    Target encoding
                    
                

                
                    Question 7: When should you use frequency encoding?
                    Only for ordinal categories
                    When you need to preserve category order
                    For high cardinality features in unsupervised learning (no target)
                    Never, target encoding is always better
                    
                

                
                    Question 8: In a production pipeline, how should you handle categories in test data not seen during training?
                    Error out immediately
                    Ignore the entire row
                    Assign a random value
                    Use handle_unknown='ignore' or assign a default value like -1 or global mean
                    
                
            
        

        
        
            💻 Practice Exercises

            
                Exercise 1: Multi-Method Encoding
                Given this dataset, apply appropriate encoding for each feature:
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'country': ['USA', 'UK', 'Canada', 'USA', 'Germany'],  # Nominal, low cardinality
    'rating': ['Poor', 'Good', 'Excellent', 'Fair', 'Good'],  # Ordinal
    'product_id': [f'P{i}' for i in range(100)]*5,  # Nominal, high cardinality
    'sales': np.random.randint(0, 2, 500)  # Target variable
})

# TODO:
# 1. One-hot encode 'country'
# 2. Ordinal encode 'rating' (Poor < Fair < Good < Excellent)
# 3. Target encode 'product_id' with K-fold
# 4. Create a pipeline that combines all encodings
# 5. Evaluate with a Random Forest classifier

            

            
                Exercise 2: Handling Unknown Categories
                Create an encoder that handles new categories in production:
# Training data
train_df = pd.DataFrame({
    'city': ['NYC', 'LA', 'SF', 'NYC', 'LA'],
    'size': ['S', 'M', 'L', 'M', 'S']
})

# Test data with unknown categories
test_df = pd.DataFrame({
    'city': ['NYC', 'Chicago', 'LA'],  # Chicago is new
    'size': ['M', 'XXL', 'S']  # XXL is new
})

# TODO:
# 1. Create encoders that handle unknown categories
# 2. For city: use frequency encoding, assign 0 to unknown
# 3. For size: use ordinal encoding, assign -1 to unknown
# 4. Transform both train and test sets

            

            
                Exercise 3: Avoiding Data Leakage
                Fix this leaky target encoding implementation:
# WRONG implementation (has data leakage)
df = pd.DataFrame({
    'category': ['A', 'B', 'C', 'A', 'B', 'C'] * 100,
    'target': np.random.binomial(1, 0.5, 600)
})

# Calculate target means on entire dataset
target_means = df.groupby('category')['target'].mean()
df['category_encoded'] = df['category'].map(target_means)

# Then split
train_df = df[:500]
test_df = df[500:]

# TODO:
# 1. Identify the data leakage problem
# 2. Fix it using proper train/test split
# 3. Implement K-fold target encoding
# 4. Compare scores with leaky vs proper implementation

            

            
                Exercise 4: Encoding Comparison
                Compare different encoding methods on the same dataset:
from sklearn.datasets import fetch_openml

# Load categorical dataset (adult income dataset)
data = fetch_openml('adult', version=2, as_frame=True)
df = data.frame

# Features: workclass, education, marital-status, occupation, etc.
# Target: income (>50K or <=50K)

# TODO:
# 1. Apply different encodings to 'occupation' (15 unique values)
#    - One-hot encoding
#    - Label encoding
#    - Target encoding
#    - Frequency encoding
# 2. Train Logistic Regression with each encoding
# 3. Train Random Forest with each encoding
# 4. Compare accuracy scores
# 5. Analyze which encoding works best for which model

            

            
                Exercise 5: Production Pipeline
                Build a complete production-ready encoding pipeline:
# TODO:
# 1. Create a dataset with mixed categorical types:
#    - 2 nominal low cardinality features
#    - 1 ordinal feature
#    - 1 nominal high cardinality feature (50+ categories)
#    - 3 numeric features
# 
# 2. Build a ColumnTransformer that:
#    - One-hot encodes nominal low cardinality
#    - Ordinal encodes ordinal feature
#    - Target encodes high cardinality feature
#    - StandardScales numeric features
#
# 3. Create a Pipeline with the transformer + classifier
# 4. Implement proper train/test split
# 5. Save the pipeline with joblib
# 6. Load and make predictions on new data
# 7. Add error handling for unknown categories

            
        

        
        
            📝 Summary
            
            You've learned five powerful encoding techniques for transforming categorical variables:

            
                Key Takeaways
                
                    One-Hot Encoding: Best for nominal categories with 2-10 unique values. Creates binary columns. Drop one column for linear models to avoid dummy variable trap.
                    
                    Label Encoding: Assigns integers 0, 1, 2... Good for tree-based models and ordinal data, but creates false ordinal relationships for nominal categories with linear models.
                    
                    Ordinal Encoding: Like label encoding but with explicit ordering. Essential for true ordinal features like "Small < Medium < Large".
                    
                    Target Encoding: Replaces categories with mean target value. Powerful for high cardinality (100+ categories), but requires K-fold implementation to prevent data leakage.
                    
                    Frequency Encoding: Replaces with frequency/count. Good for high cardinality in unsupervised learning. Simple and no leakage risk.
                
            

            
                🎯 Decision Framework
                
                    Identify category type: Nominal (no order) vs Ordinal (has order)
                    Count unique values: Low (2-10) vs High (50+) cardinality
                    Consider model type: Linear models need one-hot, tree models handle label encoding
                    Check for target variable: Target encoding only for supervised learning
                    Prevent data leakage: Always fit encoders on training data only
                
            

            Next tutorial, you'll learn about Feature Scaling and Normalization - preparing numerical features for machine learning algorithms. See you there! 🚀
        

        
        
            🎯 Ready for the Next Challenge?
            Continue your feature engineering journey
            
                
                    ← Previous: Data Preprocessing
                
                
                    📚 Course Hub
                
                
                    Next: Feature Scaling →

Scenario	Best Method	Why?
Nominal, 2-10 categories, Linear model	One-Hot	No false ordinal relationships
Ordinal with clear order	Ordinal	Preserves meaningful ranking
Nominal, any size, Tree model	Label	Trees handle it well, memory efficient
High cardinality (50+), Supervised	Target	Captures target relationship, 1 column
High cardinality, Unsupervised	Frequency	No target needed, simple, fast