PY-L5-18 · Feature Engineering — Making Data Usable

Learning Goals

3 min

Encode categories (one-hot, ordinal) the scikit-learn way.
Scale numeric features (Standard, MinMax) inside a pipeline.
Bin continuous values and create interaction / ratio features.
Use ColumnTransformer to apply different transforms per column.

Warm-Up · The Feature Is the Model

5 min

Predicting house price? A raw "year built" is weak. But "age = current_year − year_built" is strong. "price per square foot" might be stronger still. Same data, smarter features, better model — without changing the algorithm at all.

Today's big idea

Algorithms are commodities; features are where domain knowledge lives. Encoding and scaling are mandatory hygiene; creating new features from your understanding of the problem is where the real wins come from.

New Concept · The Transform Toolkit

14 min

Encoding categories

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# nominal (no order): one-hot
OneHotEncoder(handle_unknown="ignore")

# ordinal (has order): map to 0,1,2...
OrdinalEncoder(categories=[["small", "medium", "large"]])

Scaling numbers

from sklearn.preprocessing import StandardScaler, MinMaxScaler

StandardScaler()    # mean 0, std 1 — the default
MinMaxScaler()      # squash to [0, 1] — when bounds matter

Binning continuous values

import pandas as pd
df["age_group"] = pd.cut(df["age"],
    bins=[0, 12, 19, 35, 60, 200],
    labels=["child", "teen", "young", "adult", "senior"])

Binning turns a number into categories — useful when the relationship is non-linear (e.g., risk by age bracket).

Creating features

df["age"]            = 2026 - df["year_built"]
df["price_per_sqft"] = df["price"] / df["sqft"]
df["rooms_per_floor"] = df["rooms"] / df["floors"]
df["is_weekend"]     = df["date"].dt.dayofweek >= 5

ColumnTransformer — different transforms per column

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

pre = ColumnTransformer([
    ("num", StandardScaler(),               ["age", "income"]),
    ("cat", OneHotEncoder(handle_unknown="ignore"), ["city", "plan"]),
])

model = Pipeline([
    ("prep", pre),
    ("clf",  RandomForestClassifier(random_state=0)),
])
model.fit(X_train, y_train)

This is the professional pattern: one object that knows how to transform each column AND train the model. It fits on train only, so there's zero leakage, and it deploys as a single unit.

Worked Example · A Real Preprocessing Pipeline

12 min

# pipeline.py — mixed-type data, one clean pipeline
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

df = pd.DataFrame({
    "age":    [25, 40, 35, None, 52, 29, 60, 41],
    "income": [3000, 8000, 5000, 4500, 12000, 3200, 9000, 7000],
    "city":   ["KL", "JB", "KL", "Penang", "JB", "KL", None, "Penang"],
    "churn":  [0, 1, 0, 0, 1, 0, 1, 1],
})
y = df.pop("churn")

num_cols = ["age", "income"]
cat_cols = ["city"]

pre = ColumnTransformer([
    ("num", Pipeline([("imp", SimpleImputer(strategy="median")),
                      ("sc", StandardScaler())]), num_cols),
    ("cat", Pipeline([("imp", SimpleImputer(strategy="most_frequent")),
                      ("oh", OneHotEncoder(handle_unknown="ignore"))]), cat_cols),
])

model = Pipeline([("prep", pre),
                  ("clf", RandomForestClassifier(random_state=0))])

print("CV accuracy:", cross_val_score(model, df, y, cv=4).mean().round(3))

Read the diff

Numeric columns get median-imputed then scaled; the categorical column gets mode-imputed then one-hot encoded — all in one object. Missing age and missing city are handled automatically. You can hand model raw, messy data forever and it self-cleans. This is exactly how production ML systems are structured.

Try It Yourself

13 min

01 🟢 Engineer one feature

On any dataset, create one new feature from existing columns (a ratio, a difference, or a date part). Show it improves (or doesn't) CV accuracy.

02 🟡 Bin a number

Bin a continuous column into 4 sensible buckets with pd.cut. Plot a countplot of the buckets.

03 🔴 Full ColumnTransformer

Build a ColumnTransformer for a dataset with both numeric and categorical columns, impute + scale + encode, and train a model. Compare CV accuracy to a naive "drop all categoricals" baseline.

Mini-Challenge · Feature Ablation

8 min

Measure each feature's contribution by "ablation": train with all features, then drop each one and re-measure CV accuracy. The biggest accuracy drop = the most important feature. Print a ranked list.

Show one possible solution

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

base = cross_val_score(RandomForestClassifier(random_state=0), X, y, cv=5).mean()
drops = {}
for col in X.columns:
    reduced = X.drop(columns=[col])
    acc = cross_val_score(RandomForestClassifier(random_state=0),
                          reduced, y, cv=5).mean()
    drops[col] = base - acc      # how much we lost by removing it

for col, d in sorted(drops.items(), key=lambda kv: -kv[1]):
    print(f"  {col:<20} contributes {d:+.3f}")

Non-negotiables: a baseline, leave-one-out re-training, ranked contributions. Ablation is a model-agnostic way to find what matters.

Recap

3 min

Encode categories (one-hot/ordinal), scale numbers, bin non-linear values, and craft new features from domain knowledge. Wrap it all in a ColumnTransformer + Pipeline so prep fits on train only and deploys as one unit. Better features beat better algorithms. Next: your first end-to-end project.

Vocabulary Card

feature engineering: Transforming and creating features to help the model learn.
ColumnTransformer: Applies different preprocessing to different columns in one object.
SimpleImputer: Fills missing values (median, mean, most-frequent) as a pipeline step.
ablation: Removing a feature to measure how much it contributed.

Homework

4 min

Take a mixed-type dataset. Build a full preprocessing pipeline. Engineer at least one new feature and run ablation to see if it helps. Report the before/after CV accuracy and whether your new feature earned its place.

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder # nominal (no order): one-hot OneHotEncoder(handle_unknown="ignore") # ordinal (has order): map to 0,1,2... OrdinalEncoder(categories=[["small", "medium", "large"]])

from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier pre = ColumnTransformer([ ("num", StandardScaler(), ["age", "income"]), ("cat", OneHotEncoder(handle_unknown="ignore"), ["city", "plan"]), ]) model = Pipeline([ ("prep", pre), ("clf", RandomForestClassifier(random_state=0)), ]) model.fit(X_train, y_train)

# pipeline.py — mixed-type data, one clean pipeline import pandas as pd from sklearn.model_selection import cross_val_score from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import RandomForestClassifier df = pd.DataFrame({ "age": [25, 40, 35, None, 52, 29, 60, 41], "income": [3000, 8000, 5000, 4500, 12000, 3200, 9000, 7000], "city": ["KL", "JB", "KL", "Penang", "JB", "KL", None, "Penang"], "churn": [0, 1, 0, 0, 1, 0, 1, 1], }) y = df.pop("churn") num_cols = ["age", "income"] cat_cols = ["city"] pre = ColumnTransformer([ ("num", Pipeline([("imp", SimpleImputer(strategy="median")), ("sc", StandardScaler())]), num_cols), ("cat", Pipeline([("imp", SimpleImputer(strategy="most_frequent")), ("oh", OneHotEncoder(handle_unknown="ignore"))]), cat_cols), ]) model = Pipeline([("prep", pre), ("clf", RandomForestClassifier(random_state=0))]) print("CV accuracy:", cross_val_score(model, df, y, cv=4).mean().round(3))

from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier base = cross_val_score(RandomForestClassifier(random_state=0), X, y, cv=5).mean() drops = {} for col in X.columns: reduced = X.drop(columns=[col]) acc = cross_val_score(RandomForestClassifier(random_state=0), reduced, y, cv=5).mean() drops[col] = base - acc # how much we lost by removing it for col, d in sorted(drops.items(), key=lambda kv: -kv[1]): print(f" {col:<20} contributes {d:+.3f}")