Learning Goals
3 min- Encode categories (one-hot, ordinal) the scikit-learn way.
- Scale numeric features (Standard, MinMax) inside a pipeline.
- Bin continuous values and create interaction / ratio features.
- Use
ColumnTransformerto apply different transforms per column.
Warm-Up · The Feature Is the Model
5 minPredicting house price? A raw "year built" is weak. But "age = current_year − year_built" is strong. "price per square foot" might be stronger still. Same data, smarter features, better model — without changing the algorithm at all.
Algorithms are commodities; features are where domain knowledge lives. Encoding and scaling are mandatory hygiene; creating new features from your understanding of the problem is where the real wins come from.
New Concept · The Transform Toolkit
14 minEncoding categories
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder # nominal (no order): one-hot OneHotEncoder(handle_unknown="ignore") # ordinal (has order): map to 0,1,2... OrdinalEncoder(categories=[["small", "medium", "large"]])
Scaling numbers
from sklearn.preprocessing import StandardScaler, MinMaxScaler StandardScaler() # mean 0, std 1 — the default MinMaxScaler() # squash to [0, 1] — when bounds matter
Binning continuous values
import pandas as pd df["age_group"] = pd.cut(df["age"], bins=[0, 12, 19, 35, 60, 200], labels=["child", "teen", "young", "adult", "senior"])
Binning turns a number into categories — useful when the relationship is non-linear (e.g., risk by age bracket).
Creating features
df["age"] = 2026 - df["year_built"] df["price_per_sqft"] = df["price"] / df["sqft"] df["rooms_per_floor"] = df["rooms"] / df["floors"] df["is_weekend"] = df["date"].dt.dayofweek >= 5
ColumnTransformer — different transforms per column
from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier pre = ColumnTransformer([ ("num", StandardScaler(), ["age", "income"]), ("cat", OneHotEncoder(handle_unknown="ignore"), ["city", "plan"]), ]) model = Pipeline([ ("prep", pre), ("clf", RandomForestClassifier(random_state=0)), ]) model.fit(X_train, y_train)
This is the professional pattern: one object that knows how to transform each column AND train the model. It fits on train only, so there's zero leakage, and it deploys as a single unit.
Worked Example · A Real Preprocessing Pipeline
12 min# pipeline.py — mixed-type data, one clean pipeline import pandas as pd from sklearn.model_selection import cross_val_score from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import RandomForestClassifier df = pd.DataFrame({ "age": [25, 40, 35, None, 52, 29, 60, 41], "income": [3000, 8000, 5000, 4500, 12000, 3200, 9000, 7000], "city": ["KL", "JB", "KL", "Penang", "JB", "KL", None, "Penang"], "churn": [0, 1, 0, 0, 1, 0, 1, 1], }) y = df.pop("churn") num_cols = ["age", "income"] cat_cols = ["city"] pre = ColumnTransformer([ ("num", Pipeline([("imp", SimpleImputer(strategy="median")), ("sc", StandardScaler())]), num_cols), ("cat", Pipeline([("imp", SimpleImputer(strategy="most_frequent")), ("oh", OneHotEncoder(handle_unknown="ignore"))]), cat_cols), ]) model = Pipeline([("prep", pre), ("clf", RandomForestClassifier(random_state=0))]) print("CV accuracy:", cross_val_score(model, df, y, cv=4).mean().round(3))
Read the diff
Numeric columns get median-imputed then scaled; the categorical column gets mode-imputed then one-hot encoded — all in one object. Missing age and missing city are handled automatically. You can hand model raw, messy data forever and it self-cleans. This is exactly how production ML systems are structured.
Try It Yourself
13 minOn any dataset, create one new feature from existing columns (a ratio, a difference, or a date part). Show it improves (or doesn't) CV accuracy.
Bin a continuous column into 4 sensible buckets with pd.cut. Plot a countplot of the buckets.
Build a ColumnTransformer for a dataset with both numeric and categorical columns, impute + scale + encode, and train a model. Compare CV accuracy to a naive "drop all categoricals" baseline.
Mini-Challenge · Feature Ablation
8 minMeasure each feature's contribution by "ablation": train with all features, then drop each one and re-measure CV accuracy. The biggest accuracy drop = the most important feature. Print a ranked list.
Show one possible solution
from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier base = cross_val_score(RandomForestClassifier(random_state=0), X, y, cv=5).mean() drops = {} for col in X.columns: reduced = X.drop(columns=[col]) acc = cross_val_score(RandomForestClassifier(random_state=0), reduced, y, cv=5).mean() drops[col] = base - acc # how much we lost by removing it for col, d in sorted(drops.items(), key=lambda kv: -kv[1]): print(f" {col:<20} contributes {d:+.3f}")
Non-negotiables: a baseline, leave-one-out re-training, ranked contributions. Ablation is a model-agnostic way to find what matters.
Recap
3 minEncode categories (one-hot/ordinal), scale numbers, bin non-linear values, and craft new features from domain knowledge. Wrap it all in a ColumnTransformer + Pipeline so prep fits on train only and deploys as one unit. Better features beat better algorithms. Next: your first end-to-end project.
Vocabulary Card
- feature engineering
- Transforming and creating features to help the model learn.
- ColumnTransformer
- Applies different preprocessing to different columns in one object.
- SimpleImputer
- Fills missing values (median, mean, most-frequent) as a pipeline step.
- ablation
- Removing a feature to measure how much it contributed.
Homework
4 minTake a mixed-type dataset. Build a full preprocessing pipeline. Engineer at least one new feature and run ablation to see if it helps. Report the before/after CV accuracy and whether your new feature earned its place.
Combine pipeline.py with the ablation loop. Honest answer: not every engineered feature helps — report it either way.