Learning Goals
3 min- Drop unusable columns; choose features.
- Fill or drop missing values for ML.
- Encode categorical columns to numbers (one-hot & label).
- Separate
Xandyand confirm shapes & dtypes.
Warm-Up · Models Eat Numbers
5 minA scikit-learn model cannot train on this:
name city age plan churned Aisyah Kuala Lumpur 29 premium no Wei Jie Penang NaN free yes
It needs every cell to be a number, no missing values, and the label separated out. Today we transform the table above into a clean numeric X and a 0/1 y.
Models eat numbers, not strings, and choke on NaN. "Data prep" means: drop junk → fix missing → encode categories → split X/y. Do it in that order, every time.
New Concept · The Prep Pipeline
14 min1. Drop columns that can't help
import pandas as pd df = pd.read_csv("customers.csv") df = df.drop(columns=["name", "customer_id"]) # identifiers ≠ features
2. Handle missing values
# numeric: fill with the median (robust to outliers) df["age"] = df["age"].fillna(df["age"].median()) # categorical: fill with the most common value df["plan"] = df["plan"].fillna(df["plan"].mode()[0]) # or drop rows that are missing the label df = df.dropna(subset=["churned"])
3a. Encode binary / ordinal with map
df["churned"] = df["churned"].map({"no": 0, "yes": 1}) df["size"] = df["size"].map({"small": 0, "medium": 1, "large": 2})
3b. One-hot encode nominal categories
For categories with no order (city, plan), one-hot makes a 0/1 column per value:
df = pd.get_dummies(df, columns=["city", "plan"], drop_first=True) # city → city_Penang, city_JohorBahru, ... (each 0 or 1)
drop_first=True drops one redundant column (if it's not any other city, it's the dropped one) — avoids the "dummy variable trap".
4. Split X and y
y = df["churned"] X = df.drop(columns=["churned"]) print(X.shape, y.shape) print(X.dtypes) # should be all numeric now
Worked Example · Prep a Churn Dataset
12 min# prep.py — raw CSV → clean X, y import pandas as pd df = pd.read_csv("customers.csv") print("raw shape:", df.shape) # 1. drop identifiers df = df.drop(columns=["name", "customer_id"], errors="ignore") # 2. missing values df["age"] = df["age"].fillna(df["age"].median()) df["plan"] = df["plan"].fillna(df["plan"].mode()[0]) df = df.dropna(subset=["churned"]) # 3. encode df["churned"] = df["churned"].map({"no": 0, "yes": 1}) df = pd.get_dummies(df, columns=["city", "plan"], drop_first=True) # 4. split y = df["churned"] X = df.drop(columns=["churned"]) print("X shape:", X.shape) print("y balance:\n", y.value_counts()) print("all numeric?", (X.dtypes != object).all()) print(X.head(3))
Sample output
raw shape: (500, 6) X shape: (498, 7) y balance: 0 351 1 147 Name: churned, dtype: int64 all numeric? True
Read the diff
Five lines turned a messy human-readable table into a numeric matrix. The y.value_counts() check is important — if 95% of rows are one class, accuracy will lie to you (Lesson 11). Always look at the label balance before training.
Try It Yourself
13 minTake a yes/no column and map it to 1/0. Confirm the dtype is now int.
One-hot encode a categorical column with get_dummies(drop_first=True). Print the new column names.
Write prepare(df, label_col) that returns (X, y): drops object columns it can't encode, fills numeric NaN with median, one-hots remaining categories, splits off the label.
Hint
def prepare(df, label_col): df = df.copy() y = df.pop(label_col) # fill numeric NaN for c in df.select_dtypes("number"): df[c] = df[c].fillna(df[c].median()) # one-hot the rest df = pd.get_dummies(df, drop_first=True) return df, y
Mini-Challenge · Prep Report
8 minExtend prepare to also return a report dict: original shape, final shape, columns dropped, % missing filled, label balance. Print it so the prep is auditable.
Show one possible solution
def prepare(df, label_col): report = {"raw_shape": df.shape} df = df.copy() y = df.pop(label_col) obj_before = list(df.columns) for c in df.select_dtypes("number"): n_missing = df[c].isna().sum() if n_missing: df[c] = df[c].fillna(df[c].median()) report[f"filled_{c}"] = int(n_missing) df = pd.get_dummies(df, drop_first=True) report["final_shape"] = df.shape report["label_balance"] = y.value_counts().to_dict() return df, y, report X, y, rep = prepare(pd.read_csv("customers.csv"), "churned") import json; print(json.dumps(rep, indent=2, default=str))
Non-negotiables: pure function returning X, y AND a report; the report records what was changed.
Recap
3 minPrep order: drop junk → fix missing → encode categories → split X/y. Models need all-numeric, NaN-free data and a separated label. map for binary/ordinal, get_dummies for nominal. Always check the label balance. Next we visualise before we model.
Vocabulary Card
- one-hot encoding
- Turning a category into one 0/1 column per value.
- label encoding
- Mapping categories to integers — only safe when the categories have an order.
- imputation
- Filling missing values (e.g., with the median or mode).
- class balance
- How evenly the label values are split; imbalance breaks naive accuracy.
Homework
4 minTake a real CSV with at least one categorical column and some missing values. Write a prep script that outputs a clean numeric X, a numeric y, and a JSON prep-report. Save X and y to disk (CSV or .npy) so the next lessons can load them.
Reuse the prepare function from the mini-challenge. Add X.to_csv("X.csv", index=False) and y.to_csv("y.csv", index=False) at the end.