PY-L5-06 · Pandas for AI — Preparing Data

Learning Goals

3 min

Drop unusable columns; choose features.
Fill or drop missing values for ML.
Encode categorical columns to numbers (one-hot & label).
Separate X and y and confirm shapes & dtypes.

Warm-Up · Models Eat Numbers

5 min

A scikit-learn model cannot train on this:

name      city          age   plan       churned
Aisyah    Kuala Lumpur  29    premium    no
Wei Jie   Penang        NaN   free       yes

It needs every cell to be a number, no missing values, and the label separated out. Today we transform the table above into a clean numeric X and a 0/1 y.

Today's big idea

Models eat numbers, not strings, and choke on NaN. "Data prep" means: drop junk → fix missing → encode categories → split X/y. Do it in that order, every time.

New Concept · The Prep Pipeline

14 min

1. Drop columns that can't help

import pandas as pd
df = pd.read_csv("customers.csv")
df = df.drop(columns=["name", "customer_id"])   # identifiers ≠ features

2. Handle missing values

# numeric: fill with the median (robust to outliers)
df["age"] = df["age"].fillna(df["age"].median())

# categorical: fill with the most common value
df["plan"] = df["plan"].fillna(df["plan"].mode()[0])

# or drop rows that are missing the label
df = df.dropna(subset=["churned"])

3a. Encode binary / ordinal with map

df["churned"] = df["churned"].map({"no": 0, "yes": 1})
df["size"]    = df["size"].map({"small": 0, "medium": 1, "large": 2})

3b. One-hot encode nominal categories

For categories with no order (city, plan), one-hot makes a 0/1 column per value:

df = pd.get_dummies(df, columns=["city", "plan"], drop_first=True)
# city → city_Penang, city_JohorBahru, ...  (each 0 or 1)

drop_first=True drops one redundant column (if it's not any other city, it's the dropped one) — avoids the "dummy variable trap".

4. Split X and y

y = df["churned"]
X = df.drop(columns=["churned"])
print(X.shape, y.shape)
print(X.dtypes)   # should be all numeric now

Worked Example · Prep a Churn Dataset

12 min

# prep.py — raw CSV → clean X, y
import pandas as pd

df = pd.read_csv("customers.csv")
print("raw shape:", df.shape)

# 1. drop identifiers
df = df.drop(columns=["name", "customer_id"], errors="ignore")

# 2. missing values
df["age"]    = df["age"].fillna(df["age"].median())
df["plan"]   = df["plan"].fillna(df["plan"].mode()[0])
df = df.dropna(subset=["churned"])

# 3. encode
df["churned"] = df["churned"].map({"no": 0, "yes": 1})
df = pd.get_dummies(df, columns=["city", "plan"], drop_first=True)

# 4. split
y = df["churned"]
X = df.drop(columns=["churned"])

print("X shape:", X.shape)
print("y balance:\n", y.value_counts())
print("all numeric?", (X.dtypes != object).all())
print(X.head(3))

Sample output

raw shape: (500, 6)
X shape: (498, 7)
y balance:
 0    351
 1    147
Name: churned, dtype: int64
all numeric? True

Read the diff

Five lines turned a messy human-readable table into a numeric matrix. The y.value_counts() check is important — if 95% of rows are one class, accuracy will lie to you (Lesson 11). Always look at the label balance before training.

Try It Yourself

13 min

01 🟢 Map a binary label

Take a yes/no column and map it to 1/0. Confirm the dtype is now int.

02 🟡 One-hot a category

One-hot encode a categorical column with get_dummies(drop_first=True). Print the new column names.

03 🔴 Full prep function

Write prepare(df, label_col) that returns (X, y): drops object columns it can't encode, fills numeric NaN with median, one-hots remaining categories, splits off the label.

Hint

def prepare(df, label_col):
    df = df.copy()
    y = df.pop(label_col)
    # fill numeric NaN
    for c in df.select_dtypes("number"):
        df[c] = df[c].fillna(df[c].median())
    # one-hot the rest
    df = pd.get_dummies(df, drop_first=True)
    return df, y

Mini-Challenge · Prep Report

8 min

Extend prepare to also return a report dict: original shape, final shape, columns dropped, % missing filled, label balance. Print it so the prep is auditable.

Show one possible solution

def prepare(df, label_col):
    report = {"raw_shape": df.shape}
    df = df.copy()
    y = df.pop(label_col)

    obj_before = list(df.columns)
    for c in df.select_dtypes("number"):
        n_missing = df[c].isna().sum()
        if n_missing:
            df[c] = df[c].fillna(df[c].median())
            report[f"filled_{c}"] = int(n_missing)
    df = pd.get_dummies(df, drop_first=True)

    report["final_shape"] = df.shape
    report["label_balance"] = y.value_counts().to_dict()
    return df, y, report

X, y, rep = prepare(pd.read_csv("customers.csv"), "churned")
import json; print(json.dumps(rep, indent=2, default=str))

Non-negotiables: pure function returning X, y AND a report; the report records what was changed.

Recap

3 min

Prep order: drop junk → fix missing → encode categories → split X/y. Models need all-numeric, NaN-free data and a separated label. map for binary/ordinal, get_dummies for nominal. Always check the label balance. Next we visualise before we model.

Vocabulary Card

one-hot encoding: Turning a category into one 0/1 column per value.
label encoding: Mapping categories to integers — only safe when the categories have an order.
imputation: Filling missing values (e.g., with the median or mode).
class balance: How evenly the label values are split; imbalance breaks naive accuracy.

Homework

4 min

Take a real CSV with at least one categorical column and some missing values. Write a prep script that outputs a clean numeric X, a numeric y, and a JSON prep-report. Save X and y to disk (CSV or .npy) so the next lessons can load them.

# numeric: fill with the median (robust to outliers) df["age"] = df["age"].fillna(df["age"].median()) # categorical: fill with the most common value df["plan"] = df["plan"].fillna(df["plan"].mode()[0]) # or drop rows that are missing the label df = df.dropna(subset=["churned"])

# prep.py — raw CSV → clean X, y import pandas as pd df = pd.read_csv("customers.csv") print("raw shape:", df.shape) # 1. drop identifiers df = df.drop(columns=["name", "customer_id"], errors="ignore") # 2. missing values df["age"] = df["age"].fillna(df["age"].median()) df["plan"] = df["plan"].fillna(df["plan"].mode()[0]) df = df.dropna(subset=["churned"]) # 3. encode df["churned"] = df["churned"].map({"no": 0, "yes": 1}) df = pd.get_dummies(df, columns=["city", "plan"], drop_first=True) # 4. split y = df["churned"] X = df.drop(columns=["churned"]) print("X shape:", X.shape) print("y balance:\n", y.value_counts()) print("all numeric?", (X.dtypes != object).all()) print(X.head(3))

def prepare(df, label_col): df = df.copy() y = df.pop(label_col) # fill numeric NaN for c in df.select_dtypes("number"): df[c] = df[c].fillna(df[c].median()) # one-hot the rest df = pd.get_dummies(df, drop_first=True) return df, y

def prepare(df, label_col): report = {"raw_shape": df.shape} df = df.copy() y = df.pop(label_col) obj_before = list(df.columns) for c in df.select_dtypes("number"): n_missing = df[c].isna().sum() if n_missing: df[c] = df[c].fillna(df[c].median()) report[f"filled_{c}"] = int(n_missing) df = pd.get_dummies(df, drop_first=True) report["final_shape"] = df.shape report["label_balance"] = y.value_counts().to_dict() return df, y, report X, y, rep = prepare(pd.read_csv("customers.csv"), "churned") import json; print(json.dumps(rep, indent=2, default=str))