PY-L4-29 · Pandas — Cleaning (NaN, dtypes, duplicates)

Learning Goals

3 min

Detect missing values with isna / notna; count per column.
Decide: dropna, fillna(value), or per-column strategy.
Fix dtypes with astype and to_datetime.
Find and remove duplicate rows with duplicated / drop_duplicates.

Warm-Up · Where Is the Dirt?

5 min

import pandas as pd
df = pd.read_csv("dirty.csv")     # something with missing values
print(df.isna().sum())            # missing per column
print(df.duplicated().sum())      # number of duplicate rows

Two questions, two one-liners. Run them on every new dataset before any analysis.

Today's big idea

You can't clean what you can't see. Always start with isna().sum() and duplicated().sum(). Then decide for each column: drop, fill, or leave alone.

New Concept · NaN, dtypes, duplicates

14 min

Detecting missing

df.isna()              # DataFrame of booleans
df["score"].isna().sum()  # count missing in one column
df.isna().sum()        # missing per column
df.isna().any(axis=1).sum()  # rows with ANY missing value

Handling missing

# Option A: drop
df.dropna()                       # any row with any NaN
df.dropna(subset=["score"])       # only if score is NaN
df.dropna(thresh=3)               # keep rows with ≥3 non-NaN

# Option B: fill
df.fillna({"score": 0, "name": "Unknown"})
df["score"].fillna(df["score"].mean(), inplace=False)
df["score"].fillna(method="ffill")     # carry last good value forward

Pick the strategy that matches the data. "Missing score" might mean "not yet taken" (fill with NaN, keep) or "0 attempted, didn't show" (fill with 0). Document the choice in a comment.

Fixing dtypes

df["id"] = df["id"].astype(str)
df["price"] = df["price"].astype(float)
df["date"]  = pd.to_datetime(df["date"], errors="coerce")
                                  # errors="coerce" → NaT for unparseable

to_datetime with errors="coerce" is the cleanest way to bring messy date columns under control — unparseable values become NaT (the date equivalent of NaN).

Duplicates

df.duplicated().sum()              # how many exact-match dupes
df.duplicated(subset=["id"]).sum() # dupe by one column
df.drop_duplicates()               # remove
df.drop_duplicates(subset=["id"], keep="last")
                                  # keep the most recent

Worked Example · Clean an Orders File

12 min

import pandas as pd

df = pd.read_csv("orders_dirty.csv")

print("\nbefore:")
print(df.isna().sum())
print(f"duplicates: {df.duplicated(subset=['order_id']).sum()}")
print(df.dtypes)

# 1. Parse dates (handles mixed formats)
df["date"] = pd.to_datetime(df["date"], errors="coerce", dayfirst=False)

# 2. Force numeric columns
for col in ["quantity", "price"]:
    df[col] = pd.to_numeric(df[col], errors="coerce")

# 3. Drop rows missing required fields
df = df.dropna(subset=["customer", "product", "quantity", "price", "date"])

# 4. Drop duplicate order_ids, keep first
df = df.drop_duplicates(subset=["order_id"])

# 5. Trim and Title-case text columns
for col in ["customer", "product"]:
    df[col] = df[col].str.strip().str.title()

print("\nafter:")
print(df.isna().sum())
print(f"duplicates: {df.duplicated(subset=['order_id']).sum()}")
print(df.head())

Sample output

before:
order_id     0
customer     1
product      0
quantity     2
price        1
date         0
dtype: int64
duplicates: 1
order_id     int64
customer    object
product     object
quantity    object   ← had "one" in there
price       object   ← had blanks
date        object

after:
(all zero missing, no duplicate ids, dtypes now int / float / datetime)

Read the diff

This is yesterday's janitor script — but in 5 lines instead of 50, because pandas vectorises every step. to_numeric(errors="coerce") turned the word "one" into NaN, which dropna then removed. Less code, fewer bugs.

Try It Yourself

13 min

01 🟢 Drop missing customer rows

Use dropna(subset=...) and confirm the row count went down.

02 🟡 Fill price with median

Replace missing prices with the median of the non-missing prices. Use fillna with median().

Hint

df["price"] = df["price"].fillna(df["price"].median())

03 🔴 Dedupe with priority

Where two rows have the same order_id, keep the one with the larger total. Hint: sort by total descending first, then drop_duplicates(subset="order_id").

Hint

df["total"] = df["quantity"] * df["price"]
df = (df.sort_values("total", ascending=False)
        .drop_duplicates(subset="order_id"))

Mini-Challenge · Cleaning Report

8 min

Wrap the worked example into a function clean(df) → (clean_df, report) where report is a dict containing:

rows in / out
dropped because of missing values
dropped because of duplicates
per-column missing counts before cleaning

Show one possible solution

def clean(df):
    report = {"rows_in": len(df),
              "missing_before": df.isna().sum().to_dict()}
    df["date"]     = pd.to_datetime(df["date"], errors="coerce")
    df["quantity"] = pd.to_numeric(df["quantity"], errors="coerce")
    df["price"]    = pd.to_numeric(df["price"], errors="coerce")
    n0 = len(df)
    df = df.dropna(subset=["customer","product","quantity","price","date"])
    report["dropped_missing"] = n0 - len(df)
    n1 = len(df)
    df = df.drop_duplicates(subset="order_id")
    report["dropped_dupes"] = n1 - len(df)
    report["rows_out"] = len(df)
    return df, report

Non-negotiables: the function is pure (no I/O), returns both the cleaned frame AND the report dict.

Recap

3 min

Three cleanup tools cover most real-world dirt: isna + dropna/fillna for missing values; astype/to_numeric/to_datetime for dtypes; drop_duplicates for duplicates. Vectorised — way faster than Python loops. Tomorrow: merge — pandas's JOIN.

Homework

4 min

Take your real-CSV. Write clean_pipeline.py with a function that returns (clean_df, report). Save the cleaned version as clean.csv and the report as clean_report.json. Document each transformation with a one-line comment.

import pandas as pd df = pd.read_csv("dirty.csv") # something with missing values print(df.isna().sum()) # missing per column print(df.duplicated().sum()) # number of duplicate rows

df.isna() # DataFrame of booleans df["score"].isna().sum() # count missing in one column df.isna().sum() # missing per column df.isna().any(axis=1).sum() # rows with ANY missing value

# Option A: drop df.dropna() # any row with any NaN df.dropna(subset=["score"]) # only if score is NaN df.dropna(thresh=3) # keep rows with ≥3 non-NaN # Option B: fill df.fillna({"score": 0, "name": "Unknown"}) df["score"].fillna(df["score"].mean(), inplace=False) df["score"].fillna(method="ffill") # carry last good value forward

df["id"] = df["id"].astype(str) df["price"] = df["price"].astype(float) df["date"] = pd.to_datetime(df["date"], errors="coerce") # errors="coerce" → NaT for unparseable

df.duplicated().sum() # how many exact-match dupes df.duplicated(subset=["id"]).sum() # dupe by one column df.drop_duplicates() # remove df.drop_duplicates(subset=["id"], keep="last") # keep the most recent

import pandas as pd df = pd.read_csv("orders_dirty.csv") print("\nbefore:") print(df.isna().sum()) print(f"duplicates: {df.duplicated(subset=['order_id']).sum()}") print(df.dtypes) # 1. Parse dates (handles mixed formats) df["date"] = pd.to_datetime(df["date"], errors="coerce", dayfirst=False) # 2. Force numeric columns for col in ["quantity", "price"]: df[col] = pd.to_numeric(df[col], errors="coerce") # 3. Drop rows missing required fields df = df.dropna(subset=["customer", "product", "quantity", "price", "date"]) # 4. Drop duplicate order_ids, keep first df = df.drop_duplicates(subset=["order_id"]) # 5. Trim and Title-case text columns for col in ["customer", "product"]: df[col] = df[col].str.strip().str.title() print("\nafter:") print(df.isna().sum()) print(f"duplicates: {df.duplicated(subset=['order_id']).sum()}") print(df.head())

before: order_id 0 customer 1 product 0 quantity 2 price 1 date 0 dtype: int64 duplicates: 1 order_id int64 customer object product object quantity object ← had "one" in there price object ← had blanks date object after: (all zero missing, no duplicate ids, dtypes now int / float / datetime)

def clean(df): report = {"rows_in": len(df), "missing_before": df.isna().sum().to_dict()} df["date"] = pd.to_datetime(df["date"], errors="coerce") df["quantity"] = pd.to_numeric(df["quantity"], errors="coerce") df["price"] = pd.to_numeric(df["price"], errors="coerce") n0 = len(df) df = df.dropna(subset=["customer","product","quantity","price","date"]) report["dropped_missing"] = n0 - len(df) n1 = len(df) df = df.drop_duplicates(subset="order_id") report["dropped_dupes"] = n1 - len(df) report["rows_out"] = len(df) return df, report