Learning Goals
3 min- Detect missing values with
isna/notna; count per column. - Decide:
dropna,fillna(value), or per-column strategy. - Fix dtypes with
astypeandto_datetime. - Find and remove duplicate rows with
duplicated/drop_duplicates.
Warm-Up · Where Is the Dirt?
5 minimport pandas as pd df = pd.read_csv("dirty.csv") # something with missing values print(df.isna().sum()) # missing per column print(df.duplicated().sum()) # number of duplicate rows
Two questions, two one-liners. Run them on every new dataset before any analysis.
You can't clean what you can't see. Always start with isna().sum() and duplicated().sum(). Then decide for each column: drop, fill, or leave alone.
New Concept · NaN, dtypes, duplicates
14 minDetecting missing
df.isna() # DataFrame of booleans df["score"].isna().sum() # count missing in one column df.isna().sum() # missing per column df.isna().any(axis=1).sum() # rows with ANY missing value
Handling missing
# Option A: drop df.dropna() # any row with any NaN df.dropna(subset=["score"]) # only if score is NaN df.dropna(thresh=3) # keep rows with ≥3 non-NaN # Option B: fill df.fillna({"score": 0, "name": "Unknown"}) df["score"].fillna(df["score"].mean(), inplace=False) df["score"].fillna(method="ffill") # carry last good value forward
Pick the strategy that matches the data. "Missing score" might mean "not yet taken" (fill with NaN, keep) or "0 attempted, didn't show" (fill with 0). Document the choice in a comment.
Fixing dtypes
df["id"] = df["id"].astype(str) df["price"] = df["price"].astype(float) df["date"] = pd.to_datetime(df["date"], errors="coerce") # errors="coerce" → NaT for unparseable
to_datetime with errors="coerce" is the cleanest way to bring messy date columns under control — unparseable values become NaT (the date equivalent of NaN).
Duplicates
df.duplicated().sum() # how many exact-match dupes df.duplicated(subset=["id"]).sum() # dupe by one column df.drop_duplicates() # remove df.drop_duplicates(subset=["id"], keep="last") # keep the most recent
Worked Example · Clean an Orders File
12 minimport pandas as pd df = pd.read_csv("orders_dirty.csv") print("\nbefore:") print(df.isna().sum()) print(f"duplicates: {df.duplicated(subset=['order_id']).sum()}") print(df.dtypes) # 1. Parse dates (handles mixed formats) df["date"] = pd.to_datetime(df["date"], errors="coerce", dayfirst=False) # 2. Force numeric columns for col in ["quantity", "price"]: df[col] = pd.to_numeric(df[col], errors="coerce") # 3. Drop rows missing required fields df = df.dropna(subset=["customer", "product", "quantity", "price", "date"]) # 4. Drop duplicate order_ids, keep first df = df.drop_duplicates(subset=["order_id"]) # 5. Trim and Title-case text columns for col in ["customer", "product"]: df[col] = df[col].str.strip().str.title() print("\nafter:") print(df.isna().sum()) print(f"duplicates: {df.duplicated(subset=['order_id']).sum()}") print(df.head())
Sample output
before: order_id 0 customer 1 product 0 quantity 2 price 1 date 0 dtype: int64 duplicates: 1 order_id int64 customer object product object quantity object ← had "one" in there price object ← had blanks date object after: (all zero missing, no duplicate ids, dtypes now int / float / datetime)
Read the diff
This is yesterday's janitor script — but in 5 lines instead of 50, because pandas vectorises every step. to_numeric(errors="coerce") turned the word "one" into NaN, which dropna then removed. Less code, fewer bugs.
Try It Yourself
13 minUse dropna(subset=...) and confirm the row count went down.
Replace missing prices with the median of the non-missing prices. Use fillna with median().
Hint
df["price"] = df["price"].fillna(df["price"].median())
Where two rows have the same order_id, keep the one with the larger total. Hint: sort by total descending first, then drop_duplicates(subset="order_id").
Hint
df["total"] = df["quantity"] * df["price"] df = (df.sort_values("total", ascending=False) .drop_duplicates(subset="order_id"))
Mini-Challenge · Cleaning Report
8 minWrap the worked example into a function clean(df) → (clean_df, report) where report is a dict containing:
- rows in / out
- dropped because of missing values
- dropped because of duplicates
- per-column missing counts before cleaning
Show one possible solution
def clean(df): report = {"rows_in": len(df), "missing_before": df.isna().sum().to_dict()} df["date"] = pd.to_datetime(df["date"], errors="coerce") df["quantity"] = pd.to_numeric(df["quantity"], errors="coerce") df["price"] = pd.to_numeric(df["price"], errors="coerce") n0 = len(df) df = df.dropna(subset=["customer","product","quantity","price","date"]) report["dropped_missing"] = n0 - len(df) n1 = len(df) df = df.drop_duplicates(subset="order_id") report["dropped_dupes"] = n1 - len(df) report["rows_out"] = len(df) return df, report
Non-negotiables: the function is pure (no I/O), returns both the cleaned frame AND the report dict.
Recap
3 minThree cleanup tools cover most real-world dirt: isna + dropna/fillna for missing values; astype/to_numeric/to_datetime for dtypes; drop_duplicates for duplicates. Vectorised — way faster than Python loops. Tomorrow: merge — pandas's JOIN.
Homework
4 minTake your real-CSV. Write clean_pipeline.py with a function that returns (clean_df, report). Save the cleaned version as clean.csv and the report as clean_report.json. Document each transformation with a one-line comment.
The answer is your clean_pipeline.py, plus the two output files. Match the shape of the mini-challenge.