Learning Goals
3 min- Build a boolean Series with comparisons (
df["x"] > 10). - Filter rows with
df[mask]ordf.loc[mask]. - Combine masks with
&,|,~— and parentheses. - Use
isin,between,str.contains, anddf.query().
Warm-Up · A Mask Is Just a Series
5 minimport pandas as pd df = pd.read_csv("students.csv") mask = df["score"] > 80 print(mask)
0 True 1 False 2 True 3 False Name: score, dtype: bool
A mask is a Series of booleans, one per row. When you index with it, pandas keeps the True rows.
Filtering = comparing to make a mask, then indexing with that mask. Combine masks with & (AND), | (OR), ~ (NOT). Always parenthesise — operator precedence will burn you otherwise.
New Concept · Masks & Helpers
14 minSimple filters
df[df["score"] >= 80] # rows where score >= 80 df[df["product"] == "Nasi"] # rows for one product df[df["price"] != 0] # rows with non-zero price
Combine with logical operators
# AND df[(df["score"] >= 80) & (df["age"] == 13)] # OR df[(df["score"] >= 90) | (df["score"] < 60)] # NOT df[~(df["product"] == "Roti")]
Use &, |, ~ — not and/or/not. Each side must be parenthesised because & binds tighter than >= in Python.
isin — multiple values
df[df["product"].isin(["Roti", "Milo"])]
between — inclusive range
df[df["score"].between(70, 90)]
String tests via .str
df[df["name"].str.startswith("A")] df[df["name"].str.contains("ei")] df[df["name"].str.lower().str.contains("mei")]
query — readable filters
df.query("score >= 80 and age == 13") df.query("product in ['Roti', 'Milo']") df.query("name.str.startswith('A')", engine="python")
df.query takes a string and is often the most readable filter for complex conditions. Pass engine="python" for .str calls inside the string.
Edit only the matched rows
# Bump every Form 1A score by 5 (using yesterday's safe pattern) df.loc[df["class"] == "Form 1A", "score"] += 5
Worked Example · Sales Filtering Drill
12 minimport pandas as pd df = pd.read_csv("clean.csv", parse_dates=["date"]) df["total"] = df["quantity"] * df["price"] # 1. Orders ≥ RM 10 print(df[df["total"] >= 10]) # 2. Only Roti or Milo print(df[df["product"].isin(["Roti", "Milo"])]) # 3. Big orders on or after 2026-05-02 mask = (df["total"] >= 10) & (df["date"] >= "2026-05-02") print(df[mask]) # 4. Same with query print(df.query("total >= 10 and date >= '2026-05-02'")) # 5. Top customers ($ ≥ RM 20 total spend) using groupby + filter spend = df.groupby("customer")["total"].sum() print(spend[spend >= 20])
Read the diff
Five filters, same dataframe, three syntaxes (chained masks, isin, query) — pick whichever is most readable for the query at hand. The last one shows a group-then-filter pattern; we'll formalise it in Lesson 28.
Try It Yourself
13 minReturn only rows where quantity >= 3.
Hint
df[df["quantity"] >= 3]
Rows where product is Nasi OR Milo AND price > 2.
Hint
df[df["product"].isin(["Nasi", "Milo"]) & (df["price"] > 2)]
Apply a 10% discount to every "Milo" order: add a discount column equal to 0 by default, then set it to 0.10 for Milo rows.
Hint
df["discount"] = 0.0 df.loc[df["product"] == "Milo", "discount"] = 0.10 print(df[["product", "discount"]])
Mini-Challenge · Sliding Window Filter
8 minGiven any dated DataFrame, write a function last_n_days(df, n) that returns rows from the last n days (relative to today). Then return only the rows from the last 7 days where total > mean(total).
Show one possible solution
from datetime import date, timedelta def last_n_days(df, n): cutoff = pd.Timestamp(date.today() - timedelta(days=n)) return df[df["date"] >= cutoff] recent = last_n_days(df, 7) hot = recent[recent["total"] > recent["total"].mean()] print(hot)
Non-negotiables: re-using the function, computing the mean of the slice not the whole df.
Recap
3 minA boolean mask is a Series of True/False; index with it to keep matching rows. Combine masks with & | ~ and parentheses. isin, between, .str.* and df.query express common patterns concisely. Filtering is half the job in data work — the other half is grouping, which is tomorrow.
Homework
4 minOn your real-CSV, write five filters:
- One comparison.
- One AND.
- One OR.
- One
isin. - One
str.contains.
For each, write the resulting row count.
for label, mask in [ ("score >= 80", df["score"] >= 80), ("score >= 80 AND age == 13", (df["score"] >= 80) & (df["age"] == 13)), ("score >= 90 OR score < 60", (df["score"] >= 90) | (df["score"] < 60)), ("class isin [A, B]", df["class"].isin(["Form 1A", "Form 1B"])), ("name contains 'ei'", df["name"].str.contains("ei")), ]: print(f" {label:<40} → {mask.sum()} rows")