PY-L4-27 · Pandas — Filtering with Boolean Masks

Learning Goals

3 min

Build a boolean Series with comparisons (df["x"] > 10).
Filter rows with df[mask] or df.loc[mask].
Combine masks with &, |, ~ — and parentheses.
Use isin, between, str.contains, and df.query().

Warm-Up · A Mask Is Just a Series

5 min

import pandas as pd
df = pd.read_csv("students.csv")

mask = df["score"] > 80
print(mask)

0     True
1    False
2     True
3    False
Name: score, dtype: bool

A mask is a Series of booleans, one per row. When you index with it, pandas keeps the True rows.

Today's big idea

Filtering = comparing to make a mask, then indexing with that mask. Combine masks with & (AND), | (OR), ~ (NOT). Always parenthesise — operator precedence will burn you otherwise.

New Concept · Masks & Helpers

14 min

Simple filters

df[df["score"] >= 80]            # rows where score >= 80
df[df["product"] == "Nasi"]      # rows for one product
df[df["price"] != 0]             # rows with non-zero price

Combine with logical operators

# AND
df[(df["score"] >= 80) & (df["age"] == 13)]

# OR
df[(df["score"] >= 90) | (df["score"] < 60)]

# NOT
df[~(df["product"] == "Roti")]

Use &, |, ~ — not and/or/not. Each side must be parenthesised because & binds tighter than >= in Python.

isin — multiple values

df[df["product"].isin(["Roti", "Milo"])]

between — inclusive range

df[df["score"].between(70, 90)]

String tests via .str

df[df["name"].str.startswith("A")]
df[df["name"].str.contains("ei")]
df[df["name"].str.lower().str.contains("mei")]

query — readable filters

df.query("score >= 80 and age == 13")
df.query("product in ['Roti', 'Milo']")
df.query("name.str.startswith('A')", engine="python")

df.query takes a string and is often the most readable filter for complex conditions. Pass engine="python" for .str calls inside the string.

Edit only the matched rows

# Bump every Form 1A score by 5 (using yesterday's safe pattern)
df.loc[df["class"] == "Form 1A", "score"] += 5

Worked Example · Sales Filtering Drill

12 min

import pandas as pd
df = pd.read_csv("clean.csv", parse_dates=["date"])
df["total"] = df["quantity"] * df["price"]

# 1. Orders ≥ RM 10
print(df[df["total"] >= 10])

# 2. Only Roti or Milo
print(df[df["product"].isin(["Roti", "Milo"])])

# 3. Big orders on or after 2026-05-02
mask = (df["total"] >= 10) & (df["date"] >= "2026-05-02")
print(df[mask])

# 4. Same with query
print(df.query("total >= 10 and date >= '2026-05-02'"))

# 5. Top customers ($ ≥ RM 20 total spend) using groupby + filter
spend = df.groupby("customer")["total"].sum()
print(spend[spend >= 20])

Read the diff

Five filters, same dataframe, three syntaxes (chained masks, isin, query) — pick whichever is most readable for the query at hand. The last one shows a group-then-filter pattern; we'll formalise it in Lesson 28.

Try It Yourself

13 min

01 🟢 Single condition

Return only rows where quantity >= 3.

Hint

df[df["quantity"] >= 3]

02 🟡 AND + OR mixed

Rows where product is Nasi OR Milo AND price > 2.

Hint

df[df["product"].isin(["Nasi", "Milo"]) & (df["price"] > 2)]

03 🔴 Combine with safe assignment

Apply a 10% discount to every "Milo" order: add a discount column equal to 0 by default, then set it to 0.10 for Milo rows.

Hint

df["discount"] = 0.0
df.loc[df["product"] == "Milo", "discount"] = 0.10
print(df[["product", "discount"]])

Mini-Challenge · Sliding Window Filter

8 min

Given any dated DataFrame, write a function last_n_days(df, n) that returns rows from the last n days (relative to today). Then return only the rows from the last 7 days where total > mean(total).

Show one possible solution

from datetime import date, timedelta

def last_n_days(df, n):
    cutoff = pd.Timestamp(date.today() - timedelta(days=n))
    return df[df["date"] >= cutoff]

recent = last_n_days(df, 7)
hot    = recent[recent["total"] > recent["total"].mean()]
print(hot)

Non-negotiables: re-using the function, computing the mean of the slice not the whole df.

Recap

3 min

A boolean mask is a Series of True/False; index with it to keep matching rows. Combine masks with & | ~ and parentheses. isin, between, .str.* and df.query express common patterns concisely. Filtering is half the job in data work — the other half is grouping, which is tomorrow.

Homework

4 min

On your real-CSV, write five filters:

One comparison.
One AND.
One OR.
One isin.
One str.contains.

For each, write the resulting row count.

for label, mask in [
    ("score >= 80",                df["score"] >= 80),
    ("score >= 80 AND age == 13",  (df["score"] >= 80) & (df["age"] == 13)),
    ("score >= 90 OR score < 60",  (df["score"] >= 90) | (df["score"] < 60)),
    ("class isin [A, B]",          df["class"].isin(["Form 1A", "Form 1B"])),
    ("name contains 'ei'",         df["name"].str.contains("ei")),
]:
    print(f"  {label:<40} → {mask.sum()} rows")

import pandas as pd df = pd.read_csv("clean.csv", parse_dates=["date"]) df["total"] = df["quantity"] * df["price"] # 1. Orders ≥ RM 10 print(df[df["total"] >= 10]) # 2. Only Roti or Milo print(df[df["product"].isin(["Roti", "Milo"])]) # 3. Big orders on or after 2026-05-02 mask = (df["total"] >= 10) & (df["date"] >= "2026-05-02") print(df[mask]) # 4. Same with query print(df.query("total >= 10 and date >= '2026-05-02'")) # 5. Top customers ($ ≥ RM 20 total spend) using groupby + filter spend = df.groupby("customer")["total"].sum() print(spend[spend >= 20])

from datetime import date, timedelta def last_n_days(df, n): cutoff = pd.Timestamp(date.today() - timedelta(days=n)) return df[df["date"] >= cutoff] recent = last_n_days(df, 7) hot = recent[recent["total"] > recent["total"].mean()] print(hot)