PY-L4-28 · Pandas — Grouping & Aggregating

Learning Goals

3 min

Bucket rows with df.groupby("col").
Apply one aggregation (mean, sum, count, min, max).
Apply many aggregations at once with .agg({...}).
Group by multiple columns; use as_index=False for tidy output.

Warm-Up · The Split-Apply-Combine Pattern

5 min

SPLIT    rows → groups (one per unique value)
APPLY    a function to each group (mean, sum, count, custom)
COMBINE  glue results into a single table

That's the mental model. Every groupby follows this rhythm — split, apply, combine.

Today's big idea

You don't write loops to summarise data in pandas. groupby + an aggregation does it in one line, vectorised under the hood.

New Concept · groupby & agg

14 min

Single column, single aggregate

# Average score per class
df.groupby("class")["score"].mean()

class
Form 1A    81.5
Form 1B    74.25
Name: score, dtype: float64

Multiple aggregates on one column

df.groupby("class")["score"].agg(["count", "mean", "min", "max"])

        count   mean  min  max
class
Form 1A     4  81.50   65   95
Form 1B     4  74.25   55   92

Different aggregates per column

df.groupby("class").agg(
    students   = ("name",  "count"),
    avg_score  = ("score", "mean"),
    best_score = ("score", "max"),
    avg_age    = ("age",   "mean"),
)

Named-aggregation syntax — modern pandas. The result has tidy named columns.

Group by multiple columns

df.groupby(["class", "age"])["score"].mean()

class    age
Form 1A  13    91.5
         14    75.0
Form 1B  13    92.0
         14    66.0

Multi-key groupby creates a MultiIndex on the result. Add .reset_index() if you want a flat DataFrame.

as_index=False

df.groupby("class", as_index=False)["score"].mean()

     class  score
0  Form 1A  81.50
1  Form 1B  74.25

Filter groups (groupby + filter)

# Keep classes whose mean score is above 75
df.groupby("class").filter(lambda g: g["score"].mean() > 75)

Worked Example · Revenue Pivot

12 min

import pandas as pd

df = pd.read_csv("clean.csv", parse_dates=["date"])
df["total"] = df["quantity"] * df["price"]

# Per-product revenue
rev = (df.groupby("product", as_index=False)
         .agg(orders=("order_id", "count"),
              units=("quantity", "sum"),
              revenue=("total", "sum"),
              avg_order=("total", "mean"))
         .sort_values("revenue", ascending=False))
print(rev)

print("\nper day:")
daily = df.groupby(df["date"].dt.date)["total"].sum().rename("revenue")
print(daily)

print("\nproduct × day pivot:")
pivot = df.pivot_table(index="product", columns=df["date"].dt.date,
                       values="total", aggfunc="sum", fill_value=0)
print(pivot)

Sample output

  product  orders  units  revenue  avg_order
0    Nasi       2      5     40.0      20.00
1    Milo       2      6     18.00      9.00
2    Roti       1      3      4.50      4.50

per day:
2026-05-01     10.50
2026-05-02     24.00
2026-05-03     28.00

product × day pivot:
date     2026-05-01  2026-05-02  2026-05-03
product
Milo            6.0         0.0        12.0
Nasi            0.0        24.0        16.0
Roti            4.5         0.0         0.0

Read the diff

Three different summaries from the same data: one-axis (product), one-axis (day), two-axis (pivot table). Pivot tables are just groupby with the result reshaped — pandas knows the moves you used to do in Excel.

Try It Yourself

13 min

01 🟢 Mean by group

Compute average price per product.

Hint

df.groupby("product")["price"].mean()

02 🟡 Multi-stat

For each product: count of orders, sum of quantity, max of total.

Hint

df.groupby("product").agg(
    orders=("order_id", "count"),
    units =("quantity", "sum"),
    biggest=("total",   "max"),
)

03 🔴 Hot days

Find dates where total revenue was above the overall mean. Use groupby + filter conceptually (or compute the mean first, then filter the grouped result).

Hint

daily = df.groupby(df["date"].dt.date)["total"].sum()
print(daily[daily > daily.mean()])

Mini-Challenge · Weekday × Product Pivot

8 min

Make a pivot showing total quantity per weekday × product. Rows = weekday (Monday … Sunday), columns = product, values = sum of quantity. Sort the weekday axis Monday-first.

Show one possible solution

df["weekday"] = df["date"].dt.day_name()
order = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]

pivot = (df.pivot_table(index="weekday", columns="product",
                        values="quantity", aggfunc="sum", fill_value=0)
           .reindex(order))
print(pivot)

Non-negotiables: weekday name from .dt.day_name(), reindex to fix order.

Recap

3 min

Split-apply-combine. df.groupby splits; .agg applies one or many aggregations; pandas combines. pivot_table is groupby with the result reshaped into rows × columns. as_index=False keeps things flat. Next: cleaning, where reality lives.

Homework

4 min

On a real CSV with at least one categorical column and one numeric column, produce:

One single-key groupby with three aggregates.
One two-key groupby.
One pivot_table.
One sentence interpreting the most interesting finding.

import pandas as pd df = pd.read_csv("clean.csv", parse_dates=["date"]) df["total"] = df["quantity"] * df["price"] # Per-product revenue rev = (df.groupby("product", as_index=False) .agg(orders=("order_id", "count"), units=("quantity", "sum"), revenue=("total", "sum"), avg_order=("total", "mean")) .sort_values("revenue", ascending=False)) print(rev) print("\nper day:") daily = df.groupby(df["date"].dt.date)["total"].sum().rename("revenue") print(daily) print("\nproduct × day pivot:") pivot = df.pivot_table(index="product", columns=df["date"].dt.date, values="total", aggfunc="sum", fill_value=0) print(pivot)

product orders units revenue avg_order 0 Nasi 2 5 40.0 20.00 1 Milo 2 6 18.00 9.00 2 Roti 1 3 4.50 4.50 per day: 2026-05-01 10.50 2026-05-02 24.00 2026-05-03 28.00 product × day pivot: date 2026-05-01 2026-05-02 2026-05-03 product Milo 6.0 0.0 12.0 Nasi 0.0 24.0 16.0 Roti 4.5 0.0 0.0

df["weekday"] = df["date"].dt.day_name() order = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"] pivot = (df.pivot_table(index="weekday", columns="product", values="quantity", aggfunc="sum", fill_value=0) .reindex(order)) print(pivot)