Learning Goals
3 min- Bucket rows with
df.groupby("col"). - Apply one aggregation (
mean,sum,count,min,max). - Apply many aggregations at once with
.agg({...}). - Group by multiple columns; use
as_index=Falsefor tidy output.
Warm-Up · The Split-Apply-Combine Pattern
5 minSPLIT rows → groups (one per unique value) APPLY a function to each group (mean, sum, count, custom) COMBINE glue results into a single table
That's the mental model. Every groupby follows this rhythm — split, apply, combine.
You don't write loops to summarise data in pandas. groupby + an aggregation does it in one line, vectorised under the hood.
New Concept · groupby & agg
14 minSingle column, single aggregate
# Average score per class df.groupby("class")["score"].mean()
class Form 1A 81.5 Form 1B 74.25 Name: score, dtype: float64
Multiple aggregates on one column
df.groupby("class")["score"].agg(["count", "mean", "min", "max"])
count mean min max class Form 1A 4 81.50 65 95 Form 1B 4 74.25 55 92
Different aggregates per column
df.groupby("class").agg( students = ("name", "count"), avg_score = ("score", "mean"), best_score = ("score", "max"), avg_age = ("age", "mean"), )
Named-aggregation syntax — modern pandas. The result has tidy named columns.
Group by multiple columns
df.groupby(["class", "age"])["score"].mean()
class age
Form 1A 13 91.5
14 75.0
Form 1B 13 92.0
14 66.0Multi-key groupby creates a MultiIndex on the result. Add .reset_index() if you want a flat DataFrame.
as_index=False
df.groupby("class", as_index=False)["score"].mean()
class score 0 Form 1A 81.50 1 Form 1B 74.25
Filter groups (groupby + filter)
# Keep classes whose mean score is above 75 df.groupby("class").filter(lambda g: g["score"].mean() > 75)
Worked Example · Revenue Pivot
12 minimport pandas as pd df = pd.read_csv("clean.csv", parse_dates=["date"]) df["total"] = df["quantity"] * df["price"] # Per-product revenue rev = (df.groupby("product", as_index=False) .agg(orders=("order_id", "count"), units=("quantity", "sum"), revenue=("total", "sum"), avg_order=("total", "mean")) .sort_values("revenue", ascending=False)) print(rev) print("\nper day:") daily = df.groupby(df["date"].dt.date)["total"].sum().rename("revenue") print(daily) print("\nproduct × day pivot:") pivot = df.pivot_table(index="product", columns=df["date"].dt.date, values="total", aggfunc="sum", fill_value=0) print(pivot)
Sample output
product orders units revenue avg_order 0 Nasi 2 5 40.0 20.00 1 Milo 2 6 18.00 9.00 2 Roti 1 3 4.50 4.50 per day: 2026-05-01 10.50 2026-05-02 24.00 2026-05-03 28.00 product × day pivot: date 2026-05-01 2026-05-02 2026-05-03 product Milo 6.0 0.0 12.0 Nasi 0.0 24.0 16.0 Roti 4.5 0.0 0.0
Read the diff
Three different summaries from the same data: one-axis (product), one-axis (day), two-axis (pivot table). Pivot tables are just groupby with the result reshaped — pandas knows the moves you used to do in Excel.
Try It Yourself
13 minCompute average price per product.
Hint
df.groupby("product")["price"].mean()
For each product: count of orders, sum of quantity, max of total.
Hint
df.groupby("product").agg( orders=("order_id", "count"), units =("quantity", "sum"), biggest=("total", "max"), )
Find dates where total revenue was above the overall mean. Use groupby + filter conceptually (or compute the mean first, then filter the grouped result).
Hint
daily = df.groupby(df["date"].dt.date)["total"].sum() print(daily[daily > daily.mean()])
Mini-Challenge · Weekday × Product Pivot
8 minMake a pivot showing total quantity per weekday × product. Rows = weekday (Monday … Sunday), columns = product, values = sum of quantity. Sort the weekday axis Monday-first.
Show one possible solution
df["weekday"] = df["date"].dt.day_name() order = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"] pivot = (df.pivot_table(index="weekday", columns="product", values="quantity", aggfunc="sum", fill_value=0) .reindex(order)) print(pivot)
Non-negotiables: weekday name from .dt.day_name(), reindex to fix order.
Recap
3 minSplit-apply-combine. df.groupby splits; .agg applies one or many aggregations; pandas combines. pivot_table is groupby with the result reshaped into rows × columns. as_index=False keeps things flat. Next: cleaning, where reality lives.
Homework
4 minOn a real CSV with at least one categorical column and one numeric column, produce:
- One single-key groupby with three aggregates.
- One two-key groupby.
- One pivot_table.
- One sentence interpreting the most interesting finding.
The deliverable is your three tables + one-sentence finding. Examples follow the lesson template — substitute your own column names.