PY-L4-31 · Pandas — Sorting & Top-N Analysis

Learning Goals

3 min

Sort rows by one or many columns; ascending/descending per-column.
Use nlargest / nsmallest for fast top-N.
Rank rows with .rank() and the method= option.
Combine sort + group for "top N per group".

Warm-Up · The Simplest Sort

5 min

df.sort_values("score", ascending=False)

That's 95% of what you'll ever need. The other 5%: multiple keys, NA placement, sort by index.

Today's big idea

sort_values returns a new DataFrame; the original is unchanged. Always assign the result back if you want the order to stick.

New Concept · Sort, Top-N, Rank

14 min

Multiple sort keys

df.sort_values(["class", "score"], ascending=[True, False])
# class asc, then within class score desc

nlargest / nsmallest — faster than sort + head

df.nlargest(3, "score")          # top 3 by score
df.nsmallest(3, "score")         # bottom 3
df.nlargest(5, ["score", "age"]) # primary score, tiebreaker age

Internally these don't fully sort the frame — they use partial sorting, which is faster for large data.

Sort by index

df.sort_index()                    # by row index
df.sort_index(axis=1)              # by column name

Rank

df["rank"] = df["score"].rank(ascending=False, method="dense")

method= options: average (default), min, max, first, dense. For leaderboards dense avoids gaps after ties.

Top N per group

# Top 2 students per class
(df.sort_values(["class", "score"], ascending=[True, False])
   .groupby("class").head(2))

Bonus — sort by a derived value

df.iloc[df["name"].str.len().argsort()]   # by length of name

Worked Example · Top-N Dashboard

12 min

import pandas as pd

df = pd.read_csv("clean.csv", parse_dates=["date"])
df["total"] = df["quantity"] * df["price"]

# 1. Top 5 customers by spend
top_customers = (df.groupby("customer", as_index=False)["total"].sum()
                   .nlargest(5, "total")
                   .reset_index(drop=True))
print(top_customers)

# 2. Worst day by revenue
worst_day = (df.groupby(df["date"].dt.date)["total"].sum()
               .nsmallest(1))
print(f"\n📉 worst day: {worst_day.index[0]}  RM {worst_day.values[0]:.2f}")

# 3. Top 2 products per day
top_per_day = (df.groupby(["date", "product"], as_index=False)["total"].sum()
                 .sort_values(["date", "total"], ascending=[True, False])
                 .groupby("date").head(2))
print("\n🏆 top 2 per day:")
print(top_per_day)

# 4. Add a rank
df["rank_today"] = (df.groupby(df["date"].dt.date)["total"]
                      .rank(ascending=False, method="dense"))

Read the diff

Same DataFrame, four different reports. The combo groupby + sort + head is the Pythonic "top N per group" — easier to read than window functions in SQL once you've used it twice.

Try It Yourself

13 min

01 🟢 Top 3 by total

Print the three rows with the largest total.

02 🟡 Per-class ranking

Add a class_rank column to students.csv (1 = top of each class).

Hint

df["class_rank"] = (df.groupby("class")["score"]
                       .rank(ascending=False, method="dense"))

03 🔴 Biggest week-on-week drop

For a daily-revenue series, compute the day-on-day change. Find the day with the biggest negative change.

Hint

daily = df.groupby(df["date"].dt.date)["total"].sum()
diff  = daily.diff()
worst = diff.nsmallest(1)
print(f"biggest drop: {worst.index[0]}  RM {worst.values[0]:.2f}")

Mini-Challenge · Climbers & Fallers

8 min

Imagine two daily snapshots (yesterday, today). Each has columns product, revenue. Find the top 3 climbers (biggest gainers) and top 3 fallers between the two days.

Show one possible solution

merged = (yest.merge(today, on="product", suffixes=("_y", "_t"), how="outer")
              .fillna(0))
merged["delta"] = merged["revenue_t"] - merged["revenue_y"]

print("🚀 climbers:")
print(merged.nlargest(3, "delta")[["product", "delta"]])

print("\n📉 fallers:")
print(merged.nsmallest(3, "delta")[["product", "delta"]])

Non-negotiables: outer merge so products on one day only still count; fillna(0); nlargest/nsmallest by delta.

Recap

3 min

sort_values + slicing covers nearly every ordering need. nlargest/nsmallest are faster when you only want N. rank with method="dense" is the leaderboard helper. Combine sort with groupby to compute top-N per group. Tomorrow we level up with the big challenge.

Homework

4 min

From your real CSV, produce four ranked outputs:

Top 5 by your most important numeric column.
Bottom 5 by the same.
Top 3 per group, for any sensible group column.
A "biggest mover" report comparing two slices of your data (months, weeks, categories).

import pandas as pd df = pd.read_csv("clean.csv", parse_dates=["date"]) df["total"] = df["quantity"] * df["price"] # 1. Top 5 customers by spend top_customers = (df.groupby("customer", as_index=False)["total"].sum() .nlargest(5, "total") .reset_index(drop=True)) print(top_customers) # 2. Worst day by revenue worst_day = (df.groupby(df["date"].dt.date)["total"].sum() .nsmallest(1)) print(f"\n📉 worst day: {worst_day.index[0]} RM {worst_day.values[0]:.2f}") # 3. Top 2 products per day top_per_day = (df.groupby(["date", "product"], as_index=False)["total"].sum() .sort_values(["date", "total"], ascending=[True, False]) .groupby("date").head(2)) print("\n🏆 top 2 per day:") print(top_per_day) # 4. Add a rank df["rank_today"] = (df.groupby(df["date"].dt.date)["total"] .rank(ascending=False, method="dense"))

merged = (yest.merge(today, on="product", suffixes=("_y", "_t"), how="outer") .fillna(0)) merged["delta"] = merged["revenue_t"] - merged["revenue_y"] print("🚀 climbers:") print(merged.nlargest(3, "delta")[["product", "delta"]]) print("\n📉 fallers:") print(merged.nsmallest(3, "delta")[["product", "delta"]])