Learning Goals
3 min- Use
pd.merge(left, right, on="key")for an inner join. - Switch join style with
how="left"/"right"/"outer". - Join on multi-column keys, or on differently-named keys (
left_on/right_on). - Stack frames vertically with
pd.concat.
Warm-Up · Two Frames, One Key
5 minimport pandas as pd pupils = pd.DataFrame({ "id": [1, 2, 3, 4], "name": ["Aisyah", "Wei Jie", "Suresh", "Mei"], "class_id": [1, 1, 2, 2], }) classes = pd.DataFrame({ "class_id": [1, 2, 3], "name": ["Form 1A", "Form 1B", "Form 1C"], }) print(pupils.merge(classes, on="class_id", suffixes=("", "_class")))
id name class_id name_class 0 1 Aisyah 1 Form 1A 1 2 Wei Jie 1 Form 1A 2 3 Suresh 2 Form 1B 3 4 Mei 2 Form 1B
Same column name on both sides (name) → use suffixes so pandas knows what to call them in the result.
pd.merge = SQL JOIN. Same four flavours, same gotchas. If you understood Lessons 20/21, you already know this.
New Concept · merge & concat
14 minFour join styles
how="inner" keep only matches in BOTH (default) how="left" keep all left rows; NaN where right misses how="right" keep all right rows; NaN where left misses how="outer" keep everything from both
pupils.merge(classes, on="class_id", how="left") pupils.merge(classes, on="class_id", how="outer", suffixes=("_p", "_c"))
Multi-column join
a.merge(b, on=["year", "subject"])
Differently-named keys
orders.merge(customers, left_on="customer_id", right_on="id") # drops the duplicate id column afterwards .drop(columns="id")
indicator=True — see who matched
merged = pupils.merge(classes, on="class_id", how="outer", indicator=True) print(merged["_merge"].value_counts())
_merge both 4 right_only 1 ← Form 1C had no pupil left_only 0 Name: count, dtype: int64
Super useful for debugging unexpected join results.
concat — stack frames
# Same columns, more rows all_pupils = pd.concat([pupils_2024, pupils_2025, pupils_2026], ignore_index=True) # Different columns, side-by-side combined = pd.concat([df_a, df_b], axis=1)
concat is for stacking — when both frames have the same shape and you want to add more rows (or more columns side-by-side).
Worked Example · Sales + Customers + Products
12 minimport pandas as pd sales = pd.read_csv("sales.csv") # date, customer_id, product_id, qty customers = pd.read_csv("customers.csv") # id, name, city products = pd.read_csv("products.csv") # id, name, price # Two merges to enrich every sale with customer name and product name + price joined = (sales .merge(customers, left_on="customer_id", right_on="id", suffixes=("", "_c")) .drop(columns="id") .merge(products, left_on="product_id", right_on="id", suffixes=("", "_p")) .drop(columns="id")) joined["revenue"] = joined["qty"] * joined["price"] # Top-spending customers top = (joined.groupby("name")["revenue"].sum() .sort_values(ascending=False) .head(5)) print(top)
Read the diff
Chained merges enrich the sales frame with two lookup tables, the same way a SQL query would join three tables. Then groupby + sum + head gives you the top customers — what would have been a 30-line script in plain Python becomes 10 lines, vectorised.
Try It Yourself
13 minMerge pupils with classes on class_id. Confirm Form 1C disappears.
Switch to outer + indicator. Print the value_counts of the _merge column.
Make two small DataFrames sales_2024 and sales_2025 with the same columns. concat them, add a column year that's 2024 or 2025 (use keys= if you like). Group by year and sum.
Hint
combined = pd.concat({"2024": sales_2024, "2025": sales_2025}, names=["year", None]).reset_index() print(combined.groupby("year")["revenue"].sum())
Mini-Challenge · Customers Who Never Bought Product X
8 minFind every customer who has NEVER bought product "Nasi". Use a left merge + filter.
Show one possible solution
nasi_sales = sales[sales["product_id"] == NASI_ID][["customer_id"]].drop_duplicates() no_nasi = customers.merge(nasi_sales, left_on="id", right_on="customer_id", how="left", indicator=True) print(no_nasi[no_nasi["_merge"] == "left_only"][["name"]])
Non-negotiables: left merge keeps every customer; _merge == "left_only" isolates those with no Nasi match.
Recap
3 minmerge = JOIN; concat = stack. Four join styles with how=. left_on/right_on for differently-named keys; suffixes for collisions. indicator=True when in doubt. Tomorrow we sort + slice for top-N reports.
Homework
4 minBuild a small two-frame scenario from your own data (or two CSVs). Run all four merge styles on it. For each, write the row count of the result and one sentence interpreting why it differs from the next style.
for how in ["inner", "left", "right", "outer"]: n = len(pupils.merge(classes, on="class_id", how=how)) print(f" how={how:<6} rows={n}")