PY-L5-08 · Seaborn Deep Dive — pairplot, heatmap, distributions

Learning Goals

3 min

Draw a pairplot and read the diagonal vs off-diagonal.
Compute a correlation matrix and visualise it with a heatmap.
Spot redundant (highly correlated) features.
Compare distributions across classes with kdeplot / violinplot.

Warm-Up · One Plot, Many Answers

5 min

import seaborn as sns
sns.set_theme(style="ticks")
df = sns.load_dataset("penguins").dropna()
sns.pairplot(df, hue="species")

That single call draws a whole grid: every numeric feature plotted against every other, coloured by species, with distributions on the diagonal. It's the fastest way to understand a small dataset.

Today's big idea

Before modelling, ask two questions: which features separate the classes? (pairplot) and which features are redundant? (heatmap). Answering both saves you from training bloated, confused models.

New Concept · The Three Plots

14 min

1. pairplot — the all-pairs grid

sns.pairplot(df, hue="species", diag_kind="kde",
             vars=["bill_length_mm", "flipper_length_mm", "body_mass_g"])

Diagonal = each feature's distribution per class.
Off-diagonal = scatter of two features. Look for panels where the colours separate cleanly — those are your best feature pairs.
Pass vars= to limit columns (a pairplot of 20 features is unreadable).

2. correlation heatmap — which features move together

import matplotlib.pyplot as plt

corr = df.select_dtypes("number").corr()
plt.figure(figsize=(6, 5))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm",
            center=0, square=True)
plt.title("Feature correlations")
plt.tight_layout()

Correlation ranges -1 .. +1
 +1  perfectly move together
  0  unrelated
 -1  perfectly opposite

Two features with correlation near ±1 are redundant — they carry the same information. Keep one; dropping the other simplifies the model with no loss.

3. distribution comparisons

# overlaid smooth curves per class
sns.kdeplot(data=df, x="flipper_length_mm", hue="species", fill=True)

# violin = boxplot + distribution shape
sns.violinplot(data=df, x="species", y="body_mass_g")

If a feature's curves are well-separated per class, that feature is predictive. If they sit on top of each other, the feature won't help.

Worked Example · Find the Best Features

12 min

# feature_scout.py — which features should I train on?
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="whitegrid")
df = sns.load_dataset("penguins").dropna()
num = df.select_dtypes("number")

# 1. correlation heatmap — find redundancy
corr = num.corr()
plt.figure(figsize=(6, 5))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", center=0)
plt.title("Correlations")
plt.tight_layout(); plt.savefig("corr.png", dpi=150); plt.clf()

# 2. report strongest correlations (excluding the diagonal)
import numpy as np
c = corr.abs()
np.fill_diagonal(c.values, 0)
pair = c.stack().idxmax()
print(f"most redundant pair: {pair}  (r = {corr.loc[pair]:.2f})")

# 3. pairplot of three chosen features
sns.pairplot(df, hue="species",
             vars=["bill_length_mm", "flipper_length_mm", "body_mass_g"])
plt.savefig("pairs.png", dpi=120)

Sample output

most redundant pair: ('flipper_length_mm', 'body_mass_g')  (r = 0.87)

Read the diff

The heatmap flagged flipper length and body mass as 0.87 correlated — bigger penguins have bigger everything. You might keep just one. The pairplot then shows bill_length + flipper_length separates the three species cleanly — those two features alone would train a strong classifier.

Try It Yourself

13 min

01 🟢 Heatmap

On the iris dataset, compute and plot the correlation heatmap. Which two features are most correlated?

02 🟡 Targeted pairplot

Pairplot iris with only the two petal features + species hue. Are the classes separable using just petals?

03 🔴 Auto-drop redundant features

Write a function that drops one of any feature pair with |correlation| > 0.9. Return the reduced feature list.

Hint

def drop_redundant(num_df, thresh=0.9):
    corr = num_df.corr().abs()
    keep, dropped = list(num_df.columns), set()
    for i, a in enumerate(keep):
        for b in keep[i+1:]:
            if b not in dropped and corr.loc[a, b] > thresh:
                dropped.add(b)
    return [c for c in keep if c not in dropped], dropped

Mini-Challenge · Feature Report Card

8 min

For a labelled dataset, rank features by how well they separate the classes. A cheap proxy: the absolute difference of per-class means, divided by the overall std. Print a ranked list — your "these are the features worth keeping" report.

Show one possible solution

import seaborn as sns
df = sns.load_dataset("penguins").dropna()
num = df.select_dtypes("number").columns

scores = {}
for col in num:
    means = df.groupby("species")[col].mean()
    spread = means.max() - means.min()
    scores[col] = spread / df[col].std()

for col, s in sorted(scores.items(), key=lambda kv: -kv[1]):
    print(f"  {col:<22} separability {s:.2f}")

Non-negotiables: a per-feature separability score, ranked descending. The top features are your candidates; this echoes real feature-selection methods.

Recap

3 min

pairplot shows every feature pair (diagonal = distributions, off-diagonal = scatter); the correlation heatmap finds redundant features; kde/violin plots compare distributions per class. Use them to pick informative, non-redundant features before training. Next we start training real models.

Vocabulary Card

pairplot: Grid of scatter plots for every feature pair, coloured by class.
correlation: How strongly two numeric features move together, from −1 to +1.
heatmap: Colour-coded grid of values — great for correlation matrices.
redundant feature: One that's nearly a copy of another; safe to drop.

Homework

4 min

On your own dataset: produce a correlation heatmap, a targeted pairplot, and a ranked feature-separability list. Write a short note: which 2-3 features you'd train on and why, plus any you'd drop for redundancy.

import matplotlib.pyplot as plt corr = df.select_dtypes("number").corr() plt.figure(figsize=(6, 5)) sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", center=0, square=True) plt.title("Feature correlations") plt.tight_layout()

# feature_scout.py — which features should I train on? import seaborn as sns import matplotlib.pyplot as plt sns.set_theme(style="whitegrid") df = sns.load_dataset("penguins").dropna() num = df.select_dtypes("number") # 1. correlation heatmap — find redundancy corr = num.corr() plt.figure(figsize=(6, 5)) sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", center=0) plt.title("Correlations") plt.tight_layout(); plt.savefig("corr.png", dpi=150); plt.clf() # 2. report strongest correlations (excluding the diagonal) import numpy as np c = corr.abs() np.fill_diagonal(c.values, 0) pair = c.stack().idxmax() print(f"most redundant pair: {pair} (r = {corr.loc[pair]:.2f})") # 3. pairplot of three chosen features sns.pairplot(df, hue="species", vars=["bill_length_mm", "flipper_length_mm", "body_mass_g"]) plt.savefig("pairs.png", dpi=120)

def drop_redundant(num_df, thresh=0.9): corr = num_df.corr().abs() keep, dropped = list(num_df.columns), set() for i, a in enumerate(keep): for b in keep[i+1:]: if b not in dropped and corr.loc[a, b] > thresh: dropped.add(b) return [c for c in keep if c not in dropped], dropped

import seaborn as sns df = sns.load_dataset("penguins").dropna() num = df.select_dtypes("number").columns scores = {} for col in num: means = df.groupby("species")[col].mean() spread = means.max() - means.min() scores[col] = spread / df[col].std() for col, s in sorted(scores.items(), key=lambda kv: -kv[1]): print(f" {col:<22} separability {s:.2f}")

Seaborn Deep Dive — `pairplot`, `heatmap`, distributions

Learning Goals

Warm-Up · One Plot, Many Answers

New Concept · The Three Plots

1. pairplot — the all-pairs grid

2. correlation heatmap — which features move together

3. distribution comparisons

Worked Example · Find the Best Features

Read the diff

Try It Yourself

Mini-Challenge · Feature Report Card

Recap

Vocabulary Card

Homework

Seaborn Deep Dive — `pairplot`, `heatmap`, distributions

Learning Goals

Warm-Up · One Plot, Many Answers

New Concept · The Three Plots

1. pairplot — the all-pairs grid

2. correlation heatmap — which features move together

3. distribution comparisons

Worked Example · Find the Best Features

Read the diff

Try It Yourself

Mini-Challenge · Feature Report Card

Recap

Vocabulary Card

Homework