Learning Goals
3 min- Draw a
pairplotand read the diagonal vs off-diagonal. - Compute a correlation matrix and visualise it with a
heatmap. - Spot redundant (highly correlated) features.
- Compare distributions across classes with
kdeplot/violinplot.
Warm-Up · One Plot, Many Answers
5 minimport seaborn as sns sns.set_theme(style="ticks") df = sns.load_dataset("penguins").dropna() sns.pairplot(df, hue="species")
That single call draws a whole grid: every numeric feature plotted against every other, coloured by species, with distributions on the diagonal. It's the fastest way to understand a small dataset.
Before modelling, ask two questions: which features separate the classes? (pairplot) and which features are redundant? (heatmap). Answering both saves you from training bloated, confused models.
New Concept · The Three Plots
14 min1. pairplot — the all-pairs grid
sns.pairplot(df, hue="species", diag_kind="kde", vars=["bill_length_mm", "flipper_length_mm", "body_mass_g"])
- Diagonal = each feature's distribution per class.
- Off-diagonal = scatter of two features. Look for panels where the colours separate cleanly — those are your best feature pairs.
- Pass
vars=to limit columns (a pairplot of 20 features is unreadable).
2. correlation heatmap — which features move together
import matplotlib.pyplot as plt corr = df.select_dtypes("number").corr() plt.figure(figsize=(6, 5)) sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", center=0, square=True) plt.title("Feature correlations") plt.tight_layout()
Correlation ranges -1 .. +1 +1 perfectly move together 0 unrelated -1 perfectly opposite
Two features with correlation near ±1 are redundant — they carry the same information. Keep one; dropping the other simplifies the model with no loss.
3. distribution comparisons
# overlaid smooth curves per class sns.kdeplot(data=df, x="flipper_length_mm", hue="species", fill=True) # violin = boxplot + distribution shape sns.violinplot(data=df, x="species", y="body_mass_g")
If a feature's curves are well-separated per class, that feature is predictive. If they sit on top of each other, the feature won't help.
Worked Example · Find the Best Features
12 min# feature_scout.py — which features should I train on? import seaborn as sns import matplotlib.pyplot as plt sns.set_theme(style="whitegrid") df = sns.load_dataset("penguins").dropna() num = df.select_dtypes("number") # 1. correlation heatmap — find redundancy corr = num.corr() plt.figure(figsize=(6, 5)) sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", center=0) plt.title("Correlations") plt.tight_layout(); plt.savefig("corr.png", dpi=150); plt.clf() # 2. report strongest correlations (excluding the diagonal) import numpy as np c = corr.abs() np.fill_diagonal(c.values, 0) pair = c.stack().idxmax() print(f"most redundant pair: {pair} (r = {corr.loc[pair]:.2f})") # 3. pairplot of three chosen features sns.pairplot(df, hue="species", vars=["bill_length_mm", "flipper_length_mm", "body_mass_g"]) plt.savefig("pairs.png", dpi=120)
Sample output
most redundant pair: ('flipper_length_mm', 'body_mass_g') (r = 0.87)Read the diff
The heatmap flagged flipper length and body mass as 0.87 correlated — bigger penguins have bigger everything. You might keep just one. The pairplot then shows bill_length + flipper_length separates the three species cleanly — those two features alone would train a strong classifier.
Try It Yourself
13 minOn the iris dataset, compute and plot the correlation heatmap. Which two features are most correlated?
Pairplot iris with only the two petal features + species hue. Are the classes separable using just petals?
Write a function that drops one of any feature pair with |correlation| > 0.9. Return the reduced feature list.
Hint
def drop_redundant(num_df, thresh=0.9): corr = num_df.corr().abs() keep, dropped = list(num_df.columns), set() for i, a in enumerate(keep): for b in keep[i+1:]: if b not in dropped and corr.loc[a, b] > thresh: dropped.add(b) return [c for c in keep if c not in dropped], dropped
Mini-Challenge · Feature Report Card
8 minFor a labelled dataset, rank features by how well they separate the classes. A cheap proxy: the absolute difference of per-class means, divided by the overall std. Print a ranked list — your "these are the features worth keeping" report.
Show one possible solution
import seaborn as sns df = sns.load_dataset("penguins").dropna() num = df.select_dtypes("number").columns scores = {} for col in num: means = df.groupby("species")[col].mean() spread = means.max() - means.min() scores[col] = spread / df[col].std() for col, s in sorted(scores.items(), key=lambda kv: -kv[1]): print(f" {col:<22} separability {s:.2f}")
Non-negotiables: a per-feature separability score, ranked descending. The top features are your candidates; this echoes real feature-selection methods.
Recap
3 minpairplot shows every feature pair (diagonal = distributions, off-diagonal = scatter); the correlation heatmap finds redundant features; kde/violin plots compare distributions per class. Use them to pick informative, non-redundant features before training. Next we start training real models.
Vocabulary Card
- pairplot
- Grid of scatter plots for every feature pair, coloured by class.
- correlation
- How strongly two numeric features move together, from −1 to +1.
- heatmap
- Colour-coded grid of values — great for correlation matrices.
- redundant feature
- One that's nearly a copy of another; safe to drop.
Homework
4 minOn your own dataset: produce a correlation heatmap, a targeted pairplot, and a ranked feature-separability list. Write a short note: which 2-3 features you'd train on and why, plus any you'd drop for redundancy.
The deliverable is two figures + a feature recommendation. Combine the heatmap and separability code from this lesson.