PY-L5-07 · Seaborn 101 — Pretty Plots Fast

Learning Goals

3 min

Install seaborn; load a built-in dataset.
Draw histplot, countplot, scatterplot, boxplot in one line each.
Colour points by a category with hue=.
Read each chart for an ML insight (separability, imbalance, outliers).

Warm-Up · Look Before You Train

5 min

pip install seaborn

import seaborn as sns
import matplotlib.pyplot as plt

penguins = sns.load_dataset("penguins")   # built-in sample
print(penguins.head())
print(penguins["species"].value_counts())

Today's big idea

Charts answer "is this problem even learnable?" before you waste an hour training. If the classes overlap completely, no model will separate them. If one class is 1% of the data, accuracy will mislead. You see all of this in seconds.

New Concept · The Four Workhorses

14 min

histplot — distribution of one number

sns.histplot(data=penguins, x="body_mass_g", bins=30, kde=True)
plt.show()

Shows the shape — is it bell-shaped, skewed, bimodal? kde=True overlays a smooth curve.

countplot — frequency of a category

sns.countplot(data=penguins, x="species")
plt.show()

Instantly reveals class balance. Three roughly equal bars = balanced; one tiny bar = imbalance you must handle.

scatterplot — relationship of two numbers, coloured by class

sns.scatterplot(data=penguins, x="bill_length_mm",
                y="flipper_length_mm", hue="species")
plt.show()

The most useful ML chart. If colours form separate clusters, the classes are separable — a classifier will do well. If they're a jumbled mess, expect trouble.

boxplot — distribution per category, with outliers

sns.boxplot(data=penguins, x="species", y="body_mass_g")
plt.show()

Box = middle 50%, line = median, dots = outliers. Compares groups and flags extremes.

Make it look good

sns.set_theme(style="whitegrid")    # call once at the top
plt.figure(figsize=(8, 5))
# ... your plot ...
plt.title("Penguin body mass by species")
plt.tight_layout()

Worked Example · Is This Learnable?

12 min

# explore.py — a 4-panel "should I even train?" check
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="whitegrid")
df = sns.load_dataset("penguins").dropna()

fig, axes = plt.subplots(2, 2, figsize=(11, 8))

# class balance
sns.countplot(data=df, x="species", ax=axes[0, 0])
axes[0, 0].set_title("Class balance")

# one feature's distribution
sns.histplot(data=df, x="flipper_length_mm", hue="species",
             kde=True, ax=axes[0, 1])
axes[0, 1].set_title("Flipper length distribution")

# separability
sns.scatterplot(data=df, x="bill_length_mm", y="flipper_length_mm",
                hue="species", ax=axes[1, 0])
axes[1, 0].set_title("Are classes separable?")

# per-class spread
sns.boxplot(data=df, x="species", y="body_mass_g", ax=axes[1, 1])
axes[1, 1].set_title("Body mass by species")

fig.tight_layout()
fig.savefig("explore.png", dpi=150)
plt.show()

Read the diff

Four panels answer four ML questions: balance (roughly even — good), distribution (clear peaks per species), separability (the scatter shows three loose clusters — a classifier will work), spread (no wild outliers). You now expect a good model before training one. That intuition is what separates careful ML practitioners from button-pushers.

Try It Yourself

13 min

01 🟢 Histogram

Load the tips dataset (sns.load_dataset("tips")). Plot the distribution of total_bill.

02 🟡 Coloured scatter

Scatter total_bill vs tip, coloured by time (lunch/dinner). Is there a relationship?

Hint

tips = sns.load_dataset("tips")
sns.scatterplot(data=tips, x="total_bill", y="tip", hue="time")
plt.show()

03 🔴 Spot the imbalance

Load the titanic dataset. Use countplot on survived and on class. Describe one imbalance and how it might bias a naive model.

Mini-Challenge · The 4-Panel EDA

8 min

Pick any seaborn built-in dataset (iris, diamonds, titanic). Build the same 4-panel "is this learnable?" figure: class balance, a distribution, a separability scatter, and a boxplot. Write a 3-sentence verdict.

Show the structure

df = sns.load_dataset("iris")
fig, ax = plt.subplots(2, 2, figsize=(11, 8))
sns.countplot(data=df, x="species", ax=ax[0,0])
sns.histplot(data=df, x="petal_length", hue="species", ax=ax[0,1])
sns.scatterplot(data=df, x="petal_length", y="petal_width",
                hue="species", ax=ax[1,0])
sns.boxplot(data=df, x="species", y="sepal_length", ax=ax[1,1])
fig.tight_layout(); fig.savefig("iris_eda.png", dpi=150)
# Verdict: balanced, petal features separate the species cleanly,
# so even a simple classifier should score very high.

Recap

3 min

Seaborn draws statistical charts in one line from a DataFrame. histplot for distributions, countplot for balance, scatterplot + hue for separability, boxplot for per-group spread. Always explore before training — charts tell you whether a problem is learnable. Tomorrow: the three deeper seaborn plots.

Vocabulary Card

EDA: Exploratory Data Analysis — looking at data before modelling.
hue: Seaborn kwarg that colours points/bars by a category.
separability: How cleanly classes form distinct regions — predicts classifier success.
kde: Kernel Density Estimate — a smooth curve approximating a distribution.

Homework

4 min

On your own prepped dataset from Lesson 6, build the 4-panel EDA figure and write a verdict: is this problem learnable, balanced, and separable? What will be the hardest part for a model?

# explore.py — a 4-panel "should I even train?" check import seaborn as sns import matplotlib.pyplot as plt sns.set_theme(style="whitegrid") df = sns.load_dataset("penguins").dropna() fig, axes = plt.subplots(2, 2, figsize=(11, 8)) # class balance sns.countplot(data=df, x="species", ax=axes[0, 0]) axes[0, 0].set_title("Class balance") # one feature's distribution sns.histplot(data=df, x="flipper_length_mm", hue="species", kde=True, ax=axes[0, 1]) axes[0, 1].set_title("Flipper length distribution") # separability sns.scatterplot(data=df, x="bill_length_mm", y="flipper_length_mm", hue="species", ax=axes[1, 0]) axes[1, 0].set_title("Are classes separable?") # per-class spread sns.boxplot(data=df, x="species", y="body_mass_g", ax=axes[1, 1]) axes[1, 1].set_title("Body mass by species") fig.tight_layout() fig.savefig("explore.png", dpi=150) plt.show()

df = sns.load_dataset("iris") fig, ax = plt.subplots(2, 2, figsize=(11, 8)) sns.countplot(data=df, x="species", ax=ax[0,0]) sns.histplot(data=df, x="petal_length", hue="species", ax=ax[0,1]) sns.scatterplot(data=df, x="petal_length", y="petal_width", hue="species", ax=ax[1,0]) sns.boxplot(data=df, x="species", y="sepal_length", ax=ax[1,1]) fig.tight_layout(); fig.savefig("iris_eda.png", dpi=150) # Verdict: balanced, petal features separate the species cleanly, # so even a simple classifier should score very high.