Learning Goals
3 min- Install seaborn; load a built-in dataset.
- Draw
histplot,countplot,scatterplot,boxplotin one line each. - Colour points by a category with
hue=. - Read each chart for an ML insight (separability, imbalance, outliers).
Warm-Up · Look Before You Train
5 minpip install seaborn
import seaborn as sns import matplotlib.pyplot as plt penguins = sns.load_dataset("penguins") # built-in sample print(penguins.head()) print(penguins["species"].value_counts())
Charts answer "is this problem even learnable?" before you waste an hour training. If the classes overlap completely, no model will separate them. If one class is 1% of the data, accuracy will mislead. You see all of this in seconds.
New Concept · The Four Workhorses
14 minhistplot — distribution of one number
sns.histplot(data=penguins, x="body_mass_g", bins=30, kde=True) plt.show()
Shows the shape — is it bell-shaped, skewed, bimodal? kde=True overlays a smooth curve.
countplot — frequency of a category
sns.countplot(data=penguins, x="species") plt.show()
Instantly reveals class balance. Three roughly equal bars = balanced; one tiny bar = imbalance you must handle.
scatterplot — relationship of two numbers, coloured by class
sns.scatterplot(data=penguins, x="bill_length_mm", y="flipper_length_mm", hue="species") plt.show()
The most useful ML chart. If colours form separate clusters, the classes are separable — a classifier will do well. If they're a jumbled mess, expect trouble.
boxplot — distribution per category, with outliers
sns.boxplot(data=penguins, x="species", y="body_mass_g") plt.show()
Box = middle 50%, line = median, dots = outliers. Compares groups and flags extremes.
Make it look good
sns.set_theme(style="whitegrid") # call once at the top plt.figure(figsize=(8, 5)) # ... your plot ... plt.title("Penguin body mass by species") plt.tight_layout()
Worked Example · Is This Learnable?
12 min# explore.py — a 4-panel "should I even train?" check import seaborn as sns import matplotlib.pyplot as plt sns.set_theme(style="whitegrid") df = sns.load_dataset("penguins").dropna() fig, axes = plt.subplots(2, 2, figsize=(11, 8)) # class balance sns.countplot(data=df, x="species", ax=axes[0, 0]) axes[0, 0].set_title("Class balance") # one feature's distribution sns.histplot(data=df, x="flipper_length_mm", hue="species", kde=True, ax=axes[0, 1]) axes[0, 1].set_title("Flipper length distribution") # separability sns.scatterplot(data=df, x="bill_length_mm", y="flipper_length_mm", hue="species", ax=axes[1, 0]) axes[1, 0].set_title("Are classes separable?") # per-class spread sns.boxplot(data=df, x="species", y="body_mass_g", ax=axes[1, 1]) axes[1, 1].set_title("Body mass by species") fig.tight_layout() fig.savefig("explore.png", dpi=150) plt.show()
Read the diff
Four panels answer four ML questions: balance (roughly even — good), distribution (clear peaks per species), separability (the scatter shows three loose clusters — a classifier will work), spread (no wild outliers). You now expect a good model before training one. That intuition is what separates careful ML practitioners from button-pushers.
Try It Yourself
13 minLoad the tips dataset (sns.load_dataset("tips")). Plot the distribution of total_bill.
Scatter total_bill vs tip, coloured by time (lunch/dinner). Is there a relationship?
Hint
tips = sns.load_dataset("tips") sns.scatterplot(data=tips, x="total_bill", y="tip", hue="time") plt.show()
Load the titanic dataset. Use countplot on survived and on class. Describe one imbalance and how it might bias a naive model.
Mini-Challenge · The 4-Panel EDA
8 minPick any seaborn built-in dataset (iris, diamonds, titanic). Build the same 4-panel "is this learnable?" figure: class balance, a distribution, a separability scatter, and a boxplot. Write a 3-sentence verdict.
Show the structure
df = sns.load_dataset("iris") fig, ax = plt.subplots(2, 2, figsize=(11, 8)) sns.countplot(data=df, x="species", ax=ax[0,0]) sns.histplot(data=df, x="petal_length", hue="species", ax=ax[0,1]) sns.scatterplot(data=df, x="petal_length", y="petal_width", hue="species", ax=ax[1,0]) sns.boxplot(data=df, x="species", y="sepal_length", ax=ax[1,1]) fig.tight_layout(); fig.savefig("iris_eda.png", dpi=150) # Verdict: balanced, petal features separate the species cleanly, # so even a simple classifier should score very high.
Recap
3 minSeaborn draws statistical charts in one line from a DataFrame. histplot for distributions, countplot for balance, scatterplot + hue for separability, boxplot for per-group spread. Always explore before training — charts tell you whether a problem is learnable. Tomorrow: the three deeper seaborn plots.
Vocabulary Card
- EDA
- Exploratory Data Analysis — looking at data before modelling.
- hue
- Seaborn kwarg that colours points/bars by a category.
- separability
- How cleanly classes form distinct regions — predicts classifier success.
- kde
- Kernel Density Estimate — a smooth curve approximating a distribution.
Homework
4 minOn your own prepped dataset from Lesson 6, build the 4-panel EDA figure and write a verdict: is this problem learnable, balanced, and separable? What will be the hardest part for a model?
The deliverable is the figure + verdict. Reuse the 4-panel structure from the mini-challenge with your own columns.