PY-L5-25 · Overfitting — When AI Memorises

Learning Goals

3 min

Define overfitting vs underfitting via the bias-variance idea.
Spot overfitting from the train-vs-validation gap.
Name the cures: more data, simpler model, regularisation, early stopping.
Demonstrate overfitting and fix it.

Warm-Up · The Student Who Memorised

5 min

One student memorises every past exam answer word-for-word — perfect on practice, lost on a new question. Another learns the underlying concepts — slightly worse on practice, great on anything new. The second "generalises". ML models can do either.

underfit   too simple — bad on train AND test
good fit    learns the real pattern — good on both
overfit     memorised noise — great on train, bad on test

Today's big idea

The gap between training score and validation/test score is the overfitting signal. Train high + val low = overfit. Both low = underfit. Close and decent = good. You ALREADY have the diagnostic — the train-vs-val curve.

New Concept · Spot It, Fix It

14 min

The tell-tale curve

accuracy
 │ train ────────────────  ← keeps rising toward 100%
 │ val   ──────╮
 │             ╰────╮       ← peaks, then FALLS  ← overfitting starts here
 └───────────────────── epochs

When validation accuracy peaks and then declines while training keeps climbing, the model has started memorising. The peak is where you'd stop.

The cures (in order of preference)

1. MORE DATA           the best fix — more examples, harder to memorise
2. SIMPLER MODEL       fewer layers/neurons, smaller max_depth
3. REGULARISATION      penalise complexity (L2, dropout — next lesson)
4. EARLY STOPPING      stop at the val peak
5. DATA AUGMENTATION   (images) flip/rotate to fake more data

Underfitting — the opposite

Both train and val are low and flat.
Cure: a MORE powerful model, more features, train longer,
      higher learning rate / capacity.

The sweet spot is a balance

Increase model capacity until validation stops improving; that's the edge of overfitting. Too little capacity = underfit; too much = overfit. The right amount depends on how much data you have.

Worked Example · Make a Net Overfit, Then Fix It

12 min

# overfit_demo.py — too-big net on too-little data
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt

X, y = load_breast_cancer(return_X_y=True)
X = StandardScaler().fit_transform(X)
# use only 60 samples to make overfitting easy to trigger
Xtr, Xte, ytr, yte = train_test_split(X, y, train_size=60, random_state=0)

def big_net():
    return keras.Sequential([
        layers.Input(shape=(30,)),
        layers.Dense(256, activation="relu"),
        layers.Dense(256, activation="relu"),
        layers.Dense(1, activation="sigmoid"),
    ])

m = big_net()
m.compile("adam", "binary_crossentropy", metrics=["accuracy"])
h = m.fit(Xtr, ytr, validation_data=(Xte, yte),
          epochs=200, batch_size=8, verbose=0)

plt.plot(h.history["accuracy"], label="train")
plt.plot(h.history["val_accuracy"], label="val")
plt.xlabel("epoch"); plt.ylabel("accuracy"); plt.legend()
plt.title("Overfitting: train ↑, val ↓")
plt.savefig("overfit.png", dpi=150)
print("final train acc:", round(h.history["accuracy"][-1], 3))
print("final val   acc:", round(h.history["val_accuracy"][-1], 3))

Sample output

final train acc: 1.000
final val   acc: 0.912

Train hits 100% (memorised the 60 samples); val lags and wobbles. Now the fix — smaller net + early stopping:

from tensorflow.keras.callbacks import EarlyStopping

small = keras.Sequential([
    layers.Input(shape=(30,)),
    layers.Dense(16, activation="relu"),
    layers.Dense(1, activation="sigmoid"),
])
small.compile("adam", "binary_crossentropy", metrics=["accuracy"])
es = EarlyStopping(monitor="val_loss", patience=15, restore_best_weights=True)
small.fit(Xtr, ytr, validation_data=(Xte, yte),
          epochs=200, batch_size=8, callbacks=[es], verbose=0)
print("fixed val acc:", round(small.evaluate(Xte, yte, verbose=0)[1], 3))

Read the diff

The smaller net can't memorise as easily, and early stopping halts at the validation peak. The train-val gap shrinks and the test score is more honest. Two of the cheapest cures — less capacity + early stopping — handle most overfitting.

Try It Yourself

13 min

01 🟢 Spot it

Given three train/val curve descriptions, label each underfit / good / overfit.

02 🟡 More data helps

Re-run the overfit demo with train_size=400 instead of 60. Does the train-val gap shrink?

03 🔴 Tree depth revisited

Recreate the overfitting curve for a decision tree by plotting train vs CV accuracy as max_depth grows. Find the depth where they diverge.

Mini-Challenge · Capacity Sweep

8 min

Train nets with hidden sizes [4, 16, 64, 256] on a small dataset. Plot final train accuracy AND val accuracy vs hidden size. Identify where val stops improving (the sweet spot) and where overfitting begins.

Show the structure

sizes = [4, 16, 64, 256]
train_acc, val_acc = [], []
for s in sizes:
    m = keras.Sequential([layers.Input(shape=(30,)),
                          layers.Dense(s, activation="relu"),
                          layers.Dense(1, activation="sigmoid")])
    m.compile("adam", "binary_crossentropy", metrics=["accuracy"])
    h = m.fit(Xtr, ytr, validation_data=(Xte, yte),
              epochs=100, batch_size=8, verbose=0)
    train_acc.append(h.history["accuracy"][-1])
    val_acc.append(h.history["val_accuracy"][-1])
plt.plot(sizes, train_acc, "o-", label="train")
plt.plot(sizes, val_acc, "o-", label="val")
plt.xscale("log"); plt.legend(); plt.xlabel("hidden size")

Non-negotiables: the train line keeps rising; the val line peaks then plateaus/falls — the gap IS overfitting.

Recap

3 min

Overfitting = memorising training data; the tell is a train-vs-val gap. Underfitting = too simple, both scores low. Cures in order: more data, simpler model, regularisation, early stopping, augmentation. Always watch the validation curve. Next: the two regularisation tricks — dropout and weight penalties.

Vocabulary Card

overfitting: High train score, low test score — memorised noise, didn't generalise.
underfitting: Low scores everywhere — model too simple to capture the pattern.
generalisation: Performing well on data the model never saw — the actual goal.
early stopping: Halting training at the validation peak to avoid memorising.

Homework

4 min

Deliberately overfit a model on a small dataset, capture the curve, then apply two cures (e.g., less capacity + early stopping, or more data). Show before/after curves and the improved test score. One paragraph: which cure helped most and why.

accuracy │ train ──────────────── ← keeps rising toward 100% │ val ──────╮ │ ╰────╮ ← peaks, then FALLS ← overfitting starts here └───────────────────── epochs

1. MORE DATA the best fix — more examples, harder to memorise 2. SIMPLER MODEL fewer layers/neurons, smaller max_depth 3. REGULARISATION penalise complexity (L2, dropout — next lesson) 4. EARLY STOPPING stop at the val peak 5. DATA AUGMENTATION (images) flip/rotate to fake more data

# overfit_demo.py — too-big net on too-little data import numpy as np from sklearn.datasets import load_breast_cancer from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from tensorflow import keras from tensorflow.keras import layers import matplotlib.pyplot as plt X, y = load_breast_cancer(return_X_y=True) X = StandardScaler().fit_transform(X) # use only 60 samples to make overfitting easy to trigger Xtr, Xte, ytr, yte = train_test_split(X, y, train_size=60, random_state=0) def big_net(): return keras.Sequential([ layers.Input(shape=(30,)), layers.Dense(256, activation="relu"), layers.Dense(256, activation="relu"), layers.Dense(1, activation="sigmoid"), ]) m = big_net() m.compile("adam", "binary_crossentropy", metrics=["accuracy"]) h = m.fit(Xtr, ytr, validation_data=(Xte, yte), epochs=200, batch_size=8, verbose=0) plt.plot(h.history["accuracy"], label="train") plt.plot(h.history["val_accuracy"], label="val") plt.xlabel("epoch"); plt.ylabel("accuracy"); plt.legend() plt.title("Overfitting: train ↑, val ↓") plt.savefig("overfit.png", dpi=150) print("final train acc:", round(h.history["accuracy"][-1], 3)) print("final val acc:", round(h.history["val_accuracy"][-1], 3))

from tensorflow.keras.callbacks import EarlyStopping small = keras.Sequential([ layers.Input(shape=(30,)), layers.Dense(16, activation="relu"), layers.Dense(1, activation="sigmoid"), ]) small.compile("adam", "binary_crossentropy", metrics=["accuracy"]) es = EarlyStopping(monitor="val_loss", patience=15, restore_best_weights=True) small.fit(Xtr, ytr, validation_data=(Xte, yte), epochs=200, batch_size=8, callbacks=[es], verbose=0) print("fixed val acc:", round(small.evaluate(Xte, yte, verbose=0)[1], 3))

sizes = [4, 16, 64, 256] train_acc, val_acc = [], [] for s in sizes: m = keras.Sequential([layers.Input(shape=(30,)), layers.Dense(s, activation="relu"), layers.Dense(1, activation="sigmoid")]) m.compile("adam", "binary_crossentropy", metrics=["accuracy"]) h = m.fit(Xtr, ytr, validation_data=(Xte, yte), epochs=100, batch_size=8, verbose=0) train_acc.append(h.history["accuracy"][-1]) val_acc.append(h.history["val_accuracy"][-1]) plt.plot(sizes, train_acc, "o-", label="train") plt.plot(sizes, val_acc, "o-", label="val") plt.xscale("log"); plt.legend(); plt.xlabel("hidden size")