PY-L5-26 · Dropout & Regularisation

Learning Goals

3 min

Explain dropout: randomly zero neurons during training.
Add Dropout layers to a Keras model.
Add L2 weight regularisation to a Dense layer.
Show the train-val gap shrink after regularising.

Warm-Up · Don't Rely on One Star Player

5 min

A team that depends on one superstar collapses when they're injured. Coach them to function with random players benched each practice, and everyone improves — the team becomes robust. Dropout does exactly this to neurons.

Today's big idea

Regularisation = adding a constraint that discourages complexity. Dropout forces redundancy (no neuron can be a single point of failure). L2 keeps weights small (smooth, simple functions). Both make the net generalise better.

New Concept · Dropout & L2

14 min

Dropout

from tensorflow.keras import layers

layers.Dropout(0.3)   # during training, randomly zero 30% of inputs

Applied between Dense layers.
Active only during training; at prediction time all neurons are used (Keras handles this automatically).
Typical rates: 0.2-0.5. Higher = stronger regularisation.

Where it goes

model = keras.Sequential([
    layers.Input(shape=(30,)),
    layers.Dense(128, activation="relu"),
    layers.Dropout(0.4),
    layers.Dense(64, activation="relu"),
    layers.Dropout(0.4),
    layers.Dense(1, activation="sigmoid"),
])

L2 weight regularisation

from tensorflow.keras import regularizers

layers.Dense(64, activation="relu",
             kernel_regularizer=regularizers.l2(0.001))

L2 adds λ × sum(weights²) to the loss. The optimiser now has to balance fitting the data against keeping weights small — which smooths the function and curbs memorising. λ (e.g. 0.001) controls the strength.

L1 vs L2

L2 (ridge)  shrinks all weights toward 0 — smooth, the usual choice
L1 (lasso)  drives some weights exactly to 0 — sparse, feature selection

BatchNorm (a bonus stabiliser)

layers.BatchNormalization() normalises activations between layers — speeds training and adds mild regularisation. Common in bigger nets; optional here.

Worked Example · Regularise the Overfitter

12 min

# regularise.py — same overfit-prone setup, now with dropout + L2
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras import layers, regularizers
import matplotlib.pyplot as plt

X, y = load_breast_cancer(return_X_y=True)
X = StandardScaler().fit_transform(X)
Xtr, Xte, ytr, yte = train_test_split(X, y, train_size=80, random_state=0)

def model(reg=False):
    if reg:
        return keras.Sequential([
            layers.Input(shape=(30,)),
            layers.Dense(128, activation="relu",
                         kernel_regularizer=regularizers.l2(0.001)),
            layers.Dropout(0.4),
            layers.Dense(64, activation="relu",
                         kernel_regularizer=regularizers.l2(0.001)),
            layers.Dropout(0.4),
            layers.Dense(1, activation="sigmoid"),
        ])
    return keras.Sequential([
        layers.Input(shape=(30,)),
        layers.Dense(128, activation="relu"),
        layers.Dense(64, activation="relu"),
        layers.Dense(1, activation="sigmoid"),
    ])

fig, ax = plt.subplots(1, 2, figsize=(11, 4))
for i, reg in enumerate([False, True]):
    m = model(reg)
    m.compile("adam", "binary_crossentropy", metrics=["accuracy"])
    h = m.fit(Xtr, ytr, validation_data=(Xte, yte),
              epochs=200, batch_size=8, verbose=0)
    ax[i].plot(h.history["accuracy"], label="train")
    ax[i].plot(h.history["val_accuracy"], label="val")
    ax[i].set_title("with reg" if reg else "no reg")
    ax[i].legend()
fig.savefig("reg_compare.png", dpi=150)

What you'll see

no reg : train shoots to 1.0, val lags with a wide gap
with reg: train rises more slowly, val tracks it closely — smaller gap

Read the diff

Regularisation deliberately makes the model worse on training data — and that's the point. By preventing memorisation, the train and val curves stay close, so the model that ships generalises better. The training accuracy is not the goal; the gap is what you manage.

Try It Yourself

13 min

01 🟢 Add dropout

Take any overfitting net and insert two Dropout(0.3) layers. Compare the train-val gap before and after.

02 🟡 Dropout-rate sweep

Try dropout rates 0.0, 0.2, 0.5, 0.8. Plot final val accuracy vs rate. Too much dropout hurts — find the sweet spot.

03 🔴 L2 strength sweep

Sweep L2 λ in [0, 1e-4, 1e-3, 1e-2, 1e-1]. Plot train and val accuracy vs λ. What happens at very high λ?

Answer

At very high λ the model is forced toward tiny weights — it underfits, and both train and val accuracy fall. Regularisation has a sweet spot, just like model capacity.

Mini-Challenge · The Best-Generalising Net

8 min

Combine everything: a reasonably-sized net + dropout + L2 + early stopping. Aim for the smallest train-val gap while keeping val accuracy high. Report your final architecture and the gap you achieved.

Recap

3 min

Dropout randomly silences neurons during training, forcing redundancy. L2 penalises large weights for smoother functions. Both reduce overfitting by adding a constraint. They have sweet spots — too much causes underfitting. Combine with early stopping for robust nets. That completes the neural-net toolkit; next we apply it to images.

Vocabulary Card

regularisation: Techniques that discourage model complexity to improve generalisation.
dropout: Randomly zeroing a fraction of neurons during training.
L2 / weight decay: Adding a penalty on large weights to the loss.
BatchNormalization: Normalising activations between layers; speeds training, mild regularisation.

Homework

4 min

On an overfit-prone dataset, produce a 2-panel figure: no-regularisation vs full-regularisation (dropout + L2 + early stopping) train/val curves. Report both test accuracies and the gap reduction. One paragraph on the trade-off you observed.

model = keras.Sequential([ layers.Input(shape=(30,)), layers.Dense(128, activation="relu"), layers.Dropout(0.4), layers.Dense(64, activation="relu"), layers.Dropout(0.4), layers.Dense(1, activation="sigmoid"), ])

# regularise.py — same overfit-prone setup, now with dropout + L2 import numpy as np from sklearn.datasets import load_breast_cancer from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from tensorflow import keras from tensorflow.keras import layers, regularizers import matplotlib.pyplot as plt X, y = load_breast_cancer(return_X_y=True) X = StandardScaler().fit_transform(X) Xtr, Xte, ytr, yte = train_test_split(X, y, train_size=80, random_state=0) def model(reg=False): if reg: return keras.Sequential([ layers.Input(shape=(30,)), layers.Dense(128, activation="relu", kernel_regularizer=regularizers.l2(0.001)), layers.Dropout(0.4), layers.Dense(64, activation="relu", kernel_regularizer=regularizers.l2(0.001)), layers.Dropout(0.4), layers.Dense(1, activation="sigmoid"), ]) return keras.Sequential([ layers.Input(shape=(30,)), layers.Dense(128, activation="relu"), layers.Dense(64, activation="relu"), layers.Dense(1, activation="sigmoid"), ]) fig, ax = plt.subplots(1, 2, figsize=(11, 4)) for i, reg in enumerate([False, True]): m = model(reg) m.compile("adam", "binary_crossentropy", metrics=["accuracy"]) h = m.fit(Xtr, ytr, validation_data=(Xte, yte), epochs=200, batch_size=8, verbose=0) ax[i].plot(h.history["accuracy"], label="train") ax[i].plot(h.history["val_accuracy"], label="val") ax[i].set_title("with reg" if reg else "no reg") ax[i].legend() fig.savefig("reg_compare.png", dpi=150)