Learning Goals
3 min- Explain dropout: randomly zero neurons during training.
- Add
Dropoutlayers to a Keras model. - Add L2 weight regularisation to a Dense layer.
- Show the train-val gap shrink after regularising.
Warm-Up · Don't Rely on One Star Player
5 minA team that depends on one superstar collapses when they're injured. Coach them to function with random players benched each practice, and everyone improves — the team becomes robust. Dropout does exactly this to neurons.
Regularisation = adding a constraint that discourages complexity. Dropout forces redundancy (no neuron can be a single point of failure). L2 keeps weights small (smooth, simple functions). Both make the net generalise better.
New Concept · Dropout & L2
14 minDropout
from tensorflow.keras import layers layers.Dropout(0.3) # during training, randomly zero 30% of inputs
- Applied between Dense layers.
- Active only during training; at prediction time all neurons are used (Keras handles this automatically).
- Typical rates: 0.2-0.5. Higher = stronger regularisation.
Where it goes
model = keras.Sequential([ layers.Input(shape=(30,)), layers.Dense(128, activation="relu"), layers.Dropout(0.4), layers.Dense(64, activation="relu"), layers.Dropout(0.4), layers.Dense(1, activation="sigmoid"), ])
L2 weight regularisation
from tensorflow.keras import regularizers layers.Dense(64, activation="relu", kernel_regularizer=regularizers.l2(0.001))
L2 adds λ × sum(weights²) to the loss. The optimiser now has to balance fitting the data against keeping weights small — which smooths the function and curbs memorising. λ (e.g. 0.001) controls the strength.
L1 vs L2
L2 (ridge) shrinks all weights toward 0 — smooth, the usual choice L1 (lasso) drives some weights exactly to 0 — sparse, feature selection
BatchNorm (a bonus stabiliser)
layers.BatchNormalization() normalises activations between layers — speeds training and adds mild regularisation. Common in bigger nets; optional here.
Worked Example · Regularise the Overfitter
12 min# regularise.py — same overfit-prone setup, now with dropout + L2 import numpy as np from sklearn.datasets import load_breast_cancer from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from tensorflow import keras from tensorflow.keras import layers, regularizers import matplotlib.pyplot as plt X, y = load_breast_cancer(return_X_y=True) X = StandardScaler().fit_transform(X) Xtr, Xte, ytr, yte = train_test_split(X, y, train_size=80, random_state=0) def model(reg=False): if reg: return keras.Sequential([ layers.Input(shape=(30,)), layers.Dense(128, activation="relu", kernel_regularizer=regularizers.l2(0.001)), layers.Dropout(0.4), layers.Dense(64, activation="relu", kernel_regularizer=regularizers.l2(0.001)), layers.Dropout(0.4), layers.Dense(1, activation="sigmoid"), ]) return keras.Sequential([ layers.Input(shape=(30,)), layers.Dense(128, activation="relu"), layers.Dense(64, activation="relu"), layers.Dense(1, activation="sigmoid"), ]) fig, ax = plt.subplots(1, 2, figsize=(11, 4)) for i, reg in enumerate([False, True]): m = model(reg) m.compile("adam", "binary_crossentropy", metrics=["accuracy"]) h = m.fit(Xtr, ytr, validation_data=(Xte, yte), epochs=200, batch_size=8, verbose=0) ax[i].plot(h.history["accuracy"], label="train") ax[i].plot(h.history["val_accuracy"], label="val") ax[i].set_title("with reg" if reg else "no reg") ax[i].legend() fig.savefig("reg_compare.png", dpi=150)
What you'll see
no reg : train shoots to 1.0, val lags with a wide gap with reg: train rises more slowly, val tracks it closely — smaller gap
Read the diff
Regularisation deliberately makes the model worse on training data — and that's the point. By preventing memorisation, the train and val curves stay close, so the model that ships generalises better. The training accuracy is not the goal; the gap is what you manage.
Try It Yourself
13 minTake any overfitting net and insert two Dropout(0.3) layers. Compare the train-val gap before and after.
Try dropout rates 0.0, 0.2, 0.5, 0.8. Plot final val accuracy vs rate. Too much dropout hurts — find the sweet spot.
Sweep L2 λ in [0, 1e-4, 1e-3, 1e-2, 1e-1]. Plot train and val accuracy vs λ. What happens at very high λ?
Answer
At very high λ the model is forced toward tiny weights — it underfits, and both train and val accuracy fall. Regularisation has a sweet spot, just like model capacity.
Mini-Challenge · The Best-Generalising Net
8 minCombine everything: a reasonably-sized net + dropout + L2 + early stopping. Aim for the smallest train-val gap while keeping val accuracy high. Report your final architecture and the gap you achieved.
Recap
3 minDropout randomly silences neurons during training, forcing redundancy. L2 penalises large weights for smoother functions. Both reduce overfitting by adding a constraint. They have sweet spots — too much causes underfitting. Combine with early stopping for robust nets. That completes the neural-net toolkit; next we apply it to images.
Vocabulary Card
- regularisation
- Techniques that discourage model complexity to improve generalisation.
- dropout
- Randomly zeroing a fraction of neurons during training.
- L2 / weight decay
- Adding a penalty on large weights to the loss.
- BatchNormalization
- Normalising activations between layers; speeds training, mild regularisation.
Homework
4 minOn an overfit-prone dataset, produce a 2-panel figure: no-regularisation vs full-regularisation (dropout + L2 + early stopping) train/val curves. Report both test accuracies and the gap reduction. One paragraph on the trade-off you observed.
Reuse regularise.py and add EarlyStopping. The trade-off paragraph should note that regularisation lowered training accuracy but improved (or stabilised) the test score.