PY-L5-24 · Training a Neural Net — Loss & Optimisers

Learning Goals

3 min

Picture training as rolling downhill on a loss landscape.
Explain gradient, learning rate, and a training step.
Know what Adam / SGD do and why Adam is the default.
Diagnose a learning rate that's too high or too low from the loss curve.

Warm-Up · Rolling Downhill

5 min

loss
 │   *                         ← bad weights, high error
 │    \
 │     \__                     gradient points UP the slope;
 │        \___                 we step the OPPOSITE way (downhill)
 │            \_____ * ←       good weights, low error
 └────────────────────── weight value

The loss is "how wrong are we". The gradient tells which way the loss increases. Training repeatedly steps the weights a little bit downhill, lowering the loss. That's gradient descent.

Today's big idea

Training = minimise the loss by gradient descent. The gradient gives the direction; the learning rate sets the step size; the optimiser (Adam, SGD) is the strategy for taking smart steps. Keras computes the gradients for you (backpropagation).

New Concept · Gradient Descent

14 min

One step, in words

1. forward pass:  predict, compute loss
2. backward pass: compute gradient of loss wrt every weight (backprop)
3. update:        weight = weight − learning_rate × gradient
4. repeat for every batch, every epoch

Learning rate — the most important dial

too high   → overshoots, loss bounces or explodes (NaN)
too low    → crawls, takes forever, may get stuck
just right → loss drops smoothly to a low plateau

from tensorflow.keras.optimizers import Adam
model.compile(optimizer=Adam(learning_rate=0.001),  # the default-ish
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

SGD vs Adam

SGD   plain steps downhill; needs careful tuning + momentum
Adam  adapts the step size per-weight automatically;
      the sensible default — start here, almost always works

Gradient descent by hand (1 variable)

# minimise f(w) = (w - 3)**2 ; the minimum is at w = 3
def f(w):  return (w - 3) ** 2
def grad(w): return 2 * (w - 3)

w = 0.0
lr = 0.1
for step in range(20):
    w = w - lr * grad(w)
print(round(w, 3))     # → ~3.0  (rolled to the minimum)

That loop IS training, in miniature — just one weight and a toy loss. Keras does this for millions of weights using calculus you don't have to write.

Worked Example · Watch the Learning Rate

12 min

# lr_demo.py — same net, three learning rates
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt

X, y = load_iris(return_X_y=True)
X = StandardScaler().fit_transform(X)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0)

def make():
    m = keras.Sequential([
        layers.Input(shape=(4,)),
        layers.Dense(16, activation="relu"),
        layers.Dense(3, activation="softmax"),
    ])
    return m

plt.figure(figsize=(8, 5))
for lr in [0.0001, 0.01, 1.0]:
    m = make()
    m.compile(optimizer=Adam(lr),
              loss="sparse_categorical_crossentropy", metrics=["accuracy"])
    h = m.fit(Xtr, ytr, epochs=60, batch_size=8, verbose=0)
    plt.plot(h.history["loss"], label=f"lr={lr}")

plt.xlabel("epoch"); plt.ylabel("training loss"); plt.legend()
plt.title("Effect of learning rate")
plt.savefig("lr_demo.png", dpi=150)

What you'll see

lr=0.0001  loss drops painfully slowly — undershooting
lr=0.01    loss drops fast and smoothly — just right
lr=1.0     loss is erratic / stuck high — overshooting

Read the diff

The loss curve is your training dashboard. Smooth steady decline = healthy. Crawling = LR too low. Spiky or rising = LR too high. When a net "won't learn", the learning rate is the first thing to check — and Adam at ~0.001 is the safe starting point.

Try It Yourself

13 min

01 🟢 Hand-rolled descent

Minimise f(w) = (w-5)**2 + 2 with gradient descent. Print w after each step and watch it approach 5.

02 🟡 Break it

Set the learning rate to 1.5 in the toy example. What happens? Explain why it diverges.

Answer

With lr=1.5 the step overshoots the minimum each time and bounces further out — w explodes to ±infinity. The step size is bigger than the slope can tolerate.

03 🔴 SGD vs Adam

Train the iris net once with SGD() and once with Adam() at the same epochs. Plot both loss curves. Which converges faster?

Mini-Challenge · Loss-Curve Doctor

8 min

Write a function that takes a Keras history and prints a diagnosis: "still improving — train longer", "converged", "diverging — lower LR", or "overfitting — val loss rising". Use simple rules on the loss/val_loss arrays.

Show one possible solution

def diagnose(history):
    loss = history.history["loss"]
    val  = history.history.get("val_loss")
    if loss[-1] > loss[0]:
        return "diverging — lower the learning rate"
    if val and val[-1] > min(val) * 1.1:
        return "overfitting — val loss rising; add regularisation / early stop"
    if loss[-1] < loss[-5] * 0.98:
        return "still improving — train more epochs"
    return "converged — looks done"

Non-negotiables: detect the four common patterns from the loss arrays. This codifies the "read the curve" skill.

Recap

3 min

Training minimises loss by gradient descent: forward (loss), backward (gradients via backprop), update (step downhill). Learning rate is the step size — too high diverges, too low crawls. Adam adapts per-weight and is the default. The loss curve is your dashboard; read it to diagnose problems. Next: the universal enemy — overfitting.

Vocabulary Card

gradient descent: Iteratively stepping weights downhill to reduce loss.
backpropagation: The algorithm that computes gradients for every weight efficiently.
learning rate: Step size per update — the key training hyperparameter.
optimiser: The strategy for taking steps; Adam is the usual default.

Homework

4 min

Run the learning-rate sweep on any dataset. Save the three loss curves on one chart, label them, and write a sentence for each: too low / just right / too high. Apply your loss-curve doctor to the "just right" run.

loss │ * ← bad weights, high error │ \ │ \__ gradient points UP the slope; │ \___ we step the OPPOSITE way (downhill) │ \_____ * ← good weights, low error └────────────────────── weight value

1. forward pass: predict, compute loss 2. backward pass: compute gradient of loss wrt every weight (backprop) 3. update: weight = weight − learning_rate × gradient 4. repeat for every batch, every epoch

from tensorflow.keras.optimizers import Adam model.compile(optimizer=Adam(learning_rate=0.001), # the default-ish loss="sparse_categorical_crossentropy", metrics=["accuracy"])

# minimise f(w) = (w - 3)**2 ; the minimum is at w = 3 def f(w): return (w - 3) ** 2 def grad(w): return 2 * (w - 3) w = 0.0 lr = 0.1 for step in range(20): w = w - lr * grad(w) print(round(w, 3)) # → ~3.0 (rolled to the minimum)

# lr_demo.py — same net, three learning rates import numpy as np from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from tensorflow import keras from tensorflow.keras import layers from tensorflow.keras.optimizers import Adam import matplotlib.pyplot as plt X, y = load_iris(return_X_y=True) X = StandardScaler().fit_transform(X) Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0) def make(): m = keras.Sequential([ layers.Input(shape=(4,)), layers.Dense(16, activation="relu"), layers.Dense(3, activation="softmax"), ]) return m plt.figure(figsize=(8, 5)) for lr in [0.0001, 0.01, 1.0]: m = make() m.compile(optimizer=Adam(lr), loss="sparse_categorical_crossentropy", metrics=["accuracy"]) h = m.fit(Xtr, ytr, epochs=60, batch_size=8, verbose=0) plt.plot(h.history["loss"], label=f"lr={lr}") plt.xlabel("epoch"); plt.ylabel("training loss"); plt.legend() plt.title("Effect of learning rate") plt.savefig("lr_demo.png", dpi=150)

def diagnose(history): loss = history.history["loss"] val = history.history.get("val_loss") if loss[-1] > loss[0]: return "diverging — lower the learning rate" if val and val[-1] > min(val) * 1.1: return "overfitting — val loss rising; add regularisation / early stop" if loss[-1] < loss[-5] * 0.98: return "still improving — train more epochs" return "converged — looks done"