Learning Goals
3 min- Picture training as rolling downhill on a loss landscape.
- Explain gradient, learning rate, and a training step.
- Know what Adam / SGD do and why Adam is the default.
- Diagnose a learning rate that's too high or too low from the loss curve.
Warm-Up · Rolling Downhill
5 minloss │ * ← bad weights, high error │ \ │ \__ gradient points UP the slope; │ \___ we step the OPPOSITE way (downhill) │ \_____ * ← good weights, low error └────────────────────── weight value
The loss is "how wrong are we". The gradient tells which way the loss increases. Training repeatedly steps the weights a little bit downhill, lowering the loss. That's gradient descent.
Training = minimise the loss by gradient descent. The gradient gives the direction; the learning rate sets the step size; the optimiser (Adam, SGD) is the strategy for taking smart steps. Keras computes the gradients for you (backpropagation).
New Concept · Gradient Descent
14 minOne step, in words
1. forward pass: predict, compute loss 2. backward pass: compute gradient of loss wrt every weight (backprop) 3. update: weight = weight − learning_rate × gradient 4. repeat for every batch, every epoch
Learning rate — the most important dial
too high → overshoots, loss bounces or explodes (NaN) too low → crawls, takes forever, may get stuck just right → loss drops smoothly to a low plateau
from tensorflow.keras.optimizers import Adam model.compile(optimizer=Adam(learning_rate=0.001), # the default-ish loss="sparse_categorical_crossentropy", metrics=["accuracy"])
SGD vs Adam
SGD plain steps downhill; needs careful tuning + momentum
Adam adapts the step size per-weight automatically;
the sensible default — start here, almost always worksGradient descent by hand (1 variable)
# minimise f(w) = (w - 3)**2 ; the minimum is at w = 3 def f(w): return (w - 3) ** 2 def grad(w): return 2 * (w - 3) w = 0.0 lr = 0.1 for step in range(20): w = w - lr * grad(w) print(round(w, 3)) # → ~3.0 (rolled to the minimum)
That loop IS training, in miniature — just one weight and a toy loss. Keras does this for millions of weights using calculus you don't have to write.
Worked Example · Watch the Learning Rate
12 min# lr_demo.py — same net, three learning rates import numpy as np from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from tensorflow import keras from tensorflow.keras import layers from tensorflow.keras.optimizers import Adam import matplotlib.pyplot as plt X, y = load_iris(return_X_y=True) X = StandardScaler().fit_transform(X) Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0) def make(): m = keras.Sequential([ layers.Input(shape=(4,)), layers.Dense(16, activation="relu"), layers.Dense(3, activation="softmax"), ]) return m plt.figure(figsize=(8, 5)) for lr in [0.0001, 0.01, 1.0]: m = make() m.compile(optimizer=Adam(lr), loss="sparse_categorical_crossentropy", metrics=["accuracy"]) h = m.fit(Xtr, ytr, epochs=60, batch_size=8, verbose=0) plt.plot(h.history["loss"], label=f"lr={lr}") plt.xlabel("epoch"); plt.ylabel("training loss"); plt.legend() plt.title("Effect of learning rate") plt.savefig("lr_demo.png", dpi=150)
What you'll see
lr=0.0001 loss drops painfully slowly — undershooting lr=0.01 loss drops fast and smoothly — just right lr=1.0 loss is erratic / stuck high — overshooting
Read the diff
The loss curve is your training dashboard. Smooth steady decline = healthy. Crawling = LR too low. Spiky or rising = LR too high. When a net "won't learn", the learning rate is the first thing to check — and Adam at ~0.001 is the safe starting point.
Try It Yourself
13 minMinimise f(w) = (w-5)**2 + 2 with gradient descent. Print w after each step and watch it approach 5.
Set the learning rate to 1.5 in the toy example. What happens? Explain why it diverges.
Answer
With lr=1.5 the step overshoots the minimum each time and bounces further out — w explodes to ±infinity. The step size is bigger than the slope can tolerate.
Train the iris net once with SGD() and once with Adam() at the same epochs. Plot both loss curves. Which converges faster?
Mini-Challenge · Loss-Curve Doctor
8 minWrite a function that takes a Keras history and prints a diagnosis: "still improving — train longer", "converged", "diverging — lower LR", or "overfitting — val loss rising". Use simple rules on the loss/val_loss arrays.
Show one possible solution
def diagnose(history): loss = history.history["loss"] val = history.history.get("val_loss") if loss[-1] > loss[0]: return "diverging — lower the learning rate" if val and val[-1] > min(val) * 1.1: return "overfitting — val loss rising; add regularisation / early stop" if loss[-1] < loss[-5] * 0.98: return "still improving — train more epochs" return "converged — looks done"
Non-negotiables: detect the four common patterns from the loss arrays. This codifies the "read the curve" skill.
Recap
3 minTraining minimises loss by gradient descent: forward (loss), backward (gradients via backprop), update (step downhill). Learning rate is the step size — too high diverges, too low crawls. Adam adapts per-weight and is the default. The loss curve is your dashboard; read it to diagnose problems. Next: the universal enemy — overfitting.
Vocabulary Card
- gradient descent
- Iteratively stepping weights downhill to reduce loss.
- backpropagation
- The algorithm that computes gradients for every weight efficiently.
- learning rate
- Step size per update — the key training hyperparameter.
- optimiser
- The strategy for taking steps; Adam is the usual default.
Homework
4 minRun the learning-rate sweep on any dataset. Save the three loss curves on one chart, label them, and write a sentence for each: too low / just right / too high. Apply your loss-curve doctor to the "just right" run.
Combine lr_demo.py with the diagnose function. The labelled chart is the deliverable.