Learning Goals
3 min- Plot and explain ReLU, sigmoid, tanh, softmax.
- Know the default recipe: ReLU hidden, sigmoid/softmax output.
- Understand the "dying ReLU" and "vanishing gradient" problems in one line each.
- Match the output activation to the task.
Warm-Up · Why Squash at All?
5 minStack two linear layers with no activation and the maths collapses: W2(W1·x) = (W2·W1)·x — still just one linear function. The activation in between is the only thing that lets a deep net be more expressive than a single line.
Activations add non-linearity. ReLU is the default for hidden layers (fast, simple). The OUTPUT activation is dictated by the task: sigmoid for one yes/no, softmax for "pick one of N classes", none for regression.
New Concept · The Squashers
14 minReLU — the hidden-layer default
def relu(z): return np.maximum(0, z) # negatives → 0, positives unchanged
relu(-3) = 0 relu(0) = 0 relu(5) = 5
Cheap, no vanishing gradient for positive values, trains fast. The standard choice for hidden layers. (Downside: a neuron stuck always-negative outputs 0 forever — "dying ReLU"; variants like LeakyReLU fix it.)
Sigmoid — binary output
def sigmoid(z): return 1 / (1 + np.exp(-z)) # any number → (0, 1) probability
Use on the OUTPUT for a single yes/no. Avoid in deep hidden layers — it saturates (flat at the ends), causing vanishing gradients that stall learning.
tanh — like sigmoid but centred at 0
def tanh(z): return np.tanh(z) # range (-1, 1)
Sometimes used in hidden layers / RNNs. Still saturates, so ReLU usually wins.
Softmax — multi-class output
def softmax(z): e = np.exp(z - z.max()) # subtract max for stability return e / e.sum() # turns N scores into N probabilities that sum to 1
softmax([2.0, 1.0, 0.1]) → [0.66, 0.24, 0.10] (sums to 1)
Use on the OUTPUT layer for "pick one of N classes". The biggest score becomes the most probable class.
The cheat sheet
Layer Activation hidden ReLU (default) output, binary sigmoid (1 node) output, multi softmax (N nodes) output, number none / linear (regression)
Worked Example · Plot Them All
12 min# activations.py — visualise the four squashers import numpy as np import matplotlib.pyplot as plt z = np.linspace(-6, 6, 200) relu = np.maximum(0, z) sigmoid = 1 / (1 + np.exp(-z)) tanh = np.tanh(z) fig, ax = plt.subplots(1, 3, figsize=(12, 4)) ax[0].plot(z, relu); ax[0].set_title("ReLU") ax[1].plot(z, sigmoid); ax[1].set_title("Sigmoid") ax[2].plot(z, tanh); ax[2].set_title("tanh") for a in ax: a.axhline(0, color="gray", lw=0.5); a.axvline(0, color="gray", lw=0.5) fig.tight_layout(); fig.savefig("activations.png", dpi=150) # softmax demo (it needs a vector, not a curve) def softmax(v): e = np.exp(v - v.max()); return e / e.sum() print("softmax([2,1,0.1]) =", softmax(np.array([2, 1, 0.1])).round(2))
What the plots show
ReLU hard kink at 0, linear after — fast, no upper saturation Sigmoid S-curve flattening to 0 and 1 — saturates at both ends tanh S-curve flattening to -1 and 1 — centred at zero softmax([2,1,0.1]) = [0.66 0.24 0.1]
Read the diff
See how sigmoid and tanh go flat at the extremes — flat means "tiny gradient", which is why deep networks of sigmoids learn slowly (the vanishing-gradient problem). ReLU stays linear for positives, so its gradient never vanishes there — that single property is why ReLU unlocked deep learning.
Try It Yourself
13 minWrite softmax and confirm its output always sums to 1, for several input vectors.
For each task, name the OUTPUT activation: (a) spam/not-spam, (b) digit 0-9, (c) predict temperature, (d) cat/dog/bird.
Answer
a) sigmoid (1 node) b) softmax (10 nodes) c) none / linear d) softmax (3 nodes)
Show a neuron whose weights push every input negative — its ReLU output is always 0, and (you'll learn) its gradient is 0 too, so it never updates. Then swap to LeakyReLU and show it stays alive.
Hint
def leaky_relu(z, a=0.01): return np.where(z > 0, z, a*z) print(leaky_relu(np.array([-5, -1, 0, 2]))) # tiny negative slope keeps it alive
Mini-Challenge · Activation Quiz Builder
8 minWrite a function recommend_output(task_type, n_classes) that returns the right output activation and number of output nodes for a given task. Cover binary, multiclass, and regression.
Show one possible solution
def recommend_output(task_type, n_classes=None): if task_type == "regression": return ("linear (none)", 1) if task_type == "binary": return ("sigmoid", 1) if task_type == "multiclass": return ("softmax", n_classes) raise ValueError("unknown task") print(recommend_output("binary")) # ('sigmoid', 1) print(recommend_output("multiclass", 10)) # ('softmax', 10) print(recommend_output("regression")) # ('linear (none)', 1)
Non-negotiables: correct activation + node count per task. Hidden layers are always ReLU; only the output changes.
Recap
3 minActivations add the non-linearity that makes depth worthwhile. Default recipe: ReLU in hidden layers; sigmoid (1 node) for binary output; softmax (N nodes) for multiclass; linear/none for regression. Sigmoid/tanh saturate (vanishing gradients) so avoid them in deep hidden layers. Next: actually building this in Keras.
Vocabulary Card
- ReLU
- max(0, z). The default hidden-layer activation.
- softmax
- Turns N scores into N probabilities summing to 1 — multiclass output.
- vanishing gradient
- When saturated activations make gradients tiny, stalling deep training.
- dying ReLU
- A neuron stuck outputting 0 forever; LeakyReLU mitigates it.
Homework
4 minMake a one-page cheat sheet (image or markdown) of the four activations: formula, plot/shape, range, where to use it, one gotcha. You'll reference this for the rest of Level 5.
Use activations.py to generate the plots and annotate the cheat sheet with the cheat-sheet table from this lesson.