PY-L5-22 · Activation Functions — ReLU, Sigmoid, Softmax

Learning Goals

3 min

Plot and explain ReLU, sigmoid, tanh, softmax.
Know the default recipe: ReLU hidden, sigmoid/softmax output.
Understand the "dying ReLU" and "vanishing gradient" problems in one line each.
Match the output activation to the task.

Warm-Up · Why Squash at All?

5 min

Stack two linear layers with no activation and the maths collapses: W2(W1·x) = (W2·W1)·x — still just one linear function. The activation in between is the only thing that lets a deep net be more expressive than a single line.

Today's big idea

Activations add non-linearity. ReLU is the default for hidden layers (fast, simple). The OUTPUT activation is dictated by the task: sigmoid for one yes/no, softmax for "pick one of N classes", none for regression.

New Concept · The Squashers

14 min

ReLU — the hidden-layer default

def relu(z): return np.maximum(0, z)
# negatives → 0, positives unchanged

relu(-3) = 0     relu(0) = 0     relu(5) = 5

Cheap, no vanishing gradient for positive values, trains fast. The standard choice for hidden layers. (Downside: a neuron stuck always-negative outputs 0 forever — "dying ReLU"; variants like LeakyReLU fix it.)

Sigmoid — binary output

def sigmoid(z): return 1 / (1 + np.exp(-z))
# any number → (0, 1) probability

Use on the OUTPUT for a single yes/no. Avoid in deep hidden layers — it saturates (flat at the ends), causing vanishing gradients that stall learning.

tanh — like sigmoid but centred at 0

def tanh(z): return np.tanh(z)   # range (-1, 1)

Sometimes used in hidden layers / RNNs. Still saturates, so ReLU usually wins.

Softmax — multi-class output

def softmax(z):
    e = np.exp(z - z.max())     # subtract max for stability
    return e / e.sum()
# turns N scores into N probabilities that sum to 1

softmax([2.0, 1.0, 0.1]) → [0.66, 0.24, 0.10]   (sums to 1)

Use on the OUTPUT layer for "pick one of N classes". The biggest score becomes the most probable class.

The cheat sheet

Layer          Activation
hidden         ReLU (default)
output, binary sigmoid  (1 node)
output, multi  softmax  (N nodes)
output, number none / linear (regression)

Worked Example · Plot Them All

12 min

# activations.py — visualise the four squashers
import numpy as np
import matplotlib.pyplot as plt

z = np.linspace(-6, 6, 200)
relu    = np.maximum(0, z)
sigmoid = 1 / (1 + np.exp(-z))
tanh    = np.tanh(z)

fig, ax = plt.subplots(1, 3, figsize=(12, 4))
ax[0].plot(z, relu);    ax[0].set_title("ReLU")
ax[1].plot(z, sigmoid); ax[1].set_title("Sigmoid")
ax[2].plot(z, tanh);    ax[2].set_title("tanh")
for a in ax:
    a.axhline(0, color="gray", lw=0.5); a.axvline(0, color="gray", lw=0.5)
fig.tight_layout(); fig.savefig("activations.png", dpi=150)

# softmax demo (it needs a vector, not a curve)
def softmax(v):
    e = np.exp(v - v.max()); return e / e.sum()
print("softmax([2,1,0.1]) =", softmax(np.array([2, 1, 0.1])).round(2))

What the plots show

ReLU     hard kink at 0, linear after — fast, no upper saturation
Sigmoid  S-curve flattening to 0 and 1 — saturates at both ends
tanh     S-curve flattening to -1 and 1 — centred at zero
softmax([2,1,0.1]) = [0.66 0.24 0.1]

Read the diff

See how sigmoid and tanh go flat at the extremes — flat means "tiny gradient", which is why deep networks of sigmoids learn slowly (the vanishing-gradient problem). ReLU stays linear for positives, so its gradient never vanishes there — that single property is why ReLU unlocked deep learning.

Try It Yourself

13 min

01 🟢 Implement softmax

Write softmax and confirm its output always sums to 1, for several input vectors.

02 🟡 Pick the activation

For each task, name the OUTPUT activation: (a) spam/not-spam, (b) digit 0-9, (c) predict temperature, (d) cat/dog/bird.

Answer

a) sigmoid (1 node)   b) softmax (10 nodes)
c) none / linear      d) softmax (3 nodes)

03 🔴 Dying ReLU demo

Show a neuron whose weights push every input negative — its ReLU output is always 0, and (you'll learn) its gradient is 0 too, so it never updates. Then swap to LeakyReLU and show it stays alive.

Hint

def leaky_relu(z, a=0.01): return np.where(z > 0, z, a*z)
print(leaky_relu(np.array([-5, -1, 0, 2])))  # tiny negative slope keeps it alive

Mini-Challenge · Activation Quiz Builder

8 min

Write a function recommend_output(task_type, n_classes) that returns the right output activation and number of output nodes for a given task. Cover binary, multiclass, and regression.

Show one possible solution

def recommend_output(task_type, n_classes=None):
    if task_type == "regression":
        return ("linear (none)", 1)
    if task_type == "binary":
        return ("sigmoid", 1)
    if task_type == "multiclass":
        return ("softmax", n_classes)
    raise ValueError("unknown task")

print(recommend_output("binary"))            # ('sigmoid', 1)
print(recommend_output("multiclass", 10))    # ('softmax', 10)
print(recommend_output("regression"))        # ('linear (none)', 1)

Non-negotiables: correct activation + node count per task. Hidden layers are always ReLU; only the output changes.

Recap

3 min

Activations add the non-linearity that makes depth worthwhile. Default recipe: ReLU in hidden layers; sigmoid (1 node) for binary output; softmax (N nodes) for multiclass; linear/none for regression. Sigmoid/tanh saturate (vanishing gradients) so avoid them in deep hidden layers. Next: actually building this in Keras.

Vocabulary Card

ReLU: max(0, z). The default hidden-layer activation.
softmax: Turns N scores into N probabilities summing to 1 — multiclass output.
vanishing gradient: When saturated activations make gradients tiny, stalling deep training.
dying ReLU: A neuron stuck outputting 0 forever; LeakyReLU mitigates it.

Homework

4 min

Make a one-page cheat sheet (image or markdown) of the four activations: formula, plot/shape, range, where to use it, one gotcha. You'll reference this for the rest of Level 5.

# activations.py — visualise the four squashers import numpy as np import matplotlib.pyplot as plt z = np.linspace(-6, 6, 200) relu = np.maximum(0, z) sigmoid = 1 / (1 + np.exp(-z)) tanh = np.tanh(z) fig, ax = plt.subplots(1, 3, figsize=(12, 4)) ax[0].plot(z, relu); ax[0].set_title("ReLU") ax[1].plot(z, sigmoid); ax[1].set_title("Sigmoid") ax[2].plot(z, tanh); ax[2].set_title("tanh") for a in ax: a.axhline(0, color="gray", lw=0.5); a.axvline(0, color="gray", lw=0.5) fig.tight_layout(); fig.savefig("activations.png", dpi=150) # softmax demo (it needs a vector, not a curve) def softmax(v): e = np.exp(v - v.max()); return e / e.sum() print("softmax([2,1,0.1]) =", softmax(np.array([2, 1, 0.1])).round(2))

ReLU hard kink at 0, linear after — fast, no upper saturation Sigmoid S-curve flattening to 0 and 1 — saturates at both ends tanh S-curve flattening to -1 and 1 — centred at zero softmax([2,1,0.1]) = [0.66 0.24 0.1]

def recommend_output(task_type, n_classes=None): if task_type == "regression": return ("linear (none)", 1) if task_type == "binary": return ("sigmoid", 1) if task_type == "multiclass": return ("softmax", n_classes) raise ValueError("unknown task") print(recommend_output("binary")) # ('sigmoid', 1) print(recommend_output("multiclass", 10)) # ('softmax', 10) print(recommend_output("regression")) # ('linear (none)', 1)