The activation function · NorthGradient

In the last lesson a neuron produced a raw number $z$ that could be any size. On its own that number is just a straight-line combination of the inputs. To build something that can bend and curve, each neuron passes $z$ through an activation function, a fixed nonlinear function applied to the output.

Without a nonlinear activation, a deep stack of neurons collapses into a single linear function, no matter how many layers you add.

Squashing the sum

The activation function takes the neuron’s raw output $z$ and returns the neuron’s final output $a$ :

a = \sigma(z)

Here $z$ is the weighted sum from the previous lesson, $\sigma$ is the chosen activation function, and $a$ is the activated output that the neuron actually passes on. Two common choices:

\text{ReLU}(z) = \max(0, z)

ReLU returns $z$ when $z$ is positive and returns $0$ otherwise. It keeps positive signals and discards negative ones.

\sigma(z) = \frac{1}{1 + e^{-z}}

This is the sigmoid. Here $e \approx 2.718$ is the base of the natural exponential, and $z$ is the raw input. The output is always between $0$ and $1$ , so it never blows up no matter how large $z$ becomes.

The same neuron, now with its output z passing through an activation function to produce the final output a.

A worked example

Take the value from the previous lesson, $z = -1$ , and apply each function:

\text{ReLU}(-1) = \max(0, -1) = 0

\sigma(-1) = \frac{1}{1 + e^{1}} \approx \frac{1}{3.718} \approx 0.269

In code:

import math

# the neuron's raw weighted sum from the previous lesson
z = -1.0

# ReLU keeps positive values and replaces negatives with zero
relu = max(0.0, z)

# sigmoid squashes any number into the range between 0 and 1
sigmoid = 1 / (1 + math.exp(-z))

print(relu)     # 0.0
print(sigmoid)  # 0.2689414213699951

Why the nonlinearity matters

Suppose neurons had no activation, so each layer just computed a weighted sum. Stacking two such layers would give $\mathbf{W}_2(\mathbf{W}_1 \mathbf{x}) = (\mathbf{W}_2 \mathbf{W}_1)\mathbf{x}$ , where $\mathbf{W}_1$ and $\mathbf{W}_2$ are the two layers’ weights. The product $\mathbf{W}_2 \mathbf{W}_1$ is itself just one matrix, so the whole stack reduces to a single weighted sum. Adding more layers changes nothing. The nonlinear $\sigma$ placed between layers is exactly what breaks this collapse and lets depth add real power.

In the next lesson, we will line up many of these neurons side by side into a single layer and write the whole layer as one compact equation.