Sigmoid was used initially because of the logistic regression analogy, but the field has moved firmly toward ReLU for hidden layers. Understanding why is essential for building effective networks.
Three most common activation functions:
- Sigmoid:
g(z) = 1/(1+eโปแถป). Output: (0,1). Use only at the output layer for binary classification. Problem: saturates (goes flat) at both extremes โ small gradients โ slow learning.
- ReLU (Rectified Linear Unit):
g(z) = max(0, z). Output: [0, โ). Use for hidden layers. Only flat for z < 0, not at both extremes. Faster to compute than sigmoid.
- Linear:
g(z) = z. Output: (โโ, โ). Use at the output layer for regression problems where ลท can be any real number.
Why ReLU dominates hidden layers:
- Faster computation:
max(0,z) is cheaper than computing eโปแถป.
- Fewer flat regions: Sigmoid is flat at both extremes โ small gradients throughout โ slow learning. ReLU is flat only for z < 0 โ stronger gradients in the active region.
Other options โ tanh, LeakyReLU, swish โ appear in the literature and occasionally work slightly better. ReLU is the safe default for hidden layers in most architectures today.
Interview-Ready Deepening
Source-backed reinforcement: these points add detail beyond short-duration UI hints and emphasize production tradeoffs.
- ReLU, linear activation, and why hidden layers should almost never use sigmoid.
- ReLU (Rectified Linear Unit) : g(z) = max(0, z) . Output: [0, โ). Use for hidden layers. Only flat for z < 0, not at both extremes. Faster to compute than sigmoid.
- Sigmoid was used initially because of the logistic regression analogy , but the field has moved firmly toward ReLU for hidden layers.
- Fewer flat regions : Sigmoid is flat at both extremes โ small gradients throughout โ slow learning. ReLU is flat only for z < 0 โ stronger gradients in the active region.
- ReLU is the safe default for hidden layers in most architectures today.
- Maybe awareness should be any non negative number because there can be any non negative value of awareness going from 0 up to very very large numbers.
- On the last slide we just looked at the ReLU or rectified linear unit g(z) equals max(0, z).
- It goes by the name ReLU with this funny capitalization and ReLU stands for again, somewhat arcane term, but it stands for rectified linear unit.
Tradeoffs You Should Be Able to Explain
- More expressive models improve fit but can reduce interpretability and raise overfitting risk.
- Higher optimization speed can reduce training time but may increase instability if learning dynamics are not monitored.
- Feature-rich pipelines improve performance ceilings but increase maintenance and monitoring complexity.
First-time learner note: Read each model as a dataflow system: inputs become representations, representations become scores, and scores become decisions through a chosen loss and thresholding policy.
Production note: Track three things relentlessly in ML systems: data shape contracts, evaluation methodology, and the operational meaning of the model's errors. Most expensive failures come from one of those three.
Activation functions determine the range and shape of what a neuron can represent. ReLU lets hidden units express zero or arbitrarily large positive responses, which is often a better fit for graded internal concepts such as awareness, intensity, or count-like evidence than forcing everything into a probability-like 0 to 1 range.
Edge case: ReLU can create inactive units when z stays negative. That is still usually a better default than saturating every hidden unit with sigmoid, but it is one reason people later explore variants such as LeakyReLU.