Activation Functions
Activation functions are the decision-makers of neural networks — they introduce non-linearity, enable feature selection, and determine whether learning can happen at all. A practical guide to Sigmoid, Tanh, ReLU, Leaky ReLU, and Softmax.
Executive Summary
Activation functions are the decision-makers of neural networks — they introduce non-linearity, enable feature selection, and determine whether learning can happen at all. A practical guide to Sigmoid, Tanh, ReLU, Leaky ReLU, and Softmax.
Activation functions are the decision-makers of neural networks — they determine whether a neuron fires and how strongly. More precisely, they introduce non-linearity to the system.
Without activation functions, neural networks are nothing more than glorified linear regression models, regardless of how many layers you stack. They normalize outputs, enable feature selection, and facilitate gradient flow during backpropagation.
The Problem They Solve
Consider stacking multiple linear transformations: W₂(W₁x + b₁) + b₂. This is algebraically equivalent to a single linear transformation. Activation functions break this linearity, allowing networks to approximate arbitrarily complex functions.
Five Core Activation Functions
Sigmoid
Maps any input to the range (0, 1). Historically popular, now primarily used in binary classification output layers.
σ(x) = 1 / (1 + e^(−x))
Limitation: Vanishing gradients. When inputs are very large or very small, the gradient approaches zero — learning effectively stops in early layers.
Tanh (Hyperbolic Tangent)
Maps inputs to (−1, 1). Zero-centered, which makes it preferable over Sigmoid for RNN hidden layers. Still suffers from vanishing gradients at the extremes.
tanh(x) = (eˣ − e⁻ˣ) / (eˣ + e⁻ˣ)
ReLU (Rectified Linear Unit)
The industry standard for hidden layers in CNNs and deep networks. Computationally trivial and empirically effective.
ReLU(x) = max(0, x)
Limitation: The "dying ReLU" problem. Neurons receiving consistently negative inputs output zero permanently and receive zero gradients — they stop learning entirely.
Leaky ReLU
Fixes dying ReLU by allowing a small negative slope (α ≈ 0.01). Commonly used in GANs and architectures sensitive to neuron death.
LeakyReLU(x) = x if x > 0, else αx
Softmax
Converts a raw output vector into a probability distribution summing to 1.0. Used exclusively in multi-class classification output layers.
Softmax(xᵢ) = eˣⁱ / Σ eˣʲ
Why This Matters for Training
During backpropagation, weight updates are proportional to the activation function's derivative at each point. When that derivative approaches zero — as with Sigmoid and Tanh at saturation — gradients vanish across layers and learning stalls.
This is why ReLU and its variants dominate modern deep learning: their derivatives are either 0 or 1 (or a small constant), preventing gradient collapse across most inputs.
Practical Selection Guide
| Use Case | Recommended Function |
|---|---|
| Hidden layers (CNN, MLP) | ReLU |
| Hidden layers (GAN, dying neuron risk) | Leaky ReLU |
| RNN hidden layers | Tanh |
| Binary classification output | Sigmoid |
| Multi-class classification output | Softmax |
The choice of activation function is not cosmetic — it directly determines whether your network can learn at all.
Key Takeaways
- Core Concept: machine-learning
- Difficulty: Intermediate/Advanced
- Author: Gökçe Akçıl (Senior AI/ML Engineer)