Activation Functions in Neural Networks — Gökçe Akçıl

Activation functions are the decision-makers of neural networks — they determine whether a neuron fires and how strongly. More precisely, they introduce non-linearity to the system.

Without activation functions, neural networks are nothing more than glorified linear regression models, regardless of how many layers you stack. They normalize outputs, enable feature selection, and facilitate gradient flow during backpropagation.

The Problem They Solve

Consider stacking multiple linear transformations: W₂(W₁x + b₁) + b₂. This is algebraically equivalent to a single linear transformation. Activation functions break this linearity, allowing networks to approximate arbitrarily complex functions.

Five Core Activation Functions

Sigmoid

Maps any input to the range (0, 1). Historically popular, now primarily used in binary classification output layers.

σ(x) = 1 / (1 + e^(−x))

Limitation: Vanishing gradients. When inputs are very large or very small, the gradient approaches zero — learning effectively stops in early layers.

Tanh (Hyperbolic Tangent)

Maps inputs to (−1, 1). Zero-centered, which makes it preferable over Sigmoid for RNN hidden layers. Still suffers from vanishing gradients at the extremes.

tanh(x) = (eˣ − e⁻ˣ) / (eˣ + e⁻ˣ)

ReLU (Rectified Linear Unit)

The industry standard for hidden layers in CNNs and deep networks. Computationally trivial and empirically effective.

ReLU(x) = max(0, x)

Limitation: The "dying ReLU" problem. Neurons receiving consistently negative inputs output zero permanently and receive zero gradients — they stop learning entirely.

Leaky ReLU

Fixes dying ReLU by allowing a small negative slope (α ≈ 0.01). Commonly used in GANs and architectures sensitive to neuron death.

LeakyReLU(x) = x if x > 0, else αx

Softmax

Converts a raw output vector into a probability distribution summing to 1.0. Used exclusively in multi-class classification output layers.

Softmax(xᵢ) = eˣⁱ / Σ eˣʲ

Why This Matters for Training

During backpropagation, weight updates are proportional to the activation function's derivative at each point. When that derivative approaches zero — as with Sigmoid and Tanh at saturation — gradients vanish across layers and learning stalls.

This is why ReLU and its variants dominate modern deep learning: their derivatives are either 0 or 1 (or a small constant), preventing gradient collapse across most inputs.

Practical Selection Guide

Use Case	Recommended Function
Hidden layers (CNN, MLP)	ReLU
Hidden layers (GAN, dying neuron risk)	Leaky ReLU
RNN hidden layers	Tanh
Binary classification output	Sigmoid
Multi-class classification output	Softmax

The choice of activation function is not cosmetic — it directly determines whether your network can learn at all.

Activation Functions

Executive Summary

The Problem They Solve

Five Core Activation Functions

Sigmoid

Tanh (Hyperbolic Tangent)

ReLU (Rectified Linear Unit)

Leaky ReLU

Softmax

Why This Matters for Training

Practical Selection Guide

Key Takeaways

About Gökçe Akçıl

Read Next

Hyperparameter Tuning in AWS SageMaker

ML Model Deployment Strategies on Google Cloud

ML Model Deployment Strategies on AWS