The purpose of an activation function is to add non-linearity to the neural network.
Let’s suppose we have a neural network working without the activation functions.
In that case, every neuron will only be performing a linear transformation on the inputs using the weights and biases. It’s because it doesn’t matter how many hidden layers we attach in the neural network; all layers will behave in the same way because the composition of two linear functions is a linear function itself.
Although the neural network becomes simpler, learning any complex task is impossible, and our model would be just a linear regression model.
Non-linear activation functions solve the following limitations of linear activation functions:
Sigmoid
/ Logistic Activation FunctionThis function takes any real value as input and outputs values in the range of 0 to 1.
Sigmoid
function
σ ( x ) = 1 1 + e − x \sigma(x)=\frac{1}{1+e^{-x}} σ(x)=1+e−x1
The derivative of Sigmoid
function
σ ′ ( x ) = σ ( x ) ( 1 − σ ( x ) ) \sigma '(x)=\sigma(x)(1-\sigma(x)) σ′(x)=σ(x)(1−σ(x))
Advantages of Sigmoid
Function
It is commonly used for models where we have to predict the probability as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice because of its range.
The function is differentiable and provides a smooth gradient, i.e., preventing jumps in output values. This is represented by an S-shape of the sigmoid activation function.
Limitation of Sigmoid
function
ReLU
等函数对比)Tanh
Function (Hyperbolic Tangent)Tanh
激活函数
tanh ( x ) = ( e x − e − x ) ( e x + e − x ) \text{tanh}(x)=\frac{\left(e^{x}-e^{-x}\right)}{\left(e^{x}+e^{-x}\right)} tanh(x)=(ex+e−x)(ex−e−x)
Advantages of Tanh
Function
Limitation of Tanh
function
ReLU
Function / Rectified Linear UnitReLU
激活函数
f ( x ) = max ( 0 , x ) f(x)=\max (0, x) f(x)=max(0,x)
Advantages of ReLU
Function
Limitation of ReLU
function
Leaky ReLU
FunctionLeaky ReLU
激活函数
f ( x ) = max ( 0.1 x , x ) f(x)=\max (0.1x, x) f(x)=max(0.1x,x)
Leaky ReLU is an improved version of ReLU function to solve the Dying ReLU problem as it has a small positive slope in the negative area.
Limitation of Leaky ReLU
function
makes the learning of model parameters time-consuming
.Parametric ReLU
Functionf ( x ) = max ( α x , x ) f(x)=\max (\alpha x, x) f(x)=max(αx,x)
Parametric ReLU is another variant of ReLU that aims to solve the problem of gradient’s becoming zero for the left half of the axis.
This function provides the slope of the negative part of the function as an argument a. By performing backpropagation, the most appropriate value of α \alpha α is learnt.
Where α \alpha α is the slope parameter for negative values.
The parameterized ReLU function is used when the leaky ReLU function still fails at solving the problem of dead neurons, and the relevant information is not successfully passed to the next layer.
This function’s limitation is that it may perform differently for different problems depending upon the value of slope parameter α \alpha α.
ELUs
) Functionf ( x ) = { x for x ⩾ 0 α ( e x − 1 ) for x < 0 f(x)=\begin{cases}x & \text { for } x \geqslant 0 \\ \alpha\left(e^{x}-1\right) & \text { for } x<0\end{cases} f(x)={xα(ex−1) for x⩾0 for x<0
Advantages of ELU
Function
Limitation of ELU
function
Softmax
FunctionSoftmax
activation function make things easy for multi-class classification problems.
Swish
It is a self-gated activation function
developed by researchers at Google.
f ( x ) = x ∗ sigmoid ( x ) f(x)=x * \operatorname{sigmoid}(x) f(x)=x∗sigmoid(x)
Here are a few advantages of the Swish activation function over ReLU
:
Swish is a smooth function that means that it does not abruptly change direction like ReLU
does near x = 0. Rather, it smoothly bends from 0 towards values < 0 and then upwards again.
Small negative values were zeroed out in ReLU
activation function. However, those negative values may still be relevant for capturing patterns underlying the data. Large negative values are zeroed out for reasons of sparsity making it a win-win situation.
The swish function being non-monotonous enhances the expression of input data and weight to be learnt.
GELU
)f ( x ) = x P ( X ≤ x ) = x Φ ( x ) = 0.5 x ( 1 + tanh [ 2 / π ( x + 0.044715 x 3 ) ] ) \begin{gathered} f(x)=x P(X \leq x)=x \Phi(x) \\ =0.5 x\left(1+\tanh \left[\sqrt{2 / \pi}\left(x+0.044715 x^{3}\right)\right]\right) \end{gathered} f(x)=xP(X≤x)=xΦ(x)=0.5x(1+tanh[2/π(x+0.044715x3)])
The Gaussian Error Linear Unit (GELU
) activation function is compatible with BERT, ROBERTa, ALBERT, and other top NLP models. This activation function is motivated by combining properties from dropout, zoneout, and ReLUs
.
GELU
nonlinearity is better than ReLU
and ELU
activations and finds performance improvements across all tasks in domains of computer vision, natural language processing, and speech recognition.
SELU
)SELU
was defined in self-normalizing networks and takes care of internal normalization which means each layer preserves the mean and variance from the previous layers. SELU
enables this normalization by adjusting the mean and variance.
SELU
has both positive and negative values to shift the mean, which was impossible for ReLU
activation function as it cannot output negative values.
Gradients can be used to adjust the variance. The activation function needs a region with a gradient larger than one to increase it.
f ( α , x ) = λ { α ( e x − 1 ) for x < 0 x for x ⩾ 0 f(\alpha, x)=\lambda \begin{cases}\alpha\left(e^{x}-1\right) & \text { for } x<0 \\ x & \text { for } x \geqslant 0\end{cases} f(α,x)=λ{α(ex−1)x for x<0 for x⩾0
SELU
has values of alpha α and lambda λ predefined.
Here’s the main advantage of SELU
over ReLU
:
SELU
is a relatively newer activation function and needs more papers on architectures such as CNNs and RNNs, where it is comparatively explored.
There are two challenges you might encounter when training your deep neural networks.
Like the sigmoid function, certain activation functions squish an ample input space into a small output space between 0 and 1.
Therefore, a large change in the input of the sigmoid function will cause a small change in the output. Hence, the derivative becomes small. For shallow networks with only a few layers that use these activations, this isn’t a big problem.
However, when more layers are used, it can cause the gradient to be too small for training to work effectively.
Exploding gradients are problems where significant error gradients accumulate and result in very large updates to neural network model weights during training.
An unstable network can result when there are exploding gradients, and the learning cannot be completed.
The values of the weights can also become so large as to overflow and result in something called NaN values.
You need to match your activation function for your output layer based on the type of prediction problem that you are solving—specifically, the type of predicted variable.
Here’s what you should keep in mind.
As a rule of thumb, you can begin with using the ReLU activation function and then move over to other activation functions if ReLU doesn’t provide optimum results.
And here are a few other guidelines to help you out.
Finally, a few rules for choosing the activation function for your output layer based on the type of prediction problem that you are solving:
The activation function used in hidden layers is typically chosen based on the type of neural network architecture.
参考资料:
[1] 12 Types of Neural Networks Activation Functions: How to Choose? (v7labs.com)
[2] 从ReLU到GELU,一文概览神经网络的激活函数