Activation Functions in Deep Learning / Neural Networks

Activation Function

7 min readOct 22, 2020

In simple terms activation function is a function which can be used to calculate the output of neural networks on the basis of some mathematical equations. These activations functions attached with each neuron helps us identify whether neuron should be activated or not depending on whether it is contributing to models prediction.

In neural networks we mostly use non linear activation functions but the question is why ?
Lets just say that we use linear activation functions in neural network then what will be happen is then we can say that its just a linear transformation of the input and such a network can easily be represented as a multiplication of matrix but does that solve our real word complex problems ?
Answer is no.

As in the case where all neurons have affine activation functions (i.e. an activation function on the form f(x) = a*x + c, where a and c are constants, which is a generalization of linear activation functions), which will just result in an affine transformation from input to output, which is not very exciting either.

We need non-linear activation functions to approximate non-linear functions and most real world problems are highly complex and non-linear and cannot be approximated with linear functions.

The purpose of non linear activation function is to bring non linearity in the network such that the network recognizes all the non linear patterns in the data or the problems.

In the above diagram input is x , weight is w.

Here what we do is just multiple weights with input and sum all of them and application functions on top of it and then what we get as output is passed on to the next hidden layer and so on or in the case of single hidden layer can be directly given to the output layer.

Types of Activation Functions

( Here we will see the types of activation functions which are mostly used in neural networks).

1. Sigmoid

This along with tanh ( we will see later ) was one of the most popularly used activation functions in 80’s and 90’s.

Here is the chart of the sigmoid function and its derivative.

This functions brings down or transforms the value between 0 and 1 and the same can be seen in the chart shown above. The function is not zero centric and hence transforms all the values to zero which are negative.

Now when we see the 2nd chart then we can see that the derivative of the sigmoid ranges between 0 and 0.25

When we talk about derivatives its most when we need to update the weights in backward prop the derivative of sigmoid can be represented in terms of sigmoid only as we can see hence easy to find the derivative of sigmoid.As we know the derivative ranges from 0 to 0.25 we will encounter the vanishing gradient problem which means it will be tough to achieve the convergence.

Advantages of Sigmoid Functions:

Easy to Interpret as clear predictions between 0 and 1
Derivative of sigmoid is also in terms of sigmoid hence easy to find the derivative.
Normalizes the output between 0 and 1 hence can be used as a probability whenever we have binary class problem.

Disadvantages of Sigmoid Functions:

The output is not zero centric.
As its an exponential function hence it can be computationally expensive and time taking.
These are very prone to vanishing gradient problem specially in deep neural networks.

2. Tanh

Tanh is a hyperbolic tangent function and is very similar to the sigmoid function and is symmetric around the origin which means its a zero centric function.

As you can see tanh is very similar to the sigmoid and it ranges from -1 to 1. Hence once any input passes to the tanh function the output can be positive or negative.
The derivative of tanh ranges from 0 to 1.( 0≤tanh≤1 )

So derivative of tanh function can also be represented in terms of tanh.

Because tanh is a zero centric function its preferred over sigmoid. Sigmoid is mostly used in the output layer for the binary class classification.

Advantages of Tanh Functions:

Easy to Interpret as clear predictions between -1 and 1
Derivative of sigmoid is also in terms of tanh hence easy to find the derivative.
This is a zero centric function hence it takes into count the negative values.

Disadvantages of Tanh Functions:

As its an exponential function hence it can be computationally expensive and time taking.
Vanishing gradient problem is still not encountered in this function.

3. RELU

Relu stands for rectified linear unit and is most popular activation function till date.

This chops off all the negative values to zero and for al the values greater than zero it gives the value itself. Since the function is not continuous and not smooth hence is not differentiable at zero ( not a very big problem ). Derivative is defined everywhere else except at zero.

Since the derivative of relu can only be either 0 or 1 hence for the negative values the derivatives will be zero and hence we have the problem of dead activations where the neurons and not activated.

Advantages of Relu Functions:

Simple to understand and interpret as one can treat this as an if else condition for negative and positive values.
It is computationally fast and inexpensive.
It treats vanishing and exploding gradient problem better than sigmoid and tanh and hence the convergence is very fast.

Disadvantages of Relu Functions:

Gradient will be completely zero as a negative number is encountered. Hence in backward prop this can be a problem ( dead activations ).
Not a zero centric function as the output can either be zero or that value.

4. Leaky RELU

Leaky relu function was intended to be a better version of relu as its was encountering the negative values as well but in the actual operation it was found that Relu is still batter. Lets see how it was trying to overcome the problem of Relu.

For all the negative values the value will not be zero instead it will be for all the x’s which are negative 0.1(x) and for positives the value will be itself.

When we see Leaky Relu it has all the advantages over relu as it solves the problem of dead activations as the gradient for the negative values will be a non zero but in the real word still relu is found to be more powerful and used as compared to Leaky Relu.

5. Parameterised ReLU

This is another variant of Relu which is aimed towards the problem of solving the relu’s dead activations for the negative values.

Here for all the negative values we have introduced a new parameter ‘a’ which ranges between 0 and 1 and this is a trainable parameter. Hence convergence can be much faster as compared to leaky relu.

6. ELU(Exponential Linear Unit).

This was also aimed towards solving the problem of relu and has advantages over relu in theory and papers.

This can be termed as zero centric function and also solves dead relu issues but this is computationally very expensive and is also in real word problems its not found to be as efficient as Relu.

7. Softmax

For any vector of length K softmax can compress it into a real vector of length k with a value in the range (0,1) and sum of the elements in the vector is1.

The softmax function is used for multi class classification and the function returns the probability of all the data points belonging to specific class and the expression is already shared above.

8.Swish (A Self-Gated) Function

y = x * sigmoid (x)

Although swish is less popular but in deeper models it is often found to perform better than Relu.

Swish’s design was inspired by the use of sigmoid functions for gating in LSTMs and highway networks. We use the same value for gating to simplify the gating mechanism, which is called self-gating.

The advantage of self-gating is that it only requires a simple scalar input, while normal gating requires multiple scalar inputs. This feature enables self-gated activation functions such as Swish to easily replace activation functions that take a single scalar as input (such as Relu) without changing the hidden capacity or number of parameters.

Thanks for Reading.

If you found this post useful, an extra motivation will be helpful by giving this post some claps 👏.Feel free to ask questions and share your suggestions.

Reach me at :

LinkedIn : https://www.linkedin.com/in/saurabh-mishra-553ab188/

Github : https://github.com/SaurabhMishra779

Activation Functions in Deep Learning / Neural Networks