What are activation functions?

Abidi Ghofrane
7 min readJan 7, 2021

All about activation functions and their purpose, linear, Binary step, Sigmoid, ReLu, softmax… What are the differences? And what should you use?

In order for us to really understand activation functions, let’s have a quick recap on how a Neural Network works.

What are Neural Networks?

When dealing with our own brains trying to read something, we use our visual cortex containing up to 40 million neurons with billions of connections between them. We do it unconsciously without realizing how tough of a job it actually is. But when putting it to an actual program that does it algorithmically, it turns out not that simple.

A neural network is a collection or a group of neurons, it’s made up of layers of neurons. these neurons are the count processing units of the network

How does a neural network work?

first, we have the input layer which receives input, the output layer predicts the final output. In between, we have hidden layers which perform most of the computations required by our network.

Neurons of a layer are connected to neurons of the next layer through channels. each of these channels is assigned a numerical value known as weight. Weight affects the amount of influence a change in the input will have upon the output. A low weight value will have no change on the input, and alternatively a larger weight value will more significantly change the output. The inputs are multiplied to the corresponding weight and their sum is sent as input to the neurons in the hidden layer. Each of these neurons is associated with a numerical value called bias. Bias represents how far off the predictions are from their intended value. Biases make up the difference between the function’s output and its intended output.

z = w1*x1+b1 + w2*x2+b2+ …+wn*xn+bn

This value z is passed through a threshold function called the Activation Function. The result of the activation function determines if the particular neuron will get activated ot not. What does this mean? An activated neuron transmits the data to the neurons of the next layer over the channels, in this way data is propagated through the network, this is called Forward propagation. In the output layer, the neuron with the highest value determines the output. the values are a propability so they range between 0 and 1.

Simply put, a neuron calculates a weighted sum of its input, adds bias and then passes the value to an activation function that decides wheather outside connections, or neurons in the next layer should consider this neuron as “fired” or not, means activated or not and then do the same thing. We keep repeating this process until we reach the last layer. The final output value is the prediction. we find the difference between it and the expected output which is the label

Then use that error value to compute the partial derivative with respect to the weights in each layer. We then update the weights with these values and repeat the process until the error is as small as possible.

An additional aspect of activation functions is that they must be computationally efficient because they are calculated across thousands or even millions of neurons for each data sample. There are multiple types of activation functions:

Binary step activation function

It is very basic and it comes to mind every time if we try to bound output, if the input value is above or below a certain threshold, the neuron is activated and sends exactly the same signal to the next layer, else the neuron is not activated

f(x) = 1 if x > 0 else 0 if x < 0

Binary step function

The problem with binary step functions is that you can only have 2 types of outputs. 1 or 0 It does not allow multi-value outputs.

Linear activation function

It is also called identity function. A linear function takes the form

f(x)=a*x + b

Linear function with b = 0

Problem with linear functions is you can’t use the gradient descent in the backpropagation phase which uses the derivative of the activation function, since the derivative of a linear function is a constant so this will not be good for learning. Also, all layers will collapse to 1, no matter how many layers in the neural network, the last layer will be a linear function of the first layer (because a linear combination of linear functions is still a linear function)

PS: You can still use linear activation functions in a linear regression model.

Non-linear activation functions

Generally, Non-linear activation functions are used, since they allow backpropagation because they have a derivative function which is related to the inputs.

Sigmoid activation function

the Sigmoid Activation function is very simple. It takes any real value as input and outputs a value that ‘s always between 0 or 1.

It is non-linear, monotonic, continuously differentiable, and has a [0,1] output range. Main advantage is simple and good for classifier. But the Big disadvantage of the function is that it it gives rise to a problem of “vanishing gradients”, for very high or very low values of X, there is almost no change to the prediction, causing a vanishing gradient problem. This can result in the network refusing to learn further, or being too slow to reach an accurate prediction.

The second problem is that its output isn’t zero centered, it starts from 0 and ends up at 1. that means the value after the function will be positive and that makes the gradient of the weights become all positive or all negative this makes the gradient updates go too far in different directions which makes optimisation harder

TanH / Hyperbolic Tangent activation function

Tanh is a rescaled logistic sigmoid function. It squaches a real number to range of [-1, 1] instead of [0,1] so its output is indeed zero entered which makes optimisation easier.

Tanh activation function

So in practice it is close to the sigmoid function and just like the sigmoid, the problem of the vanishing gradient persists

ReLU (Rectified Linear Unit) Activation function

This activation function has become very popular in the last few years. It’s basically f(x)=max(0,x). It’s not linear and it provides the same benefits as Sigmoid but with better performance.

ReLu function

It’s main advantage is that it allows the network to converge very quickly and its less expensive than sigmoid and tanh. But one problem that ReLu has sometimes is that some units can be fragile during training and die, ie a big gradient flowing through a neuron can cause a weight update that makes it never activate on any data point again so gradients flowing through it will always be 0 from that point on.

So a variant of the ReLu function was introduced,

Leaky ReLu activation function

this variation of ReLU has a small positive slope in the negative area, so it does enable backpropagation, even for negative input values => it prevents the dying ReLU problem

Leaky ReLu function

Softmax Activation function

It is also another popular variant of the ReLu function which is a generalized form of ReLu and Leaky ReLu but it doubles the number of parameters for each neuron so there’s tradeoff.

softmax function

Here’s an example of the softmax function:

  • Able to handle multiple classes only one class in other activation functions — normalizes the outputs for each class between 0 and 1, and divides by their sum, giving the probability of the input value being in a specific class.
  • Useful for output neurons — typically SoftMax is used only for the output layer, for neural networks that need to classify inputs into multiple categories.

What activation function should I use?

It depends on the problem for easy and quicker convergence of the network.

  • Sigmoid functions generally work better when working with classifiers.
  • ReLU function is a general activation function and is used in most cases these days.
  • If there’s a case of dead neurons in the network, the leaky ReLU function is the best choice for the situation.
  • ReLU function should only be used in the hidden layers.
  • The best choice for all situations is ,begin with using ReLU function and then move over to other activation functions in case ReLU doesn’t provide with best results.

for more information about this;

--

--

Abidi Ghofrane

Software engineering student at Holberton School Tunis