How does Sigmoid activation work in multi-class classification problems



I know that for a problem with multiple classes we usually use softmax, but can we also use sigmoid? I have tried to implement digit classification with sigmoid at the output layer, it works. What I don't understand is how does it work?

bharath chandra

Posted 2018-10-06T08:41:48.900

Reputation: 61



If your task is a kind of classification that the labels are mutually exclusive, each input just has one label, you have to use Softmax. If the inputs of your classification task have multiple labels for an input, your classes are not mutually exclusive and you can use Sigmoid for each output. For the former case, you should choose the output entry with the maximum value as the output. For the latter case, for each class, you have an activation value which belongs to the last sigmoid. If each activation is more than 0.5 you can say that entry exists in the input.


Posted 2018-10-06T08:41:48.900

Reputation: 12 077

Yes sir, but my intention is to know how they work within the network. For example, consider a training example using softmax I got expected value 3 when it the actual output is 4 so this can be compared and the weights can be adjusted, but when using sigmoid I always get the output between 0 to 1 how can I compare this with the actual output which can anything between 0 to 9. I am getting an accuracy of 98% when using sigmoid and 99% when using softmax but I don't understand how sigmoid is working. – bharath chandra – 2018-10-06T23:22:07.333

I didn't understand. – Media – 2018-10-07T06:56:50.127

@Media He/She is asking why even if the nature of her data is multi-class, employing sigmoid still working. – sariii – 2020-04-22T05:10:17.257

@sariii Maybe due to the nature of the problem in hand. all tasks are not complicated. This also can happen due to a lack of enough data. It all depends on the dispersion of the classes. If they are good separable in the space means the classes are too much distinct. – Media – 2020-04-22T06:33:42.013


softmax() will give you the probability distribution which means all output will sum to 1. While, sigmoid() will make sure the output value of neuron is between 0 to 1.

In case of digit classification and sigmoid(), you will have output of 10 output neurons between 0 to 1. Then, you can take biggest one of them and classify as that digit.


Posted 2018-10-06T08:41:48.900

Reputation: 508

So what you are saying is both works same? So softmax calculates the probability of one neuron with respect of all others and then returns neuron that has maximum probability whereas when using sigmoid it generates output for each neuron independently and the neuron that has maximum output is returned. Please correct me if I am wrong.. – bharath chandra – 2018-10-07T03:00:24.833

Yes, both work the same way. Softmax is an extension of sigmoid for multi-class classifications problem. Softmax in multiclass logistic regression with K=2 takes the form of sigmoid function. – Preet – 2019-02-10T11:30:14.980


@bharath chandra A Softmax function will never give 3 as output. It will always output real values between 0 and 1. A Sigmoid function also gives output between 0 and 1. The difference is that in the former one, the sum of all the outputs will be equal to 1 (due to mutually exclusive nature) while in the latter case, the sum of all the outputs need not necessarily be equal to 1 (due to independent nature).

PS Nayak

Posted 2018-10-06T08:41:48.900

Reputation: 143


For Beginners: You may read this quora answer Which explains Pros and cons of Sigmoid Activations and softmax Probability. there are 6 answers at the time of writing for inclusiveness . Sigmoid vs Softmax

Answer Highlights :

  • if you see the function of Softmax, the sum of all softmax units are supposed to be 1. In sigmoid it’s not really necessary.

  • In the binary classification both sigmoid and softmax function are the same where as in the multi-class classification we use Softmax function.

  • If you’re using one-hot encoding, then I strongly recommend to use Softmax.

What i Noticed: to the best of my knowledge >> Softmax is probability distribution for various possible classes (multi class) in our sample space. and all classes must be predefined in advance before passing anything to softmax activation layer via one-hot encoding. for example tokenization and word stemming in NLP to homogenize data.

For Not-beginners: on the official Keras Page softmax documentation is given as:


keras.activations.softmax(x, axis=-1)

Softmax activation function.


    x: Input tensor.
    axis: Integer, axis along which the softmax normalization is applied.


Tensor, output of softmax transformation.


    ValueError: In case dim(x) == 1.

While for Sigmoid is given as:



Sigmoid activation function.


    x: Input tensor.


The sigmoid activation: 1 / (1 + exp(-x)). 

nikhil swami

Posted 2018-10-06T08:41:48.900

Reputation: 109