Soft voting takes into account the probability of the predicted classes. In order to combine the predictions, soft voting calculates the average probability of each class and assumes that the winner is the class with the highest average probability.In the simple case of three base learners and two classes, we must take into consideration the predicted probability for each class and average them across the three learners:
Using our previous example, and by taking the average of each column for Table 1, we can expand it, adding a row for the average probability.
The following table shows the predicted probabilities for each class by each learner, as well as the average probability:
Class A |
Class B |
Class C |
|
Learner 1 |
0.5 |
0.3 |
0.2 |
Learner 2 |
0 |
0.48 |
0.52 |
Learner 3 |
0.4 |
0.3 |
0.3 |
Average |
0.3 |
0.36 |
0.34 |
As we can see, class A has an average probability of 0.3, class B has an average probability of 0.36, and class C has an average probability of 0.34, making class B the winner. Note that class B is not selected by any base learner as the predicted class, but by combining the predicted probabilities, class B arises as the best compromise between the predictions.
In order for soft voting to be more effective than hard voting, the base classifiers must produce good estimates regarding the probability that a sample belongs to a specific class. If the probabilities are meaningless (for example, if they are always 100% for one class and 0% for all others), soft voting could be even worse than hard voting.
For more on the impossibility theorem, see A difficulty in the concept of social welfare. Arrow, K.J., 1950. Journal of political economy, 58(4), pp.328-346.