Voting can be utilized in order to combine different clusterings of the same dataset. It is similar to voting for supervised learning, as each model (base learner) contributes to the final result with a vote. Here arises a problem of linking two clusters originating from two different clusterings. As each model will produce different clusters with different centers, we have to link similar clusters originating from different models. This is accomplished by linking together clusters that share the greatest number of instances. For example, assume that the following table and figure clusterings have occurred for a particular dataset:
The following table depicts each instance's cluster assignments for the three different clusterings.
Instance |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
Clustering 1 |
0 |
0 |
2 |
2 |
2 |
0 |
0 |
1 |
0 |
2 |
Clustering 2 |
1 |
1 |
2 |
2 |
2 |
1 |
0 |
1 |
1 |
2 |
Clustering 3 |
0 |
0 |
2 |
2 |
2 |
1 |
0 |
1 |
1 |
2 |
Using the preceding mapping, we can calculate the co-association matrix for each instance. This matrix indicates how many times a pair of instances has been assigned to the same cluster:
Instances |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
1 |
3 |
3 |
0 |
0 |
0 |
2 |
2 |
1 |
2 |
0 |
2 |
3 |
3 |
0 |
0 |
0 |
2 |
2 |
1 |
2 |
0 |
3 |
0 |
0 |
3 |
3 |
3 |
0 |
0 |
0 |
0 |
3 |
4 |
0 |
0 |
3 |
3 |
3 |
0 |
0 |
0 |
0 |
3 |
5 |
0 |
0 |
3 |
3 |
3 |
0 |
0 |
0 |
0 |
3 |
6 |
2 |
2 |
0 |
0 |
0 |
3 |
1 |
0 |
3 |
0 |
7 |
2 |
2 |
0 |
0 |
0 |
1 |
3 |
0 |
1 |
0 |
8 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
3 |
2 |
0 |
9 |
2 |
2 |
0 |
0 |
0 |
3 |
1 |
2 |
3 |
0 |
10 |
0 |
0 |
3 |
3 |
3 |
0 |
0 |
0 |
0 |
3 |
By dividing each element with the number of base learners in the ensemble, and clustering together samples that have a value greater than 0.5, we get the following cluster assignments:
Instance |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
Voting clustering |
0 |
0 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
1 |
As it is evident, the clustering is more stable. Furthermore, it is apparent that two clusters are sufficient for this dataset. By plotting the data and their cluster membership, we can see that there are two distinct groups, which is exactly what the voting ensemble was able to model, although each base learner generated three distinct cluster centers: