Co-occurrence matrix linkage treats the co-occurrence matrix as a distance matrix between instances, and utilizes the distances in order to perform hierarchical clustering. The clustering stops when there is no element on the matrix with a value greater than the threshold. Again, we repeat the example. We use the finish_co_occ_linkage function to utilize co-occurrence matrix linkage with threshold=0.5, and use the 'co_occ_linkage' key to access the results:
# --- SECTION 1 ---
# Libraries and data loading
import openensembles as oe
import pandas as pd
import sklearn.metrics
from sklearn.datasets import load_breast_cancer
bc = load_breast_cancer()
# --- SECTION 2 ---
# Create the data object
cluster_data = oe.data(pd.DataFrame(bc.data), bc.feature_names)
# --- SECTION 3 ---
# Create the ensembles and calculate the homogeneity score
for K in [2, 3, 4, 5, 6, 7]:
for ensemble_size in [3, 4, 5]:
ensemble = oe.cluster(cluster_data)
for i in range(ensemble_size):
name = f'kmeans_{ensemble_size}_{i}'
ensemble.cluster('parent', 'kmeans', name, K)
preds = ensemble.finish_co_occ_linkage(threshold=0.5)
print(f'K: {K}, size {ensemble_size}:', end=' ')
print('%.2f' % sklearn.metrics.homogeneity_score(
bc.target, preds.labels['co_occ_linkage']))
The following table summarizes the results. Notice that it outperforms the other two methods. Furthermore, the results are more stable, and less time is required to execute it than either of the other two methods:
K |
Size |
Homogeneity |
2 |
3 |
0.42 |
2 |
4 |
0.42 |
2 |
5 |
0.42 |
3 |
3 |
0.47 |
3 |
4 |
0.47 |
3 |
5 |
0.45 |
4 |
3 |
0.58 |
4 |
4 |
0.58 |
4 |
5 |
0.58 |
5 |
3 |
0.6 |
5 |
4 |
0.6 |
5 |
5 |
0.6 |
6 |
3 |
0.59 |
6 |
4 |
0.62 |
6 |
5 |
0.62 |
7 |
3 |
0.62 |
7 |
4 |
0.63 |
7 |
5 |
0.63 |