Using OpenEnsembles

OpenEnsembles is a Python library that is dedicated to ensemble methods for clustering. In this section, we will present its usage and utilize it in order to cluster some of our example datasets. In order to install the library, the pip install openensembles command must be executed in the Terminal. Although it leverages scikit-learn, its interface is different. One major difference is that data must be passed as a data class, implemented by OpenEnsembles. The constructor has two input parameters: a pandas DataFrame which contains the data, and a list which contains the feature names:

# --- SECTION 1 ---
# Libraries and data loading
import openensembles as oe
import pandas as pd
import sklearn.metrics

from sklearn.datasets import load_breast_cancer

bc = load_breast_cancer()

# --- SECTION 2 ---
# Create the data object
cluster_data = oe.data(pd.DataFrame(bc.data), bc.feature_names)

In order to create a cluster ensemble, a cluster class object is created, passing the data as the parameter:

ensemble = oe.cluster(cluster_data)

In this example, we will calculate the homogeneity score for a number of K values and ensemble sizes. In order to add a base learner to the ensemble, the cluster method of the cluster class must be called. The method accepts as arguments, source_name, which denotes the source data matrix name, algorithm. This dictates what algorithm the base learners will utilize, output_name, which will be the dictionary key for accessing the results of the specific base learner and K, the number of clusters for the specific base learner. Finally, in order to compute the final cluster memberships through majority voting, the finish_majority_vote method must be called. The only parameter that must be specified is the threshold value:

# --- SECTION 3 ---
# Create the ensembles and calculate the homogeneity score
for K in [2, 3, 4, 5, 6, 7]:
for ensemble_size in [3, 4, 5]:
ensemble = oe.cluster(cluster_data)
for i in range(ensemble_size):
name = f'kmeans_{ensemble_size}_{i}'
ensemble.cluster('parent', 'kmeans', name, K)

preds = ensemble.finish_majority_vote(threshold=0.5)
print(f'K: {K}, size {ensemble_size}:', end=' ')
print('%.2f' % sklearn.metrics.homogeneity_score(
bc.target, preds.labels['majority_vote']))

It is evident that five clusters produce the best results for all three ensemble sizes. The results are summarized in the following table:

K

Size

Homogeneity

2

3

0.42

2

4

0.42

2

5

0.42

3

3

0.45

3

4

0.47

3

5

0.47

4

3

0.58

4

4

0.58

4

5

0.58

5

3

0.6

5

4

0.61

5

5

0.6

6

3

0.35

6

4

0.47

6

5

0.35

7

3

0.27

7

4

0.63

7

5

0.37

OpenEnsembles majority vote cluster homogeneity for the breast cancer dataset

If we transform the data into two embeddings with t-SNE, and repeat the experiment, we get the following homogeneity scores:

K

Size

Homogeneity

2

3

0.42

2

4

0.42

2

5

0.42

3

3

0.59

3

4

0.59

3

5

0.59

4

3

0.61

4

4

0.61

4

5

0.61

5

3

0.61

5

4

0.61

5

5

0.61

6

3

0.65

6

4

0.65

6

5

0.65

7

3

0.66

7

4

0.66

7

5

0.66

Majority vote cluster homogeneity for the transformed breast cancer dataset
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset