Graph closure

Graph closure creates a graph from the co-occurrence matrix. Every element (instance pair) is treated as a node. Pairs that have a higher value than the threshold are connected by an edge. Following this, a clique formation occurs, according to a specified size (specified by the number of nodes in the clique). Cliques are subsets of the graph's nodes, such that every two nodes of the clique are neighbors. Finally, the cliques are combined to form unique clusters. In OpenEnsembles, it is implemented by the finish_graph_closure function, in the cluster class. The clique_size parameter determines the number of nodes in each clique. The threshold parameter determines the minimum co-occurrence that a pair must have in order to be connected by an edge in the graph. Similar to the previous example, we will use graph closure in order to cluster the breast cancer dataset. Notice that the only change in the code will be the usage of finish_graph_closure, instead of finish_majority_vote. First, we load the libraries and the dataset, and create the OpenEnsembles data object:

# --- SECTION 1 ---
# Libraries and data loading
import openensembles as oe
import pandas as pd
import sklearn.metrics

from sklearn.datasets import load_breast_cancer

bc = load_breast_cancer()

# --- SECTION 2 ---
# Create the data object
cluster_data = oe.data(pd.DataFrame(bc.data), bc.feature_names)

Then, we create the ensemble and use graph_closure in order to combine the cluster results. Notice that the dictionary key also changes to 'graph_closure':

# --- SECTION 3 ---
# Create the ensembles and calculate the homogeneity score
for K in [2, 3, 4, 5, 6, 7]:
for ensemble_size in [3, 4, 5]:
ensemble = oe.cluster(cluster_data)
for i in range(ensemble_size):
name = f'kmeans_{ensemble_size}_{i}'
ensemble.cluster('parent', 'kmeans', name, K)

preds = ensemble.finish_majority_vote(threshold=0.5)
print(f'K: {K}, size {ensemble_size}:', end=' ')
print('%.2f' % sklearn.metrics.homogeneity_score(
bc.target, preds.labels['majority_vote']))

The effect of K and the ensemble size on the clustering quality is similar to majority voting, although it does not achieve the same level of performance. The results are depicted in the following table:

K

Size

Homogeneity

2

3

0.42

2

4

0.42

2

5

0.42

3

3

0.47

3

4

0

3

5

0.47

4

3

0.58

4

4

0.58

4

5

0.58

5

3

0.6

5

4

0.5

5

5

0.5

6

3

0.6

6

4

0.03

6

5

0.62

7

3

0.63

7

4

0.27

7

5

0.27

Homogeneity for graph closure clustering on the raw breast cancer data
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset