Mohammad Shabbir Hasan; Zhong-Hui Duan Department of Computer Science, College of Arts and Sciences, University of Akron, Akron, USA
DNA microarray technology is a very useful tool to get information for various biological processes by providing a convenient way to investigate expression levels of thousands of genes in a collection of related samples. Moreover, as a discovery tool, DNA microarrays have the potential to enhance the understanding of the molecular basis of different diseases by simultaneously providing information on thousands of genes. Researchers using DNA microarray technology often find it interesting and meaningful to group genes with similar expression patterns together. k-means and hierarchical clustering are two widely used clustering algorithms used for this purpose. However, both of these algorithms suffer from some limitations. Although the k-means clustering algorithm is an efficient algorithm in producing tight clusters, it requires a prespecified number of clusters as input, and the performance of this clustering algorithm mostly depend on this specification. Hierarchical clustering, on the other hand, produces a hierarchy of clusters that is very informative; however, this advantage comes at the cost of low efficiency. This chapter describes a new clustering algorithm called hierarchical k-means that was used to overcome these limitations. In this algorithm, the number of clusters is determined from the hierarchy of clusters produced by the hierarchical clustering algorithm, and then this information is fed into k-means to produce the final clusters. After applying this proposed algorithm to microarray data of lung adenocarcinoma followed by the functional investigation of representative genes from the group of normal tissues and KRAS mutation tissues, we found that this proposed algorithm can group genes with similar expression patterns together. Therefore, in future studies, this proposed algorithm can definitely be used for clustering microarray data.
Gene products such as proteins or RNA are created from the inheritable information contained in a gene (Hunter and Holm, 1992). Traditional molecular biology focuses on studying individual genes in isolation for determining gene functions. However, it is not suitable for determining complex gene interactions or for explaining the nature of complex biological processes due to the large number of genes. For this purpose, examining the expression pattern of a large number of genes in parallel is required (Michaels et al., 1998). With the advancement of large-scale transcription profiling technology, DNA microarrays have become a useful tool that allows the analysis of the gene expression pattern at the genome level (Gresham et al., 2008). In genetic-mapping studies, DNA microarrays have been widely used on polymorphisms between parental genotypes and have facilitated the discovery of gene expression markers (Gresham et al., 2008; Wang et al., 2009). Due to its importance, efficient algorithms are necessary to analyze the DNA microarray data set accurately (Hasan, 2013). Studies have showed that a group of genes with similar gene expressions are likely to have related gene functions (Mount, 2004). Therefore, how to find the genes that share similar expression patterns across samples is an important question that is frequently asked in the DNA microarray studies (Qin et al., 2014).
Clustering, which is a useful technique to constitute unknown groupings of objects (Kaufman and Rousseeuw, 2009), has become an important part of gene expression data analysis (Qin et al., 2014; Eisen et al., 1998). By investigating the clusters of genes having similar expression patterns across samples, researchers can elucidate gene functions, genetic pathways, and regulatory circuits. Clustering helps to find a distinct pattern for each cluster, as well as more information about functional similarities and gene interactions within the cluster (Hasan and Duan, 2014). For clustering DNA microarray data, a good number of algorithms have been developed that include k-means (Tavazoie et al., 1999), hierarchical clustering (Eisen et al., 1998; Luo et al., 2003; Wen et al., 1998), self-organizing maps (Tamayo et al., 1999; Törönen et al., 1999; He et al., 2003), support vector machines (Brown et al., 2000), Bayesian networks (Friedman et al., 2000), and fuzzy logic approach (Woolf and Wang, 2000). In addition to these algorithms, there are others that use genomic information, along with gene expression data, to improve clustering efficiency. Algorithms that fall into this category include an ontology-driven clustering algorithm (Wang et al., 2005) and the ones that use information about TS2 upstream regions of the coding sequences and gene expression profiles to get more biologically relevant clusters (Holmes and Bruno, 2000; Barash and Friedman, 2002; Kasturi et al., 2003).
Among the existing clustering algorithms, k-means and hierarchical clustering algorithms are the most commonly used. k-means is computationally faster than hierarchical clustering and produces tighter clusters than the hierarchical clustering algorithm. On the other hand, the hierarchical clustering algorithm computes a complete hierarchy of clusters and hence is more informative than k-means. Despite these advantages, both of these algorithms suffer from some limitations. The performance of k-means clustering depends on how effectively the initial number of clusters (i.e., the value of k) is determined, and the advantage of hierarchical clustering comes at the cost of low efficiency. Moreover, being computationally expensive, both of these algorithms impede the wide use of these algorithms in gene expression data analysis (Garai and Chaudhuri, 2004; Ushizawa et al., 2004; Bolshakova et al., 2005). As a solution to this problem, a combined approach was proposed by Chen et al. (2005), who first applied the k-means algorithm to determine the k clusters and then fed these clusters into the hierarchical clustering technique to shorten the merging cluster time and generate a treelike dendrogram. However, this solution still suffers from the limitation of determining the initial value for k (Hasan, 2013; Hasan and Duan, 2014).
In this chapter, we propose a new algorithm, hierarchical k-means, that combines the advantages of both k-means and the hierarchical clustering algorithm to overcome their limitations. Combining different algorithms to overcome their own limitations and produce better results is a popular approach in research (Che et al., 2011, 2012; Hasan et al., 2012). In this proposed algorithm, initially we applied the hierarchical clustering algorithm and then used the result to decide the initial number of clusters and fed this information into k-means clustering to obtain the final clusters. Since similar gene expression profiles indicate similarity in their gene functionalities (Azuaje and Dopazo, 2005), after applying the proposed algorithm to the microarray data set of lung adenocarcinoma using gene ontology (GO) annotations, we explored the change in the enrichment of molecular functionalities of the genes of each cluster for normal tissue and KRAS-positive tissues. Our results showed that in each cluster, genes were grouped together based on their expression pattern and molecular functions, which indicate the correctness of this proposed algorithm.
k-means clustering algorithm: For clustering genes, k-means clustering, a well-known method for cluster analysis partition expression levels of n genes into k clusters, so that the total distance between the cluster’s genes and its corresponding centroid, representative of the cluster, is minimized. In short, the goal is to partition the n genes into k sets Si, i = 1, 2…, k in order to minimize the within-cluster sum of squares (WCSS), defined as
where provides the distance between a gene and the cluster’s centroid.
In this clustering algorithm, the initial cluster centroids are selected randomly. After that, each gene is assigned to the closest cluster centroid. Then each cluster centroid is moved to the mean of the points assigned to it. This algorithm converges when the assignments no longer change. Algorithm 4.1 shows the pseudocode of the k-means clustering algorithm.
Hierarchical clustering algorithm: In gene clustering, hierarchical clustering is a method of cluster analysis that builds a hierarchy of clusters (as its name indicates). This clustering method organizes genes into tree structures based on their relation. The basic idea is to assemble a set of genes into a tree, where genes are joined by very short branches if they have very great similarity to each other, and by increasingly long branches as their similarity decreases.
The approaches for hierarchical clustering can be classified into two groups: agglomerative and divisive. The agglomerative approach is a “bottom-up” approach, where each gene starts in its own cluster and pairs of clusters are merged as one moves up the hierarchy. On the other hand, divisive approach is a “top-down” approach, where all genes starts in one cluster and splits are performed recursively as one moves down the hierarchy. In this chapter, we mainly focus on the agglomerative approach for hierarchical clustering.
The first step in hierarchical clustering is to calculate the distance matrix between the genes in the data set. The clustering starts once this matrix of distances is computed. The agglomerative hierarchical clustering technique consists of repeated cycles where the two closest genes having the smallest distance are joined by a node known as a pseudonode. The two joined genes are removed from the list of genes being processed and replaced by the pseudonode that represents the new branch. The distances between this pseudonode and all other remaining genes are computed, and the process is repeated until only one node remains. Note that there are a variety of ways to compute distances while dealing with a pseudonode: centroid linkage, single linkage, complete linkage, and average linkage. In this chapter, we use average linkage, which defines the distance between two clusters as the average pairwise distance between genes in cluster Ci and Cj calculated using Eq. (4.2):
where δ(x, y) is typically given by the Euclidean distance calculated using Eq. (4.3):
The pseudocode of agglomerative hierarchical clustering using average linkage is illustrated in Algorithm 4.2.
Hierarchical k-means: In this proposed algorithm, we selected the value of k (i.e., the number of clusters) in a systematic way. Initially, we used the agglomerative hierarchical clustering algorithm for clustering the data set using average linkage and then checked at what level the distance between two consecutive nodes of the hierarchy was the maximum. Using this information, the value of k is determined, which is then fed into the k-means clustering algorithm to produce the final clusters. In both algorithms, the Pearson correlation coefficient (r) was used as the similarity metric between two samples and 1 − r was used as the distance metric. Algorithm 4.3 shows the pseudocode of the proposed algorithm.
Lung adenocarcinoma, the most frequent type of non-small-cell lung cancer (NSCLC) accounts for more than 50% of NSCLC, and the percentage is increasing (Okayama et al., 2012). Recent studies revealed that activation of the EGFR, KRAS, and ALK genes defines three different pathways that are responsible for a considerable fraction (30%–60%) of lung adenocarcinomas (Pao and Girard, 2011; Ihle et al., 2012; Janku et al., 2010; Bronte et al., 2010; Gerber and Minna, 2010). The data set used in this research contains expression profiles for 246 samples, of which 20 samples belonged to normal lung tissue. Out of the remaining 226 lung adenocarcinoma samples, 127 were with EGFR mutation, 20 with KRAS mutation, 11 with EML4-ALK fusion, and 68 with triple negative cases. The platform used for this data set was GPL570 [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array. This data set was collected from the GEO database (accession number GSE31210). The data set contained 54,675 genes. In this study, we considered 40 samples consisting of 20 samples from normal tissues and 20 samples from KRAS-positive tissues.
To determine the differentially expressed genes, we performed paired student t-test (Hsu and Lachenbruch, 2008) and Bonferroni corrections (Bonferroni, 1936), followed by the calculation of the value of fold change of the genes. In this study, after performing Bonferroni correction, we selected the genes as the most differentially expressed ones, which have adjusted p-values ≤ 0.05. In addition, we considered only those genes where the value of fold change (increase or decrease) is significant; i.e., the average fold change between cancer and normal is ≥ 2. Besides this preprocessing, we considered only those genes that are associated with molecular functions according to Gene Ontology (GO).
After performing the t-test, we obtained 21,880 genes having significant p-values (≤ 0.05). We performed Bonferroni correction on these genes and found 1988 genes that had a significantly adjusted p-value (≤ 0.05). Adding the fold change criterion, we reduced the set of differentially expressed genes to 1005. We then performed another step of filtering to keep only those genes that have GO terms and responsible for molecular functions. Finally, we came up with 464 genes in the final data set. The final data set is given partially in Table 4.1, and the complete data set is available in http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE31210 (Accessed 06/18/2013).
Table 4.1
A Brief Overview of the Final data set
Affymatrix ID | Gene Symbol | Samples | ||
GSM 773551 | … | GSM 773784 | ||
1555579_s_at | PTPRM | 3441.22 | … | 3569.13 |
211986_at | AHNAK | 4395.68 | … | 7080.40 |
222392_x_at | PERP | 21707.73 | … | 11350.53 |
236715_x_at | UACA | 1303.01 | … | 1867.76 |
244704_at | NFYB | 124.08 | … | 277.49 |
… | … | … | … | … |
211237_s_at | FGFR4 | 22.41 | … | 11.07 |
203980_at | FABP4 | 257.25 | … | 920.44 |
207302_at | SGCG | 47.09 | … | 9.61 |
210081_at | AGER | 241.63 | … | 2001.28 |
217046_s_at | AGER | 132.42 | … | 1016.05 |
The result of hierarchical clustering for normal tissue data set is shown in Figure 4.1. There are 463 interior nodes in the tree where each node is labeled based on the increasing order of its height. Therefore, the ID for the root is 463. To determine the number of clusters from the output of hierarchical clustering, we used a bar graph to show the difference of height between two consecutive interior nodes (see Figure 4.2).
From Figure 4.2, we can see that the difference is the maximum for nodes 461 and 462. As there is a total of 463 nodes in the tree, node 461 is on level 3 from the top. So, according to the proposed algorithm, the total number of clusters for k-means clustering should be 4.
Similarly, the number of clusters for the KRAS-positive data set can also be determined. Figure 4.3 shows the hierarchical clustering of KRAS-positive data set, and Figure 4.4 shows the height difference between two consecutive nodes. The results indicate that the number of clusters for KRAS-positive data set should be 4.
After determining the value for the initial number of clusters (k), we passed the value to k-means algorithms, and k numbers of clusters were formed for both normal and KRAS-positive tissues. We explored their common features (genes) and explained the change of molecular function of the genes captured in the clusters of both normal tissue and KRAS-positive tissue using GO annotations. For comparing the molecular function of the clusters of normal tissue and KRAS-positive tissues, we took one cluster from the normal tissue data set and one from the KRAS-positive data set that have the maximum number of common genes. Table 4.2 shows the clusters that we selected for comparing their molecular functions with the number of genes they have in common.
Table 4.2
List of the Clusters to Be Compared for the Alteration in Molecular Function
Clusters to Compare | Number of Genes in Common | |
Normal Tissue | KRAS-Positive | |
Cluster 1 | Cluster 1 | 20 |
Cluster 2 | Cluster 3 | 52 |
Cluster 3 | Cluster 4 | 46 |
Cluster 4 | Cluster 2 | 69 |
We explored the molecular functions of the genes in each cluster using GO annotations. Relationships among the genes were represented using a directed acyclic graph (DAG), termed the GO graph. We used a web-based tool called the Gene Ontology Enrichment Analysis Software Toolkit (GOEAST) (Zheng and Wang, 2008) to generate these graphs. This graph displays enriched Gene Ontology IDs (GOIDs) and their hierarchical relationships in molecular function GO categories. Figures 4.5 and 4.6 show the GO graph for cluster 1 for normal tissue and KRAS-positive tissue data set, respectively.
In Figures 4.5 and 4.6, boxes represent GO terms, each labeled by its GOID and term definition. Note that significantly enriched GO terms are shaded yellow. The degree of color saturation of each node is positively correlated with the significance of enrichment of the corresponding GO term. Nonsignificant GO terms within the hierarchical tree are shown as white boxes. In both of these graphs, edges stand for connections between different GO terms. Edges colored in red stand for the relationship between two enriched GO terms, black solid edges stand for the relationship between enriched and unenriched terms, and black dashed edges stand for the relationship between two unenriched GO terms.
In brief, these two figures show that the significant GO terms GO: 0005488 (binding) and GO: 0005515 (protein binding) remain the same in both clusters. GO terms such as GO: 0030234 (Enzyme Regulator Activity), GO: 0019207 (Kinase Regulator Activity), GO: 0019210 (Kinase Inhibitor Activity), GO: 0019887 (Protein Kinase Regulator Activity), and GO: 0004860 (Protein Kinase Inhibitor Activity), which are unenriched in normal tissue, become highly enriched in the KRAS-positive tissues, indicating that our proposed algorithm can cluster representative genes of both data sets correctly.
To compare the enrichment status of the two clusters better, we used Multi-GOEAST, which is an advanced version of GOEAST, and it is helpful to identify the hidden correlation between the two clusters (Zheng and Wang, 2008). Figure 4.7 shows the comparative GO graph for cluster 1 of both data sets.
In the comparative GO graph, significantly enriched GO terms in both clusters are marked yellow, and light yellow color indicates the GO terms that are enriched in both clusters. Nodes marked with coral pink indicate the GO terms that are enriched in the normal tissue data set but not in the KRAS-positive data set. In addition, green nodes represent the GO terms that are unenriched in normal tissue but enriched in KRAS-positive tissue. Note that the degree of color saturation of each node is positively correlated with the significance of enrichment of the corresponding GO term.
Table 4.3 lists the genes associated with the GO terms that are enriched in cluster 1 of the KRAS-positive tissue data set, but not in cluster 1 of the normal tissue data set. These GO terms are marked green in the comparative GO graph shown in Figure 4.7. We believe that these are responsible for the alteration of the molecular activity in the cell and are linked to the development of KRAS lung cancer. Similarly, we can generate and compare the GO enrichment graph for the rest of the clusters (see supplementary materials).
Table 4.3
GO Terms and Pathways That Are Enriched in Molecular Functions of the Genes of Cluster 1 of KRAS-Positive Tissue but Unenriched in the Genes of Cluster 1 of Normal Tissue Data Set
GO ID | GO Term | Associated Genes | Pathway |
GO:0030234 | Enzyme regulator activity | TIMP3 | 1 Matrix_metalloproteinases |
CDKN1C | G1_to_S_cell_cycle_reactome | ||
PAK1 | Integrin mediated_cell_adhesion_KEGG | ||
ECT2 | N/A | ||
RALGPS2 | N/A | ||
SFN | Calcium_regulation_in_cardiac_cells | ||
Smooth_muscle_contraction | |||
GO:0019207 | Kinase regulator activity | CDKN1C | G1_to_S_cell_cycle_reactome |
SFN | Calcium_regulation_in_cardiac_cells | ||
Smooth_muscle_contraction | |||
GO:0004857 | Enzyme inhibitor activity | TIMP3 | Matrix_metalloproteinases |
CDKN1C | G1_to_S_cell_cycle_reactome | ||
SFN | Calcium_regulation_in_cardiac_cells | ||
Smooth_muscle_contraction | |||
GO:0019887 | Protein kinase regulator activity | CDKN1C | G1_to_S_cell_cycle_reactome |
SFN | Calcium_regulation_in_cardiac_cells | ||
Smooth_muscle_contraction | |||
GO:0019210 | Kinase inhibitor activity | CDKN1C | G1_to_S_cell_cycle_reactome |
SFN | Calcium_regulation_in_cardiac_cells | ||
Smooth_muscle_contraction | |||
GO:0004860 | Protein kinase inhibitor activity | CDKN1C | G1_to_S_cell_cycle_reactome |
SFN | Calcium_regulation_in_cardiac_cells | ||
Smooth_muscle_contraction | |||
GO:0051018 | Protein kinase A binding | AKAP12 | G_protein_signaling |
GO:0008179 | Adenylate Cyclase binding | AKAP12 | G_protein_signaling |
In this chapter, we propose hierarchical k-means, a new combined clustering algorithm designed to cluster genes in a microarray data set based on their expression levels. In this algorithm, using the output from hierarchical clustering, we systematically determined the value of k required for k-means clustering. This way, the proposed algorithm overcomes the limitation of k-means clustering. This proposed algorithm takes advantage of the ability of hierarchical clustering to get a complete hierarchy of clusters and uses this information in k-means clustering to produce tighter clusters.
In this study, we examined 40 samples and 464 genes from the data set of lung adenocarcinoma, which is one of the most frequent types of NSCLC. Out of the 40 samples, 20 were from normal tissue and 20 were from KRAS-positive tissue. We applied t-test, Bonferroni correction, and fold change cutoff techniques to find the significantly differentially expressed genes, and among them, only the genes having GO terms and responsible for molecular functions were included in the final data set.
After applying the proposed clustering algorithms, we obtained four clusters for both the normal tissue data set and KRAS-positive data set. Hereafter, we examined the genes contained in each cluster with respect to their molecular functions based on GO annotation to see what changes in the enrichment of the molecular functions of genes took place from normal tissues to KRAS-positive tissues. This way, after checking the change in enrichment of the GO terms, we verified that the proposed algorithm can cluster representative genes of both data sets based on their expression patterns. The coherent approach presented in this chapter shows its correctness to cluster genes, and we believe that it can be generalized for clustering other types of large data sets as well.