6 Mining deeper on GO for protein subcellular localization

The methods described in the previous chapters use gene ontology (GO) information for single-label and multi-label protein subcellular localization prediction. Their performance is significantly better than the conventional methods that use non-GO features. However, these GO-based methods only use the occurrences of GO terms as features and disregard the relationships among the GO terms. This chapter describes some novel methods that mine deeper into the GO database for protein subcellular localization. The methods leverage not only the GO term occurrences, but also the interterm relationships. Some previous works related to semantic similarity are first presented. Then, a multi-label predictor, SS-Loc, based on semantic similarity over GO is discussed. Finally, a hybrid-feature predictor, HybridGO-Loc, based on both GO term occurrences and semantic similarity, is elaborated.

6.1 Related work

The GO comprises three orthogonal taxonomies whose terms describe the cellular components, biological processes, and molecular functions of gene products. The GO terms in each taxonomy are organized within a directed acyclic graph. These terms are placed within structural relationships, of which the most important being the “is-a” relationship (parent and child) and the “part-of” relationship (part and whole) [159, 225].

Figure 6.1 shows a screenshot of the query results (under child terms) of the GOA webserver using the GO term GO:0000187 as the search key. As can be seen, GO:0000187 has six “is-a” child terms (GO:0000169, GO:0000199, GO:0071883, GO:0007257, GO:0071508, and GO:0035419) and one “part-of” child term (GO:0004708). Figure 6.2 shows a fragment of GO graph, of which the hierarchical relationships between an ancestor GO term (GO:0000187) and its child terms are given. As can be seen, a GO term may have many ancestor GO terms. For example, GO:0038080 has an “is-a” ancestor GO term GO:0000169 and a “part-of” ancestor GO term GO:0004708. Moreover, the common ancestor GO:0000187 of GO:0000169 and GO:0004708 is also an ancestor GO term of GO:0038080. The ancestors of GO:0000187 (not shown in the graph) are also those of GO:0038080.

Recently, the GO consortium has been enriched with more structural relationships, such as “positively-regulates”, “negatively-regulates”. and “has-part” [200, 201]. These relationships reflect that the GO hierarchical tree for each taxonomy contains redundant information, for which semantic similarity over GO terms can be found.


Fig. 6.1: A screenshot of the query results listing child terms of the GOA webserver (http://www.ebi.ac.uk/GOA) using a GO term (GO:0000187).


Fig. 6.2: A fragment of GO graph showing the hierarchical relationships between an ancestor GO term (GO:0000187) and its child terms. For simplicity, 0000187 stands for GO:0000187; similar abbreviations apply to other GO terms. Solid arrows represent “is-a” relationships, and dashed arrows represent “part-of” relationships.

Since the relationship between GO terms reflects the association between different gene products, protein sequences annotated with GO terms can be compared on the basis of semantic similarity measures. The semantic similarity over GO has been extensively studied and have been applied to many biological problems, including protein function prediction [164, 246], subnuclear localization prediction [118], protein-protein interaction inference [82, 229, 234], and microarray clustering [236]. The performance of these predictors depends on whether the similarity measure is relevant to the biological problems. Over the years, a number of semantic similarity measures have been proposed, some of which have been used in natural language processing.

Semantic similarity measures can be applied at the GO-term level or the geneproduct level. At the GO-term level, methods are roughly categorized as node-based and edge-based. The node-based measures basically rely on the concept of information content of terms, which was proposed by Resnik [174] for natural language processing. Later, Lord et al. [126] applied this idea to measure the semantic similarity among GO terms. Lin et al. [123] proposed a method based on information theory and structural information. Subsequently, more node-based measures [19, 58, 181] were proposed. Edge-based measures are based on using the length or the depth of different paths between terms and/or their common ancestors [31, 168, 228, 238]. At the gene-product level, the two most common methods are pairwise approaches [175, 185, 199, 216, 220] and groupwise approaches [29, 98, 144, 186]. Pairwise approaches measure the similarity between two gene products by combining the semantic similarities between their terms. Groupwise approaches, on the other hand, directly group the GO terms of a gene product as a set, a graph, or a vector, and then calculate the similarity by set similarity techniques, graph matching techniques, or vector similarity techniques. More recently, Pesquita et al. [165] reviewed the semantic similarity measures applied to biomedical ontologies, and Guzzi et al. [84] provides a comprehensive review on the relationship between semantic similarity measures and biological features.

6.2 SS-Loc: Using semantic similarity over GO

This section proposes a novel predictor, SS-Loc, based on the GO semantic similarity for multi-label protein subcellular localization prediction. The predictor proposed is different from other predictors in that (1) it formulates the feature vectors by the semantic similarity over gene ontology, and contains richer information than only GO terms; (2) it adopts a new strategy to incorporate richer and more useful homologous information from more distant homologs rather than using only the top homologs; (3) it adopts a new decision scheme for an SVM classifier, so that it can effectively deal with datasets containing both single-label and multi-label proteins. Results on a recent benchmark dataset demonstrate that these three properties enable the proposed predictor to accurately predict multi-location proteins and to outperform three state-of-the-art predictors.

SS-Loc adopts a way of retrieving GO terms similar to GOASVM and mGOASVM which makes sure that each protein will correspond to at least one GO term. Subsequently, these sets of GO terms are measured based on the semantic similarity elaborated below.

6.2.1 Semantic similarity measures

To obtain the GO semantic similarity between two proteins, we should start by introducing the semantic similarity between two GO terms. Semantic similarity (SS) is a measure for quantifying the similarity between categorical data (e.g., words in documents), where the notion of similarity is based on the likeness of meanings in the data. It was originally developed by Resnik [174] for natural language processing. The idea is to evaluate semantic similarity in an “is-a” taxonomy using the shared information content of categorical data.

In the context of gene ontology, the semantic similarity between two GO terms is based on their most specific common ancestor in the GO hierarchy. The procedures for obtaining the semantic similarity between two GO terms is shown in Figure 6.3. Given two GO terms x and y, we first find their respective ancestors by searching the GO database. Then, the GO terms of their common ancestors are identified, which constitute the GO term set A(x,y). Subsequently, the semantic similarity sim(x,y) between GO terms x and y is computed by using the information contents of the common ancestors, which are elaborated below.

The relationships between GO terms in the GO hierarchy, such as “is-a” ancestor-child, or “part-of” ancestor-child can be obtained from the SQL database through the link: http://archive.geneontology.org/latest-termdb/go_daily-termdb-tables.tar.gz. Note here only the “is-a” relationship is considered for semantic similarity analysis [123]. The semantic similarity between two GO terms x and y is defined as [174]


where A(x,y) is the set of ancestor GO terms of both x and y, - log(p(c)) is the information content of the GO term c, and p(c) is the relative frequency of occurrence, which is defined as the number of gene products annotated to the GO term c divided by the number of all the gene products annotated to the GO taxonomy.

While Resnik’s measure is effective in quantifying the shared information between two GO terms, it ignores the “distance” between the terms and their common ancestors in the GO hierarchy. The distance can be derived from the relative positions of GO terms in the GO hierarchical structure. To further incorporate the structural information of the GO hierarchy into the similarity measure, extensions of Resnik’s measure have been proposed in the literature. They are Lin’s measure [123], Jiang’s measure [103], and relevance similarity (RS) [181].


Fig. 6.3: Flowchart of obtaining termwise semantic similarity between two GO terms.

Given two GO terms x and y, the similarity by Lin’s, Jiang’s and RS measures are





respectively. Among the three measures, simLin(x, y) and simJiang(x, y) are relative measures that are proportional to the difference in the information content between the terms and their common ancestors, which is independent of the information content of the ancestors. On the other hand, simRS(x, y) attempts to incorporate the depth of the lowest common ancestors (LCA) of x and y into the similarity measure. This is achieved by weighting the Lin’s measure with the probability that c is not the LCA of x and y. The rationale behind the weighting is that the deeper the GO terms x and y are located in the GO hierarchy, the more specific they and their LCA will be. As a result, the number of gene products annotating to the common ancestor will be small. By weighting the Lin's measure with 1−p(c),the relevance measure can put more emphasis on the common ancestors which are more specific and deep in the GO hierarchy. To simplify the notations, we refer simLin(x,y), simJiang(x,y), and simRS(x,y) as sim1(x,y), sim2(x,y), and sim3(x,y), respectively.

6.2.2 SS vector construction

Based on the semantic similarity between two GO terms, we adopted a continuous measure proposed in [236] to calculate the similarity between two proteins. Specifically, given two proteins image and image we retrieved their corresponding GO terms image and image, as described in Section 4.1.2. 17 Then we computed the semantic similarity between two sets of GO terms {image,image} as follows:


where l ∈ {1,2,3}, and siml(x,y) is defined in equations (6.2)(6.4). Sl(image,image) is computed in the same way by swapping image and image. Finally, the overall similarity between the two proteins is given by


where l ∈ {1,2,3}. In the sequel, we refer the SS measures by Lin, Jiang and RS to as SS1, SS2 and SS3, respectively.

For a testing protein image with GO term set image, a GO semantic similarity (SS) vector image can be obtained by computing the semantic similarity between image and each of the training protein image, where N is the number of training proteins. Thus, image can be represented by an N-dimensional vector:


where l ∈ {1,2,3}. In other words, image represents the SS vector by using the l-th SS measure.

6.3 HybridGO-Loc: Hybridizing GO frequency and semantic similarity features

This section proposes a multi-label subcellular-localization predictor, HybridGO-Loc, that leverages not only the GO term occurrences but also the interterm relationships. This is achieved by hybridizing the GO frequencies of occurrences and the semantic similarity between GO terms.

The flowchart of HybridGO-Loc is shown in Figure 6.4. First, given a protein, its GO frequency vector and GO SS vector are obtained through the methods specified in Sections 4.1 and 6.2, respectively. Specifically, given a protein, a set of GO terms are retrieved by searching against the gene ontology database, using the accession numbers of homologous proteins obtained via BLAST search as the keys. The frequency of GO occurrences and semantic similarity (SS) between GO terms are used to formulate frequency vectors and semantic similarity vectors, respectively, which are subsequently hybridized to construct fusion vectors. An adaptive-decision-based multi-label support vector machine (SVM) classifier is used to classify the fusion vectors. Experimental results based on recent benchmark datasets and a new dataset containing novel proteins show that the proposed hybrid-feature predictor significantly outperforms predictors based on individual GO features as well as other state-of-the-art predictors.


Fig. 6.4: Flowchart of HybridGO-Loc.

6.3.1 Hybridization of two GO features

As can be seen from Sections 4.1.3 and 6.2.1, we know that the GO frequency features (equation (4.3)) use the frequency of occurrences of GO terms, while GO SS features (equations (6.2)(6.4)) use the semantic similarity between GO terms. These two features are developed from two different perspectives. It is therefore reasonable to believe that these two kinds of information complement each other. Based on this assumption, we combine these two GO features and form a hybridized vector as


where l ∈ {1,2,3}. In other words, image represents the hybridizing-feature vector by combining the GO frequency features and the SS features derived from the l-th SS measure. We refer them to as Hybrid1, Hybrid2, and Hybrid3, respectively.

6.3.2 Multi-label multiclass SVM classification

The hybridized-feature vectors obtained in Section 6.3.1 are used for training multi-label one-vs-rest support vector machines (SVMs). Specifically, for an M-class problem (here M is the number of subcellular locations), M-independent binary SVMs are trained, one for each class. Denote the hybrid GO vectors of the t-th query protein as image, where the l-th SS measure is used in Section 6.2.1. Given the t-th query protein image, the score of the m-th SVM using the l-th SS measure is


where image is the hybrid GO vector derived from image (see equation (6.8)), imagel is the set of support vector indexes corresponding to the m-th SVM, αm,r are the Lagrange multipliers , ym,r ∈ {-1, + 1} indicates whether the r-th training protein belongs to the m-th class or not, and K(·, ·) is a kernel function. Here, the linear kernel was used.

Unlike the single-label problem, where each protein has only one predicted label, a multi-label protein could have more than one predicted labels.

In this work, we compared the two decision schemes introduced in Sections 5.3 and 5.4 for this multi-label problem. In the first scheme, the predicted subcellular location(s) of the i-th query protein are given by


The second scheme is an improved version of the first one in that the decision threshold is dependent on the test protein. Specifically, the predicted subcellular location(s) of the i-th query protein are given as follows:

If image




In equation (6.11), f (smax,l(image)) is a function of


In this work, we used a linear function as follows:


where θ ∈ [0.0,1.0] is a hyper-parameter that can be optimized through cross-validation.

In fact, besides SVMs, many other machine learning models, such as hidden Markov models (HMMs) and neural networks (NNs) [3, 152], have been used in protein subcellular-localization predictors. However, HMMs and NNs are not suitable for GO-based predictors because of the high dimensionality of GO vectors. The main reason is that under such conditions, HMMs and NNs can be easily overtrained and thus lead to poor performance. On the other hand, linear SVMs can well handle high-dimensional data, because even if the number of training samples is smaller than the feature dimension, linear SVMs are still able to find an optimal solution.

6.4 Summary

This chapter presented two predictors, namely SS-Loc and HybridGO-Loc, both of which extract deeper GO information, i.e. GO semantic similarity, for multi-location protein subcellular localization.

SS-Loc extends the interterm relationship of GO terms to an intergroup relationship of GO term groups which are used to represent the similarity between proteins. Then the similarity vectors are predicted by multi-label SVM classifiers. Based on this, HybridGO-Loc combines the features of GO occurrences and GO semantic similarity to generate hybrid feature vectors. Several semantic similarity measures were compared, and two different decision schemes were tried. The superior performance of HybridGO-Loc (see Chapter 9) also demonstrates that these two features are complementary to each other.

