3 Legitimacy of using gene ontology information

Before we propose subcellular-location predictors based on gene ontology (GO) information, in this chapter we will address some concerns about the legitimacy of using GO information for protein subcellular localization. There are mainly three kinds of concerns about using GO information: (1) Can the GO-based methods be replaced by a lookup table using the cellular component GO terms as the keys and the component categories as the hashed values? (2) Are cellular components GO terms the only information necessary for protein subcellular localization? (3) Are GO-based methods equivalent to transferring annotations from BLAST homologs? These concerns are explicitly addressed in the following sections.

3.1 Direct table lookup?

For those who are skeptical about the GO-based prediction methods, the following question is prone to be raised: If a protein has already been annotated by cellular component GO terms, is it still necessary to predict its subcellular localization? The GO comprises three orthogonal categories whose terms describe the cellular components, biological processes, and molecular functions of gene products. This sounds like a legitimate question, because the GO terms already suggest the subcellular localization, and therefore it is merely a procedure of converting the annotation into another format. In other words, all we need is to create a lookup table (hash table) using the cellular component GO terms as the keys and the component categories as the hashed values.

To answer this question, let us provide some facts here. Most of the existing “non-GO predictors” were established based on the proteins in the Swiss-Prot Database, in which the subcellular locations are experimentally determined. Is it logical to consider that all of these methods have nothing to predict? Obviously, it is not. Fairly speaking, as long as the input is a query protein sequence and the output is its subcellular location(s), the predictor is deemed to be a valid protein subcellular-location predictor. In fact, most of the existing GO predictors, such as iLoc-Euk [52] and iLoc-Hum [53], use protein sequence information only to predict the subcellular locations, without adding any GO information to the input. That is to say, these GO predictors use the same input as the non-GO predictors. Therefore, GO-based predictors should also be regarded as valid predictors.

Here we explain why the simple table-lookup method mentioned above is undesirable. Although the cellular component ontology is directly related to the subcellular localization, we cannot simply use its GO terms to determine the subcellular locations of proteins. The reason is that some proteins do not have cellular component GO terms. Even for proteins annotated with cellular-component GO terms, it is inapproriate to use these terms only to determine their subcellular localizations. The reason is that a protein could have multiple cellular-component GO terms that map to different subcellular localizations, which are highly likely to be inconsistent with the true subcellular locations of proteins. Another reason is that, according to [44], proteins with annotated subcellular localization in Swiss-Prot may still be marked as “cellular component unknown” in the GO database. Because of this limitation, it is necessary to use the other two ontologies as well, because they are also relevant (although not directly) to the subcellular localization of proteins.

To further exemplify the analysis above, we created lookup tables for protein subcellular prediction of both the single-and multi-label case, respectively, which are specified in the following subsections.

3.1.1 Table lookup procedure for single-label prediction

To exemplify the discussion above for the single-location case, we created a lookup table (Table 3.1) and developed a table-lookup procedure to predict the subcellular localization of the proteins in the EU16 dataset (Table 8.1). Table 3.1 has two types of GO terms: essential GO terms and child GO terms. As the name implies, the essential GO terms, as identified by Huang et al. [99], are GO terms that are essential or critical for the subcellular localization prediction. In addition to the essential GO terms, their direct descendants (known as child terms) also possess direct localization information. The relationships between child terms and their parent terms include “is a”, “part of”, and “occurs in” [126]. The former two correspond to cellular component GO terms and the third one typically corresponds to biological process GO terms. As we are more interested in cellular component GO terms, the “occurs in” relationship will not be considered. For ease of reference, we refer to both essential GO terms and their child terms as ‘explicit GO terms’.

For each class in Table 3.1, the child terms were obtained by presenting the corresponding essential GO term to the QuickGO server [17], followed by excluding those child terms which do not appear in the proteins of the EU16 dataset. Note that if we use the cellular-component names as the searching keys, QuickGO will give us more than 49 cellular-component GO terms, suggesting that the 49 explicit GO terms are only a tiny subset of all relevant GO terms (in our method, we have more than 5000 relevant GO terms). Even for such a small number of explicit GO terms, many proteins have explicit GO terms spanning several classes.

Given a query sequence, we first obtain its “GO-term” set from the GO Annotation Database. Then, if only one of the terms in this set matches an essential GO term in Table 3.1, the subcellular location of this query protein is predicted to be the one corresponding to this matched GO term. For example, if the set of GO terms contains GO:0005618, then this query protein is predicted as “cell wall”. Further, if none of the terms in this set match any essential GO terms but one of the terms in this set matches any child terms in Table 3.1, then the query protein is predicted as belonging to the class associated with this child GO term. For example, if no essential GO terms can be found in the set but GO:0009274 is found, then the query protein is predicted as “cell wall”.

Table 3.1: Explicit GO terms for the EU16 dataset. Explicit GO terms include essential GO terms and their child terms that appear in the proteins of the dataset. The definition of essential GO terms can be found in [99]. Here the relationship only includes “is a” and “part of”, because only cellular component GO terms are analyzed here. CC: cellular components, including cell wall (CEL), centriole (CEN), chloroplast (CHL), cyanelle (CYA), cytoplasm (CYT), cytoskeleton (CYK), endoplasmic reticulum (ER), extracellular (EXT), Golgi apparatus (GOL), lysosome (LYS), mitochondrion (MIT), nucleus (NUC), peroxisome (PER), plasma membrane (PM), plastid (PLA) and vacuole (VAC); elationship: the relationship between child terms and their parent essential GO terms; number of terms: the total number of explicit GO terms in a particular class.

Class CC Explicit GO terms No. of terms
Essential terms Child terms (relationship)
1 CEL GO:0005618 GO:0009274 (is a), GO:0009277 (is a), GO:0009505 (is a), GO:0031160 (is a) 5
2 CEN GO:0005814 None 1
3 CHL GO:0009507 None 1
4 CYA GO:0009842 GO:0034060 (part of) 2
5 CYT GO:0005737 GO:0016528 (is a), GO:0044444 (part of) 3
6 CYK GO:0005856 GO:0001533 (is a), GO:0030863 (is a), GO:0015629 (is a), GO:0015630 (is a), GO:0045111 (is a), GO:0044430 (part of) 7
7 ER GO:0005783 GO:0005791 (is a), GO:0044432 (part of) 3
8 EXT GO:0030198 None 1
9 GOL GO:0005794 None 1
10 LYS GO:0005764 GO:0042629 (is a), GO:0005765 (part of), GO:0043202 (part of) 4
11 MIT GO:0005739 None 1
12 NUC GO:0005634 GO:0043073 (is a), GO:0045120 (is a), GO:0044428 (part of) 4
13 PER GO:0005777 GO:0020015 (is a), GO:0009514 (is a) 3
14 PM GO:0005886 GO:0042383 (is a), GO:0044459 (part of) 3
15 PLA GO:0009536 GO:0009501 (is a), GO:0009507 (is a), GO:0009509 (is a), GO:0009513 (is a), GO:0009842 (is a) 6
16 VAC GO:0005773 GO:0000322 (is a), GO:0000323 (is a), GO:0005776 (is a) 4

3.1.2 Table-lookup procedure for multi-label prediction

To exemplify the above discussion for the multi-location case, we created a lookup table (Table 3.2) and developed a table-lookup procedure to predict the subcellular localization of the proteins in the virus dataset (see Table 8.5a). Similar to the single-location case, Table 3.2 has two types of GO terms: essential GO terms and child GO terms. As the name implies, the essential GO terms [99] are GO terms that are essential or critical for the subcellular localization prediction. In addition to the essential GO terms, their direct descendants (known as child terms) also possess direct localization information. The relationships between child terms and their parent terms include “is a”, “part of”, and “occurs in” [126]. The former two correspond to cellular component GO terms and the third one typically corresponds to biological process GO terms. As we are more interested in cellular component GO terms, the “occurs in” relationship will not be considered. For ease of reference, we refer to both essential GO terms and their child terms as “explicit GO terms”.

Table 3.2: Explicit GO terms for the virus dataset. Explicit GO terms include essential GO terms and their child terms. The definition of essential GO terms can be found in [99].Here the relationship includes “is a” and “part of” only, because only cellular component GO terms are analyzed here. CC:cellular components, including viral capsid (VC), host cell membrane (HCM), host endoplasmic reticulum (HER), host cytoplasm (HCYT), host nucleus (HNUC) and secreted (SEC); relationship: the relationship between child terms and their parent essential GO terms; number: the total number of explicit GO terms in a particular class.

Class CC Explicit GO terms No. of terms
Essential terms Child terms (relationship)
1 VC GO:0019028 GO:00046727 (part of), GO:0046798 (part of), GO:0046806 (part of), GO:0019013 (part of), GO:0019029 (is a), GO:0019030 (is a) 7
2 HCM GO:0033644 GO:0044155 (part of), GO:0044084 (part of), GO:0044385 (part of), GO:0044160 (is a), GO:0044162 (is a), GO:0085037 (is a), GO:0085042 (is a), GO:0085039 (is a), GO:0020002 (is a), GO:0044167 (is a), GO:0044173 (is a), GO:0044175 (is a), GO:0044178 (is a), GO:0044384 (is a), GO:0033645 (is a), GO:0044231 (is a), GO:0044188 (is a), GO:0044191 (is a), GO:0044200 (is a) 20
3 HER GO:0044165 GO:0044166 (part of), GO:0044167 (part of), GO:0044168 (is a), GO:0044170 (is a) 5
4 HCYT GO:0030430 GO:0033655 (part of) 2
5 HNUC GO:0042025 GO:0044094 (part of) 2
6 SEC GO:0005576 GO:0048046 (is a), GO:0044421 (part of) 3

For each class in Table 3.2, the child terms were obtained by presenting the corresponding essential GO term to the QuickGO server [17]. In our method, we have more than 300 relevant GO terms for the virus dataset. Even for such a small number of explicit GO terms, many proteins have explicit GO terms spanning several classes.

Given a query sequence, we first obtain its “GO-term” set from the GO Annotation Database. Then, if one (or more than one) of the terms in this set matches an essential GO term in Table 3.2, the subcellular location set of this query protein is predicted to be the one (or the ones) corresponding to the matched GO term(s). For example, if the set of GO terms contains GO:0019028, then this query protein is predicted as “viral capsid”; or if the set of GO terms contains both GO:0030430 and GO:0042025, then this query protein is predicted as “host cytoplasm” and “host nucleus”. Further, if none of the terms in this set matches any essential GO terms but one (or more than one) of the terms in this set match(es) any child terms in Table 3.2, then the query protein is predicted as belonging to the class(es) associated with the child GO term(s). For example, if no essential GO terms can be found in the set but GO:0019030 is found, then the query protein is predicted as “viral capsid”; or if GO:0044155, GO:0044166 and GO:0033655 are found, then the query protein is predicted as “host cell membrane”, ‘host endoplasmic reticulum”, and “host cytoplasm”.

3.1.3 Problems of table lookup

A major problem of this table lookup procedure is that the GO terms of a query protein may contain many essential GO terms and/or having child terms spanning across more classes than the number of true subcellular locations, causing over-prediction or inconsistent classification decisions. For example, in the EU16 single-location dataset, 713 (out of 2423) proteins have explicit GO terms which map to more than one class, and 513 (out of 2423) proteins do not have any explicit GO terms. This means that about 51% (1226/2423) of the proteins in the dataset cannot be predicted using only explicit GO terms. Among the 2423 proteins in the dataset, only 1197 (49 %) of them have explicit GO terms that map to unique (consistent) subcellular locations. While in the multi-label virus dataset, 69, 14, and 3 (out of 207) proteins have explicit GO terms that map to two, three, and four locations, and 121 (out of 207) proteins have explicit GO terms that map to one location. By comparing with the true locations, there are a total of 139 proteins whose explicit GO terms are consistent with their true locations, of which 107 are single-label proteins, 30 two-label proteins, and 2 three-label proteins. This means that only about 67% (139/207) proteins are likely to be predicted correctly.

Note that this table-up procedure only incorporates the explicit GO terms. If more cellular-component GO terms and even GO terms from the other two ontologies are used to infer the subcellular locations, more proteins are likely to be over-predicted. This analysis suggests that direct table lookup is not a desirable approach and this motivates us to develop machine-learning methods for GO-based subcellular localization prediction.

3.2 Using only cellular component GO terms?

Some people disprove the effectiveness of GO-based methods by claiming that only cellular component GO terms are necessary and GO terms in the other two categories play no role in determining the subcellular localization. They argue that cellular component GO terms directly associate with the predicting labels, and only these GO terms are useful for determination of protein subcellular localization.

This concern has been explicitly and directly addressed by Lu and Hunter [128], who demonstrated that GO molecular function terms are also predictive of subcellular localization, particularly for nucleus, extracellular space, membrane, mitochondrion, endoplasmic reticulum, and Golgi apparatus. The in-depth analysis of the correlation between the molecular function GO terms and localization provide an explanation of why GO-based methods outperform sequence-based methods. Mei et al. [142] also did extensive experiments on Multiloc [92] dataset, BaCelLo [167] dataset and Euk-mPLoc [47] dataset to show that not only cellular component GO terms play significant roles in estimating the kernel weights of the proposed classifier and training the prediction model, but also GO terms in categories of molecular functions and biological processes make considerable contributions to final predictions. This is also understandable because although GO terms in molecular functions and biological processes have no direct implications of protein subcellular localization, proteins can only properly exert their functions in particular physiological contexts and participate in certain biological processes within amenable cellular compartments. Therefore, it is logically acceptable that all categories of GO terms should be considered for accurate prediction of protein subcellular localization.

3.3 Equivalent to homologous transfer?

Even though GO-based methods can predict novel proteins based on the GO information obtained from their homologous proteins [213, 215], some researchers still argue that the prediction is equivalent to using the annotated localization of the homologs (i.e. using BLAST [5] with homologous transfer). They argue that GO-based methods are in fact equivalent to mining the annotations of homologous proteins retrieved by BLAST, whose subcellular localization information are well annotated or experimentally determined. Namely, they consider GO-based methods as have nothing to do with machine learning; instead they think these methods simply assign the subcellular localization information of the homologs to the target proteins.

This claim is clearly proved to be untenable in Table 9.4 of Chapter 9, which demonstrates that GO-based methods remarkably outperform methods that only use BLAST and homologous transfer. More details of the procedures can be found in Section 9.1.4 of Chapter 9. In addtion, Briesemeister et al. [24] also suggested that using BLAST alone is not sufficient for reliable prediction.

3.4 More reasons for using GO information

As suggested by Chou [38], as long as the input of query proteins for predictors is the sequence information without any GO annotation information and the output is the subcellular localization information, there is no difference between non-GO-based methods and GO-based methods, which should be regarded as equally legitimate for subcellular localization.

Some other papers [48, 222] also provide strong arguments supporting the legitimacy of using GO information for subcellular localization. In particular, as suggested by [48], the good performance of GO-based methods is due to the fact that the feature vectors in the GO space can better reflect their subcellular locations than those in the Euclidean space or any other simple geometric space.

