10 Properties of the proposed predictors

This chapter will further discuss the advantages and disadvantages of the proposed predictors, especially for multi-label predictors. First, the impact of noisy data in the GOA database on the performance will be discussed. The reasons why the predictors proposed in this book make improvement in multi-label classification (AD-SVM and mPLR-Loc), feature extraction (SS-Loc and HybridGO-Loc), and finding relevant subspaces (RP-SVM and R3P-Loc) will be discussed. Finally, a comprehensive comparison of all of the predictors will be presented.

10.1 Noise data in the GOA Database

As stated in Section 4.1.1, the GOA Database is constructed by various biological research communities around the world.23 It is possible that the same proteins are annotated by different GO consortium contributing groups around the world. In this case, it is likely that the same protein involved in the same biological process, performing the same molecular function, or located in the same cellular component gets annotated by different research groups using different or even contradictory GO terms, which may result in inaccuracy or inconsistency of the GO annotations. In other words, there is inevitably some noisy data or outliers in the GOA Database. Thes noisy data and outliers may negatively affect the performance of machine-learning-based approaches.

For this concern, first of all we need to admit that these noisy data and outliers are likely to exist in the GOA Database, and that unfortunately we cannot easily distinguish them from correct GO annotations. Only wet-lab experimentalists can use biological knowledge to identify these noisy data or outliers. However, we remain optimistic about our proposed predictors for the following reasons:

1. The GOA Database has some guidelines to ensure that its data is of high-quality. First, GO annotations are classified as electronic annotations, literature annotations, and sequence-based annotations. Each annotation entry will be labeled by an evidence code to represent its sources. For example, the evidence code “IEA” specifies that the corresponding GO annotation is inferred from electronic (computation) means, whereas “EXP” indicates that the GO annotation is inferred from biological experiments. This information may be conducive for users to distinguish different kinds of annotations.24

2. In this book, term-frequency information is used to emphasize those annotations that have been confirmed by different research groups. From our observations, the same GO term for annotating a protein may appear in the GOA Database, but each appearance may be associated with a different evidence code and extracted from different contributing databases. This means that this kind of GO term is validated several times by different research groups and in different ways, which lead to the same annotations. On the contrary, if different research groups annotate the same protein by different GO terms whose information is contradictory, the frequencies of occurrences of these GO terms for this protein will be low. In otherwords, the higher the frequency a GO term is used for annotating a particular protein, the more times this GO annotation is confirmed by different research groups, and the more credible is the annotation. By using term frequency in the feature vectors we can enhance the influence of those GO terms that appear more frequently; or in other words, we can enhance the influence of those GO terms that have been used in a consistent purpose. Meanwhile, we can indirectly suppress the influence of those GO terms which appear less frequently; or in other words, we can suppress the influence of those GO terms whose information is contradictory to others.

3. The noisy data and outliers may exist in both training and testing datasets, in which case the negative impact of the noisy data and outliers may be reduced. In our methods we used homologous transfer methods to obtain the feature information for both training and testing proteins. Thus, if there are some noisy data and outliers in the GOA Database, it is possible that both the training and testing proteins contain noisy data and outliers. In such a case, we conjecture that the noise data and outliers may contribute to the final decision, and more interestingly may improve the prediction performance instead of making the performance poorer.

10.2 Analysis of single-label predictors

10.2.1 GOASVM vs FusionSVM

For single-location protein subcellular localization, this book proposes two GO-based predictors: GOASVM and FusionSVM. Although both predictors use GO information and advocate using term frequencies instead of the 1-0 values for constructing GO vectors, there are definite differences between these two predictors. The differences can be viewed in the following perspectives.

  1. Retrieval of GO terms. GOASVM exploits the GO terms from the GOA database, while FusionSVM extracts the GO terms from a program called InterProScan.
  2. Searching GO terms. To guarantee that GOASVM is applicable to all of the proteins of interest, GOASVM adopts a successive-search strategy to incorporate remote yet useful homologous GO information for classifiation. On the other hand, FusionSVM only uses InterProScan to generate GO terms, which cannot make sure that it is applicable to all proteins. Thus, FusionSVM needs to use other features as a back-up.
  3. Feature information. GOASVM only uses GO information as features, while FusionSVM uses both GO information and profile alignment information. The scores arising from these two kinds of features are combined to make final classification decisions.

Besides the differences mentioned above, some other points are also worthy noting. First, the numbers of subcellular locations for the datasets used in these two predictors are different. The number of subcellular locations in the eukaryotic datasets (i.e. EU16 and the novel eukaryotic dataset) for GOASVM is 16, while that for FusionSVM is only 11. Moreover, the degree of sequence similarity in the datasets is different. The sequence similarity in the datasets for evaluating GOASVM – including EU16, HUM12, and the novel eukaryotic dataset – is cut off at 25 %;on the other hand, the sequence similarity of the eukaryotic dataset for FusionSVM is only cut off at 50%. It is generally accepted that the more similar the proteins of the dataset of interest is, the easier the predictors can achieve high accuracy. Therefore, for the same predictor, the lower the sequence similarity cut-off, the lower the achievable accuracy. Nevertheless, even under the condition that the datasets for GOASVM are more stringent than that used for FusionSVM, GOASVM can still achieve much better performance than FusionSVM (direct comparison results are not shown, but we can easily draw this conclusion from Tables 9.3 and 9.8).

10.2.2 Can GOASVM be combined with PairProSVM?

From the experimental results in Chapter 9 and the analysis above, we can see that GOASVM performs better than FusionSVM. However, some people may wonder: if the fusion of the GO-based InterProScan and the profile-based PairProSVM can boost the prediction performance, is it possible to combine the GO-based GOASVM with PairProSVM to further boost the performance? The answer is no. This is because typically score-fusion methods can boost the performance only if the performance of the two methods of interest are comparable [108, 112], as evident in Table 9.7, where InterProGOSVM and PairProSVM can achieve comparable accuracies (72.21% and 77.05%). On the contrary, GOASVM remarkably outperforms PairProSVM, as evident in Table 9.3, where the accuracies of GOASVM and PairProSVM are 94.55% and 54.52 %, respectively. It is highly likely that fusing GOASVM and PairProSVM will perform worse than GOASVM. Therefore, it is unwise to combine the scores of GOASVM and PairProSVM to make predictions.

10.3 Advantages of mGOASVM

mGOASVM possesses several desirable properties which make it outperform Virus-mPLoc [189], iLoc-Virus [233], Plant-mPLoc [51], and iLoc-Plant [230].

10.3.1 GO-vector construction

Virus-mPLoc and Plant-mPLoc construct GO vectors by using 1-0 values to indicate the presence and absence of some predefined GO terms. This method is simple and logically plausible, but some information will be inevitably lost, because it quantizes the frequency of occurrences of GO terms to either 1 or 0. The GO vectors in iLoc-Virus and iLoc-Plant contain more information than those in Virus-mPLoc and Plant-mPLoc, because the former two consider not only the GO terms of the query protein but also the GO terms of its homologs. Specifically, instead of using 1-0 value, each element of the GO vectors in the iLoc-Virus and iLoc-Plant represents the percentage of homologous proteins containing the corresponding GO term. However, this method ignores the fact that a GO term may be used to annotate the same protein multiple times under different entries in the GOA Database. On the contrary, mGOASVM uses the frequency of occurrences of GO terms to construct the GO vectors. Intuitively, this is because proteins of the same subcellular localization tend to be annotated by similar sets of GO terms. The advantages of using the GO term frequencies as features is evident by their superior performance in Table 9.10.

10.3.2 GO subspace selection

To facilitate the sophisticated machine-learning approach for the multi-label problem, GO subspace selection is adopted. Unlike the traditional methods [51, 189, 230, 233], which use all of the GO terms in the GO annotation database to form the GO vector space, mGOASVM selects a relevant GO subspace by finding a set of distinct relevant GO terms. With the rapid growth of the GO database, the number of GO terms is also increasing. As of March 2011, the number of GO terms is 18,656, which means that without feature selection the GO vectors will have the dimension 18,656. This imposes a computational burden on the classifier, especially when leave-one-out cross validation is used for evaluation. There is no doubt that many of the GO terms in the full space are redundant, irrelevant, or even detrimental to prediction performance. By selecting a set of distinct GO terms to form a GO subspace, mGOASVM can reduce the irrelevant information and at the same time retain useful information. As can be seen from Table 9.13, for the virus dataset around 300 to 400 distinct GO terms are sufficient for good performance. Therefore, using GO subspace selection can tremendously speed up the prediction without compromising the performance.

10.3.3 Capability of handling multi-label problems

An efficient way to handle multi-label problems is to first predict the number of labels for each sample, and then predict the specific label set for each sample according to the order of the scores. Let us compare mGOASVM with two other existing approaches:

– When predicting the number of subcellular locations for a query protein, iLoc-Virus and iLoc-Plant determine the number of labels of a query protein based on the number of labels of its nearest training sample. mGOASVM, on the contrary, determines the number of labels for a query protein by looking at the number of positive-class decisions among all of the one-vs-rest SVM classifiers. Therefore, the number of labels depends on the whole training set as opposed to the query protein’s nearest neighbor in the training set.

– As opposed to Virus-mPLoc and Plant-mPLoc, which require a predefined threshold, our mGOASVM adopts a machine-learning approach to solving the multi-label classification problem. The predicted class labels in mGOASVM are assigned based on the SVMs that produce positive responses to the query protein.

In summary, the superiority of mGOASVM in handling multi-label problems is evident in Tables 9.99.13.

From the machine-learning perspective, prediction of multi-location proteins is a multi-label learning problem. Approaches to addressing this problem can be divided into types: problem transformation and algorithm adaptation [203]. The multi-label KNN classifiers used in iLoc-Plant and iLoc-Virus belong to the first type, whereas our multi-label SVM classifier belongs to the second type. While our results show that multi-label SVMs perform better than multi-label KNN, further work needs to be done to compare these two types of approaches in the context of multi-label subcellular localization.

10.4 Analysis for HybridGO-Loc

10.4.1 Semantic similarity measures

For HybridGO-Loc, we have compared three of the most common semantic similarity measures for subcellular localization, including Lin’s measure [123], Jiang’s measure [103], and relevance similarity measure [181].25 In addition to these measures, many online tools are also available for computing the semantic similarity at the GO-term and gene-product levels [68, 118, 166, 237]. However, these measures are discrete measures, whereas the measures that we used are continuous. Research has shown that continuous measures are better than discrete measures in many applications [236].

10.4.2 GO-frequency features vs SS features

Note that when hybridizing GO information, we do not replace the GO frequency vectors. Instead, we augment the GO frequency feature with a more sophisticated feature, i.e. the GO SS vectors, which are to be combined with the GO frequency vectors. A GO frequency vector is found by counting the number of occurrences of every GO term in a set of distinct GO terms obtained from the training dataset, whereas an SS vector is constructed by computing the semantic similarity between a test protein with each of the training proteins at the gene-product level. That is, each element in an SS vector represents the semantic similarity of two GO-term groups. This can be easily seen from their definitions in equations (4.3) and (6.2)(6.7).

The GO frequency vectors and the GO SS vectors are different in two fundamental ways:

  1. GO frequency vectors are more primitive in the sense that their elements are based on individual GO terms without considering the interterm relationship, i.e. the elements in a GO frequency vectors are independent of each other.
  2. GO SS vectors are more sophisticated in the following two senses:
    1. Interterm relationship. SS vectors are based on interterm relationships. They are defined on a space in which each basis corresponds to one training protein, and the coordinate along that basis is defined by the semantic similarity between a testing protein and the corresponding training protein.
    2. Intergroup relationship. The pairwise relationships between a test protein and the training proteins are hierarchically structured. This is because each basis of the SS space depends on a group of GO terms of the corresponding training protein, and the terms are arranged in a hierarchical structure (parent-child relationship). Because the GO terms in different groups are not mutually exclusive, the bases in the SS space are not independent of each other.

10.4.3 Bias analysis

Except for the new plant dataset, we adopted LOOCV to examine the performance of HybridGO-Loc, which is considered to be the most rigorous and bias-free [86]. Nevertheless, determining the set of distinct GO terms image (in Section 4.1.3) from a dataset is by no means without bias, which may favor the LOOCV performance. This is because the set of distinct GO terms image derived from a given dataset may not be representative for other datasets; in other words, the generalization capabilities of the predictors may be weakened when new GO terms outside image are found in the test proteins.

However, we have the following strategies to minimize the bias. First, the two multi-label benchmark datasets used for HybridGO-Loc were constructed based on the whole Swiss-Prot Database (although in different years), which to some extent incorporated all of the possible information regarding plant proteins or virus proteins in the database. In other words, image was constructed based on all of the GO terms corresponding to the whole Swiss-Prot Database, which enables image to be representative for all of the distinct GO terms. Second, these two benchmark datasets were collected according to strict criteria (see Section 8.2.1), and the sequence similarity of both datasets was cut off at 25%, which enables us to use a small set of representative proteins to represent all of the proteins of the corresponding species (i.e. virus or plant) in the whole database. In other words, image will vary from species to species, yet still be statistically representative for all of the useful GO terms for the corresponding species. Third, using image for statistical performance evaluation is equivalent or at least approximate to using all of the distinct GO terms in the GOA database. This is because other GO terms that do not correspond to the training proteins will not participate in training the linear SVMs, nor will they play essential roles in contributing to the final predictions. In other words, the generalization capabilities of HybridGO-Loc will not be weakened even if some new GO terms are found in the test proteins. A mathematical proof of this statement can be found in Appendix C.

One may argue that performance bias might arise when the whole image was used to construct the hybrid GO vectors for both training and testing during cross-validation. This is because, in each fold of the LOOCV the training proteins and the singled-out test protein will use the same image to construct the GO vectors, meaning that the SVM training algorithm can see some information of the test protein indirectly through the GO vector space defined by image. It is possible that for a particular fold of LOOCV the GO terms of a test protein do not exist in any of the training proteins. However, we have mathematically proved that this bias will not exist during LOOCV (see Appendix C for a proof). Furthermore, the results of the independent tests (see Table 8.8) for which no such bias occurs also strongly suggest that HybridGO-Loc outperforms other predictors by a large margin.

10.5 Analysis for RP-SVM

10.5.1 Legitimacy of using RP

As stated in [27], if R and qi in equation (7.1) satisfy the conditions of the basis pursuit theorem (i.e. both are sparse in a fixed basis), then qi can be reconstructed perfectly from a vector which lies in a lower-dimensional space.


Fig. 10.1: Histogram illustrating the distribution of the number of nonzero entries (spareness) in the GO vectors with dimensionality 1541. The histogram is plotted up to 45 nonzero entries in the GO vectors because among the 978 proteins in the dataset, none of their GO vectors have more than 45 nonzero entries.

In fact, the GO vectors and our projected matrix R satisfy these conditions. Here we use the plant dataset (Table 8.6 in Chapter 8) as an example. There are 978 proteins distributed in 12 subcellular locations. After feature extraction, the dimension of the GO vectors is 1541. The distribution of the number of nonzero entries in the GO vectors are shown in Figure 10.1. As shown in Figure 10.1, the number of nonzero entries in the GO vectors tends to be small (i.e. sparse) when compared to the dimension of the GO vectors. Among the 978 proteins in the dataset, a majority of them only have 9 nonzero entries in the 1541-dimensional vectors, and the largest number of nonzero entries is only 45. These statistics suggest that the GO vectors qi in equation (7.1) are very sparse. Therefore, according to [27], RP is very suitable to be applied for reducing the dimension of GO vectors.

10.5.2 Ensemble random projection for robust performance

Since R in equation (7.1) is a random matrix, the scores in equation (7.3) for each application of RP will be different. To construct a robust classifier, we fused the scores for several applications of RP and obtained an ensemble classifier, which was specified in equation (7.5). Actually, the performance achieved by a single application of random-projection, or single RP, varies considerably, as evident in Section 9.7.3. Therefore, single RP was not conducive to final prediction. However, by combining several applications of RP, the performance of RP-SVM can outperform mGOASVM if the number of applications of RP is large enough and the projected dimensionality is no less than a certain value. These results demonstrate the significance of ensemble RP for boosting the final performance of RP-SVM. Also, the results reveal that there is some tradeoff between the number of applications of RP and the projected dimensionality to guarantee an improved performance. It is also evident that RP can be easily applied to other methods, such as R3P-Loc in Section 7.3.

Nevertheless, some interesting questions remain unanswered: (1) Is there a threshold for the projected dimensionality above which RP-SVM will always outperform mGOASVM, or the performance by the ensemble RP will always be superior to that by the original features? (2) How can the threshold of the projected dimensionality be determined? (3) At least how many applications of RP are needed, if the designated projected dimensionality is above the threshold, to guarantee a better performance of RP-SVM? (4) Can the projected dimensionality and the number of applications of RP be optimized to achieve the best possible performance of RP-SVM? These are possibly some of the future directions for applying ensemble random projection to multi-label classification.

10.6 Comparing the proposed multi-label predictors

To further compare the advantages and disadvantages of all of the proposed multi-label predictors with those of state-of-the-art predictors, Table 10.1 summarizes five perspectives that contribute most to the superiority of the proposed predictors over the state-of-the-art predictors. These five perspectives are: (1) whether the predictor of interest uses term frequency, namely the term frequency GO-vector construction method (see equation (4.3) in Section 4.1.3); (2) whether the predictor of interest uses successive search, namely the new successive-search strategy to retrieve GO terms (see Section 4.1.2); (3) whether the predictor of interest does classification refinement, namely multi-label classifier improvement or trying other efficient multi-label classifiers; (4) whether the predictor of interest uses deeper features, i.e. GO semantic similarity features or hybridizing both GO frequency and semantic similarity features (see Chapter 6); and (5) whether the predictor of interest uses dimension reduction, i.e. ensemble random projection for improving the performance of predictors (see Chapter 7).

As can be seen from Table 10.1, all of the proposed predictors have adopted the new successive-search strategy to avoid null GO vectors. On the other hand, Virus-mPLoc, iLoc-Virus, Plant-mPLoc, iLoc-Plant, Euk-mPLoc 2.0, and iLoc-Euk need to use back-up methods whenever null GO vectors occur, which are likely to make the prediction performance poorer than that by using only GO information. Among the proposed predictors, all predictors except SS-Loc use term frequencies for constructing feature vectors. In terms of classifier refinement, AD-SVM and HybridGO-loc use an adaptive decision scheme based on the multi-label SVM classifier used in mGOASVM, while mPLR-Loc and R3P-Loc use multi-label penalized logistic regression and ridge regression classifiers, respectively. For deeper features, HybridGO-Loc and SS-Loc exploit GO semantic similarity for classification. As for dimensionality reduction, RP-SVM and R3P-Loc adopt ensemble random projection for reducing high dimensionality of feature vectors while at the same time boosting the prediction performance. As stated in Chapter 9, mining deeper into the GOA database is probably the major contribution to the superior performance of HybridGO-Loc.

Table 10.1: Comparing the properties of the proposed multi-label predictors with state-of-the-art predictors: Virus-mPLoc [189], iLoc-Virus [233], Plant-mPLoc [51], iLoc-Plant [230], Euk-mPLoc 2.0 [49], and iLoc-Euk [52]. Term greq.: term-frequency GO-vector construction method (equation (4.3) in Section 4.1.3); Succ. search: the new successive-search strategy to retrieve GO terms (Section 4.1.2); Clas. refinement: multi-label classifier improvement or trying other efficient multi–label classifiers, compared to the baseline of multi-label SVM classifiers used in mGOASVM (Chapter 5); Deeper features: using deeper GO features, i.e. GO semantic similarity features or hybridizing both GO frequency and semantic similarity features (Chapter 6); Dim. reduction: using dimension-reduction methods, i.e. ensemble random projection for improving the performance of predictors (Chapter 7). “X” indicates that the predictor does not have the corresponding advantage; “✓” indicates that the predictor has the corresponding advantage.


10.7 Summary

This chapter presented further discussions on the advantages and limitations of the proposed multi-label predictors, including mGOASVM, AD-SVM, mPLR-Loc, SS-Loc, HybridGO-Loc, RP-SVM, and R3P-Loc. From the perspective of refinement of multi-label classifiers, mGOASVM, AD-SVM, and mPLR-Loc belong to the same category and share the same feature extraction method, but they have different multi-label classifiers. From the perspective of mining deeper into the GO information, SS-loc and HybridGO-Loc exploit semantic similarity over GO information. From the perspective of dimension reduction, RP-SVM and R3P-Loc apply ensemble random projection to reduce dimensionality and boost the performance at the same time. Of all of the multi-label predictors, HybridGO-Loc performs the best, demonstrating that mining deeper into the GOA database can significantly boost the prediction performance.

