9.6 Performance of HybridGO-Loc

9.6.1 Comparing different features

Figure 9.5a shows the performance of individual and hybridized GO features on the virus dataset based on leave-one-out cross validation (LOOCV). In the figure, SS1, SS2, and SS3 represent Lin’s, Jiang’s, and RS similarity measures, respectively. Hybrid1, Hybrid2, and Hybrid3 represent the hybridized features obtained from these measures. As can be seen, in terms of all six performance metrics, the performance of the hybrid features is remarkably better than the performance of individual features. Specifically, the OAAs (the most stringent and objective metric) of all of the three hybrid features are at least 3% (absolute) higher than that of the individual features, which suggests that hybridizing the two features can significantly boost the prediction performance. Moreover, among the hybridized features, the performance of Hybrid2, namely combining GO frequency features and Jiang’s GO SS features, outperforms Hybrid1 and Hybrid3. Another interesting observation is that although all of the individual GO SS features perform much worse than the GO frequency features, the performance of the three hybridized features is still better that of any of the individual features. This suggests that the GO frequency features and SS features are complementary to each other.

Similar conclusions can be drawn from the plant dataset shown in Figure 9.5b. However, comparison between Figure 9.5a and b reveals that for the plant dataset the performance of hybridized features outperforms all of the individual features in terms of all metrics except OLA and recall, while for the virus dataset, the former is superior to the latter in terms of all metrics. However, the losses in these two metrics do not outweigh the significant improvement on other metrics, especially on OAA, which has around 3% (absolute) improvement in terms of hybridized features as opposed to using individual features. Among the hybridizing features, Hybrid2 also outperforms Hybrid1 and Hybrid3 in terms of OLA, accuracy, recall, and F1-score, whereas Hybrid1 performs better than others in terms of OAA and precision. These results demonstrate that the GO SS features obtained by Lin’s measure and Jiang’s measure are better candidates than the RS measure for combining with the GO frequency features; however, there is no evidence suggesting which measure is better. It is also interesting to see that the performance of the three individual GO SS features is better than that of GO frequency features, in contrary to the results shown in Fig 9.5a.


Fig. 9.5: Performance of the hybrid features and individual features on the (a) virus and (b) plant datasets. Freq: GO frequency features; SS1, SS2, and SS3: GO semantic similarity features by using Lin’s measure [123], Jiang’s measure [103] and RS measure [181], respectively; Hybrid1, Hybrid2, and Hybrid3: GO hybrid features by combining GO frequency features with GO semantic similarity features based on SS1, SS2, and SS3, respectively.

9.7 Performance of RP-SVM

9.7.1 Performance of ensemble random projection

Figure 9.6a shows the performances of RP-SVM and RP-AD-SVM for different feature dimensions based on leave-one-out cross-validation on the virus dataset. The blue and black dotted lines represent the performance of mGOASVM [213] and AD-SVM [214], respectively. In other words, these two horizontal lines represent the original performance without dimension reduction for the two decision schemes. The dimensionality of the original feature vectors is 331. As can be seen, for dimensions between 50 and 300, the performance of RP-SVM is better than that of mGOASVM, which demonstrates that RP can boost the classification performance even the dimension is only one-sixth (50/331) of that of the original one. This suggests that the original feature vectors really have irrelevant or redundant information. Figure 9.6a also shows that the performance of RP-AD-SVM is equivalent to, or better than that of AD-SVM when the dimensionality is larger than 100. This result demonstrates that random projection is complementary to the adaptive decision scheme. Similar conclusions can be drawn from the plant dataset shown in Figure 9.6b. Figures 9.6a and b reveal that for the plant dataset, RP-AD-SVM outperforms AD-SVM for a wild range of feature dimensions (200 to 1541),22 whereas for the virus dataset, the former outperforms the latter at a much narrower range (100 to 300). This suggests that RP-AD-SVM is more robust in classifying plant proteins than in classifying virus proteins.

9.7.2 Comparison with other dimension-reduction methods

Figures 9.7a and 9.7b compare RP-SVM with other dimension-reduction methods on the virus dataset and the plant dataset, respectively. Here, PCA-SVM and RFE-SVM mean replacing RP with principal component analysis (PCA) and recursive feature elimination (RFE) [83]. As can be seen, for the virus dataset, both RP-SVM and RFE-SVM perform better than mGOASVM when the dimensionality is larger than 50, while PCA-SVM performs better than mGOASVM only when the dimensionality is larger than 100. This suggests that the former two methods are more robust than PCA-SVM. When the dimension is higher than 75, RP-SVM outperforms both RFE-SVM and PCA-SVM, although RFE-SVM performs the best when the dimension is 50. For the plant dataset, only RP-SVM performs the best for a wide range of dimensionality, while RFE-SVM and PCA-SVM perform poorly when the dimension is reduced to 200 (out of 1541).


Fig. 9.6: Performance of RP-SVM and RP-AD-SVM at different feature dimensions based on leave-one-out cross-validation (LOOCV) on (a) the virus dataset and (b) the plant dataset. The solid lines in both figures represent the performance of mGOASVM [213] and AD-SVM [214] on the two datasets, respectively.


Fig. 9.7: Comparing ensemble random projection with other dimension-reduction methods at different feature dimensions based on leave-one-out cross-validation (LOOCV) on (a) the virus dataset and (b) the plant dataset, respectively. The top black lines in both figures represent the performance of mGOASVM for the two datasets.

9.7.3 Performance of single random-projection

Figures 9.8a and 9.8b show the performance statistics of RP-SVM on the virus and the plant datasets, respectively, when the ensemble size (L in equation (7.5)) is fixed to 1, which we refer to as 1-RP-SVM for simplicity. We created ten 1-RP-SVM classifiers, each with a different RP matrix. The min OAA, max OAA, mean OAA, and median OAA represent the minimum, maximum, mean, and median OAA of these 10 classifiers. As can be seen, for both datasets even the max OAA is not always higher than that of mGOASVM, let alone the min, mean, or median OAA. This demonstrates that a single RP cannot guarantee that the original performance can be kept when the dimension is reduced. On the contrary, combining the effect of several RPs, as evidenced by Figure 9.6, can boost the performance to a level higher than any of the individual RPs.

9.7.4 Effect of dimensions and ensemble size

As individual RP cannot guarantee good performance, it is reasonable to ask: at least how many applications of RP can guarantee that the performance of the ensemble classifier is equivalent to, or even better than, that of the one without RP (i.e. mGOASVM)? Figure 9.9a and b show the performance of RP-SVM for different dimensions and different ensemble sizes of RPs on the virus and plant datasets, respectively. The blue/red areas represent the condition under which RP-SVM performs better/worse than mGOASVM. The yellow dotted planes in both figures represent the performance of mGOASVM on the two datasets. As can be seen, in the virus dataset, for dimensionality between 75 and 300, the performance of RP-SVM with at least three times RP is better than that of mGOASVM; for dimensionality 50, we need at least eight applications of RP to guarantee that the performance will not deteriorate. In the plant dataset, for dimensionality from 300 to 1400, RP-SVM with at least four applications of RP can outperform mGOASVM; for dimensionality 200, we need at least five applications of RP to obtain a performance better than mGOASVM. These results suggest that the proposed RP-SVM is very robust because only three or four applications of RP will be sufficient to achieve good performance.

9.8 Performance of R3P-Loc

9.8.1 Performance on the compact databases

Table 9.15 compares the subcellular localization performance of R3P-Loc under two different configurations. The column “Swiss-Prot + GOA” shows the performance when Swiss-prot and the GOA Database were used as the data sources for BLAST search and GO terms retrieval in Figure 7.4, whereas the column “ProSeq + ProSeq-GO” shows the performance when the proposed compact databases were used instead. As can be seen, the performances of the two configurations are almost the same, which clearly suggests that the compact databases can be used in place of the large Swiss-Prot and GOA databases.


Fig. 9.8: Performance of 1-RP-SVM at different feature dimensions based on leave-one-out cross-validation (LOOCV) on (a) the virus dataset and (b) the plant dataset, respectively. The black lines with circles in both figures represent the performance of mGOASVM on the two datasets. 1-RP-SVM: RP-SVM with an ensemble size of 1.


Fig. 9.9: Performance of RP-SVM at different feature dimensions and different ensemble sizes of random projection based on leave-one-out cross-validation (LOOCV) on (a) the virus dataset and (b) the plant dataset. The blue (red) areas represent the conditions for which RP-SVM outperforms (is inferior to) mGOASVM. The yellow dotted planes in both figures represent the performance of mGOASVM on the two datasets. Ensemble Size: number of applications of random projections for constructing the ensemble classifier.

Table 9.15: Performance of R3P-Loc on the proposed compact databases based on the leave-one-out cross validation (LOOCV) using the eukaryotic dataset. SCL: subcellular location; ER: endoplasmic reticulum; SPI: spindle pole body; OAA: overall actual accuracy; OLA: overall locative accuracy; F1: F1-score; HL: Hamming loss; memory requirement: memory required for loading the GO-term database; No. of database entries: number of entries in the corresponding GO-term database; No. of distinct GO terms: number of distinct GO terms found by using the corresponding GO-term database.

Label SCL LOOCV Locative Accuracy (LA)
Swiss-Prot + GOA ProSeq + ProSeq-GO
 1 Acrosome /14 = 0.143 /14 = 0.143
 2 Cell membrane 23/697 = 0.750 25/697 = 0.753
 3 Cell wall 6/49 = 0.939 5/49 = 0.918
 4 Centrosome 5/96 = 0.677 5/96 = 0.677
 5 Chloroplast 75/385 = 0.974 75/385 = 0.974
 6 Cyanelle 9/79 = 1.000 9/79 = 1.000
 7 Cytoplasm 964/2186 = 0.898 960/2186 = 0.897
 8 Cytoskeleton 0/139 = 0.360 3/139 = 0.381
 9 ER 24/457 = 0.928 26/457 = 0.932
10 Endosome 2/41 = 0.293 2/41 = 0.293
11 Extracellular 68/1048 = 0.924 69/1048 = 0.925
12 Golgi apparatus 09/254 = 0.823 08/254 = 0.819
13 Hydrogenosome 0/10 = 1.000 0/10 = 1.000
14 Lysosome 7/57 = 0.825 7/57 = 0.825
15 Melanosome /47 = 0.192 0/47 = 0.213
16 Microsome /13 = 0.077 /13 = 0.077
17 Mitochondrion 75/610 = 0.943 76/610 = 0.944
18 Nucleus 169/2320 = 0.935 157/2320 = 0.930
19 Peroxisome 03/110 = 0.936 04/110 = 0.946
20 SPI 7/68 = 0.691 2/68 = 0.618
21 Synapse 6/47 = 0.553 6/47 = 0.553
22 Vacuole 157/170 = 0.924 156/170 = 0.918
OAA 191/7766 = 0.797 201/7766 = 0.799
OLA 861/8897 = 0.884 848/8897 = 0.882
Accuracy 0.859 0.859
Precision 0.882 0.882
Recall 0.899 .898
F1 0.880 0.880
  HL 0.013 0.013
Memory requirement 2.5G 0.6G
  Time 0.57s 0.13s
No. of database entries 5.4 million 0.5 million
  No. of distinct GO terms 10808 10775

The bottom panel of Table 9.15 compares the implementation requirements and the number of distinct GO terms (dimension of GOA vectors) of R3P-Loc under the two configurations. In order to retrieve the GO terms in constant time (i.e. complexity image(1)) regardless of the database size, the AC to GO-terms mapping was implemented as a hash table in memory. This instantaneous retrieval, however, comes with a price: the hash table consumes a considerable amount of memory when the database size increases. Specifically, to load the whole GOA database released in March 2011, only 15 gigabytes of memory are required; the memory consumption rapidly increases to 22.5 gigabytes if the GOA Database released in July 2013 is loaded. The main reason is that this release of GOA Database contains 25 million entries. However, as shown in Table 9.15, the number of entries reduces to half a million if ProSeq-GO is used instead, which amounts to a reduction in memory consumption by 39 times . The small number of AC entries in ProSeq-GO results in a small memory footprint. Despite the small number of entries, the number of distinct GO terms in this compact GO database is almost the same as that in the big GOA Database. This explains why using ProSeq and ProSeq-GO can achieve almost the same performance as using Swiss-Prot and the original GOA Database.

9.8.2 Effect of dimensions and ensemble size

Figure 9.10a shows the performance of R3P-Loc at different projected dimensions and ensemble sizes of random projection on the plant dataset. The dimensionality of the original feature vectors is 1541. The yellow dotted plane represents the performance using only multi-label ridge regression classifiers, namely the performance without random projection. For ease of comparison, we refer to it as RR-Loc. The mesh with blue (red) surfaces represent the projected dimensions and ensemble sizes at which the R3P-Loc performs better (poorer) than RR-Loc. As can be seen, there is no red region across all the dimensions (200–1200) and all the ensemble sizes (2–10), which means that the ensemble R3P-Loc always performs better than RR-Loc. The results suggest that using ensemble random projection can always boost the performance of RR-Loc. Similar conclusions can be drawn from Figure 9.10b, which shows the performance of R3P-Loc at different projected dimensions and ensemble sizes of random projection on the eukaryotic dataset. The difference is that the original dimension of the feature vectors is 10,775, which means that R3P-Loc performs better than RR-Loc even when the feature dimension is reduced by almost 10∼100 times.

Figure 9.11a compares the performance of R3P-Loc with mGOASVM [213] at different projected dimensions and ensemble sizes of random projection on the plant dataset. The green dotted plane represents the accuracy of mGOASVM, which is a constant for all projected dimensions and ensemble size. The mesh with blue (red) surfaces represents the projected dimensions and ensemble sizes at which the ensemble R3P-Loc performs better (poorer) than mGOASVM. As can be seen, R3P-Loc performs better than mGOASVM throughout all dimensions (200–1400) when the ensemble size is more than 4. On the other hand, when the ensemble size is less than 2, the performance of R3P-Loc is worse than mGOASVM for almost all the dimensions. These results suggest that a large enough ensemble size is important for boosting the performance of R3P-Loc. Figure 9.11b compares the performance of R3P-Loc with mGOASVM on the eukaryotic dataset. As can be seen, R3P-Loc performs better than mGOASVM when the dimension is larger than 300 and the ensemble size is no less than 3, or the dimension is larger than 500 and the ensemble size is no less than 2. These experimental results suggest that a large enough projected dimension is also necessary for improving the performance of R3P-Loc.


Fig. 9.10: Performance of R3P-Loc at different projected dimensions and ensemble sizes of random projection on (a) the plant dataset and (b) the eukaryotic dataset, respectively. The yellow dotted plane represents the performance using only multi-label ridge regression classifiers (short for RR-Loc), namely the performance without random projection. The mesh with blue surfaces represents the projected dimensions and ensemble sizes at which the R3P-Loc performs better than RR-Loc. The original dimensions of the feature vectors for the plant and eukaryotic datasets are 1541 and 10775, respectively. Ensemble size: number of times of random projection for ensemble.


Fig. 9.11: Comparing R3P-Loc with mGOASVM at different projected dimensions and ensemble sizes of random projection on (a) the plant dataset and (b) the eukaryotic dataset, respectively. The green dotted plane represents the accuracy of mGOASVM [213], which is a constant for all projected dimensions and ensemble size. The mesh with blue (red) surfaces represent the projected dimensions and ensemble sizes at which the ensemble R3P-Loc performs better (poorer) than mGOASVM. The original dimensions of the feature vectors for the plant and eukaryotic datasets are 1541 and 10775, respectively. Ensemble size: number of times of random projection for ensemble.

9.8.3 Performance of ensemble random projection

Figure 9.12a shows the performance statistics of R3P-Loc based on leave-one-out cross-validation at different feature dimensions, when the ensemble size (L in equation (7.15)) is fixed to 1, which we refer to as 1-R3P-Loc. We created ten 1-R3P-Loc classifiers, each with a different RP matrix. The result shows that even the highest accuracy of the ten 1-R3P-Loc is lower than that of R3P-Loc for all dimensions (200– 1400). This suggests that the ensemble random projection can significantly boost the performance of R3P-Loc. Similar conclusions can also be drawn from Figure 9.12b, which shows the performance statistics of R3P-Loc on the eukaryotic dataset.

9.9 Comprehensive comparison of proposed predictors

9.9.1 Comparison of benchmark datasets

Tables 9.169.18 show the comprehensive comparisons of the performance of all of the proposed multi-label predictors against state-of-the-art predictors on the virus, plant, and eukaryotic datasets, respectively, based on leave-one-out cross validation. All of the predictors use the GO frequency features. HybridGO-Loc extracts the feature information not only from GO frequency features, but also from GO semantic similarity features. Virus-mPLoc [189], Plant-mPLoc [51], and Euk-mPLoc 2.0 [49] use an ensemble OET-KNN (optimized evidence-theoretic K-nearest neighbors) classifier; iLoc-Virus [233], iLoc-Plant [230], and iLoc-Euk [52] use a multi-label KNN classifier; KNN-SVM [121] uses an ensemble of classifiers combining KNN and SVM; mGOASVM [213] uses a multi-label SVM classifier; AD-SVM and mPLR-Loc use multi-label SVM and penalized logistic regression classifiers, respectively, both equipped with an adaptive decision scheme; RP-SVM and R3P-Loc use ensemble random projection for dimension reduction and use multi-label SVM and ridge regression as classifiers, respectively; and HybridGO-Loc uses a multi-label SVM classifier incorporated with an adaptive decision scheme.


Fig. 9.12: Performance of R3P-Loc at different feature dimensions on (a) the plant dataset and (b) the eukaryotic dataset, respectively. The original dimensions of the feature vectors for the plant and eukaryotic datasets are 1541 and 10775, respectively. 1-R3P-Loc: RP-Loc with an ensemble size of 1.

Table 9.16: Comparing all proposed multi-label predictors with state-of-the-art predictors using the virus dataset. OAA: overall actual accuracy; OLA: overall locative accuracy. See equations (8.9)(8.15) for the definitions of the performance measures. “–” means that the corresponding references do not provide the results on the respective metrics.


As can be seen from Table 9.16, all of the proposed predictors perform significantly better than Virus-mPLoc, KNN-SVM, and iLoc-Virus in terms of all performance metrics. Among the proposed predictors, HybridGO-Loc performs the best, which demonstrates that mining deeper GO information (i.e. semantic similarity information) is important for boosting the performance of predictors. Based on mGOASVM, the other proposed predictors have made different improvements, either from refinement of multi-label classifiers, dimensionality reduction, or mining deeper into the GO database for feature extraction, of which deeper feature extraction contributes most to the performance gain.

Similar conclusions can be drawn from Table 9.17, except that RP-SVM is superior to HybridGO-loc in terms of OLA and recall while HybridGO-Loc performs the best in terms of the other metrics. This is because the adaptive decision scheme for HybridGO-Loc tends to increase OAA but decrease OLA and recall as a compromise. The analysis of the adaptive decision scheme can be found in Section 9.5.1.

As for Table 9.18, only mGOASVM and R3P-Loc have been evaluated on for the eukaryotic datasets. As can be seen from Table 9.18, R3P-Loc performs the best in terms of OAA,accuracy,precision,F1 and HL, while mGOASVM outperforms R3P-Loc in terms of OLA and recall. The results are consistent with those shown in Table 9.17.

Table 9.17: Comparing all proposed multi-label predictors with state-of-the-art predictors using the plant dataset. OAA: overall actual accuracy; OLA: overall locative accuracy. See equations (8.9)(8.15) for the definitions of the performance measures. “–” means that the corresponding references do not provide the results on the respective metrics.


Table 9.18: Comparing all proposed multi-label predictors with state-of-the-art predictors using the eukaryotic dataset. OAA: overall actual accuracy; OLA: overall locative accuracy. See equations (8.9)(8.15) for the definitions of the performance measures. “–” means that the corresponding references do not provide the results on the respective metrics.


9.9.2 Comparison of novel datasets

To further demonstrate the effectiveness of our proposed predictors, a new plant dataset (see Table 8.8 in Chapter 8) containing novel proteins was constructed. Because the novel proteins were recently added to Swiss-Prot, many of them have not been annotated in the GOA database. As a result, if we used the accession numbers of these proteins to search against the GOA database, the corresponding GO vectors will contain all zeros. This suggests that we should use the ACs of their homologs as the search keys, i.e. the procedure shown in Figure 4.4 using sequences as input should be adopted. However, we observed that for some novel proteins, even the top homologs do not have any GO terms annotated to them. To overcome this limitation, the successive-search strategy procedure (also specified in Section 4.1.2 in Chapter 4) was adopted. For the proteins whose top homologs do not have any GO terms in the GOA database, we used the second-from-the-top homolog to find the GO terms; similarly, for the proteins whose top and second-from-the-top homologs do not have any GO terms, the third-top homolog was used; and so on until all the query proteins can correspond to at least one GO term.

Because BLAST searches were used in the above procedure, the prediction performance will depend on the closeness (degree of homology) between the training and test proteins. To determine the number of test proteins that are close homologs of the training proteins, we performed a BLAST search for each of the test proteins. The E-value threshold was set to 10, so that none of the proteins in the lists returned from BLAST have an E-value larger than 10. Then we identified the training proteins in the lists based on their accession numbers and recorded their corresponding E-values.

Figure 9.13 shows the distribution of the E-values, which quantify the closeness between the training and test proteins. If we use a common criteria that homologous proteins should have E-values less than 10−4, then 74 out of 175 test proteins are homologs of training proteins, which account for 42% of the test set. Note that this homologous relationship does not mean that using BLAST’s homology transfers can predict all of the test proteins correctly. In fact, BLAST’s homology transfers (based on the CC field of the homologous proteins) can only achieve a prediction accuracy of 26.9% (47/175). As the prediction accuracy of the proposed predictors on this test set (see Table 9.19) is significantly higher than this percentage, the extra information available from the GOA database plays a very important role in the prediction.


Fig. 9.13: Distribution of the closeness between the new testing proteins and the training proteins. The closeness is defined as the BLAST E-values of the training proteins using the test proteins as the query proteins in the BLAST searches. Number of proteins: the number of testing proteins whose E-values fall into the interval specified under the bar. Small E-values suggest that the corresponding new proteins are close homologs of the training proteins.

Table 9.19: Comparing all proposed multi-label predictors with state-of-the-art predictors based on independent tests using the new plant dataset. OAA: overall actual accuracy; OLA: overall locative accuracy. See equations (8.9)(8.15) for the definitions of the performance measures. “–” means that the corresponding references do not provide the results on the respective metrics.


Table 9.19 shows the performance of the proposed multi-label predictors against state-of-the-art predictors on the new plant dataset based on independent tests. As explained earlier, to ensure that these proteins are novel to the predictors, 978 proteins of the plant dataset (see Table 8.6 in Chapter 8) were used for training the classifier. The results of Plant-mPLoc [51] and iLoc-Plant [230] are obtained by inputing the novel plant proteins to the corresponding web-servers. As shown in Table 9.19, all of the proposed predictors perform significantly better than Plant-mPLoc and iLoc-Plant in terms of all the performance measures. Given the novelty and multi-label properties of these proteins and the low sequence similarity (below 25 %), the OAA (55% 72 %) and OLA (60% 78 %) achieved by our proposed predictors are fairly high. On the other hand, due to the scarcity of data, our proposed predictors do not perform well in some subcellular locations (results not shown). But the situation will be improved when more and more proteins are available for training our multi-label classifiers.

When comparing our own proposed predictors, HybridGO-Loc performs the best in terms of all the performance measures, which are consistent with the results shown in Section 9.9.1. The results further demonstrate that mining deeper GO information (i.e. semantic similarity information) contributes more to the predictions than refining classifiers.

9.10 Summary

This chapter has elaborated experimental results for all of the proposed predictors, including GOASVM and FusionSVM for single-location protein subcellular localization, and mGOASVM,AD-SVM, mPLR-Loc, SS-Loc,HybridGO-Loc, RP-SVM,and R3P-Loc for multi-label protein subcellular localization. In-depth analysis of the predictors were provided and detailed properties discussed. Comprehensive comparisons of all of the proposed predictors were also provided.

