This chapter will elaborate the experimental results and related analysis for all the predictors introduced in previous chapters, including GOASVM and FusionSVM for single-location protein subcellular localization, and mGOASVM, AD-SVM, mPLR-Loc, SS-Loc, HybridGO-Loc, RP-SVM, and R3P-Loc for multi-location protein subcellular localization.
Table 9.1 shows the performance of different GO-vector construction methods on the EU16, HUM12, and the novel eukaryote (NE16) datasets, which were detailed in Tables 8.1, 8.2, and 8.3, respectively. Linear SVMs were used for all cases, and the penalty factor was set to 0.1. For the EU16 and HUM12 datasets, leave-one-out cross-validation (LOOCV) was used to evaluate the performance of GOASVM; for the NE16 dataset, the EU16 training dataset was used for training the classifier, which was subsequently used to classify proteins in the NE16 dataset. Four different GO-vector construction methods were investigated, including 1-0 value, term-frequency (TF), inverse sequence-frequency (ISF), and term-frequency inverse sequence-frequency (TF-ISF).
Dataset | GO vector construction method | OMCC | WAMCC | ACC |
EU16 | 1-0 value | 0.9252 | 0.9189 | 92.98% |
TF | 0.9419 | 0.9379 | 94.55% | |
ISF | 0.9243 | 0.9191 | 92.90% | |
TF-ISF | 0.9384 | 0.9339 | 94.22% | |
HUM12 | 1-0 value | 0.8896 | 0.8817 | 89.88% |
TF | 0.9074 | 0.9021 | 91.51% | |
ISF | 0.8659 | 0.8583 | 87.70% | |
TF-ISF | 0.8991 | 0.8935 | 90.75% | |
NE16 | 1-0 value | 0.6877 | 0.6791 | 70.72% |
TF | 0.7035 | 0.6926 | 72.20% | |
ISF | 0.6386 | 0.6256 | 66.12% | |
TF-ISF | 0.6772 | 0.6626 | 69.74% |
Evidently, for all the three datasets, term-frequency (TF) performs the best of these four methods, which demonstrate that the frequencies of occurrences of GO terms contain extra information that the 1-0 values do not have. The results also suggest that inverse sequence frequency (ISF) is detrimental to classification performance, despite its proven effectiveness in document retrieval. This may be due to the differences between the frequency of occurrences of common GO terms in our datasets and the frequency of occurrences of common words in document retrieval. In document retrieval, almost all documents contain the common words; as a result, the inverse document frequency is effective in suppressing the influence of these words in the retrieval. However, the common GO terms do not appear in all of the proteins in our datasets. In fact, for example, even the most commonly occurring GO term appears only in one-third of the proteins in EU16. We conjecture that this low frequency of occurrences of common GO terms makes ISF ineffective for subcellular localization.
Many existing GO-based methods use the 1-0 value approach to constructing GO vectors, including ProLoc-GO [99], Euk-OET-PLoc [46], and Hum-PLoc [44]. Table 9.1 shows that term frequency (TF) performs almost 2% better than 1-0 value (72.20% vs 70.72%). Similar conclusions can be also drawn from the performance of GOASVM based on leave-one-out cross validation on the EU16 training set and the HUM12 training set. The results are biologically relevant, because proteins of the same subcellular localization are expected to have a similar number of occurrences of the same GO term. In this regard, the 1-0 value approach is inferior, because it quantizes the number of occurrences of a GO term to 0 or 1.
Because of the inferior performance of the 1-0 value approach, predictors proposed in recent years typically use more advanced methods to extract features from the GOA database. For example, the GO vectors used by iLoc-Euk [52], iLoc-Hum [53], iLoc-Plant [230], iLoc-Gpos [231], iLoc-Gneg [232], and iLoc-Virus [233] depend on the frequency of occurrences on GO terms, which to some extent is similar to the TF approach.
Because the novel proteins were recently added to Swiss-Prot, many of them have not been annotated in the GOA database. As a result, if we used the accession numbers of these proteins to search against the GOA database, the corresponding GO vectors will contain all zeros. This suggests that we should use the ACs of their homologs as the search keys, i.e. the procedure shown in Figure 4.4 should be adopted. However, we observed that for some novel proteins, even the top homologs do not have any GO terms annotated to them. In particular, in the NE16 dataset, there are 169 protein sequences whose top homologs do not have any GO terms (second row of Table 9.2), causing GOASVM to be unable to make any predictions. As can be seen from Table 9.2, by using only the first homolog, the overall prediction accuracy of GOASVM is only 57.07% (347/608). To overcome this limitation, the following strategy was adopted. For the 169 proteins (second row of Table 9.2) whose top homologs do not have any GO terms in the GOA database, we used the second from the top homolog to find the GO terms; similarly, for the 112 proteins (third row of Table 9.2) whose top and second from the top homologs do not have any GO terms, the third from the top homolog was used; and so on until all the query proteins can correspond to at least one GO term. In the case where BLAST fails to find any homologs (although this case rarely happens) the default E-value threshold (the -e option) can be relaxed. A detailed description of this strategy can be found in Section 4.1.2 of Chapter 4.
Method | kmax | No. of sequences without GO terms | OMCC | WAMCC | ACC | |
GOASVM | 1 | 169 | 0.5421 | 0.5642 | 57.07% | |
2 | 112 | 0.5947 | 0.6006 | 62.01% | ||
3 | 12 | 0.6930 | 0.6834 | 71.22% | ||
4 | 7 | 0.6980 | 0.6881 | 71.71% | ||
5 | 3 | 0.7018 | 0.6911 | 72.04% | ||
6 | 3 | 0.7018 | 0.6911 | 72.04% | ||
7 | 0 | 0.7035 | 0.6926 | 72.20% | ||
Baseline* | 7 | 0 | 0.5246 | 0.5330 | 55.43% | |
* Since the webserver of Euk-OET-PLoc is not now available, we implemented it according to [46]. |
Table 9.2 shows the prediction performance of GOASVM on the NE16 dataset (608 novel proteins). As explained earlier, to ensure that these proteins are novel to GOASVM, 2423 proteins extracted from the training set of EU16 were used for training the classifier. For fair comparison, Euk-OET-PLoc [46] also uses the same version of the GOA dDtabase (Marvh 8, 2011) to retrieve GO terms and adopts the same procedure as GOASVM to obtain GO terms from homologs. In such a case, for Euk-OET-PLoc, it is unnecessary to use the PseAA[35] as a backup method, because a valid GO vector can be found for every protein in this novel dataset. Also, according to Euk-OET-PLoc [46], several parameters are optimized, and only the best performance is shown here (see the last row of Table 9.2). As can be seen, GOASVM performs significantly better than Euk-OET-PLoc (72.20% vs 55.43 %), demonstrating that GOASVM is more capable of predicting novel proteins than Euk-OET-PLoc. Moreover, the results clearly suggest that when more distant homologs are allowed to be used for searching GO terms in the GOA database, we have a higher chance of finding at least one GO terms for each of these novel proteins, thus improving the overall performance. In particular, when the most distant homolog has a rank of 7 (kmax = 7), GOASVM is able to find GO terms for all of the novel proteins, and the accuracy is also the highest, which is almost 15% (absolute) higher than when using only the top homolog. Given the novelty of these proteins and the low sequence similarity (< 25%), an accuracy of 72.2% is fairly high, suggesting that the homologs of novel proteins can provide useful GO information for protein subcellular localization.
Note that the gene association file we downloaded from the GOA database does not provide any subcellular localization labels. This file only allows us to create a hash table for storing the association between the accession numbers and their corresponding GO terms. This hash table covers all of the accession numbers in the GOA Database released on March 18, 2011, meaning that it will cover the EU16 (dated 2005) but not the accession numbers in the novel eukaryotic dataset. It is important to emphasize that, given a query protein, having a match in this hash table does not mean that a subcellular localization assignment can be obtained. In fact, having a match only means that a nonnull GO vector can be obtained. After that, the SVMs play an important role in classifying the nonnull GO vector.
Table 9.3 shows the performance of GOASVM using different features and different SVM classifiers on the EU16 dataset. The penalty factor for training the SVMs was set to 0.1 for both linear SVMs and RBF-SVMs. For RBF-SVMs, the kernel parameter was set to 1. For the first four methods, vector norm was adopted for better classification performances. GapAA [162] takes the maximum gap length 48 (the minimum length of all the sequences is 50). As AA, PairAA, and PseAA produce low-dimensional feature vectors, the performance achieved by RBF-SVMs is better than that achieved by linear SVMs. Therefore, we only present the performance of RBF-SVMs. As can be seen, amino acid composition and its variant are not good features for subcellular localization. The highest accuracy is only 48.66 %. Moreover, although homology-based method can achieves better accuracy (54.52%) than amino acid composition-based methods, the performance is still very poor, probably because of the low sequence similarity in this dataset. On the other hand, GOASVM can achieve a significantly better performance (94.55 %), almost 40% (absolute) better than the homology-based method. This suggests that the gene ontology-based method can provide significantly richer information pertaining to protein subcellular localization than the other methods.
To further demonstrate the superiority of GOASVM over other state-of-the-art GO methods, we also did experiments on the EU16 and HUM12 datasets, respectively. Table 9.4 compares the performance of GOASVM against three state-of-the-art GO-based methods on the EU16 and HUM12 datasets, respectively. As Euk-OET-PLoc and Hum-PLoc could not produce valid GO vectors for some proteins in EU16 and HUM12, both methods use PseAA as a backup. ProLoc-GO uses either the ACs of proteins as search keys or uses the ACs of homologs returned from BLAST as search keys. GOASVM also uses BLAST to find homologs, but unlike ProLoc-GO, GOASVM is more flexible in selecting the homologs. Specifically, instead of using the top homolog only, GOASVM can use lower-rank homologs if the higher-rank homologs result in null GO vectors.
Table 9.4 shows that for ProLoc-GO, using ACs as input, performs better than using sequences (ACs of homologs) as input. However, the results for GOASVM are not conclusive in this regard, because under LOOCV, using ACs as input performs better than using sequences, but the situation is opposite under independent tests. Table 9.4 also shows that even when using ACs sequences as input, GOASVM performs better than Euk-OET-PLoc and ProLoc-GO, for both the EU16 and HUM12 datasets.
To show that the high performance of GOASVM is not purely attributed to the homologous information obtained from BLAST, we used BLAST directly as a subcellular localization predictor. Specifically, the subcellular location of a query protein is determined by the subcellular location of its closest homolog as determined by BLAST using Swiss-Prot 2012_04 as the protein database. The subcellular location of the homologs were obtained from their CC field in Swiss-Prot. The results in Table 9.4 show that the performance of this approach is significantly poorer than that of other machine-learning approaches, suggesting that homologous information alone is not sufficient for subcellular localization prediction. [24] also used BLAST to find the subcellular locations of proteins. Their results also suggest that using BLAST alone is not sufficient for reliable prediction.
Method | Input data | Feature | Accuracy (WAMCC) | ||
LOOCV | Independent test | ||||
ProLoc-GO [99] | S | GO (using BLAST) | 86.6% (0.7999) | 83.3% (0.706) | |
ProLoc-GO [99] | AC | GO (No BLAST) | 89.0% (–) | 85.7% (0.710) | |
Euk-OET-PLoc [46] | S + AC | GO + PseAA | 81.6% (–) | 83.7% (–) | |
GOASVM | S | GO (usig BLAST) | 94.68% (0.9388) | 93.86% (0.9252) | |
GOASVM | AC | GO (No BLAST) | 94.55% (0.9379) | 94.61% (0.9348) | |
BLAST [5] | S | – | 56.75% (–) | 60.39% (–) | |
(a) Performance on the EU16 dataset. |
Method | Input data | Feature | Accuracy (WAMCC) | ||
LOOCV | Independent test | ||||
ProLoc-GO [99] | S | GO (using BLAST) | 90.0% (0.822) | 88.1% (0.661) | |
ProLoc-GO [99] | AC | GO (No BLAST) | 91.1% (–) | 90.6% (0.724) | |
Hum-PLoc [44] | S + AC | GO + PseAA | 81.1% (–) | 85.0% (–) | |
GOASVM | S | GO (usig BLAST) | 91.73% (0.9033) | 94.21% (0.9346) | |
GOASVM | AC | GO (No BLAST) | 91.51% (0.9021) | 94.39% (0.9367) | |
BLAST [5] | S | – | 68.55% (–) | 65.69% (–) | |
(b) Performance on the HUM12 dataset. |
Although all the datasets mentioned in this section were cut off at 25% sequence similarity, the performance of GOASVM increases from 72.20%(Table 9.2) on the novel dataset (NE16) to more than 90% (Table 9.4) on both the EU16 and HUM12 datasets. This is mainly because in Table 9.4 the training and testing sets were constructed at the same time, whereas the creation of the training set and the testing set in Table 9.2 were done six years apart, which causes the latter to have less similarity in GO information between the training set and test sets than the former. This in turn implies that the performance of GOASVM on the novel dataset (Table 9.2) can more objectively reflect the classification capabilities of the predictors.
The newer the version of the GOA Database, the more annotation information it contains. To investigate how the updated information affects the performance of GOASVM, we performed experiments using an earlier version (published in Oct. 2005) of the GOA Database and compared the results with Euk-OET-PLoc on the EU16 dataset. For proteins without a GO term in the GOA Database, pseudo amino acid composition (PseAA) [35] was used as the backup feature. Comparison between the last and second-to-last rows of Table 9.5 reveals that using newer versions of the GOA Database can achieve better performance than using older versions. This suggests that annotation information is very important to the prediction. The results also show that GOASVM significantly outperforms Euk-OET-PLoc, suggesting that the GO vector construction method and classifier (term frequency and SVM) in GOASVM are superior to the those used in Euk-OET-PLoc (1-0 value and K-NN).
Method | Feature | Accuracy | ||
LOOCV | Independent test | |||
Euk-OET-PLoc [46] | GO (GOA2005) | 81.6% | 83.7% | |
GOASVM | GO (GOA2005) | 86.42% | 89.11% | |
GOASVM | GO (GOA2011) | 94.55% | 94.61% |
Table 9.6 shows the performance of 12 different configurations of InterProGOSVM using the OE11 dataset (See Table 8.4). For ease of reference, we label these methods as GO_1, GO_2,..., and GO_12. Linear SVMs were used in all cases, and the penalty factor was also set to 1. When using vector normalization or geometric averaging to postprocess the GO vectors, the inverse sequence frequency can produce more discriminative GO vectors, as evident in the higher accuracy, OMCC and WAMCC corresponding to GO_6 and GO_10. Except for ISF, using the raw GO vectors (without postprocessing) as the SVM input achieves the best performance, as evident in the higher accuracy, OMCC and WAMCC corresponding to GO_1, GO_3, and GO_4. This suggests that postprocessing could remove some of the subcellular localization information pertaining to the raw GO vectors. GO_1 achieves the best performance, suggesting that postprocessing is not necessary.
Config ID | GO vectors construction | Postprocessing | ACC | OMCC | WAMCC |
GO_1 | 1-0 value | None | 72.21% | 0.6943 | 0.6467 |
GO_2 | ISF | None | 71.89% | 0.6908 | 0.6438 |
GO_3 | TF | None | 71.99% | 0.6919 | 0.6451 |
GO_4 | TF-ISF | None | 71.15% | 0.6827 | 0.6325 |
GO_5 | 1-0 value | Vector norm | 71.25% | 0.6837 | 0.6335 |
GO_6 | ISF | Vector norm | 72.02% | 0.6922 | 0.6427 |
GO_7 | TF | Vector norm | 70.96% | 0.6806 | 0.6293 |
GO_8 | TF-ISF | Vector norm | 71.73% | 0.6890 | 0.6386 |
GO_9 | 1-0 value | Geometric mean | 70.51% | 0.6756 | 0.6344 |
GO_10 | ISF | Geometric mean | 72.08% | 0.6929 | 0.6468 |
GO_11 | TF | Geometric mean | 70.64% | 0.6771 | 0.6290 |
GO_12 | TF-ISF | Geometric mean | 71.03% | 0.6813 | 0.6391 |
Table 9.7 shows the performance of different SVMs using various features extracted from the protein sequences. The features include amino acid composition (AA) [155 amino-acid pair composition (PairAA) [155], AA composition with the maximum gap length equal to 59 (the minimum length of all of the 3120 sequences is 61) [162], pseudo AA composition [35], profile alignment scores and GO vectors. The penalty factor for training the SVMs was set to 1 for both linear SVM and RBF-SVM. For RBF-SVMs, the kernel parameter was set to 1. As AA and PairAA produce low-dimensional feature vectors, the performance achieved by RBF-SVM is better than that of the linear SVM. Therefore, only the performance of RBF-SVM is presented.
Table 9.7 shows that amino acid composition and its variant are not good features for subcellular localization. The AA method only explores the amino acid composition information, so it performs the worst. Because PairAA, GapAA, and extended PseAA extract the sequence-order information, their combination achieves a slightly better prediction performance. Among the amino acid-based methods, the highest accuracy is only 61.44%. On the other hand, the homology-based method that exploits the homologous sequences in protein databases (via PSI-BLAST) achieves a significantly better performance. This suggests that the information pertaining to the amino acid sequences is limited. On the contrary, homology-based method PairProSVM can extract much more valuable information about protein subcellular localization than amino acid-based methods. The higher OMCC and WAMCC also suggest that PairProSVM can handle imbalanced problems better. The results also suggest that InterProGOSVM outperforms the amino acid composition methods, and InterProGOSVM is also comparable, although a bit inferior, to PairProSVM.
Table 9.8 shows the performance of fusing the InterProGOSVM and PairProSVM. The performance was obtained by optimizing the fusion weights wGO in equation (4.19) (based on the test dataset). The results show that the combination of PairProSVM and GO_10 (ISF with geometric mean) achieves the highest accuracy, 79.04%, which is significantly better than PairProSVM (77.05%) and the InterProGOSVM method (72.21 %) alone. The results also suggest that the combination of PairProSVM and any configuration of InterProGOSVM can outperform the individual methods. This is mainly because the information obtained from homology search and from functional domain databases has different perspectives and is therefore complementary to each other.
Method I | Optimal wGO | ACC | OMCC | WAMCC |
GO_1 | 0.45 | 78.91% | 0.7680 | 0.7322 |
GO_2 | 0.26 | 78.56% | 0.7641 | 0.7260 |
GO_3 | 0.40 | 78.75% | 0.7662 | 0.7291 |
GO_4 | 0.37 | 78.72% | 0.7659 | 0.7285 |
GO_5 | 0.37 | 78.78% | 0.7666 | 0.7293 |
GO_6 | 0.34 | 78.78% | 0.7666 | 0.7294 |
GO_7 | 0.43 | 78.81% | 0.7670 | 0.7289 |
GO_8 | 0.29 | 78.40% | 0.7624 | 0.7234 |
GO_9 | 0.42 | 78.97% | 0.7687 | 0.7318 |
GO_10 | 0.45 | 79.04% | 0.7694 | 0.7335 |
GO_11 | 0.40 | 78.37% | 0.7620 | 0.7222 |
GO_12 | 0.37 | 78.62% | 0.7648 | 0.7263 |
Surprisingly, combining the best-performing InterProGOSVM and the profile-alignment method does not give the best performance. And for different configurations, the best performance of FusionSVM is achieved at different optimal wGO. Since the performance of PairProSVM seems to be a bit better than that of InterProGOSVM, it is reasonable to give smaller weight to InterProGOSVM and more to PairProSVM.
As mentioned above, the wGO will significantly influence the final performance of each configuration. It is necessary to discover how the parameter impacts the accuracy of FusionSVM. Here, we chose the configuration with the best performance: GO_10. Figure 9.1 shows the performance of combining GO_10 (InterProGOSVM with Config ID GO_10) and PairProSVM by varying wGO from 0 to 1. As can be seen, the performance changes steadily with the change of wGO. It suggests that wGO would not abruptly impact the final performance of the combination method and the improvement of the combination method over PairProSVM exists for a wide range of wGO. Further, to show that the improvement of the combination of methods over each individual method is statistically significant, we also performed the McNemar’s test [140] on their SVM scores to compare their performance [78]. The p-value between the accuracy of the combined system (GO_10 and PairProSVM) and the PairProSVM system is 0.0055 (≪ 0.05), which suggests that the performance of the combination predictor is significantly better than that of the PairProSVM predictor.
A support vector machine (SVM) can use a linear, RBF, or polynomial function as its kernel. Some works [15, 162] have demonstrated that RBF kernels achieve better results than linear and polynomial kernels. However, our results show that linear SVMs perform better in our datasets. Table 9.9 shows the performance of mGOASVM using different types of kernel functions with different parameters based on leave-one-out cross validation (LOOCV) on the virus dataset. For RBF SVM, the kernel parameter σ was selected from the set {2−2,2−1,...,25}. For polynomial SVM, the degree of polynomial was set to either 2 or 3. The penalty parameter (C) was set to 0.1 for all cases. Table 9.9 shows that SVMs using the linear kernel perform better than those with RBF and polynomial kernels. This is plausible, because the dimension of GO vectors is larger than the number of training vectors, aggravating the curse of the dimensionality problem in nonlinear SVMs [14]. The over-fitting problem becomes more severe when the degree of nonlinearity is high (small σ), leading to reduction of performance, as demonstrated in Table 9.9. In other words, highly nonlinear SVMs become vulnerable to overfitting due to the high-dimensionality of the GO vectors.
Table 9.10 compares the performance of the 1-0 value and the term-frequency approaches to constructing GO vectors. Linear SVMs, with penalty factor C = 0.1, were used in all cases. The results show that while the term frequency (TF) approach is only slightly better than the 1-0 value approach in terms of the locative accuracy, it performs almost 2% and 7% better than 1-0 value in the actual accuracy for the virus and the plant datasets, respectively. This suggests that the frequency of occurrences of GO terms could also provide information for subcellular locations. The results are biologically relevant, because proteins of the same subcellular localization are expected to have a similar number of occurrences of the same GO term. In this regard, the 1-0 value approach is inferior, because it quantizes the number of occurrences of a GO term to 0 or 1. In fact, the increase in the actual accuracy is even more substantial in the plant dataset than in the virus dataset. Given the large size and number of multi-label proteins in the plant dataset, the remarkable performance of the TF approach suggests that the term frequencies do possess some extra information that the 1-0 values do not have.
Kernel | Parameter | Locative accuracy | Actual accuracy |
Linear SVM | – | 244/252 = 96.8% | 184/207 = 88.9% |
RBF SVM | σ =2-2 | 182/252 = 72.2% | 53/207 = 25.6% |
σ =2-1 | 118/252 = 46.8% | 87/207 = 42.0% | |
σ =1 | 148/252 = 58.7% | 116/207 = 56.0% | |
σ =21 | 189/252 = 75.0% | 142/207 = 68.6% | |
σ =22 | 223/252 = 88.5% | 154/207 = 74.4% | |
σ =23 | 231/252 = 91.7% | 150/207 = 72.5% | |
σ =24 | 233/252 = 92.5% | 115/207 = 55.6% | |
σ =25 | 136/252 = 54.0% | 5/207 = 2.4% | |
Polynomial SVM | d = 2 | 231/252 = 91.7% | 180/207 = 87.0% |
d = 3 | 230/252 = 91.3% | 178/207 = 86.0% |
GO vector construction methods | Locative accuracy | Actual accuracy |
1-0 value | 244/252 = 96.8% | 179/207 = 86.5% |
Term frequency (TF) | 244/252 = 96.8% | 184/207 = 88.9% |
(a) Performance on the viral protein dataset |
GO vector construction methods | Locative accuracy | Actual accuracy | |
1-0 value | 1014/1055 = 96.1% | 788/978 = 80.6% | |
Term frequency (TF) | 1015/1055 = 96.2% | 855/978 = 87.4% | |
(b) Performance on the plant protein dataset |
To reveal that the high locative accuracies of mGOASVM are due to the capability of mGOASVM rather than overprediction, we have investigated the distributions of the number of predicted labels in both virus and plant datasets. We consider and in equation (8.14) as the number of predicted labels and true labels for the i-th protein, respectively. The distributions of the number of labels predicted by mGOASVM are shown in Table 9.11. Denote and as the number of proteins that are over-, equal-, and under-predicted by k (k = 0,...,5 for the virus dataset and k = 0,...,11 for the plant dataset) labels. Also denote No, Ne, and Nu as the total number of proteins that are over-, equal-, and under-predicted, respectively. Here, over-prediction, equal-prediction, and under-prediction are defined as the number of predicted labels that is larger than, equal to, and smaller than the number of true labels, respectively. Table 9.11 shows that proteins that are over- or under-predicted only account for a small percentage of the datasets (8.7%˙ and 1.0%˙ over- and under-predicted in the virus dataset; 8.7% and 1.4% over- and under-predicted in the plant dataset). Even among the proteins that are over-predicted, most of them are over-predicted by only one location. These include all of the 18 proteins in the virus dataset and 83 out of 85 in the plant dataset. None of the proteins in the virus dataset are over-predicted by more than 1 location. Only 2 out of 85 proteins in the plant dataset are over-predicted by 2 locations, and none are over-predicted by more than 2 locations. As for under-prediction, all of the under-predicted proteins are only under-predicted by 1 location in both datasets. These results demonstrate that the over- and under-prediction percentages are small, which suggests that mGOASVM can effectively determine the number of subcellular locations of a query protein.
Input data | #homo | Actual accuracy of protein groups | Overall actual accuracy | ||
l = 1 | l = 2 | l = 3 | |||
AC | 0 | 154/165 = 93.3% | 34/39 = 87.2% | 3/3 = 100% | 191/207 = 92.3% |
S | 1 | 148/165 = 89.7% | 33/39 = 84.6% | 3/3 = 100% | 184/207 = 88.9% |
S + AC | 1 | 151/165 = 91.5% | 34/39 = 87.2% | 3/3 = 100% | 188/207 = 90.8% |
(a) Performance on the viral protein dataset. |
Input data | #homo | Actual accuracy of protein groups | Overall actual accuracy | |||
l = 1 | l = 2 | l = 3 | ||||
AC | 0 | 813/904 = 89.9% | 49/71 = 69.0% | 1/3 = 33.3% | 863/978 = 88.2% | |
S | 1 | 802/904 = 88.7% | 52/71 = 73.2% | 1/3 = 33.3% | 855/978 = 87.4% | |
S + AC | 1 | 811/904 = 89.7% | 47/71 = 66.2% | 1/3 = 33.3% | 859/978 = 87.8% | |
(b) Performance on the plant protein dataset. |
Table 9.12 shows the performance of mGOASVM for multi-location proteins using different types of inputs. Denote l (l = 1,...,) as the number of co-locations. As the maximum number of colocations in both datasets is 3, the actual accuracies for l (l = 1,..., 3) are shown in Table 9.12. Note that high actual accuracies for l > 1 are more difficult to achieve than for l = 1, since not only the number of subcellular locations for a protein should be correctly predicted, but also the subcellular locations should be precisely predicted. As can be seen, mGOASVM achieves high performance not only for single-label proteins (the column corresponding to l = 1), but also for multi-label proteins (the columns corresponding to l = 2 and l = 3). The results demonstrate that mGOASVM can tackle multi-label problems well.
Table 9.13 shows the performance of mGOASVM with different types of inputs and different numbers of homologous proteins for the virus and plant datasets. The input data can be of three possible types: (1) accession number only, (2) sequence only and (3) both accession number and sequence. mGOASVM can extract information from these inputs by producing multiple GO vectors for each protein. Denote #homo as the number of homologous proteins, where #homo ∈ {0,1,2,4,8} for the virus dataset and #homo ∈ {0,1,2} for the plant dataset. For different combinations of inputs and numbers of homologs, the number of distinct GO terms can be different. Typically, the number of distinct GO terms increases with the number of homologs.
Input data type | #homo | Nd(GO) | Locative accuracy | Actual accuracy | |
AC | 0 | 331 | 244/252 = 96.8% | 191/207 = 92.3% | |
S | 1 | 310 | 244/252 = 96.8% | 184/207 = 88.9% | |
S | 2 | 455 | 235/252 = 93.3% | 178/207 = 86.0% | |
S | 4 | 664 | 221/252 = 87.7% | 160/207 = 77.3% | |
S | 8 | 1134 | 202/252 = 80.2% | 130/207 = 62.8% | |
S + AC | 1 | 334 | 242/252 = 96.0% | 188/207 = 90.8% | |
S + AC | 2 | 460 | 238/252 = 94.4% | 179/207 = 86.5% | |
S + AC | 4 | 664 | 230/252 = 91.3% | 169/207 = 81.6% | |
S + AC | 8 | 1134 | 216/252 = 85.7% | 145/207 = 70.1% | |
(a) Viral protein dataset. |
Input data type | #homo | Nd(GO) | Locative accuracy | Actual accuracy | |
AC | 0 | 1532 | 1023/1055 = 97.0% | 863/978 = 88.2% | |
S | 1 | 1541 | 1015/1055 = 96.2% | 855/978 = 87.4% | |
S | 2 | 1906 | 907/1055 = 85.8% | 617/978 = 63.1% | |
S + AC | 1 | 1541 | 1010/1055 = 95.7% | 859/978 = 87.8% | |
S + AC | 2 | 1906 | 949/1055 = 90.0% | 684/978 = 70.0% | |
(b) Plant protein dataset. |
Table 9.13 shows that the number of homologs can affect the performance of mGOASVM. The results are biologically relevant, because the homologs can provide information about the subcellular locations. However, more homologs may bring redundant or even noisy information, which is detrimental to the prediction accuracy. For example, in the plant dataset, the performance of using one homolog is better than when using two (87.4% vs 63.1 %), which in turn suggests that we should limit the number of homologs to avoid bringing irrelevant information. Moreover, as can be seen from Table 9.13, the performance achieved by mGOASVM using sequences with the top homolog are comparable to that of mGOASVM using the accession number only.
Table 9.13 shows that mGOASVM, using both sequences and accession numbers as inputs, performs better than using sequences only. However, if accession numbers are available, it is better to use accession numbers only, as evident by the superior performance of the first row in Tables 9.13a and b.
To demonstrate the superiority of mGOASVM over state-of-the-art predictors, prediction results of 10 typical novel proteins by mGOASVM and Plant-mPLoc are shown in Table 9.14. As can be seen, for the single-location protein Q8VY13, mGOASVM can correctly predict it to be located in “peroxisome”, while Plant-mPLoc gives a wrong prediction (“chloroplast”); for the multi-location protein F4JV80, mGOASVM can correctly predict it to be located in both “chloroplast” and “mitochondrion”, while Plant-mPLoc predicts it to be a single-location protein located in “nucleus”. Similarly, for Q93YP7, Q9LK40, and Q6NPS8, mGOASVM can predict all of them correctly, while Plant-mPLoc gives all wrong predictions for them. Plant-mPLoc produces partially correct predictions on some of the single-location proteins, e.g. Q3ED65, Q9LQC8, and Q8VYI1. However, Plant-mPLoc wrongly considers them to be multi-label proteins. On the other hand, mGOASVM correctly predicts these proteins as single-location proteins, located in “chloroplast”, “Golgi apparatus”, and “endoplasmic reticulum”, respectively. For Q9FNY2 and Q9FJL3, Plant-mPLoc can only produce partially correct predictions, whereas mGOASVM can exactly locate both of them in the right subcellular location(s).
AC | Ground-truth location(s) | Prediction results | ||
Plant-mPLoc [51] | mGOASVM | |||
Q8VYI3 | peroxisome | chloroplast | peroxisome | |
F4JV80 | chloroplast, mitochondrion | nucleus | chloroplast, mitochondrion | |
Q93YP7 | mitochondrion | cell membrane, chloroplast, Golgi apparatus | mitochondrion | |
Q9LK40 | nucleus | chloroplast | nucleus | |
Q6NPS8 | cytoplasm, nucleus | endoplasmic reticulum | cytoplasm, nucleus | |
Q3ED65 | chloroplast | chloroplast, cytoplasm | chloroplast | |
Q9LQC8 | Golgi apparatus | endoplasmic reticulum, Golgi apparatus | Golgi apparatus | |
Q8VYI1 | endoplasmic reticulum | endoplasmic reticulum, vacuole | endoplasmic reticulum | |
Q9FNY2 | cytoplasm, nucleus | nucleus | cytoplasm, nucleus | |
Q9FJL3 | cytoplasm, nucleus | nucleus, vacuole | cytoplasm, nucleus |
Figure 9.2 shows the performance of AD-SVM on the virus dataset and the plant dataset with respect to the adaptive-decision parameter θ (equation (5.11) in Section 5.4.2 of Chapter 5) based on leave-one-out cross-validation. As can be seen, for the virus dataset, as θ increases from 0.0 to 1.0, the overall actual accuracy increases first, reaches the peak at θ= 0.3 (with an actual accuracy of 93.2%), and then decreases. An analysis of the predicted labels suggests that the increase in actual accuracy is due to the reduction in the number of over-prediction, i.e. the number of cases where has been reduced. When θ > 0.3, the benefit of reducing the over-prediction diminishes, because the criterion in equation (5.9) becomes so stringent that some of the proteins were under-predicted, i.e. the number of cases where increases. Note that the performance at θ = 0.0 is equivalent to the performance of mGOASVM, and that the best actual accuracy (93.2% when θ= 0.3) obtained by the proposed decision scheme is more than 4% (absolute) higher than mGOASVM (88.9%).
For the plant dataset, when θ increases from 0.0 to 1.0, the overall actual accuracy increases from 87.4% and then fluctuates at around 88%. If we take the same θ as that for the virus dataset, i.e., θ =0.3, the performance of AD-SVM is 88.3%, which is still better than that of mGOASVM at θ = 0.0.
Figure 9.3a shows the performance of mPLR-Loc on the virus dataset for different values of θ (equation (6.13)) based on leave-one-out cross-validation. In all cases, the penalty parameter ρ of logistic regression was set to 1.0. The performance of mPLR-Loc at θ = 0.0 is not provided because according to equations (5.20) and (6.13), all of the query proteins will be predicted as having all of the M subcellular locations, which defeats the purpose of prediction. As evident from Figure 9.3a, when θ increases from 0.1 to 1.0, the OAA of mPLR-Loc increases first, reaching the peak at θ = 0.5, with OAA = 0.903, which is almost 2% (absolute) higher than mGOASVM (0.889). The precision achieved by mPLR-Loc increases until θ = 0.5 and then remains almost unchanged when θ ≥ 0.5. On the contrary, OLA and recall peak at θ = 0.1, and these measures drop with θ until θ = 1.0. Among these metrics, no matter how θ changes, OAA is no higher than other five measurements.
An analysis of the predicted labels {; i = 1,...,207} suggests that the increase in OAA is due to the reduction in the number of over-prediction, i.e. the number of cases where . When θ > 0.5, the benefit of reducing the over-prediction diminishes, because the criterion in equation (5.20) becomes so stringent that some of the proteins were under-predicted, i.e. . When θ increases from 0.1 to 0.5, the number of cases where decreases while at the same time remains almost unchanged. In other words, the denominators of accuracy and F1-score decrease, while the numerators for both metrics remain almost unchanged, leading to better performance for both metrics. When θ > 0.5, for a reason similar to above, the increase in under-prediction outweighs the benefit of the reduction in over-prediction, causing performance loss. For precision, when θ > 0.5, the loss due to the stringent criterion is counteracted by the gain due to the reduction in , the denominator of equation (8.10). Thus, the precision increases monotonically when θ increases from 0.1 to 1. However, OLA and recall decrease monotonically with respect to θ because the denominator of these measures (see equations (8.14) and (8.11)) is independent of and the number of correctly predicted labels in the numerator decreases when the decision criterion is getting stricter.
Figure 9.3b shows the performance of mPLR-Loc (with ρ = 1) on the plant dataset. Figure 9.3b shows that the trends of OLA, accuracy, precision, recall and F1-score are similar to those of mPLR-Loc in the virus dataset. The figure also shows that the OAA achieved by mPLR-Loc is monotonically increasing with respect to θ and reaches the optimum at θ= 1.0, which is in contrast to the results in the virus dataset where the OAA is almost unchanged when θ ≥ 0.5.
Figure 9.4 shows the performance of mPLR-Loc with respect to the parameterρ (equation (5.18)) on the virus dataset. In all cases, the adaptive thresholding parameter θ was set to 0.8. As can be seen, the variations of OAA, accuracy, precision, and F1-score with respect to ρ are very similar. More importantly, all of these four metrics show that there is a wide range of ρ for which the performance is optimal. This suggests that introducing the penalty term in equation (5.13) not only helps to avoid numerical difficulty, but also improves performance.
Figure 9.4 shows that the OLA and recall are largely unaffected by the change in ρ. This is understandable because the parameter ρ is to overcome numerical difficulty when estimating the LR parameters β. More specifically, when ρ is small (say log(ρ) < -5), the value ofρ is insufficient to avoid matrix singularity in equation (5.17), which leads to extremely poor performance. When ρ is too large (say log(ρ) > 5), the matrix in equation (5.16) will be dominated by the value of ρ , which also causes poor performance. The OAA of mPLR-Loc reaches its maximum 0.903 at log(ρ) = -1.