9 Results and analysis

This chapter will elaborate the experimental results and related analysis for all the predictors introduced in previous chapters, including GOASVM and FusionSVM for single-location protein subcellular localization, and mGOASVM, AD-SVM, mPLR-Loc, SS-Loc, HybridGO-Loc, RP-SVM, and R3P-Loc for multi-location protein subcellular localization.

9.1 Performance of GOASVM

9.1.1 Comparing GO vector construction methods

Table 9.1 shows the performance of different GO-vector construction methods on the EU16, HUM12, and the novel eukaryote (NE16) datasets, which were detailed in Tables 8.1, 8.2, and 8.3, respectively. Linear SVMs were used for all cases, and the penalty factor was set to 0.1. For the EU16 and HUM12 datasets, leave-one-out cross-validation (LOOCV) was used to evaluate the performance of GOASVM; for the NE16 dataset, the EU16 training dataset was used for training the classifier, which was subsequently used to classify proteins in the NE16 dataset. Four different GO-vector construction methods were investigated, including 1-0 value, term-frequency (TF), inverse sequence-frequency (ISF), and term-frequency inverse sequence-frequency (TF-ISF).

Table 9.1: Performance of different GO-vector construction methods for GOASVM on the EU16, HUM12 and the novel eukaryotic datasets, respectively. NE16: the novel eukaryotic dataset whose proteins are distributed in 16 subcellular locations (see Table 8.3 in Chapter 8); TF: term-frequency; ISF: inverse sequence-frequency; TF-ISF: term-frequency inverse sequence frequency. OMCC: overall MCC; WAMCC: weighted average MCC; ACC: overall accuracy. Refer to equations (8.5), (8.7), and (8.8) for the definition of ACC, OMCC, and WAMCC. The higher these three evaluation measures, the better is the performance.

Dataset GO vector construction method OMCC WAMCC ACC
EU16 1-0 value 0.9252 0.9189 92.98%
TF 0.9419 0.9379 94.55%
ISF 0.9243 0.9191 92.90%
TF-ISF 0.9384 0.9339 94.22%
HUM12 1-0 value 0.8896 0.8817 89.88%
TF 0.9074 0.9021 91.51%
ISF 0.8659 0.8583 87.70%
TF-ISF 0.8991 0.8935 90.75%
NE16 1-0 value 0.6877 0.6791 70.72%
TF 0.7035 0.6926 72.20%
ISF 0.6386 0.6256 66.12%
TF-ISF 0.6772 0.6626 69.74%

Evidently, for all the three datasets, term-frequency (TF) performs the best of these four methods, which demonstrate that the frequencies of occurrences of GO terms contain extra information that the 1-0 values do not have. The results also suggest that inverse sequence frequency (ISF) is detrimental to classification performance, despite its proven effectiveness in document retrieval. This may be due to the differences between the frequency of occurrences of common GO terms in our datasets and the frequency of occurrences of common words in document retrieval. In document retrieval, almost all documents contain the common words; as a result, the inverse document frequency is effective in suppressing the influence of these words in the retrieval. However, the common GO terms do not appear in all of the proteins in our datasets. In fact, for example, even the most commonly occurring GO term appears only in one-third of the proteins in EU16. We conjecture that this low frequency of occurrences of common GO terms makes ISF ineffective for subcellular localization.

Many existing GO-based methods use the 1-0 value approach to constructing GO vectors, including ProLoc-GO [99], Euk-OET-PLoc [46], and Hum-PLoc [44]. Table 9.1 shows that term frequency (TF) performs almost 2% better than 1-0 value (72.20% vs 70.72%). Similar conclusions can be also drawn from the performance of GOASVM based on leave-one-out cross validation on the EU16 training set and the HUM12 training set. The results are biologically relevant, because proteins of the same subcellular localization are expected to have a similar number of occurrences of the same GO term. In this regard, the 1-0 value approach is inferior, because it quantizes the number of occurrences of a GO term to 0 or 1.

Because of the inferior performance of the 1-0 value approach, predictors proposed in recent years typically use more advanced methods to extract features from the GOA database. For example, the GO vectors used by iLoc-Euk [52], iLoc-Hum [53], iLoc-Plant [230], iLoc-Gpos [231], iLoc-Gneg [232], and iLoc-Virus [233] depend on the frequency of occurrences on GO terms, which to some extent is similar to the TF approach.

9.1.2 Performance of successive-search strategy

Because the novel proteins were recently added to Swiss-Prot, many of them have not been annotated in the GOA database. As a result, if we used the accession numbers of these proteins to search against the GOA database, the corresponding GO vectors will contain all zeros. This suggests that we should use the ACs of their homologs as the search keys, i.e. the procedure shown in Figure 4.4 should be adopted. However, we observed that for some novel proteins, even the top homologs do not have any GO terms annotated to them. In particular, in the NE16 dataset, there are 169 protein sequences whose top homologs do not have any GO terms (second row of Table 9.2), causing GOASVM to be unable to make any predictions. As can be seen from Table 9.2, by using only the first homolog, the overall prediction accuracy of GOASVM is only 57.07% (347/608). To overcome this limitation, the following strategy was adopted. For the 169 proteins (second row of Table 9.2) whose top homologs do not have any GO terms in the GOA database, we used the second from the top homolog to find the GO terms; similarly, for the 112 proteins (third row of Table 9.2) whose top and second from the top homologs do not have any GO terms, the third from the top homolog was used; and so on until all the query proteins can correspond to at least one GO term. In the case where BLAST fails to find any homologs (although this case rarely happens) the default E-value threshold (the -e option) can be relaxed. A detailed description of this strategy can be found in Section 4.1.2 of Chapter 4.

Table 9.2: Performance of GOASVM using successive-search strategy on the novel eukaryotic (NE16) dataset denoted in Table 8.3. The second column represents the upper bound of k in qi,k shown in Figure 4.5 of Section 4.1.2 in Chapter 4. For example, when kmax = 2, only the AC of the first or second homolog will be used for retrieving the GO terms. No. of sequences without GO terms means the number of protein sequences for which no GO terms can be retrieved. OMCC: overall MCC; WAMCC: weighted average MCC; ACC: overall accuracy. See Supplementary Materials for the definition of these performance measures. Note that for fair comparison, the baseline shown here is the performance of Euk-OET-PLoc, which we implemented and also adopts the same procedure as GOASVM to obtain GO terms from homologs.

Method kmax No. of sequences without GO terms OMCC WAMCC ACC
GOASVM 1 169 0.5421 0.5642 57.07%
2 112 0.5947 0.6006 62.01%
3 12 0.6930 0.6834 71.22%
4 7 0.6980 0.6881 71.71%
5 3 0.7018 0.6911 72.04%
6 3 0.7018 0.6911 72.04%
7 0 0.7035 0.6926 72.20%
Baseline* 7 0 0.5246 0.5330 55.43%
* Since the webserver of Euk-OET-PLoc is not now available, we implemented it according to [46].

Table 9.2 shows the prediction performance of GOASVM on the NE16 dataset (608 novel proteins). As explained earlier, to ensure that these proteins are novel to GOASVM, 2423 proteins extracted from the training set of EU16 were used for training the classifier. For fair comparison, Euk-OET-PLoc [46] also uses the same version of the GOA dDtabase (Marvh 8, 2011) to retrieve GO terms and adopts the same procedure as GOASVM to obtain GO terms from homologs. In such a case, for Euk-OET-PLoc, it is unnecessary to use the PseAA[35] as a backup method, because a valid GO vector can be found for every protein in this novel dataset. Also, according to Euk-OET-PLoc [46], several parameters are optimized, and only the best performance is shown here (see the last row of Table 9.2). As can be seen, GOASVM performs significantly better than Euk-OET-PLoc (72.20% vs 55.43 %), demonstrating that GOASVM is more capable of predicting novel proteins than Euk-OET-PLoc. Moreover, the results clearly suggest that when more distant homologs are allowed to be used for searching GO terms in the GOA database, we have a higher chance of finding at least one GO terms for each of these novel proteins, thus improving the overall performance. In particular, when the most distant homolog has a rank of 7 (kmax = 7), GOASVM is able to find GO terms for all of the novel proteins, and the accuracy is also the highest, which is almost 15% (absolute) higher than when using only the top homolog. Given the novelty of these proteins and the low sequence similarity (< 25%), an accuracy of 72.2% is fairly high, suggesting that the homologs of novel proteins can provide useful GO information for protein subcellular localization.

Note that the gene association file we downloaded from the GOA database does not provide any subcellular localization labels. This file only allows us to create a hash table for storing the association between the accession numbers and their corresponding GO terms. This hash table covers all of the accession numbers in the GOA Database released on March 18, 2011, meaning that it will cover the EU16 (dated 2005) but not the accession numbers in the novel eukaryotic dataset. It is important to emphasize that, given a query protein, having a match in this hash table does not mean that a subcellular localization assignment can be obtained. In fact, having a match only means that a nonnull GO vector can be obtained. After that, the SVMs play an important role in classifying the nonnull GO vector.

9.1.3 Comparing with methods based on other features

Table 9.3 shows the performance of GOASVM using different features and different SVM classifiers on the EU16 dataset. The penalty factor for training the SVMs was set to 0.1 for both linear SVMs and RBF-SVMs. For RBF-SVMs, the kernel parameter was set to 1. For the first four methods, vector norm was adopted for better classification performances. GapAA [162] takes the maximum gap length 48 (the minimum length of all the sequences is 50). As AA, PairAA, and PseAA produce low-dimensional feature vectors, the performance achieved by RBF-SVMs is better than that achieved by linear SVMs. Therefore, we only present the performance of RBF-SVMs. As can be seen, amino acid composition and its variant are not good features for subcellular localization. The highest accuracy is only 48.66 %. Moreover, although homology-based method can achieves better accuracy (54.52%) than amino acid composition-based methods, the performance is still very poor, probably because of the low sequence similarity in this dataset. On the other hand, GOASVM can achieve a significantly better performance (94.55 %), almost 40% (absolute) better than the homology-based method. This suggests that the gene ontology-based method can provide significantly richer information pertaining to protein subcellular localization than the other methods.

Table 9.3: Performance of different features and different SVM classifiers on the EU16 training dataset. Features include amino acid composition (AA) [155], amino acid pair composition (PairAA) [155], AA composition with gap (length = 48) (GapAA) [162], pseudo AA composition (PseAA) [35], and profile alignment scores [132].

image

9.1.4 Comparing with state-of-the-art GO methods

To further demonstrate the superiority of GOASVM over other state-of-the-art GO methods, we also did experiments on the EU16 and HUM12 datasets, respectively. Table 9.4 compares the performance of GOASVM against three state-of-the-art GO-based methods on the EU16 and HUM12 datasets, respectively. As Euk-OET-PLoc and Hum-PLoc could not produce valid GO vectors for some proteins in EU16 and HUM12, both methods use PseAA as a backup. ProLoc-GO uses either the ACs of proteins as search keys or uses the ACs of homologs returned from BLAST as search keys. GOASVM also uses BLAST to find homologs, but unlike ProLoc-GO, GOASVM is more flexible in selecting the homologs. Specifically, instead of using the top homolog only, GOASVM can use lower-rank homologs if the higher-rank homologs result in null GO vectors.

Table 9.4 shows that for ProLoc-GO, using ACs as input, performs better than using sequences (ACs of homologs) as input. However, the results for GOASVM are not conclusive in this regard, because under LOOCV, using ACs as input performs better than using sequences, but the situation is opposite under independent tests. Table 9.4 also shows that even when using ACs sequences as input, GOASVM performs better than Euk-OET-PLoc and ProLoc-GO, for both the EU16 and HUM12 datasets.

To show that the high performance of GOASVM is not purely attributed to the homologous information obtained from BLAST, we used BLAST directly as a subcellular localization predictor. Specifically, the subcellular location of a query protein is determined by the subcellular location of its closest homolog as determined by BLAST using Swiss-Prot 2012_04 as the protein database. The subcellular location of the homologs were obtained from their CC field in Swiss-Prot. The results in Table 9.4 show that the performance of this approach is significantly poorer than that of other machine-learning approaches, suggesting that homologous information alone is not sufficient for subcellular localization prediction. [24] also used BLAST to find the subcellular locations of proteins. Their results also suggest that using BLAST alone is not sufficient for reliable prediction.

Table 9.4: Comparing GOASVM with state-of-the-art GO-based methods on (a) the EU16 dataset and (b) the HUM12 dataset. S: sequences; AC: accession number; LOOCV: leave-one-out cross-validation. Values within parenthese are WAMCC. See equation (8.8) for the definition of WAMCC. (–) means that the corresponding references do not provide the WAMCC.

Method Input data Feature Accuracy (WAMCC)
LOOCV Independent test
ProLoc-GO [99] S GO (using BLAST) 86.6% (0.7999) 83.3% (0.706)
ProLoc-GO [99] AC GO (No BLAST) 89.0% (–) 85.7% (0.710)
Euk-OET-PLoc [46] S + AC GO + PseAA 81.6% (–) 83.7% (–)
GOASVM S GO (usig BLAST) 94.68% (0.9388) 93.86% (0.9252)
GOASVM AC GO (No BLAST) 94.55% (0.9379) 94.61% (0.9348)
BLAST [5] S 56.75% (–) 60.39% (–)
(a) Performance on the EU16 dataset.
Method Input data Feature Accuracy (WAMCC)
LOOCV Independent test
ProLoc-GO [99] S GO (using BLAST) 90.0% (0.822) 88.1% (0.661)
ProLoc-GO [99] AC GO (No BLAST) 91.1% (–) 90.6% (0.724)
Hum-PLoc [44] S + AC GO + PseAA 81.1% (–) 85.0% (–)
GOASVM S GO (usig BLAST) 91.73% (0.9033) 94.21% (0.9346)
GOASVM AC GO (No BLAST) 91.51% (0.9021) 94.39% (0.9367)
BLAST [5] S 68.55% (–) 65.69% (–)
(b) Performance on the HUM12 dataset.

Although all the datasets mentioned in this section were cut off at 25% sequence similarity, the performance of GOASVM increases from 72.20%(Table 9.2) on the novel dataset (NE16) to more than 90% (Table 9.4) on both the EU16 and HUM12 datasets. This is mainly because in Table 9.4 the training and testing sets were constructed at the same time, whereas the creation of the training set and the testing set in Table 9.2 were done six years apart, which causes the latter to have less similarity in GO information between the training set and test sets than the former. This in turn implies that the performance of GOASVM on the novel dataset (Table 9.2) can more objectively reflect the classification capabilities of the predictors.

9.1.5 GOASVM using old GOA databases

The newer the version of the GOA Database, the more annotation information it contains. To investigate how the updated information affects the performance of GOASVM, we performed experiments using an earlier version (published in Oct. 2005) of the GOA Database and compared the results with Euk-OET-PLoc on the EU16 dataset. For proteins without a GO term in the GOA Database, pseudo amino acid composition (PseAA) [35] was used as the backup feature. Comparison between the last and second-to-last rows of Table 9.5 reveals that using newer versions of the GOA Database can achieve better performance than using older versions. This suggests that annotation information is very important to the prediction. The results also show that GOASVM significantly outperforms Euk-OET-PLoc, suggesting that the GO vector construction method and classifier (term frequency and SVM) in GOASVM are superior to the those used in Euk-OET-PLoc (1-0 value and K-NN).

Table 9.5: Performance of GOASVM based on different versions of the GOA database on the EU16 training dataset. The second column specifies the publication year of the GOA database being used for constructing the GO vectors. For proteins without a GO term in the GOA Database, pseudo amino acid composition (PseAA) was used as the backup feature. LOOCV: leave-one-out cross validation.

Method Feature Accuracy
LOOCV Independent test
Euk-OET-PLoc [46] GO (GOA2005) 81.6% 83.7%
GOASVM GO (GOA2005) 86.42% 89.11%
GOASVM GO (GOA2011) 94.55% 94.61%

9.2 Performance of FusionSVM

9.2.1 Comparing GO vector construction methods and normalization methods

Table 9.6 shows the performance of 12 different configurations of InterProGOSVM using the OE11 dataset (See Table 8.4). For ease of reference, we label these methods as GO_1, GO_2,..., and GO_12. Linear SVMs were used in all cases, and the penalty factor was also set to 1. When using vector normalization or geometric averaging to postprocess the GO vectors, the inverse sequence frequency can produce more discriminative GO vectors, as evident in the higher accuracy, OMCC and WAMCC corresponding to GO_6 and GO_10. Except for ISF, using the raw GO vectors (without postprocessing) as the SVM input achieves the best performance, as evident in the higher accuracy, OMCC and WAMCC corresponding to GO_1, GO_3, and GO_4. This suggests that postprocessing could remove some of the subcellular localization information pertaining to the raw GO vectors. GO_1 achieves the best performance, suggesting that postprocessing is not necessary.

Table 9.6: Performance of InterProGOSVM using different approaches to constructing the raw GO vectors and different postprocessing approaches to normalizing the raw GO vectors. “None” in postprocessing means that the raw GO vectors qi are used as input to the SVMs. ISF: inverse sequence frequency; TF: term frequency; TF-ISF: term frequency inverse sequence frequency.

Config ID GO vectors construction Postprocessing ACC OMCC WAMCC
GO_1 1-0 value None 72.21% 0.6943 0.6467
GO_2 ISF None 71.89% 0.6908 0.6438
GO_3 TF None 71.99% 0.6919 0.6451
GO_4 TF-ISF None 71.15% 0.6827 0.6325
GO_5 1-0 value Vector norm 71.25% 0.6837 0.6335
GO_6 ISF Vector norm 72.02% 0.6922 0.6427
GO_7 TF Vector norm 70.96% 0.6806 0.6293
GO_8 TF-ISF Vector norm 71.73% 0.6890 0.6386
GO_9 1-0 value Geometric mean 70.51% 0.6756 0.6344
GO_10 ISF Geometric mean 72.08% 0.6929 0.6468
GO_11 TF Geometric mean 70.64% 0.6771 0.6290
GO_12 TF-ISF Geometric mean 71.03% 0.6813 0.6391

9.2.2 Performance of PairProSVM

Table 9.7 shows the performance of different SVMs using various features extracted from the protein sequences. The features include amino acid composition (AA) [155 amino-acid pair composition (PairAA) [155], AA composition with the maximum gap length equal to 59 (the minimum length of all of the 3120 sequences is 61) [162], pseudo AA composition [35], profile alignment scores and GO vectors. The penalty factor for training the SVMs was set to 1 for both linear SVM and RBF-SVM. For RBF-SVMs, the kernel parameter was set to 1. As AA and PairAA produce low-dimensional feature vectors, the performance achieved by RBF-SVM is better than that of the linear SVM. Therefore, only the performance of RBF-SVM is presented.

Table 9.7 shows that amino acid composition and its variant are not good features for subcellular localization. The AA method only explores the amino acid composition information, so it performs the worst. Because PairAA, GapAA, and extended PseAA extract the sequence-order information, their combination achieves a slightly better prediction performance. Among the amino acid-based methods, the highest accuracy is only 61.44%. On the other hand, the homology-based method that exploits the homologous sequences in protein databases (via PSI-BLAST) achieves a significantly better performance. This suggests that the information pertaining to the amino acid sequences is limited. On the contrary, homology-based method PairProSVM can extract much more valuable information about protein subcellular localization than amino acid-based methods. The higher OMCC and WAMCC also suggest that PairProSVM can handle imbalanced problems better. The results also suggest that InterProGOSVM outperforms the amino acid composition methods, and InterProGOSVM is also comparable, although a bit inferior, to PairProSVM.

Table 9.7: Comparing different features and different SVM classifiers on the FusionSVM dataset (3572 proteins). Performance obtained by using amino acid composition (AA) [155], amino-acid pair composition (PairAA) [155], AA composition with gap (length = 59) (GapAA) [162], pseudo AA composition (PseAA) [35], and profile alignment scores as feature vectors and different SVMs as classifiers. The last two rows correspond to the PairProSVM proposed in [132] and InterProGOSVM.

image

9.2.3 Performance of FusionSVM

Table 9.8 shows the performance of fusing the InterProGOSVM and PairProSVM. The performance was obtained by optimizing the fusion weights wGO in equation (4.19) (based on the test dataset). The results show that the combination of PairProSVM and GO_10 (ISF with geometric mean) achieves the highest accuracy, 79.04%, which is significantly better than PairProSVM (77.05%) and the InterProGOSVM method (72.21 %) alone. The results also suggest that the combination of PairProSVM and any configuration of InterProGOSVM can outperform the individual methods. This is mainly because the information obtained from homology search and from functional domain databases has different perspectives and is therefore complementary to each other.

Table 9.8: Performance of the fusion of InterProGOSVM and PairProSVM.

Method I Optimal wGO ACC OMCC WAMCC
GO_1 0.45 78.91% 0.7680 0.7322
GO_2 0.26 78.56% 0.7641 0.7260
GO_3 0.40 78.75% 0.7662 0.7291
GO_4 0.37 78.72% 0.7659 0.7285
GO_5 0.37 78.78% 0.7666 0.7293
GO_6 0.34 78.78% 0.7666 0.7294
GO_7 0.43 78.81% 0.7670 0.7289
GO_8 0.29 78.40% 0.7624 0.7234
GO_9 0.42 78.97% 0.7687 0.7318
GO_10 0.45 79.04% 0.7694 0.7335
GO_11 0.40 78.37% 0.7620 0.7222
GO_12 0.37 78.62% 0.7648 0.7263

Surprisingly, combining the best-performing InterProGOSVM and the profile-alignment method does not give the best performance. And for different configurations, the best performance of FusionSVM is achieved at different optimal wGO. Since the performance of PairProSVM seems to be a bit better than that of InterProGOSVM, it is reasonable to give smaller weight to InterProGOSVM and more to PairProSVM.

9.2.4 Effect of the fusion weights on the performance of FusionSVM

As mentioned above, the wGO will significantly influence the final performance of each configuration. It is necessary to discover how the parameter impacts the accuracy of FusionSVM. Here, we chose the configuration with the best performance: GO_10. Figure 9.1 shows the performance of combining GO_10 (InterProGOSVM with Config ID GO_10) and PairProSVM by varying wGO from 0 to 1. As can be seen, the performance changes steadily with the change of wGO. It suggests that wGO would not abruptly impact the final performance of the combination method and the improvement of the combination method over PairProSVM exists for a wide range of wGO. Further, to show that the improvement of the combination of methods over each individual method is statistically significant, we also performed the McNemar’s test [140] on their SVM scores to compare their performance [78]. The p-value between the accuracy of the combined system (GO_10 and PairProSVM) and the PairProSVM system is 0.0055 (≪ 0.05), which suggests that the performance of the combination predictor is significantly better than that of the PairProSVM predictor.

image

Fig. 9.1: Performance of combining GO_10 and PairProSVM using different fusion weight wGO.

9.3 Performance of mGOASVM

9.3.1 Kernel selection and optimization

A support vector machine (SVM) can use a linear, RBF, or polynomial function as its kernel. Some works [15, 162] have demonstrated that RBF kernels achieve better results than linear and polynomial kernels. However, our results show that linear SVMs perform better in our datasets. Table 9.9 shows the performance of mGOASVM using different types of kernel functions with different parameters based on leave-one-out cross validation (LOOCV) on the virus dataset. For RBF SVM, the kernel parameter σ was selected from the set {2−2,2−1,...,25}. For polynomial SVM, the degree of polynomial was set to either 2 or 3. The penalty parameter (C) was set to 0.1 for all cases. Table 9.9 shows that SVMs using the linear kernel perform better than those with RBF and polynomial kernels. This is plausible, because the dimension of GO vectors is larger than the number of training vectors, aggravating the curse of the dimensionality problem in nonlinear SVMs [14]. The over-fitting problem becomes more severe when the degree of nonlinearity is high (small σ), leading to reduction of performance, as demonstrated in Table 9.9. In other words, highly nonlinear SVMs become vulnerable to overfitting due to the high-dimensionality of the GO vectors.

9.3.2 Term-frequency for mGOASVM

Table 9.10 compares the performance of the 1-0 value and the term-frequency approaches to constructing GO vectors. Linear SVMs, with penalty factor C = 0.1, were used in all cases. The results show that while the term frequency (TF) approach is only slightly better than the 1-0 value approach in terms of the locative accuracy, it performs almost 2% and 7% better than 1-0 value in the actual accuracy for the virus and the plant datasets, respectively. This suggests that the frequency of occurrences of GO terms could also provide information for subcellular locations. The results are biologically relevant, because proteins of the same subcellular localization are expected to have a similar number of occurrences of the same GO term. In this regard, the 1-0 value approach is inferior, because it quantizes the number of occurrences of a GO term to 0 or 1. In fact, the increase in the actual accuracy is even more substantial in the plant dataset than in the virus dataset. Given the large size and number of multi-label proteins in the plant dataset, the remarkable performance of the TF approach suggests that the term frequencies do possess some extra information that the 1-0 values do not have.

Table 9.9: Performance of mGOASVM using different kernels with different parameters based on leave-one-out cross validation (LOOCV) using the virus dataset. The penalty parameter (C) was set to 0.1 for all cases. σ is the kernel parameter for the RBF SVM; d is the polynomial degree in the polynomial SVM.

Kernel Parameter Locative accuracy Actual accuracy
Linear SVM 244/252 = 96.8% 184/207 = 88.9%
RBF SVM σ =2-2 182/252 = 72.2% 53/207 = 25.6%
σ =2-1 118/252 = 46.8% 87/207 = 42.0%
σ =1 148/252 = 58.7% 116/207 = 56.0%
σ =21 189/252 = 75.0% 142/207 = 68.6%
σ =22 223/252 = 88.5% 154/207 = 74.4%
σ =23 231/252 = 91.7% 150/207 = 72.5%
σ =24 233/252 = 92.5% 115/207 = 55.6%
σ =25 136/252 = 54.0% 5/207 = 2.4%
Polynomial SVM d = 2 231/252 = 91.7% 180/207 = 87.0%
d = 3 230/252 = 91.3% 178/207 = 86.0%

Table 9.10: Performance of different GO-vector construction methods based on leave-one-out cross validation (LOOCV) for (a) the virus dataset and (b) the plant dataset.

GO vector construction methods Locative accuracy Actual accuracy
1-0 value 244/252 = 96.8% 179/207 = 86.5%
Term frequency (TF) 244/252 = 96.8% 184/207 = 88.9%
(a) Performance on the viral protein dataset
GO vector construction methods Locative accuracy Actual accuracy
1-0 value 1014/1055 = 96.1% 788/978 = 80.6%
Term frequency (TF) 1015/1055 = 96.2% 855/978 = 87.4%
(b) Performance on the plant protein dataset

9.3.3 Multi-label properties for mGOASVM

To reveal that the high locative accuracies of mGOASVM are due to the capability of mGOASVM rather than overprediction, we have investigated the distributions of the number of predicted labels in both virus and plant datasets. We consider image and image in equation (8.14) as the number of predicted labels and true labels for the i-th protein, respectively. The distributions of the number of labels predicted by mGOASVM are shown in Table 9.11. Denote image and image as the number of proteins that are over-, equal-, and under-predicted by k (k = 0,...,5 for the virus dataset and k = 0,...,11 for the plant dataset) labels. Also denote No, Ne, and Nu as the total number of proteins that are over-, equal-, and under-predicted, respectively. Here, over-prediction, equal-prediction, and under-prediction are defined as the number of predicted labels that is larger than, equal to, and smaller than the number of true labels, respectively. Table 9.11 shows that proteins that are over- or under-predicted only account for a small percentage of the datasets (8.7%˙ and 1.0%˙ over- and under-predicted in the virus dataset; 8.7% and 1.4% over- and under-predicted in the plant dataset). Even among the proteins that are over-predicted, most of them are over-predicted by only one location. These include all of the 18 proteins in the virus dataset and 83 out of 85 in the plant dataset. None of the proteins in the virus dataset are over-predicted by more than 1 location. Only 2 out of 85 proteins in the plant dataset are over-predicted by 2 locations, and none are over-predicted by more than 2 locations. As for under-prediction, all of the under-predicted proteins are only under-predicted by 1 location in both datasets. These results demonstrate that the over- and under-prediction percentages are small, which suggests that mGOASVM can effectively determine the number of subcellular locations of a query protein.

Table 9.11: Distribution of the number of labels predicted by mGOASVM for proteins in the virus and plant datasets. image: Number of predicted labels for the i-th (i = 1,...,Nact) protein; image: Number of the true labels for the i-th protein; Over-prediction: the number of predicted labels is larger than that of the true labels; Equal-prediction: the number of predicted labels is equal to that of the true labels; Under-prediction: the number of predicted labels is smaller than that of the true labels; image or image: the number of proteins that are over-, equal-, or under-predicted by k (k = 0,...,5 for the virus dataset and k = 0,...,11 for the plant dataset) labels, respectively; No, Ne or Nu: the total number of proteins that are over-, equal-, or under-predicted, respectively.

image

Table 9.12: Performance of mGOASVM for multi-location proteins using different types of inputs on (a) the virus dataset and (b) the plant dataset. S: sequence; AC: accession number; #homo: number of homologs used in the experiments; l (l = 1,...,3): Number of colocations. #homo=0 means that only the true accession number is used.

Input data #homo Actual accuracy of protein groups Overall actual accuracy
l = 1 l = 2 l = 3
AC 0 154/165 = 93.3% 34/39 = 87.2% 3/3 = 100% 191/207 = 92.3%
S 1 148/165 = 89.7% 33/39 = 84.6% 3/3 = 100% 184/207 = 88.9%
S + AC 1 151/165 = 91.5% 34/39 = 87.2% 3/3 = 100% 188/207 = 90.8%
(a) Performance on the viral protein dataset.
Input data #homo Actual accuracy of protein groups Overall actual accuracy
l = 1 l = 2 l = 3
AC 0 813/904 = 89.9% 49/71 = 69.0% 1/3 = 33.3% 863/978 = 88.2%
S 1 802/904 = 88.7% 52/71 = 73.2% 1/3 = 33.3% 855/978 = 87.4%
S + AC 1 811/904 = 89.7% 47/71 = 66.2% 1/3 = 33.3% 859/978 = 87.8%
(b) Performance on the plant protein dataset.

Table 9.12 shows the performance of mGOASVM for multi-location proteins using different types of inputs. Denote l (l = 1,...,image) as the number of co-locations. As the maximum number of colocations in both datasets is 3, the actual accuracies for l (l = 1,..., 3) are shown in Table 9.12. Note that high actual accuracies for l > 1 are more difficult to achieve than for l = 1, since not only the number of subcellular locations for a protein should be correctly predicted, but also the subcellular locations should be precisely predicted. As can be seen, mGOASVM achieves high performance not only for single-label proteins (the column corresponding to l = 1), but also for multi-label proteins (the columns corresponding to l = 2 and l = 3). The results demonstrate that mGOASVM can tackle multi-label problems well.

9.3.4 Further analysis of mGOASVM

Table 9.13 shows the performance of mGOASVM with different types of inputs and different numbers of homologous proteins for the virus and plant datasets. The input data can be of three possible types: (1) accession number only, (2) sequence only and (3) both accession number and sequence. mGOASVM can extract information from these inputs by producing multiple GO vectors for each protein. Denote #homo as the number of homologous proteins, where #homo ∈ {0,1,2,4,8} for the virus dataset and #homo ∈ {0,1,2} for the plant dataset. For different combinations of inputs and numbers of homologs, the number of distinct GO terms can be different. Typically, the number of distinct GO terms increases with the number of homologs.

Table 9.13: Performance of mGOASVM with different types of inputs and different numbers of homologous proteins for (a) the virus dataset and (b) the plant dataset. S: sequence; AC: accession number; # homo: number of homologs used in the experiments; Nd(GO): number of distinct GO terms. # homo=0 means that only the true accession number is used.

Input data type #homo Nd(GO) Locative accuracy Actual accuracy
AC 0 331 244/252 = 96.8% 191/207 = 92.3%
S 1 310 244/252 = 96.8% 184/207 = 88.9%
S 2 455 235/252 = 93.3% 178/207 = 86.0%
S 4 664 221/252 = 87.7% 160/207 = 77.3%
S 8 1134 202/252 = 80.2% 130/207 = 62.8%
S + AC 1 334 242/252 = 96.0% 188/207 = 90.8%
S + AC 2 460 238/252 = 94.4% 179/207 = 86.5%
S + AC 4 664 230/252 = 91.3% 169/207 = 81.6%
S + AC 8 1134 216/252 = 85.7% 145/207 = 70.1%
(a) Viral protein dataset.
Input data type #homo Nd(GO) Locative accuracy Actual accuracy
AC 0 1532 1023/1055 = 97.0% 863/978 = 88.2%
S 1 1541 1015/1055 = 96.2% 855/978 = 87.4%
S 2 1906 907/1055 = 85.8% 617/978 = 63.1%
S + AC 1 1541 1010/1055 = 95.7% 859/978 = 87.8%
S + AC 2 1906 949/1055 = 90.0% 684/978 = 70.0%
(b) Plant protein dataset.

Table 9.13 shows that the number of homologs can affect the performance of mGOASVM. The results are biologically relevant, because the homologs can provide information about the subcellular locations. However, more homologs may bring redundant or even noisy information, which is detrimental to the prediction accuracy. For example, in the plant dataset, the performance of using one homolog is better than when using two (87.4% vs 63.1 %), which in turn suggests that we should limit the number of homologs to avoid bringing irrelevant information. Moreover, as can be seen from Table 9.13, the performance achieved by mGOASVM using sequences with the top homolog are comparable to that of mGOASVM using the accession number only.

Table 9.13 shows that mGOASVM, using both sequences and accession numbers as inputs, performs better than using sequences only. However, if accession numbers are available, it is better to use accession numbers only, as evident by the superior performance of the first row in Tables 9.13a and b.

9.3.5 Comparing prediction results of novel proteins

To demonstrate the superiority of mGOASVM over state-of-the-art predictors, prediction results of 10 typical novel proteins by mGOASVM and Plant-mPLoc are shown in Table 9.14. As can be seen, for the single-location protein Q8VY13, mGOASVM can correctly predict it to be located in “peroxisome”, while Plant-mPLoc gives a wrong prediction (“chloroplast”); for the multi-location protein F4JV80, mGOASVM can correctly predict it to be located in both “chloroplast” and “mitochondrion”, while Plant-mPLoc predicts it to be a single-location protein located in “nucleus”. Similarly, for Q93YP7, Q9LK40, and Q6NPS8, mGOASVM can predict all of them correctly, while Plant-mPLoc gives all wrong predictions for them. Plant-mPLoc produces partially correct predictions on some of the single-location proteins, e.g. Q3ED65, Q9LQC8, and Q8VYI1. However, Plant-mPLoc wrongly considers them to be multi-label proteins. On the other hand, mGOASVM correctly predicts these proteins as single-location proteins, located in “chloroplast”, “Golgi apparatus”, and “endoplasmic reticulum”, respectively. For Q9FNY2 and Q9FJL3, Plant-mPLoc can only produce partially correct predictions, whereas mGOASVM can exactly locate both of them in the right subcellular location(s).

Table 9.14: Prediction results of 10 novel proteins by mGOASVM. AC: UniProtKB accession number; ground-truth location(s): the experimentally-validated actual subcellular location(s).

AC Ground-truth location(s) Prediction results
Plant-mPLoc [51] mGOASVM
Q8VYI3 peroxisome chloroplast peroxisome
F4JV80 chloroplast, mitochondrion nucleus chloroplast, mitochondrion
Q93YP7 mitochondrion cell membrane, chloroplast, Golgi apparatus mitochondrion
Q9LK40 nucleus chloroplast nucleus
Q6NPS8 cytoplasm, nucleus endoplasmic reticulum cytoplasm, nucleus
Q3ED65 chloroplast chloroplast, cytoplasm chloroplast
Q9LQC8 Golgi apparatus endoplasmic reticulum, Golgi apparatus Golgi apparatus
Q8VYI1 endoplasmic reticulum endoplasmic reticulum, vacuole endoplasmic reticulum
Q9FNY2 cytoplasm, nucleus nucleus cytoplasm, nucleus
Q9FJL3 cytoplasm, nucleus nucleus, vacuole cytoplasm, nucleus

9.4 Performance of AD-SVM

Figure 9.2 shows the performance of AD-SVM on the virus dataset and the plant dataset with respect to the adaptive-decision parameter θ (equation (5.11) in Section 5.4.2 of Chapter 5) based on leave-one-out cross-validation. As can be seen, for the virus dataset, as θ increases from 0.0 to 1.0, the overall actual accuracy increases first, reaches the peak at θ= 0.3 (with an actual accuracy of 93.2%), and then decreases. An analysis of the predicted labels image suggests that the increase in actual accuracy is due to the reduction in the number of over-prediction, i.e. the number of cases where image has been reduced. When θ > 0.3, the benefit of reducing the over-prediction diminishes, because the criterion in equation (5.9) becomes so stringent that some of the proteins were under-predicted, i.e. the number of cases where image increases. Note that the performance at θ = 0.0 is equivalent to the performance of mGOASVM, and that the best actual accuracy (93.2% when θ= 0.3) obtained by the proposed decision scheme is more than 4% (absolute) higher than mGOASVM (88.9%).

image

Fig. 9.2: Performance of AD-SVM with respect to the adaptive-decision parameter θ based on leave-one-out cross-validation (LOOCV) on the virus and plant datasets. θ = 0 represents the performance of mGOASVM.

For the plant dataset, when θ increases from 0.0 to 1.0, the overall actual accuracy increases from 87.4% and then fluctuates at around 88%. If we take the same θ as that for the virus dataset, i.e., θ =0.3, the performance of AD-SVM is 88.3%, which is still better than that of mGOASVM at θ = 0.0.

9.5 Performance of mPLR-Loc

9.5.1 Effect of adaptive decisions on mPLR-Loc

Figure 9.3a shows the performance of mPLR-Loc on the virus dataset for different values of θ (equation (6.13)) based on leave-one-out cross-validation. In all cases, the penalty parameter ρ of logistic regression was set to 1.0. The performance of mPLR-Loc at θ = 0.0 is not provided because according to equations (5.20) and (6.13), all of the query proteins will be predicted as having all of the M subcellular locations, which defeats the purpose of prediction. As evident from Figure 9.3a, when θ increases from 0.1 to 1.0, the OAA of mPLR-Loc increases first, reaching the peak at θ = 0.5, with OAA = 0.903, which is almost 2% (absolute) higher than mGOASVM (0.889). The precision achieved by mPLR-Loc increases until θ = 0.5 and then remains almost unchanged when θ ≥ 0.5. On the contrary, OLA and recall peak at θ = 0.1, and these measures drop with θ until θ = 1.0. Among these metrics, no matter how θ changes, OAA is no higher than other five measurements.

An analysis of the predicted labels {image; i = 1,...,207} suggests that the increase in OAA is due to the reduction in the number of over-prediction, i.e. the number of cases where image. When θ > 0.5, the benefit of reducing the over-prediction diminishes, because the criterion in equation (5.20) becomes so stringent that some of the proteins were under-predicted, i.e. image. When θ increases from 0.1 to 0.5, the number of cases where image decreases while at the same time image remains almost unchanged. In other words, the denominators of accuracy and F1-score decrease, while the numerators for both metrics remain almost unchanged, leading to better performance for both metrics. When θ > 0.5, for a reason similar to above, the increase in under-prediction outweighs the benefit of the reduction in over-prediction, causing performance loss. For precision, when θ > 0.5, the loss due to the stringent criterion is counteracted by the gain due to the reduction in image, the denominator of equation (8.10). Thus, the precision increases monotonically when θ increases from 0.1 to 1. However, OLA and recall decrease monotonically with respect to θ because the denominator of these measures (see equations (8.14) and (8.11)) is independent of image and the number of correctly predicted labels in the numerator decreases when the decision criterion is getting stricter.

image

Fig. 9.3: Performance of mPLR-Loc with respect to the adaptive-decision parameter θ based on leave-one-out cross-validation on (a) the virus dataset and (b) the plant dataset. See equations (8.9)(8.15) for the definitions of the performance measures in the legend.

Figure 9.3b shows the performance of mPLR-Loc (with ρ = 1) on the plant dataset. Figure 9.3b shows that the trends of OLA, accuracy, precision, recall and F1-score are similar to those of mPLR-Loc in the virus dataset. The figure also shows that the OAA achieved by mPLR-Loc is monotonically increasing with respect to θ and reaches the optimum at θ= 1.0, which is in contrast to the results in the virus dataset where the OAA is almost unchanged when θ ≥ 0.5.

9.5.2 Effect of regularization on mPLR-Loc

Figure 9.4 shows the performance of mPLR-Loc with respect to the parameterρ (equation (5.18)) on the virus dataset. In all cases, the adaptive thresholding parameter θ was set to 0.8. As can be seen, the variations of OAA, accuracy, precision, and F1-score with respect to ρ are very similar. More importantly, all of these four metrics show that there is a wide range of ρ for which the performance is optimal. This suggests that introducing the penalty term in equation (5.13) not only helps to avoid numerical difficulty, but also improves performance.

Figure 9.4 shows that the OLA and recall are largely unaffected by the change in ρ. This is understandable because the parameter ρ is to overcome numerical difficulty when estimating the LR parameters β. More specifically, when ρ is small (say log(ρ) < -5), the value ofρ is insufficient to avoid matrix singularity in equation (5.17), which leads to extremely poor performance. When ρ is too large (say log(ρ) > 5), the matrix in equation (5.16) will be dominated by the value of ρ , which also causes poor performance. The OAA of mPLR-Loc reaches its maximum 0.903 at log(ρ) = -1.

image

Fig. 9.4: Performance of mPLR-Loc with respect to the logistic-regression penalty factor ρ in equation (5.18) based on leave-one-out cross-validation on the virus dataset. See equations (8.9)(8.15) for the definitions of the performance measures in the legend.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset