This book presents and proposes a number of GO-based predictors for subcellular localization of both single- and multi-location proteins.
For predicting single-location proteins, the two predictors GOASVM and Fusion-SVM are presented, which differ mainly in the way they retrieve GO information. GOASVM and FusionSVM extract the GO information from the GOA database and the InterProScan, respectively. To enhance the prediction performance, FusionSVM fuses the GO-based predictor – InterProGOSVM – with the profile alignment-based method PairProSVM. Nevertheless, GOASVM still remarkably outperforms FusionSVM. Experimental results also show the superiority of GOASVM over other state-of-the-art single-label predictors.
For predicting multi-location proteins, several advanced predictors are proposed. First, mGOASVM is presented, which performs remarkably better than existing state-of-the-art predictors with the following advantages: (1) in terms of the GO-vector construction method, it uses term-frequency instead of conventional 1-0 values; (2) in terms of handling multi-label problems, it uses a more efficient multi-label SVM classifier; (3) in terms of GO-vector space selection, it selects a relevant GO-vector subspace by finding a set of distinct GO terms instead of using all of the GO terms to define the full space; and (4) in terms of retrieving GO information, it uses a successive-search strategy to incorporate more useful homologs instead of using back-up methods.
Based on these findings, several multi-label predictors are developed and enhancements from different perspectives are made, including (1) refining classifiers, such as AD-SVM which refines the multi-label SVM classifier with an adaptive decision scheme and mPLR-Loc which develops a multi-label penalized logistic regression classifier; (2) exploiting deeper features, such as SS-Loc, which formulates the feature vectors from the GO semantic similarity (SS) information, and HybridGO-Loc, which hybridizes the GO frequency and GO SS features for better performance; and (3) reducing the high dimensions of feature vectors, such as RP-SVM, which applies ensemble random projection (RP) to multi-label SVM classifiers, and R3P-Loc, which combines ensemble random projection with ridge regression classifiers for multi-label prediction. Particularly, for R3P-Loc, instead of using the successive-search strategy, it creates two compact databases, ProSeq and ProSeq-GO, to replace the traditional Swiss-Prot and GOA Database for efficient and fast retrieval of GO information.
Experimental results based on several benchmark datasets and novel datasets from species of virus, plant, and eukaryote demonstrate that the proposed predictors can significantly outperform existing state-of-the-art predictors. In particular, among the proposed predictors, HybridGO-Loc performs the best, suggesting that mining deeper into the GO information can contribute more to boosting the prediction performance than classifier refinement and dimensionality reduction.
Despite the various contributions we made in this book, there are still some limitations that are worth noting:
In view of the above limitations, possible future directions for protein subcellular localization prediction are as follows: