11 Conclusions and future directions

11.1 Conclusions

This book presents and proposes a number of GO-based predictors for subcellular localization of both single- and multi-location proteins.

For predicting single-location proteins, the two predictors GOASVM and Fusion-SVM are presented, which differ mainly in the way they retrieve GO information. GOASVM and FusionSVM extract the GO information from the GOA database and the InterProScan, respectively. To enhance the prediction performance, FusionSVM fuses the GO-based predictor – InterProGOSVM – with the profile alignment-based method PairProSVM. Nevertheless, GOASVM still remarkably outperforms FusionSVM. Experimental results also show the superiority of GOASVM over other state-of-the-art single-label predictors.

For predicting multi-location proteins, several advanced predictors are proposed. First, mGOASVM is presented, which performs remarkably better than existing state-of-the-art predictors with the following advantages: (1) in terms of the GO-vector construction method, it uses term-frequency instead of conventional 1-0 values; (2) in terms of handling multi-label problems, it uses a more efficient multi-label SVM classifier; (3) in terms of GO-vector space selection, it selects a relevant GO-vector subspace by finding a set of distinct GO terms instead of using all of the GO terms to define the full space; and (4) in terms of retrieving GO information, it uses a successive-search strategy to incorporate more useful homologs instead of using back-up methods.

Based on these findings, several multi-label predictors are developed and enhancements from different perspectives are made, including (1) refining classifiers, such as AD-SVM which refines the multi-label SVM classifier with an adaptive decision scheme and mPLR-Loc which develops a multi-label penalized logistic regression classifier; (2) exploiting deeper features, such as SS-Loc, which formulates the feature vectors from the GO semantic similarity (SS) information, and HybridGO-Loc, which hybridizes the GO frequency and GO SS features for better performance; and (3) reducing the high dimensions of feature vectors, such as RP-SVM, which applies ensemble random projection (RP) to multi-label SVM classifiers, and R3P-Loc, which combines ensemble random projection with ridge regression classifiers for multi-label prediction. Particularly, for R3P-Loc, instead of using the successive-search strategy, it creates two compact databases, ProSeq and ProSeq-GO, to replace the traditional Swiss-Prot and GOA Database for efficient and fast retrieval of GO information.

Experimental results based on several benchmark datasets and novel datasets from species of virus, plant, and eukaryote demonstrate that the proposed predictors can significantly outperform existing state-of-the-art predictors. In particular, among the proposed predictors, HybridGO-Loc performs the best, suggesting that mining deeper into the GO information can contribute more to boosting the prediction performance than classifier refinement and dimensionality reduction.

11.2 Future directions

Despite the various contributions we made in this book, there are still some limitations that are worth noting:

  1. Although remarkable performance improvement has been achieved by the predictors proposed in this book, the biological significance of the predictors remains uncertain. This is possibly a common problem for machine-learning-based approaches, because it is usually difficult to correlate the mathematical mechanisms of machine-learning approaches with biological phenomena.
  2. For the prediction of novel proteins, although our proposed predictors perform better than many state-of-the-art predictors, the overall accuracy is lower than what can be achieved if the older (benchmark) datasets are used for evaluation. This is possibly because some novel proteins have very low sequence similarity with known proteins in sequence databases; and more importantly they may possess some new information that has not been incorporated into the current GO databases, causing incorrect prediction for these proteins. The situation may improve when the GO databases evolve further.

In view of the above limitations, possible future directions for protein subcellular localization prediction are as follows:

  1. Developing interpretable algorithms. To discover the biological significance of machine-learning-based approaches, it is better to develop algorithms which can directly associate the prediction results with biological knowledge. One of the possible ways is to develop some algorithms which can yield sparse solutions for the predictions. In this case, the classification results can easily be interpreted from a small number of selected features which may possess some biological significance. For example, in the GO-based approaches, it is interesting to develop some algorithms which can (1) find some GO terms that play more important roles in determining the classification decisions, and (2) determine the extent to which the molecular function and biological processe GO terms contribute to the performance.
  2. Mining deeper GO information. It is worthwhile to extract deeper features from GO databases to boost prediction performance. The results in this book suggest that performance can be improved by incorporating the GO semantic similarity information. Therefore, it is likely that extracting further GO information, either through adopting more efficient semantic similarity measures or by incorporating more GO-related information, can further boost the performance. Recently, the GO consortium introduced more complicated relationships in the GO hierarchical graphs, such as “positively-regulates”, “negatively-regulates”, “has-part”, etc. [201]. These relationships could be included in the GO feature vectors to reflect more biology-related information of GO terms, leading to better prediction performance.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset