This appendix will simply introduce the online webservers for some proposed predictors, namely GOASVM, mGOASVM, mPLR-Loc, and HybridGO-Loc. For users’ convenience, we have made an integrated webserver interface, namely PolyU-Loc, which is a package of these webservers for predicting subcellular localization of single- and multi–location proteins in different species, such as Homo sapiens, Viridiplantae, Eukaryota, and Virus. The URL link for PolyU-Loc is http://bioinfo.eie.polyu.edu.hk/Book_website/.
The GOASVM webserver (see Figure A.1) is to predict subcellular localization for single-label eukaryotic or human proteins. The URL link for the GOASVM webserver is http://bioinfo.eie.polyu.edu.hk/mGoaSvmServer/GOASVM.html.
For eukaryotic proteins, GOASVM is designed to predict 16 subcellular locations of eukaryotic proteins. The 16 subcellular locations include: (1) cell wall; (2) centriole; (3) chloroplast; (4) cyanelle; (5) cytoplasm; (6) cytoskeleton; (7) endoplasmic reticulum; (8) extracellular; (9) Golgi apparatus; (10) lysosome; (11) mitochondrion; (12) nucleus; (13) peroxisome; (14) plasma membrane; (15) plastid; (16) vacuole.
For human proteins, GOASVM is designed to predict 12 subcellular locations of human proteins. The 12 subcellular locations include: (1) centriole; (2) cytoplasm; (3) cytoskeleton; (4) endoplasmic reticulum; (5) extracellular; (6) Golgi apparatus; (7) lysosome; (8) microsome; (9) mitochondrion; (10) nucleus; (11) peroxisome; (12) plasma membrane.
GOASVM can deal with two different input types of proteins (see Figure A.2), either protein accession numbers (ACs) in UniProtKB format or amino acid sequences in FASTA format. Large-scale predictions, i.e. a list of accession numbers or a number of protein sequences, are also acceptable for GOASVM. Examples for both cases are also provided. More information can be found in the instructions and supplementary materials on the GOASVM webserver.
The mGOASVM webserver (see Figure A.3) is to predict subcellular localization for both single- and multi-label proteins in two species (i.e. virus and plant). The URL link for the mGOASVM server is http://bioinfo.eie.polyu.edu.hk/mGoaSvmServer/mGOASVM.html. Note that two different versions of mGOASVM are provided, one based on the GOA Database released in March 2011, and one based on the one released in July 2013. Typically, using the latest version will give more a accurate prediction for protein subcellular localization.
For virus proteins, mGOASVM is designed to predict six subcellular locations of multi-label viral proteins. The six subcellular locations include: (1) viral capsid; (2) host cell membrane; (3) host endoplasmic reticulum; (4) host cytoplasm; (5) host nucleus; (6) secreted.
For plant proteins, mGOASVM is designed to predict 12 subcellular locations of multi-label plant proteins. The 12 subcellular locations include: (1) cell membrane; (2) cell wall; (3) chloroplast; (4) cytoplasm; (5) endoplasmic reticulum; (6) extracellular; (7) golgi apparatus; (8) mitochondrion; (9) nucleus; (10) peroxisome; (11) plastid; (12) vacuole.
Like GOASVM, mGOASVM can deal with two different input types, either protein ACs or protein sequences. More information can be found in the instructions and supplementary materials on the mGOASVM webserver.
Like mGOASVM, HybridGO-Loc (see Figure A.4) is a subcellular-localization predictor which can deal with datasets with both single-and multi-label proteins in two species (virus and plant) and two input types (protein ACs and protein sequences). The URL link for the mGOASVM server is http://bioinfo.eie.polyu.edu.hk/HybridGoServer/. Also, the specific subcellular locations that HybridGO can predict for both species are the same as those in mGOASVM.
Unlike mGOASVM, HybridGO-Loc integrates all possible combinations of different species and input types in one interface. Users can just follow two steps to make predictions on HybridGO-Loc: (1) select the species type and the input type (virus protein ACs, virus protein sequences, plant protein ACs, or plant protein sequences); (2) input protein sequences or accession numbers.
Moreover, users can leave their email address in the corresponding space to receive their prediction results via email. For ease in dealing with large-scale prediction results, a downloadable txt file will also be given on the webpage every time a prediction task is completed. Detailed and comprehensive supplementary materials as well as instructions (also on the webserver page) are also provided for guiding users on how to use the HybridGO-Loc server.
The mPLR-Loc webserver also possess the capability of predicting single- and multi–location proteins in virus and plant species. Similar to HybridGO-Loc, the mPLR-Loc webserver also integrates all possible different inputs (combinations of different species and input types) in one interface. The URL link for the mPLR-Loc server is http://bioinfo.eie.polyu.edu.hk/mPLRLocServer/. In addition to being able to rapidly and accurately predict subcellular localization of single-and multi-label proteins, mPLR-Loc can also provide probabilistic confidence scores for the prediction decisions.
Here a step-by-step guide on how to use the mPLR-Loc is provided. After going to the homepage of mPLR-Loc server, select a combination of species type and input type. Then input the query protein sequences or accession numbers, or upload a file containing a list of accession numbers or proteins sequences. For example, Figure A.5 shows a screenshot that uses a plant protein sequence in Fasta format as input. After clicking the button “Predict” and waiting ca. 13 s, the prediction results, as shown in Figure A.6, and the probabilistic scores, as shown in Figure A.7, will be produced. The prediction result in Figure A.6 include the Fasta header, BLAST E-value, and predicted subcellular location(s). Figure A.7 shows the confidence score on the predicted subcellular location(s). In this figure, mPLR-Loc predicts the query sequence as “Cytoplasm” and “Nucleus” with confidence scores greater than 0.8 and 0.9, respectively.
Support vector machines (SVMs), which were initiated by Vapnik [206], have become popular due to their attractive features and promising performance. Compared to conventional neural networks in which network weights are determined by minimizing the mean-square error between the actual and desired outputs, SVMs optimize the weights by minimizing the classification error, which can remove the influence of those patterns at a distance from the decision boundary. Generally speaking, an SVM classifier maps a set of input patterns into a high-dimensional space and then finds the optimal separating hyperplane and the margin of separations in that space. The obtained hyperplane is able to classify the patterns into two categories and maximize their distance from the hyperplane.
SVMs are normally defined in terms of a class-separating score function, or hyperplane, f(x) = wTx + b, which is determined to achieve the largest possible margin of separation. Suppose a set of labelled samples are denoted by = {(xi, yi)}i=1,...,n, where xi is the feature vector for the i-th sample and yi ∈ {-1, + 1} is the corresponding label. Denote the distance between the positive hyperplane (i.e., wTx + b = + 1) and the negative hyperplane (i.e. wTx + b = -1) as 2d. It can be easily shown that . SVM training aims at finding w such that the margin of separation is the largest, i.e.
subject to
Equation (B.1) is equivalent to
subject to equation (B.2).
Figure B.1 illustrates how a linear SVM solves a linearly separable problem. There are four positive-class data points and seven negative-class data points. In Figure B.1b, the margin of separation is marked by the orange bidirectional arrow, which is to be maximized according to equation (B.3). The black line is the decision hyperplane, and the black dashed lines are the hyperplanes defining the margin of separation. As can be seen from Figure B.1b, a linear SVM can correctly classify all of these samples.
To allow linear SVMs to handle nonlinearly separable problems, slack variables are often incorporated. Specifically, the SVM optimization problem is expressed as
subject to
where ξi (i = 1,...,n) is a set of slack variables which allow some data to violate the constraints that define the minimum safety margin required for the training data in the linearly separable case, and C is a user-defined penalty parameter to penalize any violation of the safety margin for the training data. A larger C definitely means a heavier penalty will be imposed on the same level of violation. Moreover, when ξi = 0 (i = 1,...,n), the problem corresponds to the linear-separable case, as no slack conditions are needed, and the second term in equation (B.4) should be removed. When there exists ξi ≠ 0 (i = 1,..., n), the problem becomes linearly nonseparable, in which some samples may fall on the margin of separation or on the wrong side of the decision boundary.
As mentioned above, the user-defined penalty parameter (C) could also affect the performance of an SVM classifier. Generally, a larger C means heavier penalty on the violation of the constraints. Then the margin of separation will be narrower so that fewer samples can violate the constraints. On the contrary, a smaller C means less penalty on the violation, and there can be more points which do not “conform to the rule”, and so the margin of separation will be wider.
By introducing Lagrange multipliers αi (i = 1,...,n) and βi (i = 1,...,n), equation (B.4) can be expressed as a Lagrangian:
where αi ≥ 0 and βii ≥ 0. By differentiating with respect to w and b, it can be shown [113] that equation (B.5) is equivalent to the following optimizer:
subject to 0 ≤ αi ≤ C, i = 1,...,n, . At the same time, the weight can be obtained as
and the bias b can be expressed as
where xk is any support vector whose label yk = 1. With w and b, we can find a decision function f(x) = wTx + b, with which we can determine the class to which a test sample belongs.
A linearly nonseparable example for SVM classification is shown in Figure B.2 a,b. In Figure B.2a, there are 10 positive class data points (blue square) and 10 negative class data points (red cross). The decision boundary of a linear SVM is shown in Figure B.2b. As can be seen, a positive class data point (Point 10) is misclassified to the negative class.
While equation (B.6) leads to optimal solution for the linear case, it can be further generalized to the nonlinear case by kernelization. Specifically, equation (B.6) is extended to
There are three common kernels: (1) linear kernel, (2) Gaussian radial basis function (RBF) kernel, and (3) polynomial kernel. The latter two are nonlinear kernels.
Polynomial kernel:
Nonlinear kernels may outperform linear kernels in some cases [15, 162], especially for low-dimensional data. Figure B.2c and d, respectively, show the decision boundaries given by a polynomial SVM and an RBF SVM on the linearly nonseparable data shown in Figure B.2a. As can be seen, all of the data points are correctly classified.
However, when the dimension of the feature vectors is larger than the number of training data, the curse of the dimensionality problem will be aggravated in nonlinear SVMs [14]. The over-fitting problem becomes more severe when the degree of nonlinearity is high (i.e. small σ in RBF-SVM), leading to a decrease in performance, as demonstrated in [213]. In other words, highly nonlinear SVMs are more vulnerable to overfitting due to the high dimensionality of the input space. More information on SVMs can be found in [113].
The one-vs-rest approach is probably the most popular approach to multiclass classification. It constructs M binary SVM classifiers, where M is the number of classes. The m-th SVM is trained with data from the m-th class using positive-class labels (+ 1) and data from all of the other classes using negative-class labels (-1). Suppose the training data are given by = {(xi, yi)}i=1,...,n where xi is the feature vector for the i-th sample26 and yi ∈ {1,...,M} is the corresponding label. Then, the m-th (m ∈ {1,..., M}) SVM is found by solving the following constraint optimization problem:
subject to
Similar to the binary SVM in Section B.1, equation (B.13) is equivalent to the following optimizer:
subject to 0 ≤ ≤ C, i = 1,..., n, m = 1,..., M, , where = 1 if yi = m, and = -1 otherwise. Then the weights and the bias term of the m-th SVM can be obtained by
respectively, where xk is any support vector whose label is = 1.
Given a test sample xt, the score for the m-th SVM is
where m = 1,...,M, is the set of support vector indexes corresponding to the m-th SVM, and are the Lagrange multipliers.
Equation (B.14) can be further extended by kernelization as follows:
In this case, given a test sample xt, the score for the m-th SVM is
where K(⋅,⋅) is a kernel function. Then, the label of the test data xt can be determined by
This appendix is to prove that when the whole is used to construct the feature vectors (i.e. GO vectors) for both training and testing, there will no bias during leave-one-out cross-validation (LOOCV) even if some new features (i.e. GO terms) are retrieved for a test sample (i.e. a test protein). Here “new” means that the corresponding features (or GO terms) do not exist in any of the training samples (or proteins), but are found in the test sample(s) (or test protein(s)).
Suppose a set of labelled samples are denoted by = {(xi,yi)}i=1,...,n,where the i-th sample xi is drawn from a d-dimensional domain and the corresponding label yi ∈ {-1,+ 1}. The soft-margin Lagrangian for SVM is
where αi ≥ 0 and βi ≥ 0. By differentiating with respect to w and b, it can be shown that equation (C.1) is equivalent to the following optimizer:
subject to 0 ≤ αi ≤ C, i = 1,..., n, .
During LOOCV, if a new GO term is found in the test protein, then during the training part, we extend the d-dim feature vectors to (d + 1)-dim to incorporate the GO term with the corresponding entry being 0, namely xi becomes .
Then, equation (C.2) becomes
subject to 0 ≤ αi ≤ C, i = 1,...,n, . Therefore, αi will not be affected by the extended feature vectors.
Based on this, the weight can be obtained as
and the bias b can be expressed as
where is any support vector whose label γk = 1.
Therefore, for any test protein with a feature vector written as , where αt ≠ 0, the SVM score is
In other words, using the extended feature vectors during LOOCV will not cause bias compared to using the original vectors.
This appendix is to show the derivations for equations (5.15) and (5.16).
In Section 5.5.1 of Chapter 5, to minimize E(β), we may use the Newton–Raphson algorithm to obtain equation (5.14), where the first and second derivatives of E(β) are as follows:
and
In equations (D.1) and (D.2), y and p are N-dim vectors whose elements are and , respectively, X =[x1,x2,...,xN]T, W is a diagonal matrix whose i-th diagonal element is p(xi;β)(1 -p(xi;β)),i = 1, 2,...,N.