A. Webservers for protein subcellular localization

This appendix will simply introduce the online webservers for some proposed predictors, namely GOASVM, mGOASVM, mPLR-Loc, and HybridGO-Loc. For users’ convenience, we have made an integrated webserver interface, namely PolyU-Loc, which is a package of these webservers for predicting subcellular localization of single- and multi–location proteins in different species, such as Homo sapiens, Viridiplantae, Eukaryota, and Virus. The URL link for PolyU-Loc is http://bioinfo.eie.polyu.edu.hk/Book_website/.

A.1 GOASVM webserver

The GOASVM webserver (see Figure A.1) is to predict subcellular localization for single-label eukaryotic or human proteins. The URL link for the GOASVM webserver is http://bioinfo.eie.polyu.edu.hk/mGoaSvmServer/GOASVM.html.

For eukaryotic proteins, GOASVM is designed to predict 16 subcellular locations of eukaryotic proteins. The 16 subcellular locations include: (1) cell wall; (2) centriole; (3) chloroplast; (4) cyanelle; (5) cytoplasm; (6) cytoskeleton; (7) endoplasmic reticulum; (8) extracellular; (9) Golgi apparatus; (10) lysosome; (11) mitochondrion; (12) nucleus; (13) peroxisome; (14) plasma membrane; (15) plastid; (16) vacuole.

For human proteins, GOASVM is designed to predict 12 subcellular locations of human proteins. The 12 subcellular locations include: (1) centriole; (2) cytoplasm; (3) cytoskeleton; (4) endoplasmic reticulum; (5) extracellular; (6) Golgi apparatus; (7) lysosome; (8) microsome; (9) mitochondrion; (10) nucleus; (11) peroxisome; (12) plasma membrane.

image

Fig. A.1: The interface of the GOASVM webserver.

image

Fig. A.2: A snapshot of the GOASVM webserver showing that GOASVM can deal with either protein ACs or protein sequences.

GOASVM can deal with two different input types of proteins (see Figure A.2), either protein accession numbers (ACs) in UniProtKB format or amino acid sequences in FASTA format. Large-scale predictions, i.e. a list of accession numbers or a number of protein sequences, are also acceptable for GOASVM. Examples for both cases are also provided. More information can be found in the instructions and supplementary materials on the GOASVM webserver.

A.2 mGOASVM webserver

The mGOASVM webserver (see Figure A.3) is to predict subcellular localization for both single- and multi-label proteins in two species (i.e. virus and plant). The URL link for the mGOASVM server is http://bioinfo.eie.polyu.edu.hk/mGoaSvmServer/mGOASVM.html. Note that two different versions of mGOASVM are provided, one based on the GOA Database released in March 2011, and one based on the one released in July 2013. Typically, using the latest version will give more a accurate prediction for protein subcellular localization.

For virus proteins, mGOASVM is designed to predict six subcellular locations of multi-label viral proteins. The six subcellular locations include: (1) viral capsid; (2) host cell membrane; (3) host endoplasmic reticulum; (4) host cytoplasm; (5) host nucleus; (6) secreted.

For plant proteins, mGOASVM is designed to predict 12 subcellular locations of multi-label plant proteins. The 12 subcellular locations include: (1) cell membrane; (2) cell wall; (3) chloroplast; (4) cytoplasm; (5) endoplasmic reticulum; (6) extracellular; (7) golgi apparatus; (8) mitochondrion; (9) nucleus; (10) peroxisome; (11) plastid; (12) vacuole.

image

Fig. A.3: The interface of the mGOASVM webserver.

Like GOASVM, mGOASVM can deal with two different input types, either protein ACs or protein sequences. More information can be found in the instructions and supplementary materials on the mGOASVM webserver.

A.3 HybridGO-Loc webserver

Like mGOASVM, HybridGO-Loc (see Figure A.4) is a subcellular-localization predictor which can deal with datasets with both single-and multi-label proteins in two species (virus and plant) and two input types (protein ACs and protein sequences). The URL link for the mGOASVM server is http://bioinfo.eie.polyu.edu.hk/HybridGoServer/. Also, the specific subcellular locations that HybridGO can predict for both species are the same as those in mGOASVM.

Unlike mGOASVM, HybridGO-Loc integrates all possible combinations of different species and input types in one interface. Users can just follow two steps to make predictions on HybridGO-Loc: (1) select the species type and the input type (virus protein ACs, virus protein sequences, plant protein ACs, or plant protein sequences); (2) input protein sequences or accession numbers.

Moreover, users can leave their email address in the corresponding space to receive their prediction results via email. For ease in dealing with large-scale prediction results, a downloadable txt file will also be given on the webpage every time a prediction task is completed. Detailed and comprehensive supplementary materials as well as instructions (also on the webserver page) are also provided for guiding users on how to use the HybridGO-Loc server.

image

Fig. A.4: The interface of the HybridGO-Loc webserver.

A.4 mPLR-Loc webserver

The mPLR-Loc webserver also possess the capability of predicting single- and multi–location proteins in virus and plant species. Similar to HybridGO-Loc, the mPLR-Loc webserver also integrates all possible different inputs (combinations of different species and input types) in one interface. The URL link for the mPLR-Loc server is http://bioinfo.eie.polyu.edu.hk/mPLRLocServer/. In addition to being able to rapidly and accurately predict subcellular localization of single-and multi-label proteins, mPLR-Loc can also provide probabilistic confidence scores for the prediction decisions.

Here a step-by-step guide on how to use the mPLR-Loc is provided. After going to the homepage of mPLR-Loc server, select a combination of species type and input type. Then input the query protein sequences or accession numbers, or upload a file containing a list of accession numbers or proteins sequences. For example, Figure A.5 shows a screenshot that uses a plant protein sequence in Fasta format as input. After clicking the button “Predict” and waiting ca. 13 s, the prediction results, as shown in Figure A.6, and the probabilistic scores, as shown in Figure A.7, will be produced. The prediction result in Figure A.6 include the Fasta header, BLAST E-value, and predicted subcellular location(s). Figure A.7 shows the confidence score on the predicted subcellular location(s). In this figure, mPLR-Loc predicts the query sequence as “Cytoplasm” and “Nucleus” with confidence scores greater than 0.8 and 0.9, respectively.

image

Fig. A.5: An example of using a plant protein sequence in Fasta format as input to the mPLR-Loc server.

image

Fig. A.6: Prediction results of the mPLR-Loc server for the plant protein sequence input in Figure A.5.

image

Fig. A.7: Confidence scores of the mPLR-Loc server for the plant protein sequence input in Figure A.5.

B. Support vector machines

Support vector machines (SVMs), which were initiated by Vapnik [206], have become popular due to their attractive features and promising performance. Compared to conventional neural networks in which network weights are determined by minimizing the mean-square error between the actual and desired outputs, SVMs optimize the weights by minimizing the classification error, which can remove the influence of those patterns at a distance from the decision boundary. Generally speaking, an SVM classifier maps a set of input patterns into a high-dimensional space and then finds the optimal separating hyperplane and the margin of separations in that space. The obtained hyperplane is able to classify the patterns into two categories and maximize their distance from the hyperplane.

B.1 Binary SVM classification

SVMs are normally defined in terms of a class-separating score function, or hyperplane, f(x) = wTx + b, which is determined to achieve the largest possible margin of separation. Suppose a set of labelled samples are denoted by image = {(xi, yi)}i=1,...,n, where xi is the feature vector for the i-th sample and yi ∈ {-1, + 1} is the corresponding label. Denote the distance between the positive hyperplane (i.e., wTx + b = + 1) and the negative hyperplane (i.e. wTx + b = -1) as 2d. It can be easily shown that image. SVM training aims at finding w such that the margin of separation is the largest, i.e.

image

subject to

image

Equation (B.1) is equivalent to

image

subject to equation (B.2).

Figure B.1 illustrates how a linear SVM solves a linearly separable problem. There are four positive-class data points and seven negative-class data points. In Figure B.1b, the margin of separation is marked by the orange bidirectional arrow, which is to be maximized according to equation (B.3). The black line is the decision hyperplane, and the black dashed lines are the hyperplanes defining the margin of separation. As can be seen from Figure B.1b, a linear SVM can correctly classify all of these samples.

To allow linear SVMs to handle nonlinearly separable problems, slack variables are often incorporated. Specifically, the SVM optimization problem is expressed as

image

subject to

image

where ξi (i = 1,...,n) is a set of slack variables which allow some data to violate the constraints that define the minimum safety margin required for the training data in the linearly separable case, and C is a user-defined penalty parameter to penalize any violation of the safety margin for the training data. A larger C definitely means a heavier penalty will be imposed on the same level of violation. Moreover, when ξi = 0 (i = 1,...,n), the problem corresponds to the linear-separable case, as no slack conditions are needed, and the second term in equation (B.4) should be removed. When there exists ξi ≠ 0 (i = 1,..., n), the problem becomes linearly nonseparable, in which some samples may fall on the margin of separation or on the wrong side of the decision boundary.

image

Fig. B.1: An example illustrating how a linear SVM classifier solves a linearly separable problem. In (a), there are four positive-class data points (blue square) and seven negative-class data points (red cross); in (b), the margin of separation is marked by the orange bidirectional arrow. The green squares and green crosses (enclosed by large circles) are the support vectors for the positive class and the negative class, respectively. The black line is the decision hyperplane, and the black dashed lines are the hyperplanes defining the margin of separation.

As mentioned above, the user-defined penalty parameter (C) could also affect the performance of an SVM classifier. Generally, a larger C means heavier penalty on the violation of the constraints. Then the margin of separation will be narrower so that fewer samples can violate the constraints. On the contrary, a smaller C means less penalty on the violation, and there can be more points which do not “conform to the rule”, and so the margin of separation will be wider.

By introducing Lagrange multipliers αi (i = 1,...,n) and βi (i = 1,...,n), equation (B.4) can be expressed as a Lagrangian:

image

where αi ≥ 0 and βii ≥ 0. By differentiating image with respect to w and b, it can be shown [113] that equation (B.5) is equivalent to the following optimizer:

image

subject to 0 ≤ αiC, i = 1,...,n, image. At the same time, the weight can be obtained as

image

and the bias b can be expressed as

image

where xk is any support vector whose label yk = 1. With w and b, we can find a decision function f(x) = wTx + b, with which we can determine the class to which a test sample belongs.

image

Fig. B.2: An linearly nonseparable example using different SVM kernels. In (a), there are 10 positive-class data points (blue square) and 10 negative-class data points (red cross); in (b), (c), and (d) the linear kernel, the polynomial kernel, and the RBF kernel for SVMs are used for classification, respectively. The green squares and green crosses (enclosed in large circles) are the support vectors for the positive class and the negative class, respectively. The black line is the decision hyperplane, and the blue or red lines are the hyperplanes defining the margin of separation.

A linearly nonseparable example for SVM classification is shown in Figure B.2 a,b. In Figure B.2a, there are 10 positive class data points (blue square) and 10 negative class data points (red cross). The decision boundary of a linear SVM is shown in Figure B.2b. As can be seen, a positive class data point (Point 10) is misclassified to the negative class.

While equation (B.6) leads to optimal solution for the linear case, it can be further generalized to the nonlinear case by kernelization. Specifically, equation (B.6) is extended to

image

There are three common kernels: (1) linear kernel, (2) Gaussian radial basis function (RBF) kernel, and (3) polynomial kernel. The latter two are nonlinear kernels.

Linear kernel:

image

RBF kernel:

image

Polynomial kernel:

image

Nonlinear kernels may outperform linear kernels in some cases [15, 162], especially for low-dimensional data. Figure B.2c and d, respectively, show the decision boundaries given by a polynomial SVM and an RBF SVM on the linearly nonseparable data shown in Figure B.2a. As can be seen, all of the data points are correctly classified.

However, when the dimension of the feature vectors is larger than the number of training data, the curse of the dimensionality problem will be aggravated in nonlinear SVMs [14]. The over-fitting problem becomes more severe when the degree of nonlinearity is high (i.e. small σ in RBF-SVM), leading to a decrease in performance, as demonstrated in [213]. In other words, highly nonlinear SVMs are more vulnerable to overfitting due to the high dimensionality of the input space. More information on SVMs can be found in [113].

B.2 One-vs-rest SVM classification

The one-vs-rest approach is probably the most popular approach to multiclass classification. It constructs M binary SVM classifiers, where M is the number of classes. The m-th SVM is trained with data from the m-th class using positive-class labels (+ 1) and data from all of the other classes using negative-class labels (-1). Suppose the training data are given by image = {(xi, yi)}i=1,...,n where xi is the feature vector for the i-th sample26 and yi ∈ {1,...,M} is the corresponding label. Then, the m-th (m ∈ {1,..., M}) SVM is found by solving the following constraint optimization problem:

image

subject to

image

Similar to the binary SVM in Section B.1, equation (B.13) is equivalent to the following optimizer:

image

subject to 0 ≤ imageC, i = 1,..., n, m = 1,..., M, image, where image = 1 if yi = m, and image = -1 otherwise. Then the weights and the bias term of the m-th SVM can be obtained by

image

and

image

respectively, where xk is any support vector whose label is image = 1.

Given a test sample xt, the score for the m-th SVM is

image

where m = 1,...,M, image is the set of support vector indexes corresponding to the m-th SVM, and image are the Lagrange multipliers.

Equation (B.14) can be further extended by kernelization as follows:

image

In this case, given a test sample xt, the score for the m-th SVM is

image

where K(⋅,⋅) is a kernel function. Then, the label of the test data xt can be determined by

image

 

C. Proof of no bias in LOOCV

This appendix is to prove that when the whole image is used to construct the feature vectors (i.e. GO vectors) for both training and testing, there will no bias during leave-one-out cross-validation (LOOCV) even if some new features (i.e. GO terms) are retrieved for a test sample (i.e. a test protein). Here “new” means that the corresponding features (or GO terms) do not exist in any of the training samples (or proteins), but are found in the test sample(s) (or test protein(s)).

Suppose a set of labelled samples are denoted by image = {(xi,yi)}i=1,...,n,where the i-th sample xi is drawn from a d-dimensional domain image and the corresponding label yi ∈ {-1,+ 1}. The soft-margin Lagrangian for SVM is

image

where αi ≥ 0 and βi ≥ 0. By differentiating image with respect to w and b, it can be shown that equation (C.1) is equivalent to the following optimizer:

image

subject to 0 ≤ αiC, i = 1,..., n, image.

During LOOCV, if a new GO term is found in the test protein, then during the training part, we extend the d-dim feature vectors to (d + 1)-dim to incorporate the GO term with the corresponding entry being 0, namely xi becomes image.

Then, equation (C.2) becomes

image

subject to 0 ≤ αiC, i = 1,...,n, image. Therefore, αi will not be affected by the extended feature vectors.

Based on this, the weight can be obtained as

image

and the bias b can be expressed as

image

where image is any support vector whose label γk = 1.

Therefore, for any test protein with a feature vector written as image , where αt ≠ 0, the SVM score is

image

In other words, using the extended feature vectors during LOOCV will not cause bias compared to using the original vectors.

D. Derivatives for penalized logistic regression

This appendix is to show the derivations for equations (5.15) and (5.16).

In Section 5.5.1 of Chapter 5, to minimize E(β), we may use the Newton–Raphson algorithm to obtain equation (5.14), where the first and second derivatives of E(β) are as follows:

image

and

image

In equations (D.1) and (D.2), y and p are N-dim vectors whose elements are image and image, respectively, X =[x1,x2,...,xN]T, W is a diagonal matrix whose i-th diagonal element is p(xi;β)(1 -p(xi;β)),i = 1, 2,...,N.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset