5 From single- to multi-location

Instead of only determining subcellular localization of single-label proteins, this chapter will focus on predicting both single- and multi-location proteins. Biological significance of multi-location proteins will be first elaborated, followed by a brief introduction of existing algorithms for multi-label classification. Subsequently, three multi-label predictors, namely mGOASVM, AD-SVM, and mPLR-Loc, are presented for multi-location protein subcellular localization.

5.1 Significance of multi-location proteins

Previous chapters show that remarkable progress in the development of computational methods has been made in the past decades for protein subcellular localization. Unfortunately, most of the existing methods are limited to the prediction of single-location proteins. These methods generally exclude the multi-label proteins or assume that multi-location proteins do not exist. Also, the focus on predicting single-location proteins may also be driven by the data available in public databases such as UniProt, where a majority of proteins are assigned to a single location. However, there are multi-location proteins that can simultaneously reside at, or move between, two or more different subcellular locations [70, 143, 150, 242]. Actually, proteins with multiple locations play an important role in some metabolic processes which take place in more than one compartment. For example, fatty acid ß -oxidation is known to reside in peroxisome and mitochondria, and antioxidant defense has been found in cytosol, mitochondria, and peroxisome [147]. Another example is the glucose transporter GLUT4. This protein is regulated by insulin and is typically stored in the intracellular vesicles of adipocytes. However, it has also been found to translocate to the plasma membrane in response to insulin [171, 178].

5.2 Multi-label classification

Traditionally, pattern classification problems are concerned with learning from a set of patterns, where each pattern is associated only with one of the known classes. These problems are referred to as single-label multiclass classification. However, many real-world problems are not limited to single-label classification. When more than one label is assigned to the data instance, the problems are referred to as multi-label multiclass classification. In recent decades, multi-label classification has received significant attention in a wide range of problem domains, such as functional genomics prediction [11, 207, 241], text categorization [76, 105, 145, 177, 179], music classification [122, 202], video segmentation [195], and semantic annotation of images [21]. In functional genomics prediction, a gene is likely to associate with many functions. In text categorization, a document describing the politics may involve other topics, such as sports or education. Similarly, in music classification, a song may belong to more than one genre.

Compared to single-label classification, multi-label classification is more complicated due to the large number of possible combinations of labels. Existing methods for multi-label classification can be grouped into two main categories: (1) algorithm adaptation and (2) problem transformation.

5.2.1 Algorithm-adaptation methods

Algorithm adaptation methods extend specific single-label algorithms to solve multi-label classification problems. Typical methods include AdaBoost.MH [179], multi-label C4.5 [55], and hierarchical multi-label decision trees [207].

AdaBoost.MH is an extension of AdaBoost [72] for multi-label classification. It uses the one-vs-rest approach to convert an M-class problem into M 2-class AdaBoost problems in which an additional feature defined by the class labels is augmented to the input space.

The C4.5 algorithm [169] builds decision trees using the concept of information entropy. At each node of the tree, C4.5 chooses the feature that most effectively splits the data into two classes; in other words, the feature with the highest normalized information gain (or difference in entropy) is chosen to create a decision node. The C4.5 algorithm then recurs on the subclasses obtained by the previous step and the nodes thus obtained are added as the children of the node in the previous step. The multi-label C4.5 [55] uses the C4.5 algorithm as a baseline classifier and extends the definition of entropy to include multi-label data by estimating the number of bits needed to describe the membership or nonmembership of each class. One disadvantage of this algorithm is that it only learns a set of accurate rules, not a complete classification.

In [207], class labels are organized in a hierarchy, and for each class a binary decision tree is learned in a hierarchical way. An example can belong to a class if it also belongs to the superclasses of that class. This parent-children relationship enables the decision tree to predict multi-label instances.

5.2.2 Problem transformation methods

Problem transformation methods transform a multi-label learning problem into one or more single-label classification problems [21] so that traditional single-label classifiers can be applied without modification. Typical methods include label powerset (LP) [204], binary relevance (BR) [203], ensembles of classifier chains (ECC) [172], and compressive sensing [96].

The label powerset method reduces a multi-label task to a single-label one by treating each possible multi-label subset as a new class in the single-label classification task. This method is simple, but is likely to generate a large number of classes, many of which are associated only with a few examples.

Binary relevance (BR) is a popular problem transformation method. It transforms a multi-label task into many binary classification tasks, one for each label. Given a query instance, its predicted label(s) are the union of the positive-class label output by these binary classifiers. BR is effective, but it neglects the correlation between labels, which may carry useful information for multi-label classification.

The classifier chain method is a variant of BR, but it can take the correlation between labels into account. Similar to BR, a set of one-vs-rest binary classifiers are trained. But unlike BR, the classifiers are linked in a chain, and the feature vectors presented to the i-th classifier in the chain are augmented with the binary vectors representing the label(s) of the first class to the (i-1)-th class. Therefore, label dependence is preserved through the feature space. Classification performance, however, depends on the chain order. This order-dependency can be overcome by ensembles of classifier chains [172].

The compressive sensing approach is motivated by the fact that when the number of classes is large, the actual labels are often sparse. In other words, a typical query instance will belong to a few classes only, even though the total number of classes is large. This approach exploits the sparsity of the output (label) space by means of compressive sensing to obtain a more efficient output coding scheme for large-scale multi-label learning problems.

Several algorithms based on support vector machines (SVM) [205] have been proposed to tackle multi-label classification problems. In [62], a ranking algorithm for multi-label classification is proposed. It uses the ranking loss [179] as the cost function, which is defined as the average fraction of pairs of labels that are incorrectly ordered. Similar to SVMs, it finds the optimal hyperplane with the maximum margin of separation. One major disadvantage of this method is that it does not output a set of labels. The SVM classifiers in [79] adopt the BR method by extending the feature vectors of the original data set with some additional features indicating the relationship between classes.

Compared to algorithm adaptation methods, one advantage of problem transformation methods is that any algorithm which is not capable of dealing with multi-label classification problems can be easily extended to deal with multi-label classification via transformation. It should be pointed out that the multi-label classification methods are different from the multiclass classification methods, such as error correcting output coding methods [61] and pairwise comparison methods [110]. There is probably no multiclass method that outperforms all others in all circumstances [182], and this is is also the case for multi-label methods.

5.2.3 Multi-label classification in bioinformatics

In the past decades, multi-label classification methods have been increasingly applied to bioinformatics, especially in protein subcellular localization. Several multi-label predictors have been proposed to deal with the prediction of multi-label proteins in species of viruses, plants, and eukaryotes, which will be elaborated below.

There are a few predictors [187, 189, 233] specifically designed for predicting viral proteins, generated by viruses in various cellular compartments of the host cell or virus-infected cells. Studying the subcellular localization of viral proteins enables us to obtain the information about their destructive tendencies and consequences [187, 189, 233]. It is also beneficial to the annotation of the functions of viral proteins and the design of antiviral drugs. To the best of our knowledge, there are two predictors, namely Virus-mPLoc [189] and iLoc-Virus [233], capable of predicting multi-label viral proteins. iLoc-Virus performs better than Virus-mPLoc because the former has a better formulation for reflecting GO information and has a better way to predict the number of subcellular location sites of a query protein [233]. Recently, a method called KNN-SVM ensemble classifier [121] was proposed to deal with multi-label proteins, including viral proteins. It was found that the performance of the KNN-SVM predictor is comparable to iLoc-Virus and is better than Virus-mPLoc.

Conventional methods specializing in plant proteins, such as TargetP [65] and Predotar [193], can only deal with singlelabel proteins. Plant-mPLoc [51] and iLoc-Plant [230] are state-of-the-art predictors that can deal with single-label and multi-label proteins of plants. iLoc-Plant performs better than Plant-mPLoc because of an improvement similar to iLoc-Virus versus Virus-mPLoc.

Prediction of eukaryotic proteins has been one of the focal points in the past decades. While many predictors were proposed [75, 99, 153, 162, 173] for predicting the subcellular localization of single-label eukaryotic proteins, in recent years, we have witnessed steady progress in multi-label predictors. In particular, Euk-mPLoc 2.0 [49] and iLoc-Euk [52] are state-of-the-art predictors that can deal with both single-label and multi-label proteins of eukaryotes. Similar to iLoc-Virus versus Virus-mPLoc, iLoc-Euk performs better than Euk-mPLoc 2.0.

Among these multi-abel predictors, Virus-mPLoc, Plant-mPLoc, Euk-mPLoc 2.0, iLoc-Virus, iLoc-Plant, and iLoc-Euk use algorithm adaptation methods, while KNN-SVM uses problem transformation methods.

5.3 mGOASVM: A predictor for both single- and multi-location proteins

This section describes an efficient multi-label predictor, namely mGOASVM [213], for multi-label protein subcellular localization prediction. Here, the prefix “m” stands for multiple, meaning that the predictor can deal with proteins with multiple subcellular locations. mGOASVM is different from other predictors in that (1) it adopts a new decision scheme for an SVM classifier so that it can effectively deal with datasets containing both single-label and multi-label proteins; (2) it selects a set of distinct relevant GO terms to form a more informative GO subspace; and (3) it constructs the GO vectors by using the frequency of occurrences of GO terms instead of using 1-0 values [51, 189, 211] for indicating the presence or absence of some predefined GO terms. The results in Chapter 9 on two benchmark datasets and a newly created dataset full of novel proteins demonstrate that these three properties enable mGOASVM to predict multi–location proteins and outperform the state-of-the-art predictors such as iLoc-Virus and iLoc-Plant.

image

Fig. 5.1: Flowchart of mGOASVM for three cases: (1) using accession numbers only; (2) using sequences only; (3) using both accession numbers and sequences. AC: accession number; S: sequence. Part II does not exist for Case 1, and Part I does not exist for Case 2. Case 3 requires using both Parts I and II. The score fusion implements equation (5.1).

5.3.1 Feature extraction

The feature extraction part of mGOASVM is similar to GOASVM introduced in Chapter 4. mGOASVM adopts a method (see Section 4.1.2) similar to GOASVM. The only difference is that mGOASVM uses more than one homologous protein for searching the GOA database to retrieve the relevant GO terms. For GO-vector construction, mGOASVM adopts the term-frequency method introduced in Section 4.1.3, which is found to be superior to other construction methods (see Chapter 9). However, compared to GOASVM, mGOASVM uses a more sophisticated multi-label multiclass classifier for multi-location protein subcellular localization, which is elaborated below.

5.3.2 Multi-label multiclass SVM classification

To predict the subcellular locations of datasets containing both single-label and multi-label proteins, mGOASVM adopts a multi-label support vector machine (SVM) classifier. GO vectors, which can be obtained from equation (4.3), are used for training the multi-label one-vs-rest SVMs. Specifically, for an M-class problem (here M is the number of subcellular locations), M-independent binary SVMs are trained, one for each class. Denote the GO vector created by using the true accession number of the i-th query protein as qi,0 and the GO vectors created by using the n homologous accession numbers as qi,j, j = 1,...,n. Then, given the i-th query protein, the score of the m-th SVM is

image

where image is the set of support vector indexes corresponding to the m-th SVM, αm,r are the Lagrange multipliers, K(·, ·) is a kernel function, and wj the fusion weights such that image. In mGOASVM, a linear kernel is used, i.e. K(pr, qi,j = image. ym,r ∈ {−1, + 1} are the class labels (here we call them “the transformed labels”), which are denoted as:

  1. for single-label pr

    image


  2. for multi-label pr

image

where m ∈ {1,...,M}. Note that unlike the single-label problem, where each protein has one and only one positive transformed label, a multi-label protein can have more than one positive transformed label.

Note that pr in equation (5.1) represents the GO training vectors, which may include the GO vectors created by using the true accession numbers of the training proteins or their homologous accession numbers. We have the following three cases:

 

Case 1. If only the true accession numbers are available, then only qi,0 exists and pr represents the GO training vector created by using only the true accession numbers. In that case, qi,j (j = 1,...,n) does not exist; w0 = 1, and wj = 0(j = 1,...,n). Then, equation (5.1) can be written as

image

Case 2. If only the amino acid sequences are known, then only the accession numbers of the homologous sequences can be used for retrieving GO terms and for constructing GO vectors. In that case, qi,0 does not exist, and w0 = 0; moreover, pr represents the GO training vectors created by using only the homologous accession numbers.

Case 3. If both accession numbers and amino acid sequences are known, then both true accession numbers and the accession numbers of the homologous sequences can be used for retrieving GO terms and for constructing GO vectors. Then, qi,j (j = 0,...,n) exists, and pr represents the GO training vectors created by using both the true accession numbers and the homologous accession numbers.

In mGOASVM [213], 1, 2, 4, and 8 homologs were tried for the virus dataset, and 1 and 2 homologs were used for the plant dataset, i.e. n ∈ {1,2,4,8} and n ∈ {1,2} in equation (5.1), respectively. Equal weights for the true accession number and the accession numbers of homologs were adopted. Therefore, for Case 2, w0 = 0 and wj = 1/n, j = 1,...,n; and for Case 3, wj = 1/(n + 1), j = 0,...,n.

Figure 5.1 illustrates the whole prediction process in mGOASVM for all three cases: (1) using only accession numbers, (2) using only sequences, and (3) using both accession numbers and sequences. Part II does not exist for Case 1, and Part I does not exist for Case 2. Both Parts I and II exist for Case 3. Score fusion is the fusion of GO scores obtained from accession numbers of homologs in Case 2, or from both true accession numbers and accession numbers of homologs in Case 3.

The subcellular location(s) of the i-th query protein can be predicted as

image

As can be seen, image(image) is a predicted set that may have zero or more elements, which enables us to make a multi-label prediction. In case equation (5.5) does not produce a class label, i.e. image(image) = image, the number of subcellular locations is set to one and the location is given by

image

Figure 5.2 depicts an example showing how the multi-label SVM classifier works. Similar to the multiclass SVM classifier shown in Figure 4.6, four independent binary SVMs are first trained for the four-class problem, one for each class, and the training GO vectors image participate in all of the four binary SVMs. However, contrary to the multiclass SVM classifier, where each training vector has the positive label only in one binary SVM and has negative labels in the remaining binary SVMs, a training vector in the multi-label SVM classifier may have the positive label in more than one binary SVMs (see equation (5.3)).

In the testing phase, suppose we use the same query proteins used in Figure 4.6, and we obtain the same SVM scores as those in the multiclass SVM classifier. In the multiclass SVM classifier, e.g. GOASVM, as shown in the upper part of Figure 5.2, the decision scheme will predict the query proteins based on the maximum SVM scores. Specifically, query protein I will be assigned to Class 2, because the second SVM score is the maximum among all of the SVM scores, while query protein II will be assigned to Class 3 because its third SVM score is the largest. On the contrary, in the multi-label SVM classifier, e.g. mGOASVM, as shown in the lower part of Figure 5.2, the decision scheme is based on the positive SVM scores (see equation (5.5)). Specifically, for query protein I, because only the second SVM score (0.1) is positive, query protein I will be considered as a single-location protein and is assigned to Class 2. For query protein II, because there are three positive SVM scores, query protein II will be considered as a multi-label protein and is assigned to Class 2, Class 3, and Class 4.

image

Fig. 5.2: An example showing how a multi-label four-class SVM classifier works. The upper part is the testing phase of the single-label multiclass predictor GOASVM, and the lower part is the testing phase of the multi-label multiclass predictor mGOASVM.

5.4 AD-SVM: An adaptive decision multi-label predictor

To determine the number of subcellular locations in which a protein will reside, mGOASVM introduced in Section 5.3 typically compares some pattern-matching scores with a fixed decision threshold. This simple strategy, however, may easily lead to a large number of false positives. To address this problem, this section proposes an adaptive decision (AD) scheme for multi-label SVM classifiers, which formulates a more powerful multi-label classifier, namely AD-SVM. AD-SVM extends binary relevance methods with an adaptive decision scheme which essentially converts the linear SVMs in the classifier into piecewise linear SVMs, effectively reducing the over-prediction instances while having little influence on the correctly predicted ones.

AD-SVM uses feature extraction methods similar to mGOASVM. The feature vectors are classified by the proposed adaptive decision multi-label classifier, which is elaborated below.

5.4.1 Multi-label SVM scoring

GO vectors, as computed in equation (4.3), are used for training the multi-label one-vs-rest SVMs. Specifically, for an M-class problem (here M is the number of subcellular locations), M-independent binary SVMs are trained, one for each class. Denote the GO vector created by using the true AC of the i-th query protein as qi,0 and the GO vector created by using the accession number of the k-th homolog as qi,k, k = 1,...,kmax, where kmax is the number of homologs retrieved by BLAST with the default parameter setting. Then, given the i-th query protein image, the score of the m-th SVM is

image

where

image

and image is the set of support vector indexes corresponding to the m-th SVM, ym,r ∈ {-1, + 1} are the class labels, αm,r are the Lagrange multipliers, and K(⋅, ⋅) is a kernel function; here, the linear kernel is used. Note that pr in equation (5.7) represents the GO training vectors, which may include the GO vectors created by using the true AC of the training sequences or their homologous ACs.

5.4.2 Adaptive decision for SVM (AD-SVM)

To predict the subcellular locations of datasets containing both single-label and multi-label proteins, an adaptive decision scheme for multi-label SVM classifiers is proposed. Unlike the single-label problem, where each protein has only one predicted label, a multi-label protein could have more than one predicted label. Thus, the predicted subcellular location(s) of the i-th query protein are given by the following:

If image

image

otherwise

image

In equation (5.9), f (smax(image))is a function of smax(image),where smax(image) = image sm(image). In [214], a linear function was used, i.e.

image

where θ ∈ [0.0,1.0] is a parameter. Because f(smaximage) is linear, equations (5.9) and (5.10) turn the linear SVMs into piecewise linear SVMs. Eq. 5.9 also suggests that the predicted labels depend on smax(image), a function of the test instance (or protein). This means that the decision and the corresponding threshold are adaptive to the test protein. For ease of reference, we refer to the proposed predictor as AD-SVM.

image

Fig. 5.3: A 3-class example illustrating how the adaptive decision scheme changes the decision boundaries from linear to piecewise linear and how the resulting SVMs assign label(s) to test points when θ in equation (5.11) changes from 0 to 1. In (a), the solid and dashed lines represent the decision boundaries and margins of individual SVMs, respectively. In (b)–(d), the input space is divided into three 1-label regions (green, blue, and red) and three 2-label regions (greenblue, bluered, and redgreen).

5.4.3 Analysis of AD-SVM

To facilitate discussion, let us define two terms: over-prediction and under-prediction. Specifically, over- and under-prediction means that the number of predicted labels of a query protein is larger or smaller, respectively than the ground truth. In this paper, both over- and under-predictions are considered as incorrect predictions, which will be reflected in the “overall actual accuracy (OAA)" to be defined in Section 8.2.3.

Conventional methods use a fixed threshold to determine the predicted classes. When the threshold is too small, the prediction results are liable to over-prediction; on the other hand, when the threshold is too large, the prediction results are susceptible to under-prediction. To overcome this problem, the adaptive decision scheme in the classifier uses the maximum score (smax(image)) among the one-vs-rest SVMs in the classifier as a reference. In particular, smax(image) in equation (5.9) adaptively normalizes the scores of all one-vs-rest SVMs, so that for SVMs to be considered as runner-ups, they need to have a sufficiently large score relative to the winner. This strategy effectively reduces the chance of over-prediction. The first condition in equation (5.9) (sm(image)> 1) aims to avoid under-prediction when the winning SVM has very high confidence (i.e. smax(image) » 1), but the runners-up still have enough confidence (sm(image) > 1) in making a right decision.16 On the other hand, when the maximum score is small (say 0 < smax(image) ≤ 1), θ in the second term of equation (5.9) can strike a balance between over- and under-prediction. When all of the SVMs have very low confidence (say smax(image) < 0), the classifier switches to single-label mode via equation (5.10).

To illustrate how this decision scheme works, an example is shown in Figure 5.3. Suppose there are four test data points (P1,...,P4) which are possibly distributed into three classes: {green, blue, red}. The decision boundaries of individual SVMs and the four points are shown in Figure 5.3a. Suppose sm(Pi) is the SVM score of Pi with respect to the class m, where i = {1,...,4} and m ∈{green, blue, red}. Figure 5.3a suggests the following conditions:

image

Note that points whose scores lie between 0 and 1 are susceptible to over-prediction because they are very close to the decision boundaries of the corresponding SVM. The decision scheme used in equations (5.9)(5.11) (i.e. θ = 0.0) leads to the decision boundaries shown in Figure 5.3b. Based on these boundaries, P1, P2 and P3 will be assigned to class green ∩ blue , and P4 will be assigned to the class with the highest SVM score (using equation (5.10)). If θ increases to 0.5, the results shown in Figure 5.3c will be obtained. The assignments of P1, P3, and P4 remain unchanged, but P2 will be changed from class green ∩ blue to class blue. Similarly, when θ increases to 1.0 (Figure 5.3d), then the class of P3 will also be determined by the SVM with the highest score. This analysis suggests that when θ increases from 0 to 1, the decision criterion becomes more stringent, which has the effect of shrinking the 2-label regions in Figure 5.3, thus reducing the over-prediction. Provided that θ is not close to 1, this reduction in over-prediction will not compromise the decisions made by the high-scoring SVMs.

image

Fig. 5.4: A 4-class example showing how the adaptive decision scheme works when θ in equation (5.11) changes from 0 to 1. (a) is the testing phase of the decision scheme of mGOASVM, (b)–(d) are the testing phases of the adaptive decision schemes with θ in equation (5.11) equal to 0, 0.5, and 1, respectively. Ref: the reference or the SVM score threshold, over which the corresponding class is predicted to be positive.

To further exemplify the strategy of the adaptive decision scheme, a four-class multi-label example demonstrating how the adaptive decision scheme works is shown in Figure 5.4. The training phase is similar to that of mGOASVM specified in Section 5.3.2. Here, we only use the query protein II shown in Figure 5.2 to demonstrate the adaptive scheme. Figure 5.4a shows the testing phase of the baseline predictor, i.e. mGOASVM, and Figure 5.4b–d show the testing phases of the adaptive decision schemes with θ in equation (5.11) equal to 0, 0.5 and 1, respectively. As shown in Figure 5.4b, when θ = 0, the adaptive scheme is the same as the decision scheme in mGOASVM, with the reference or SVM score threshold equal to 0. In other words, if there is any positive SVM score, the query protein will be assigned to the corresponding class. Thus, in this case, the query protein is assigned to Class 2, Class 3, and Class 4. When θ = 0.5 (Figure 5.4c), the reference score becomes Ref = min{1.0,0.5· smax(image)} = min{1.0,0.5·(1.6)} = 0.8, namely only those classes whose SVM scores are larger than 0.8 will be predicted as positive. In this case, the query protein will be predicted to locate in Class 2 and Class 3. When we increase θ to 1, as shown in Figure 5.4d, the reference score will become Ref = min{1.0,1 ·smax(image)} = min{1.0,1 · (1.6)} = 1.0. In this case, only the 3-rd SVM score is larger than 1.0. Therefore, the query protein will be considered as a single-location protein and is predicted to locate in Class 3.

image

Fig. 5.5: An example showing how the adaptive decision scheme works when θ in equation (5.11) changes from 0 to 1. The ordinate represents the SVM score, and the abscissa represents the classes.

This four-class multi-label problem can also be shown in another way, as in Figure 5.5. There we can clearly see that when θ = 0 (dark gray solid line), or the decision scheme of mGOASVM, there will be three classes (Classes 2, 3, and 4) passing the criterion; when θ = 0.5 (light gray solid line), only Class 2 and Class 3 can pass the criterion; when θ increases to 1, then the criterion becomes 1.0 (black solid line), and only Class 3 will be the predicted label for the query protein. This suggests that with θ increases, the decision scheme becomes stringent and the over-predictions will be reduced.

5.5 mPLR-Loc: A multi-label predictor based on penalized logistic regression

Logistic regression (LR) is a powerful discriminative classifier which has an explicit probabilistic interpretation built into its model [95]. Traditional logistic regression classifiers, including penalized logistic regression classifiers [191, 244, 245], are only applicable to multiclass classification. This section elaborates an efficient penalized multi-label logistic regression classifier, mPLR-Loc, equipped with an adaptive decision scheme.

5.5.1 Single-label penalized logistic regression

Suppose for a two-class single-label problem, we are given a set of training data image, where image and yi ∈ {0,1}. In our case, xi = image, where qi is defined in equation (4.3). Denote Pr(Y = yi|X = xi) as the posterior probability of the event that X belongs to class yi given X = xi. In logistic regression, the posterior probability is defined as

image

where ß is a (T + 1)-dim parameter vector. When the number of training instances (N) is not significantly larger than the feature dimension (T + 1), using logistic regression without any regularization often leads to over-fitting. To avoid over-fitting, an L2-regularization penalty term is added to the penalized cross-entropy error function as follows:

image

where ρ is a user-defined penalty parameter to control the degree of regularization, and ρ can be determined by cross-validation.

To minimize E(ß), we may use the Newton–Raphson algorithm

image

where

image

and

image

See Appendix D for the derivations of equations (5.15) and (5.16). In equations (5.15) and (5.16), y and p are N-dim vectors whose elements are image and image, respectively, X = [x1,x2,.... ,xN]T, W is a diagonal matrix whose i-th diagonal element is p(xi; ß )(1 - p(xi;ß)), i = 1,2,...,N.

Substituting equations (5.15) and (5.16) into equation (5.14) gives the following iterative formula for estimating ß:

image

5.5.2 Multi-label penalized logistic regression

In an M-class multi-label problem, the training data set is written as image, where image and yi ⊂ {1,2,...,M} is a set containing one or more labels. M-independent binary one-vs-rest LRs are trained, one for each class. The labels image are converted to transformed labels (similar to equations (5.2)(5.3)) yi,m ∈ {0,1}, where i = 1,...,N and m = 1,...,M. The 2-class update formula in equation (5.17) is then extended to

image

where m = 1,...,M, ym and pm are vectors whose elements are image and image respectively, Wm is a diagonal matrix, whose i-th diagonal element is p(xim)(1−p(xim)),i = 1,2,...,N.

Given the i-th GO vector qi of the query protein image, the score of the m-th LR is given by

image

The probabilistic nature of logistic regression enables us to assign confidence scores to the prediction decisions. Specifically, for the m-th location, its corresponding confidence score is sm(image). See Appendix B for the confidence scores produced by the mPLR-Loc server.

5.5.3 Adaptive decision for LR (mPLR-Loc)

Because the LR scores of a binary LR classifier are posterior probabilities, the m-th class label will be assigned to image only if sm(image) > 0.5. To facilitate multi-label classification, the following decision scheme is adopted:

image

where f (smax(image)) is a function of smax(image) and smax(image) = image sm(image). In this work, we used a linear function as follows:

image

where θ ∈ (0.0,1.0] is a parameter that can be optimized by using cross-validation. Note that θ cannot be 0.0, or otherwise all of the M-labels will be assigned to image. This is because sm(image) is a posterior probability, which is always equal to or greater than zero. Clearly, equation (5.20) suggests that the predicted labels depend on smax(image), a function of the test instance (or protein). This means that the decision and its corresponding threshold are adaptive to the test protein. For ease of reference, we refer to this predictor as mPLR-Loc.

5.6 Summary

This chapter mainly focused on multi-location protein subcellular localization and presented three efficient multi-label classifiers for prediction: mGOASVM, AD-SVM, and mPLR-Loc. All of the predictors use GO information as features for classification.

From the perspectives of multi-label classification, all of the three predictors adopt binary relevance methods to deal with multi-label problems. From the perspective of classifiers, mGOASVM and AD-SVM use the SVM classifier, while mPLR-Loc adopts the logistic regression classifier. From the perspective of decision schemes, AD-SVM and mPLR-Loc adopt an adaptive decision scheme based on the maximum score of one-vs-rest SVM or LR classifiers, while mGOASVM uses a fixed decision scheme for final decision-making. Compared to mGOASVM and AD-SVM, another noteworthy point for mPLR-Loc is that the LR posterior scores are probabilistic, which may have better biological meaning, because it can be naturally regarded as one way to analyze how probable a protein will reside in each subcellular location.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset