Typically, GO-based methods extract as many as thousands of GO terms to formulate GO vectors. The curse of dimensionality severely restricts the predictive power of GO-based multi-label classification systems. Besides, high-dimensional feature vectors may contain redundant or irrelevant information, causing the classification systems suffer from overfitting. To address this problem, this chapter presents a dimensionality reduction method that applies random projection (RP) to construct an ensemble of multi-label classifiers. After feature extraction, the GO vectors are then projected onto lower-dimensional spaces by random projection matrices whose elements conform to a distribution with zero mean and unit variance. The transformed low-dimensional vectors are classified by an ensemble of one-vs-rest multi-label classifiers, each corresponding to one of the RP matrices. The scores obtained from the ensemble are then fused for predicting the subcellular localization of proteins.
This chapter begins with an introduction to random projection. Then, two multi-label classifiers, namely RP-SVM and R3P-Loc, which use ensemble RP are presented. The former uses multi-label SVM classifiers, and the latter uses multi-label ridge regression classifiers. Moreover, to further reduce redundant information and extract useful information, two compact databases are created for the R3P-Loc predictor.
In machine learning, high-dimensional patterns are often mapped to a lower dimension subspace to avoid the curse of dimensionality [127]. Reducing the dimension of the input patterns can remove redundant or irrelevant information and allow for more reliable classification in the subspace. Actually, dimension reduction is imperative in various domains, such as text categorization [69], image retrieval [197], and gene expression microarray data analysis [80].
In the past three decades, random projection (RP) has emerged as a powerful method for dimension reduction. By using RP, the high dimensional feature vectors are transformed into much lower dimensional vectors, which preserve the original geometrical structure and contain less redundant, irrelevant, or even detrimental information that might deteriorate classification performance. RP turns out to be a computationally efficient yet sufficiently accurate method for dimensionality reduction of many high-dimensional datasets [16]. RP is particularly useful for sparse input data in high dimensions, as the original data can be reconstructed almost perfectly from data in the lower dimensional projected space [27]. RP has been widely used in various applications, such as preprocessing text data [161], indexing audio documents [114], processing images [16], and learning high-dimensional Gaussian mixture models [59]. Recently, dynamic random projection [224, 235] has been successfully applied in biometric template protection and privacy-preserving verification.
As stated in Chapter 2, conventional methods for subcellular localization prediction can be roughly divided into sequence-based and knowledge-based methods. However, no matter whether using the sequence-based features or annotation-based features, a predominant scenario is that the dimension of available features is much larger than the number of training samples. For example, Lee et. al. [116] used amino-acid sequence-based features, whose dimension (11,992-dim) is remarkably larger than the number of proteins (3017); Xiao et. al. [233] used the gene ontology (GO)18 information, as the features and the dimension of features was 11,118 while the number of proteins was only 207. It is strongly expected that the high-dimensional features contain redundant or irrelevant information, causing overfitting and worsening of the prediction performance.
The key idea of RP arises from Johnson–Lindenstrauss lemma [104]:
Lemma 1 (Johnson and Lindenstrauss [104]). Given ɛ > 0, a set of N points in , and a positive integer d such that d ≥ d0 = (logN/ɛ2), there exists f : such that
for all u, v € .
The lemma suggests that if points in a high-dimensional space are projected onto a randomly selected subspace of suitable dimension, the distances between the points are approximately preserved. A proof can be found in [71].
Specifically, the original T-dimensional data is projected onto a d-dimensional (d « T) subspace, using a d × T random matrix R whose columns are of unit length. In our case, for the i-th protein, the GO vector qi (equation (4.3) in Section 4.1.3 of Chapter 4) can be projected as
where 1/√d is a scaling factor, is the projected vector after RP, and R is a random d × T matrix.
The choice of the random matrix R has been extensively studied. Practically, as long as the elements rh,j of R conforms to any distribution with zero mean and unit variance, R will give a mapping that satisfies the Johnson–Lindenstrauss lemma [16]. For computational simplicity, we adopted a simple distribution proposed by Achlioptas [2] for the elements rh,j as follows:
It is easy to verify that equation (7.2) conforms to a distribution with zero mean and unit variance [2].
This section proposes an ensemble SVM classifier based on random projection (RP), namely RP-SVM, for predicting subcellular localization of multi-label proteins. By using RP, the high-dimensional feature vectors are transformed into much lower-dimensional vectors which contain less redundant, irrelevant, or even detrimental information which might worsen classification performance. To make the classifiers more robust, it is necessary to perform random projection of the feature vectors a number of times, each with a different projection matrix. The resulting projected vectors are then presented to an ensemble of one-vs-rest multi-label SVM classifiers.
Figure 7.1 shows the flowchart of using ensemble random projection as a dimensionality-reduction method. As can be seen, after the GO vectors construction elaborated in Section 4.1, the algorithm of ensemble random projection is applied to convert the original high-dimensional GO vector into a low-dimensional projected vector. The low-dimensional projected vector is subsequently used for multi-label SVM classification.
The projected GO vectors obtained from equation (7.1) are used for training multi-label one-vs-rest SVMs. Specifically, for an M-class problem (here M is the number of subcellular locations), M-independent binary SVMs are trained, one for each class. Denote the GO vector created by using the true AC of the i-th query protein as qi,0 and the GO vector created by using the accession number of the k-th homolog as qi,k, k ∈ {1,..., kmax}, where kmax is the number of homologs retrieved by BLAST with the default parameter setting. By equation (7.1), we obtained the corresponding projected vectors and , respectively. Then, given the i-th query protein , the score of the m-th SVM is
where
and is the set of support vector indexes corresponding to the m-th SVM, ym,r ∈ {-1,+ 1} are the class labels, αm,r are the Lagrange multipliers, and K(⋅,⋅) is a kernel function; here, the linear kernel is used. Note that in equation (7.3) represents the projected GO training vectors, which may include the projected GO vectors created by using the true AC of the training sequences or their homologous ACs.
Since R is a random matrix, the scores in equation (7.3) for each application of RP will be different. To construct a robust classifier, we fused the scores for several applications of RP and obtained an ensemble classifier, whose ensemble score of the m-th SVM for the i-th query protein is given as follows:
where represents the score of the m-th SVM for the i-th protein via the l-th application of RP, L is the total number of applications of RP, and is the weight. For simplicity, here we set wl = 1/L,l = 1,...,L. We refer to L as “ensemble size” in the sequel. Unless stated otherwise, the ensemble size was set to 10 in our experiments, i.e. L = 10. Note that instead of mapping the original data into an Ld-dim vector, the ensemble RP projects it into L d-dim vectors.
To predict the subcellular locations of datasets containing both single-label and multi-label proteins, a decision scheme for multi-label SVM classifiers should be used. Unlike the single-label problem, where each protein has only one predicted label, a multi–label protein should have more than one predicted label. In this work, we evaluated two decision schemes. The first decision scheme is the same as that used in mGOASVM (section 5.3 in chapter 5). In this scheme, the predicted subcellular location(s) of the i-th query protein are given by
The second decision scheme is an adaptive decision improving upon the first one. This decision scheme is the same as that used in AD-SVM (Section 5.4 in Chapter 5). In this scheme, the predicted subcellular location(s) of the i-th query protein are given by
If ,
otherwise
In equation (7.7), ƒ(smax()) is a function of smax(, where . In this work, we used a linear function as follows:
where θ ∈ [0.0,1.0] is a parameter which was optimized to achieve the best performance.
For ease of comparison, we refer to the proposed ensemble classifier with the first and the second decision scheme as RP-SVM and RP-AD-SVM, respectively. Figure 7.2 illustrates the whole prediction process for RP-SVM and RP-AD-SVM. If we use the first decision scheme for multi-label classification, the diagram represents RP-SVM; if we use the second decision scheme, it represents RP-AD-SVM.
So far, we have presented many GO-based predictors which extract GO information from a knowledge database, i.e. the Gene Ontology Annotation (GOA) Database. However, the predominant scenarios of GO-based methods are that (1) the GOA Database is enormous and is growing exponentially, (2) the GOA Database contains redundant information, and (3) the number of extracted features from the GOA Database is much larger than the number of data samples with ground-truth labels. These properties render the extracted features liable to redundant or irrelevant information, causing the prediction systems to suffer from overfitting. To address these problems, this section proposes an efficient multi-label predictor, R3P-Loc, which uses two compact databases for feature extraction and applies random projection (RP) to reduce the feature dimensions of an ensemble ridge regression (RR) classifier. Two new compact databases have been created from the Swiss-Prot and GOA Databases. These databases possess almost the same amount of information as their full-size counterparts but are much smaller in size. Experimental results on two recent datasets (eukaryote and plant) suggest that R3P-Loc can reduce the dimensions by sevenfold and significantly outperforms state-of-the-art predictors. It was also found that the compact databases reduce the memory consumption by 39 times without causing a decrease in prediction accuracy.
This section is organized as follows. First, the limitation of using the knowledge databases is introduced. Then, the procedure of creating the two compact databases (i.e., ProSeq and ProSeq-GO) is specified. Subsequently, multi-label ridge regression classifiers equipped with random projection are presented.
Recently, several state-of-the-art multi-label predictors have been proposed, e.g. Plant-mPLoc [51], Euk-mPLoc 2.0 [49], iLoc-Plant [230], iLoc-Euk [52], mGOASVM [213], HybridGO-Loc [217], and other predictors [88, 120, 218]. They all use the GO information as the features and apply different multi-label classifiers to tackle the multi-label classification problem. However, these GO-based methods are not without their disadvantages. Currently the predominant scenarios of GO-based methods are as follows:
To tackle the problems mentioned above, this section proposes an efficient and compact multi-label predictor, R3P-Loc, which uses ridge regression and random projection for predicting subcellular localization of both single-label and multi-label proteins. Instead of using the Swiss-Prot and GOA databases, R3P-Loc uses two newly-created compact databases, namely ProSeq and ProSeq-GO, for GO information transfer. The ProSeq Database is a sequence database in which each amino acid sequence has at least one GO term annotated to it. The ProSeq-GO comprises GO terms annotated to the protein sequences in the ProSeq Database. An important property of the ProSeq and ProSeq-GO Databases is that they are much smaller than the Swiss-Prot and GOA Databases, respectively.
Given a query protein, a set of GO-terms is retrieved by searching against the ProSeq-GO Database using the accession numbers of homologous proteins as the search keys, where the homologous proteins are obtained from BLAST searches, using ProSeq as the sequence database. The frequencies of GO occurrences are used to formulate frequency vectors, which are projected onto much lower-dimensional space by random matrices whose elements conform to a distribution with zero mean and unit variance. Subsequently, the dimension-reduced feature vectors are classified by a multi-label ridge regression classifier.
Typically, for a query protein an efficient predictor should be able to deal with two possible cases: (1) the accession number (AC) is known, and (2) only the amino acid sequence is known. For proteins with known ACs, their respective GO terms are retrieved from a database containing GO terms (i.e. the GOA Database) using the ACs as the search keys. For a protein without an AC, its amino acid sequence is presented to BLAST [5] to find its homologs against a database containing protein amino acid sequences (i.e. Swiss-Prot), whose ACs are then used as keys to search against the GO-term database.
While the GOA Database allows us to associate the AC of a protein with a set of GO terms, for some novel proteins, neither their ACs nor the ACs of their top homologs have any entries in the GOA Database; in otherwords, no GO terms can be retrieved by their ACs or the ACs of their top homologs. In such cases, some predictors use back-up methods which rely on other features, such as pseudo amino acid composition [35] and sorting signals [153]; some predictors [213, 215] use a successive-search strategy to avoid null GO vectors. However, these strategies may lead to poor performance and increase computation and storage complexity.
To address this problem, we created two small but efficient databases: ProSeq and ProSeq-GO. The former is a sequence database, and the latter is a GO-term database. The procedures of creating these databases are shown in Figure 7.3. The procedure extracts accession numbers from two different sources: Swiss-Prot and the GOA Database. Specifically, all of the ACs in the Swiss-Prot Database and the valid ACs in the GOA Database are extracted. Here an AC is considered valid if it has at least one GO term annotated to it. Then the common ACs which appear in both sets are selected (the ∩ symbol in Figure 7.3). These ACs are regarded as “valid Swiss-Prot ACs”; each of them corresponds to at least one GO term in the GOA Database. Next, using these valid ACs, their corresponding amino acid sequences can be retrieved from the Swiss-Prot Database, constituting a new sequence database, which we call “ProSeq Database”; similarly, using these valid ACs, their corresponding GO terms can be retrieved from the GOA Database, constituting a new GO-term database, which we call “ProSeq-GO database”. In this work, we created ProSeq and ProSeq-GO Databases from the Swiss-Prot and GOA Databases released in July 2013. The ProSeq-GO Database has 513,513 entries while the GOA Database has 25,441,543 entries; the ProSeq Database has 513,513 protein sequences while the Swiss-Prot Database has 540,732 protein sequences.
Similar to mGOASVM (Section 5.3 in Chapter 5), R3P-Loc uses GO frequencies as the feature information, and the feature extraction part is the same as mGOASVM. After feature extraction, similar to RP-SVM, random projection is applied to the GO vectors (equation (7.1) in Section 7.2). Then, an ensemble multi-label ridge regression classifier is proposed for classification of the dimension-reduced vectors.
Ridge regression (RR) is a simple yet effective linear regression model which has been applied to many domains [85, 136, 163]. Here we apply RR into classification. Suppose for a two-class single-label problem, we are given a set of training data , where xi ∈ and yi ∈{-1,1}. In our case, , where is defined in equation (7.1). Generally speaking, an RR model is to impose an L2-style regularization to ordinary least squares (OLS), namely minimizing the empirical loss l(β) as
subject to
where s > 0, xi,j is the j-th element of xi and β = [β1,...,βj,...,βT+1]T is the ridge vector to be optimized. Equation (7.10) is equivalent to minimize the following equation:
where λ > 0 is a penalized parameter to control the degree of regularization. Then after optimization, β is given as
where X = [x1,...,xi,...,xN]T, y = [y1,...,yi,...,yN]T, and I is a (T + 1) × (T + 1) identity matrix.
In an M-class multi-label problem, the training data set is written as , where xi ∈ and ⊂ {1,2,...,M} is a set which may contain one or more labels. M-independent binary one-vs-rest RRs are trained, one for each class. The labels are converted to transformed labels (similar to equations (5.2)–(5.3)) yi,m ∈ {-1,1}, where i = 1,...,N, and m = 1,...,M. Then, equation (7.12) is extended to
where m = 1,...,M, ym are vectors whose elements are .
The projected GO vectors obtained from equation (7.1) are used for training multi-label one-vs-rest ridge regression (RR) classifiers. Specifically, for an M-class problem (here M is the number of subcellular locations), M-independent binary RRs are trained, one for each class. Then, given the i-th query protein , the score of the m-th RR is
Since R is a random matrix, the scores in equation (7.14) for each application of RP will be different. To construct a robust classifier, we fused the scores for several applications of RP and obtained an ensemble classifier whose ensemble score of the m-th SVM for the i-th query protein is given as follows:
where wl = 1, represents the score of the m-th RR for the i-th protein via the l-th application of RP, L is the total number of applications of RP, and is the weight. For simplicity, here we set wl = 1/L, l = 1,...,L. We refer to L as “ensemble size” in the sequel. Unless stated otherwise, the ensemble size was set to 10 in our experiments, i.e. L = 10. Note that instead of mapping the original data into an Ld-dim vector, the ensemble RP projects it into L d-dim vectors.
To predict the subcellular locations of datasets containing both single- and multi-label proteins, a decision scheme for multi-label RR classifiers should be used. Unlike the single-label problem, where each protein has only one predicted label, a multi-label protein should have more than one predicted labels. Here we used the decision scheme described in mGOASVM (Section 5.3 in Chapter 5) . In this scheme, the predicted subcellular location(s) of the i-th query protein are given by
For ease of comparison, we refer to the proposed ensemble classifier with this multi-label decision scheme as R3P-Loc. The flowchart of R3P-Loc is shown in Figure 7.4.
This chapter focuses on applying ensemble random projection to reduce dimensions of GO vectors and boost the performance of GO-based predictors. Two RP-based predictors are presented, RP-SVM and R3P-Loc, one on the multi-label SVM classifier and one on the multi-label RR classifier. Particularly, for R3P-Loc, to further reduce redundant information in the knowledge databases, two compact databases, ProSeq and ProSeq-GO, were created for fast and efficient GO information extraction.