7 Ensemble random projection for large-scale predictions

Typically, GO-based methods extract as many as thousands of GO terms to formulate GO vectors. The curse of dimensionality severely restricts the predictive power of GO-based multi-label classification systems. Besides, high-dimensional feature vectors may contain redundant or irrelevant information, causing the classification systems suffer from overfitting. To address this problem, this chapter presents a dimensionality reduction method that applies random projection (RP) to construct an ensemble of multi-label classifiers. After feature extraction, the GO vectors are then projected onto lower-dimensional spaces by random projection matrices whose elements conform to a distribution with zero mean and unit variance. The transformed low-dimensional vectors are classified by an ensemble of one-vs-rest multi-label classifiers, each corresponding to one of the RP matrices. The scores obtained from the ensemble are then fused for predicting the subcellular localization of proteins.

This chapter begins with an introduction to random projection. Then, two multi-label classifiers, namely RP-SVM and R3P-Loc, which use ensemble RP are presented. The former uses multi-label SVM classifiers, and the latter uses multi-label ridge regression classifiers. Moreover, to further reduce redundant information and extract useful information, two compact databases are created for the R3P-Loc predictor.

7.1 Random projection

In machine learning, high-dimensional patterns are often mapped to a lower dimension subspace to avoid the curse of dimensionality [127]. Reducing the dimension of the input patterns can remove redundant or irrelevant information and allow for more reliable classification in the subspace. Actually, dimension reduction is imperative in various domains, such as text categorization [69], image retrieval [197], and gene expression microarray data analysis [80].

In the past three decades, random projection (RP) has emerged as a powerful method for dimension reduction. By using RP, the high dimensional feature vectors are transformed into much lower dimensional vectors, which preserve the original geometrical structure and contain less redundant, irrelevant, or even detrimental information that might deteriorate classification performance. RP turns out to be a computationally efficient yet sufficiently accurate method for dimensionality reduction of many high-dimensional datasets [16]. RP is particularly useful for sparse input data in high dimensions, as the original data can be reconstructed almost perfectly from data in the lower dimensional projected space [27]. RP has been widely used in various applications, such as preprocessing text data [161], indexing audio documents [114], processing images [16], and learning high-dimensional Gaussian mixture models [59]. Recently, dynamic random projection [224, 235] has been successfully applied in biometric template protection and privacy-preserving verification.

As stated in Chapter 2, conventional methods for subcellular localization prediction can be roughly divided into sequence-based and knowledge-based methods. However, no matter whether using the sequence-based features or annotation-based features, a predominant scenario is that the dimension of available features is much larger than the number of training samples. For example, Lee et. al. [116] used amino-acid sequence-based features, whose dimension (11,992-dim) is remarkably larger than the number of proteins (3017); Xiao et. al. [233] used the gene ontology (GO)18 information, as the features and the dimension of features was 11,118 while the number of proteins was only 207. It is strongly expected that the high-dimensional features contain redundant or irrelevant information, causing overfitting and worsening of the prediction performance.

The key idea of RP arises from Johnson–Lindenstrauss lemma [104]:

Lemma 1 (Johnson and Lindenstrauss [104]). Given ɛ > 0, a set image of N points in image, and a positive integer d such that d ≥ d0 = image(logN/ɛ2), there exists f : image such that

image

for all u, vimage.

The lemma suggests that if points in a high-dimensional space are projected onto a randomly selected subspace of suitable dimension, the distances between the points are approximately preserved. A proof can be found in [71].

Specifically, the original T-dimensional data is projected onto a d-dimensional (d « T) subspace, using a d × T random matrix R whose columns are of unit length. In our case, for the i-th protein, the GO vector qi (equation (4.3) in Section 4.1.3 of Chapter 4) can be projected as

image

where 1/√d is a scaling factor, image is the projected vector after RP, and R is a random d × T matrix.

The choice of the random matrix R has been extensively studied. Practically, as long as the elements rh,j of R conforms to any distribution with zero mean and unit variance, R will give a mapping that satisfies the Johnson–Lindenstrauss lemma [16]. For computational simplicity, we adopted a simple distribution proposed by Achlioptas [2] for the elements rh,j as follows:

image

It is easy to verify that equation (7.2) conforms to a distribution with zero mean and unit variance [2].

7.2 RP-SVM: A multi-label classifier with ensemble random projection

This section proposes an ensemble SVM classifier based on random projection (RP), namely RP-SVM, for predicting subcellular localization of multi-label proteins. By using RP, the high-dimensional feature vectors are transformed into much lower-dimensional vectors which contain less redundant, irrelevant, or even detrimental information which might worsen classification performance. To make the classifiers more robust, it is necessary to perform random projection of the feature vectors a number of times, each with a different projection matrix. The resulting projected vectors are then presented to an ensemble of one-vs-rest multi-label SVM classifiers.

Figure 7.1 shows the flowchart of using ensemble random projection as a dimensionality-reduction method. As can be seen, after the GO vectors construction elaborated in Section 4.1, the algorithm of ensemble random projection is applied to convert the original high-dimensional GO vector into a low-dimensional projected vector. The low-dimensional projected vector is subsequently used for multi-label SVM classification.

image

Fig. 7.1: Flowchart of using ensemble random projection as a dimensionality-reduction method.

7.2.1 Ensemble multi-label classifier

The projected GO vectors obtained from equation (7.1) are used for training multi-label one-vs-rest SVMs. Specifically, for an M-class problem (here M is the number of subcellular locations), M-independent binary SVMs are trained, one for each class. Denote the GO vector created by using the true AC of the i-th query protein as qi,0 and the GO vector created by using the accession number of the k-th homolog as qi,k, k ∈ {1,..., kmax}, where kmax is the number of homologs retrieved by BLAST with the default parameter setting. By equation (7.1), we obtained the corresponding projected vectors image and image, respectively. Then, given the i-th query protein image, the score of the m-th SVM is

image

where

image

and image is the set of support vector indexes corresponding to the m-th SVM, ym,r ∈ {-1,+ 1} are the class labels, αm,r are the Lagrange multipliers, and K(⋅,⋅) is a kernel function; here, the linear kernel is used. Note that image in equation (7.3) represents the projected GO training vectors, which may include the projected GO vectors created by using the true AC of the training sequences or their homologous ACs.

Since R is a random matrix, the scores in equation (7.3) for each application of RP will be different. To construct a robust classifier, we fused the scores for several applications of RP and obtained an ensemble classifier, whose ensemble score of the m-th SVM for the i-th query protein is given as follows:

image

where image represents the score of the m-th SVM for the i-th protein via the l-th application of RP, L is the total number of applications of RP, and image is the weight. For simplicity, here we set wl = 1/L,l = 1,...,L. We refer to L as “ensemble size” in the sequel. Unless stated otherwise, the ensemble size was set to 10 in our experiments, i.e. L = 10. Note that instead of mapping the original data into an Ld-dim vector, the ensemble RP projects it into L d-dim vectors.

7.2.2 Multi-label classification

To predict the subcellular locations of datasets containing both single-label and multi-label proteins, a decision scheme for multi-label SVM classifiers should be used. Unlike the single-label problem, where each protein has only one predicted label, a multi–label protein should have more than one predicted label. In this work, we evaluated two decision schemes. The first decision scheme is the same as that used in mGOASVM (section 5.3 in chapter 5). In this scheme, the predicted subcellular location(s) of the i-th query protein are given by

image

image

Fig. 7.2: Flowchart of RP-SVM/RP-AD-SVM. image: the i-th query protein; S: protein sequence; AC: protein accession number; RP: random projection; SVM: SVM scoring (equation (7.3)); ensemble RP: ensemble random projection; w1, wl, and wL: the first, l-th, and L-th weights in equation (7.5); image: the ensemble score in equation (7.5); SCLs: subcellular location(s).

The second decision scheme is an adaptive decision improving upon the first one. This decision scheme is the same as that used in AD-SVM (Section 5.4 in Chapter 5). In this scheme, the predicted subcellular location(s) of the i-th query protein are given by

If image,

image

otherwise

image

In equation (7.7), ƒ(smax(image)) is a function of smax(image, where image. In this work, we used a linear function as follows:

image

where θ ∈ [0.0,1.0] is a parameter which was optimized to achieve the best performance.

For ease of comparison, we refer to the proposed ensemble classifier with the first and the second decision scheme as RP-SVM and RP-AD-SVM, respectively. Figure 7.2 illustrates the whole prediction process for RP-SVM and RP-AD-SVM. If we use the first decision scheme for multi-label classification, the diagram represents RP-SVM; if we use the second decision scheme, it represents RP-AD-SVM.

7.3 R3P-Loc: A compact predictor based on ridge regression and ensemble random projection

So far, we have presented many GO-based predictors which extract GO information from a knowledge database, i.e. the Gene Ontology Annotation (GOA) Database. However, the predominant scenarios of GO-based methods are that (1) the GOA Database is enormous and is growing exponentially, (2) the GOA Database contains redundant information, and (3) the number of extracted features from the GOA Database is much larger than the number of data samples with ground-truth labels. These properties render the extracted features liable to redundant or irrelevant information, causing the prediction systems to suffer from overfitting. To address these problems, this section proposes an efficient multi-label predictor, R3P-Loc, which uses two compact databases for feature extraction and applies random projection (RP) to reduce the feature dimensions of an ensemble ridge regression (RR) classifier. Two new compact databases have been created from the Swiss-Prot and GOA Databases. These databases possess almost the same amount of information as their full-size counterparts but are much smaller in size. Experimental results on two recent datasets (eukaryote and plant) suggest that R3P-Loc can reduce the dimensions by sevenfold and significantly outperforms state-of-the-art predictors. It was also found that the compact databases reduce the memory consumption by 39 times without causing a decrease in prediction accuracy.

This section is organized as follows. First, the limitation of using the knowledge databases is introduced. Then, the procedure of creating the two compact databases (i.e., ProSeq and ProSeq-GO) is specified. Subsequently, multi-label ridge regression classifiers equipped with random projection are presented.

7.3.1 Limitation of using current databases

Recently, several state-of-the-art multi-label predictors have been proposed, e.g. Plant-mPLoc [51], Euk-mPLoc 2.0 [49], iLoc-Plant [230], iLoc-Euk [52], mGOASVM [213], HybridGO-Loc [217], and other predictors [88, 120, 218]. They all use the GO information as the features and apply different multi-label classifiers to tackle the multi-label classification problem. However, these GO-based methods are not without their disadvantages. Currently the predominant scenarios of GO-based methods are as follows:

  1. The Gene Ontology Annotation (GOA) Database,19 from which these GO-based predictors extract the GO information for classification, is already enormous in size and is also growing rapidly. For example, in October 2005, the GOA Database contained 7,782,748 entries for protein annotations, in March 2011, it contained 82,632,215 entries, and in July 2013, the number of entries increased to 169,603,862, which means that in less than eight years the number of annotations in the GOA Database increased by 28 times. Even after compressing the GOA Database released in July 2013 by removing the repeated pairing of accession numbers (ACs) and GO terms, the number of distinct pairs of AC–GO terms is still 25,441,543. It is expected that searching a database of such an enormous and rapidly-growing size is computationally prohibitive, which makes large-scale subcellular localization by GO-based methods inefficient and even intractable.
  2. The GOA Database contains many redundant AC entries which will never be used by typical GO-based methods. This is because given a query protein, GO-based methods search for homologous ACs from Swiss-Prot and use these ACs as keys to search against the GOA Database for retrieving relevant GO terms. Therefore, those ACs in the GOA Database which do not appear in Swiss-Prot are redundant. More than 90% of the ACs in the GOA Database are in this category. This calls for a more compact GO-term database which excludes these redundant entries.
  3. The number of extracted GO features from the GOA Database is much larger than the number of proteins relevant to the prediction task. For example, Xiao et al. [233] extracted GO information for 207 proteins from the GOA Database; the resulting feature vectors had 11,118 dimensions, which suggests that the number of features is more than 50 times the number of proteins. It is likely that among the large number of features, many of them contain redundant or irrelevant information, causing the prediction systems to suffer from overfitting and thus reducing the prediction performance.

To tackle the problems mentioned above, this section proposes an efficient and compact multi-label predictor, R3P-Loc, which uses ridge regression and random projection for predicting subcellular localization of both single-label and multi-label proteins. Instead of using the Swiss-Prot and GOA databases, R3P-Loc uses two newly-created compact databases, namely ProSeq and ProSeq-GO, for GO information transfer. The ProSeq Database is a sequence database in which each amino acid sequence has at least one GO term annotated to it. The ProSeq-GO comprises GO terms annotated to the protein sequences in the ProSeq Database. An important property of the ProSeq and ProSeq-GO Databases is that they are much smaller than the Swiss-Prot and GOA Databases, respectively.

Given a query protein, a set of GO-terms is retrieved by searching against the ProSeq-GO Database using the accession numbers of homologous proteins as the search keys, where the homologous proteins are obtained from BLAST searches, using ProSeq as the sequence database. The frequencies of GO occurrences are used to formulate frequency vectors, which are projected onto much lower-dimensional space by random matrices whose elements conform to a distribution with zero mean and unit variance. Subsequently, the dimension-reduced feature vectors are classified by a multi-label ridge regression classifier.

7.3.2 Creating compact databases

Typically, for a query protein an efficient predictor should be able to deal with two possible cases: (1) the accession number (AC) is known, and (2) only the amino acid sequence is known. For proteins with known ACs, their respective GO terms are retrieved from a database containing GO terms (i.e. the GOA Database) using the ACs as the search keys. For a protein without an AC, its amino acid sequence is presented to BLAST [5] to find its homologs against a database containing protein amino acid sequences (i.e. Swiss-Prot), whose ACs are then used as keys to search against the GO-term database.

While the GOA Database allows us to associate the AC of a protein with a set of GO terms, for some novel proteins, neither their ACs nor the ACs of their top homologs have any entries in the GOA Database; in otherwords, no GO terms can be retrieved by their ACs or the ACs of their top homologs. In such cases, some predictors use back-up methods which rely on other features, such as pseudo amino acid composition [35] and sorting signals [153]; some predictors [213, 215] use a successive-search strategy to avoid null GO vectors. However, these strategies may lead to poor performance and increase computation and storage complexity.

To address this problem, we created two small but efficient databases: ProSeq and ProSeq-GO. The former is a sequence database, and the latter is a GO-term database. The procedures of creating these databases are shown in Figure 7.3. The procedure extracts accession numbers from two different sources: Swiss-Prot and the GOA Database. Specifically, all of the ACs in the Swiss-Prot Database and the valid ACs in the GOA Database are extracted. Here an AC is considered valid if it has at least one GO term annotated to it. Then the common ACs which appear in both sets are selected (the symbol in Figure 7.3). These ACs are regarded as “valid Swiss-Prot ACs”; each of them corresponds to at least one GO term in the GOA Database. Next, using these valid ACs, their corresponding amino acid sequences can be retrieved from the Swiss-Prot Database, constituting a new sequence database, which we call “ProSeq Database”; similarly, using these valid ACs, their corresponding GO terms can be retrieved from the GOA Database, constituting a new GO-term database, which we call “ProSeq-GO database”. In this work, we created ProSeq and ProSeq-GO Databases from the Swiss-Prot and GOA Databases released in July 2013. The ProSeq-GO Database has 513,513 entries while the GOA Database has 25,441,543 entries; the ProSeq Database has 513,513 protein sequences while the Swiss-Prot Database has 540,732 protein sequences.

image

Fig. 7.3: Procedures of creating compact databases (ProSeq and ProSeq-GO). AC: accession numbers; GO: gene ontology; GOA Database: Gene Ontology Annotation Database.

Similar to mGOASVM (Section 5.3 in Chapter 5), R3P-Loc uses GO frequencies as the feature information, and the feature extraction part is the same as mGOASVM. After feature extraction, similar to RP-SVM, random projection is applied to the GO vectors (equation (7.1) in Section 7.2). Then, an ensemble multi-label ridge regression classifier is proposed for classification of the dimension-reduced vectors.

7.3.3 Single-label ridge regression

Ridge regression (RR) is a simple yet effective linear regression model which has been applied to many domains [85, 136, 163]. Here we apply RR into classification. Suppose for a two-class single-label problem, we are given a set of training data image, where xiimage and yi ∈{-1,1}. In our case, image, where image is defined in equation (7.1). Generally speaking, an RR model is to impose an L2-style regularization to ordinary least squares (OLS), namely minimizing the empirical loss l(β) as

image

subject to

image

where s > 0, xi,j is the j-th element of xi and β = [β1,...,βj,...,βT+1]T is the ridge vector to be optimized. Equation (7.10) is equivalent to minimize the following equation:

image

where λ > 0 is a penalized parameter to control the degree of regularization. Then after optimization, β is given as

image

where X = [x1,...,xi,...,xN]T, y = [y1,...,yi,...,yN]T, and I is a (T + 1) × (T + 1) identity matrix.

image

Fig 7.4: Flowchart of R3P-Loc. image: the i-th query protein; S: protein sequence; AC: protein accession number; ProSeq/ProSeq-GO: the proposed compact sequence and GO databases, respectively; RP: random projection; RR: ridge regression scoring (equation (7.14)); Ensemble R3P: ensemble ridge regression and random projection; w1, wl, and wL: the first, l-th, and L-th weights in equation (7.15); image: the ensemble score in equation (7.15); SCLs: subcellular location(s).

7.3.4 Multi-label ridge regression

In an M-class multi-label problem, the training data set is written as image, where xiimage and image ⊂ {1,2,...,M} is a set which may contain one or more labels. M-independent binary one-vs-rest RRs are trained, one for each class. The labels image are converted to transformed labels (similar to equations (5.2)(5.3)) yi,m ∈ {-1,1}, where i = 1,...,N, and m = 1,...,M. Then, equation (7.12) is extended to

image

where m = 1,...,M, ym are vectors whose elements are image.

The projected GO vectors obtained from equation (7.1) are used for training multi-label one-vs-rest ridge regression (RR) classifiers. Specifically, for an M-class problem (here M is the number of subcellular locations), M-independent binary RRs are trained, one for each class. Then, given the i-th query protein image, the score of the m-th RR is

image

Since R is a random matrix, the scores in equation (7.14) for each application of RP will be different. To construct a robust classifier, we fused the scores for several applications of RP and obtained an ensemble classifier whose ensemble score of the m-th SVM for the i-th query protein is given as follows:

image

where image wl = 1, image represents the score of the m-th RR for the i-th protein via the l-th application of RP, L is the total number of applications of RP, and image is the weight. For simplicity, here we set wl = 1/L, l = 1,...,L. We refer to L as “ensemble size” in the sequel. Unless stated otherwise, the ensemble size was set to 10 in our experiments, i.e. L = 10. Note that instead of mapping the original data into an Ld-dim vector, the ensemble RP projects it into L d-dim vectors.

To predict the subcellular locations of datasets containing both single- and multi-label proteins, a decision scheme for multi-label RR classifiers should be used. Unlike the single-label problem, where each protein has only one predicted label, a multi-label protein should have more than one predicted labels. Here we used the decision scheme described in mGOASVM (Section 5.3 in Chapter 5) . In this scheme, the predicted subcellular location(s) of the i-th query protein are given by

image

For ease of comparison, we refer to the proposed ensemble classifier with this multi-label decision scheme as R3P-Loc. The flowchart of R3P-Loc is shown in Figure 7.4.

7.4 Summary

This chapter focuses on applying ensemble random projection to reduce dimensions of GO vectors and boost the performance of GO-based predictors. Two RP-based predictors are presented, RP-SVM and R3P-Loc, one on the multi-label SVM classifier and one on the multi-label RR classifier. Particularly, for R3P-Loc, to further reduce redundant information in the knowledge databases, two compact databases, ProSeq and ProSeq-GO, were created for fast and efficient GO information extraction.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset