4 Single-location protein subcellular localization

According to a recent comprehensive review [37], the establishment of a statistical protein predictor involves the following five steps: (i) construction of a valid dataset for training and testing the predictor; (ii) formulation of effective mathematical expressions for converting proteins’ characteristics to feature vectors relevant to the prediction task; (iii) development of classification algorithms for discriminating the feature vectors; (iv) evaluation of cross-validation tests for measuring the performance of the predictor; and (v) deployment of a user-friendly publicly accessible web-server for other researchers to use and validate the prediction method. These steps are further elaborated in the this chapter.

This chapter will focus on predicting single-location protein subcellular localization. Single-location proteins refer to those proteins that are located in one subcellular compartment. It is well known that most proteins stay only at one subcellular location [97]. Therefore, predicting the subcellular localization of single-label proteins is of great significance. In this chapter, two GO-based predictors, GOASVM and FusionSVM, will be presented.

4.1 Extracting GO from the Gene Ontology Annotation Database

The GOASVM predictor uses either accession numbers (ACs) or amino acid (AA) sequences as input. The prediction process is divided into two stages: feature extraction (vectorization) and pattern classification. For the former, the query proteins are “vectorized” to high-dim GO vectors. For the latter, the GO vectors are classified by one-vs-rest linear support vector machines (SVMs).

4.1.1 Gene Ontology Annotation Database

GOASVM extracts the GO information from the Gene Ontology Annotation (GOA) Database.10 The database uses standardized GO vocabularies to systematically annotate nonredundant proteins of many species in the UniProt Knowledgebase (UniProtKB) [6], which comprises Swiss-Prot [20], TrEMBL [20], and PIR-PSD [226]. The large-scale assignment of GO terms to UniProtKB entries (or ACs) was done by converting a portion of the existing knowledge held within the UniProKB Database into GO terms [26]. The GOA Database also includes a series of cross-references to other databases. For example, the majority of UniProtKB entries contain cross-references to InterPro identification numbers in the InterPro Database maintained by the European Bioinformatics Institute (EBI) [148]. The GO-term assignments are released monthly, in accordance with a format standardized by the GO Consortium. As a result of the Gene Ontology (GO)11 Consortium annotation effort, the GOA Database has become a large and comprehensive resource for proteomics research [26].


Fig. 4.1: A screenshot of the query result of the GO Consortium website (http://www.geneontology.org) using the GO term GO:0000187 as the search key.

Because the proteins in the GOA Database have been systematically annotated by GO terms, it is possible to exploit the relationship between the accession numbers of proteins and GO terms for subcellular localization. Specifically, given the accession number of a protein, a set of GO terms can be retrieved from the GOA Database file.12 In UniProKB, each protein has a unique accession number (AC), and in the GOA Database each AC may be associated with zero, one, or more GO terms. Conversely, one GO term may be associated with zero, one, or many different ACs. This means that the mappings between ACs and GO terms are many-to-many.

Figure 4.1 shows the query result of the GO consortium official website using the GO term GO:0000187 as the search key. As can be seen, all of the related annotations of GO:0000187, including its GO accession, name, ontology, synonyms, and definition are presented. Specifically, the term GO:0000187 belongs to the ontology “biological processes”, and it is defined as initiating the activity of the inactive enzyme MAP kinase. Figure 4.2a shows the query result (under protein annotation) of the GOA webserver using the GO term GO:0000187 as the searching key, and Figure 4.2b displays the query results using the accession number A0M8T9 as the searching key. As can be seen, the same GO term – GO:0000187 – can be associated with UniProtKB ACs A0M8T9, A0MLS4, A0MNP6, etc. The same UniProtKB AC A0M8T9 can be associated with GO:0000187, GO:0001889, GO:0001890, etc. These two examples suggest that the mappings between ACs and GO terms are many-to-many, which enables us to make full use of them for classifying proteins. These figures also suggest that GO annotations have different degree of reliability or “evidence code”.13 The evidence is based on the information sources from which the annotations are produced. The sources include EXP (Inferred from Experiment), IEA (Inferred from Electronic Annotation), ISS (Inferred from Structural and Sequence Similarity), IMP (Inferred from Mutant Phenotype), IDA (Inferred from Direct Assay), IPI (Inferred from Physical Interaction), etc.


Fig. 4.2: Screenshots of the query results of the GOA webserver (http://www.ebi.ac.uk/GOA) using (a) a GO term GO:0000187 and (b) an accession number (A0M8T9) as the search key.

4.1.2 Retrieval of GO terms

Given a query protein, GOASVM can handle two possible cases: (1) the protein accession number (AC) is known, and (2) the AA sequence is known. For proteins with known ACs, their respective GO terms are retrieved from the GOA Database using the ACs as the searching keys. For a protein without an AC, its AA sequence is presented to BLAST [5] to find its homologs, whose ACs are then used as keys to search against the GOA Database. Figures 4.3 and 4.4 illustrate the prediction process of GOASVM using ACs and protein sequences as input, respectively.

While the GOA Database allows us to associate the AC of a protein with a set of GO terms, for some novel proteins, neither their ACs nor the ACs of their top homologs have any entries in the GOA Database; in other words, the GO vectors constructed in Section 4.1.3 will contain all-zero, which are meaningless for further classification. In such case, the ACs of the homologous proteins, as returned from BLAST search, will be successively used to search against the GOA Database until a match is found. Specifically, for the proteins whose top homologs do not have any GO terms in the GOA Database, we used the second from the top homolog to find the GO terms; similarly, for the proteins whose top and second homologs do not have any GO terms, the third from the top homolog was used; and so on until all the query proteins can correspond to at least one GO term. With the rapid progress of the GOA Database [10], it is reasonable to assume that the homologs of the query proteins have at least one GO term [141]. Thus, it is not necessary to use backup methods to handle the situation where no GO terms can be found. The procedures are outlined in Figure 4.5.


Fig. 4.3: Flowchart of GOASVM that uses protein accession numbers (AC) only as input.


Fig. 4.4: Flowchart of GOASVM that uses protein sequences only as input. AC: accession number.


Fig. 4.5: Procedures of retrieving GO terms. image: the i-th query protein; kmax: the maximum number of homologs retrieved by BLAST with the default parameter setting; image: the set of GO terms retrieved by BLAST using the ki-th homolog for the i-th query protein imageki: the ki-th homolog used to retrieve the GO terms.

4.1.3 Construction of GO vectors

According to equation (6) of [37], the characteristics of any proteins can be represented by the general form of Chou’s pseudo amino acid composition [35,36]:


where T is a transpose operator, W is the dimension of the feature vector qi, and the definitions of the W feature components φi,u (u = 1,...,W) depend on the feature extraction approaches elaborated below.

Given a dataset, we used the procedure described in Section 4.1.2 to retrieve the GO terms of all of its proteins. Let image denote a set of distinct GO terms corresponding to a data set. image is constructed in two steps: (1) identifying all of the GO terms in the dataset, and (2) removing the repetitive GO terms. Suppose W distinct GO terms are found, i.e. |image| = W; these GO terms form a GO Euclidean space with W dimensions. For each sequence in the dataset, a GO vector is constructed by matching its GO terms against image, using the number of occurrences of individual GO terms in image as the coordinates. We have investigated four approaches to determining the elements of the GO vectors.14

  1. 1-0 value. In this approach, each of the W GO terms represents one canonical basis of a Euclidean space, and a protein is represented by a point in this space with coordinates equal to either 0 or 1. Specifically, the GO vector of the i-th protein image is denoted as
    where “GO hit” means that the u-th GO term appears in the GOA-search result using the AC of the i-th protein as the searching key.
  2. Term frequency (TF). This approach is similar to the 1-0 value approach in that a protein is represented by a point in the W-dim Euclidean space. However, unlike the 1-0 approach, it uses the number of occurrences of individual GO terms as the coordinates. Specifically, the GO vector qi of the i-th protein is defined as
    where fi,u is the number of occurrences of the u-th GO term (term frequency) in the i-th protein. The rationale is that the term frequencies may also contain important information for classification and therefore should not be quantized to either 0 or 1. Note that bi,u’s are analogous to the term frequencies commonly used in document retrieval.
  3. Inverse sequence frequency (ISF). In this approach, a protein is represented by a point with coordinates determined by the existence of GO terms and the inverse sequence frequency (ISF). Specifically, the GO vector qi of the i-th protein is defined as
    where N is the number of protein sequences in the training dataset. The denominator inside the logarithm is the number of GO vectors (among all GO vectors in the dataset) having a nonzero entry in their u-th element, or equivalently the number of sequences with the u-th GO term as determined in Section 4.1.2. Note that the logarithmic term in equation (4.4) is analogous to the inverse document frequency commonly used in document retrieval. The idea is to emphasize (resp. to suppress) the GO terms that have a low (resp. high) frequency of occurrences in the protein sequences. The reason is that if a GO term occurs in every sequence, it is not very useful for classification.
  4. Term frequency-inverse sequence frequency (TF-ISF). This approach combines term frequency (TF) and inverse sequence frequency (ISF) mentioned above. Specifically, the GO vector qi of the i-th protein is defined as
    where bi,u is defined in equation (4.3).

By correlating equations (4.2)(4.5) with the general form of pseudo amino acid composition (equation (4.1)), we notice that W is the number of distinct GO terms of the given dataset, and φi,u’s in equation (4.1) correspond to ai,u, bi,u, ci,u, and di,u in equations (4.2)(4.5), respectively.

4.1.4 Multiclass SVM classification

Support vector machines (SVMs) were originally proposed by Vapnik [206] to tackle binary classification problems. An SVM classifier maps a set of input patterns into a high-dimensional space and then finds the optimal separating hyperplane and the margin of separations in that space. The obtained hyperplane is able to classify the patterns into two categories and maximize their distance from the hyperplane. To tackle the multiclass problems, the one-vs-rest approach described below is typically used.

GO vectors are used for training one-vs-rest SVMs. Specifically, for an M-class problem (here M is the number of subcellular locations), M-independent SVMs are trained, one for each class. Denote the GO vector created by using the true AC of the i-th query protein as qi,0 and the GO vectors created by using the AC of the k-th homolog as qi,k, k = 1,...,n, where n is the number of homologs retrieved by BLAST with the default parameter setting. Then, given the i-th query protein image, the score of the m-th SVM is




and image is the set of support vector indexes corresponding to the m-th SVM, ym,r = 1 when pr belongs to class m and ym,r = −1 otherwise, αm,r are the Lagrange multipliers, and K(·, ·) is a kernel function. In this work, linear kernels were used, i.e. K(pr, qi,k) = (pr, qi,k). Note that pr in equation (4.6) represents the GO training vectors, which may include the GO vectors created by using the true ACs of the training proteins or their homologous ACs. We have the following two cases:

  1. If the true ACs are available, pr represents the GO training vectors created by using the true ACs only.
  2. If only the AA sequences are known, then only the ACs of the homologous sequences can be used for training the SVM and for scoring. In that case, pr represents the GO training vectors created by using the homologous ACs only.

Then, the predicted class of the i-th query protein is given by


While a single SVM can only be a binary classifier, it is possible to create a multiclass SVM classifier by combining a number of SVMs using the one-vs-rest technique. Specifically, for an M-class problem, M-independent SVMs are trained, one for each class. For the m-th SVM, the training vectors corresponding to the m-th class are assigned a positive label (+1) and the rest of the training vectors are assigned a negative labels.

Figure 4.6 illustrates an example explaining how a multiclass SVM classifier uses the one-vs-rest technique to solve a four-class classification problem. As can be seen, there are four independent binary SVMs trained by the training GO vectors, one for each class. In the testing phase, suppose there are two query proteins, whose GO vectors are p1 and p2, respectively, as shown in Figure 4.6. The SVM scores for the query protein I are (−0.5,0.1,−1.0,−2.3) and those for the query protein II are (−0.8,0.9,1.3,0.5). As can be seen, the maximum SVM scores for the two query proteins are the second and third score, respectively. Therefore, according to the decision strategy of equation (4.8), the query protein I is predicted to belong to Class 2, and the query protein II is predicted to locate in Class 3. More details about SVMs can be found in Appendix B.


Fig. 4.6: An example showing how a four-class SVM classifier uses the one-vs-rest technique to solve multi-class classification problems. pi: the i-th (i = 1,2) testing GO vector.

4.2 FusionSVM: Fusion of gene ontology and homology-based features

This section introduces a fusion predictor, namely FusionSVM, which integrates GO features and homology-based features for classification.

4.2.1 InterProGOSVM: Extracting GO from InterProScan

Similar to GOASVM, the prediction process is also divided into two stages: feature extraction (vectorization) and pattern classification. However, unlike GOASVM, which retrieves GO terms from the GOA Database, InterProGOSVM extracts GO terms from a program called InterProScan,15 which does not need ACs of proteins nor BLAST, and may produce more GO terms correlated with molecular functions.

The construction of GO vectors is divided into two steps. First, a collection of distinct GO terms is obtained by presenting all of the sequences in a dataset to Inter-ProScan. For each query sequence, InterProScan returns a file containing the GO terms found by various protein-signature recognition algorithms (we used all the available algorithms in this work). Using the dataset described in Table 8.4 of Chapter 8, we found 1203 distinct GO terms, from GO:0019904 to GO:0016719. These GO terms form a GO Euclidean space with 1203 dimensions.

In the second step, for each sequence in the dataset we constructed a GO vector by matching its GO terms to all of the 1203 GO terms determined in the first step. Similarly to GOASVM, the four GO-vector construction methods have been investigated. Postprocessing of GO vectors

Although the raw GO vectors can be directly applied to support vector machines (SVMs) for classification, better performance may be obtained by postprocessing the raw vectors before SVM classification. Here we introduce two postprocessing methods: (1) vector norm and (2) geometric mean.

  1. Vector norm. Given the i-th GO training vector pi, the vector is normalized as

    where the superscript (v) stands for vector norm, and pi,j is the j-th element of pi. In case ‖pi‖ = 0, we set all the element of image. Similarly, given the i-th test vector qi, the GO test vector is normalized as
  2. Geometric mean. This method involves pairwise comparison of GO vectors, followed by normalization.

–  Pairwise comparison. Denote P = [P1,P2,...,PT]T as a T × 1203 matrix whose rows are the raw GO vectors of T training sequences. Given the i-th GO training vector pi, we compute the dot products between pi and each of the training GO vectors to obtain a T-dim vector:


During testing, given the i-th test vector qi, we compute


where T' is the number of test vectors (sequences).

–   Normalization. The j-th elements of xi is divided by the geometric mean of the i-th element of xi and the j-th element of xj, leading to the normalized vectors


where the superscript (g) stands for geometric mean. Note that we selected those proteins which have at least one GO term (see Section, therefore pairwise comparison can guarantee that the elements xi,i and xj,j exist for i,j = 1,...,T. Multiclass SVM classification

After GO vector construction and post-processing, the vectors pi, image, or image can be used for training one-vs-rest SVMs. Specifically, for an M-class problem (here M is the number of subcellular locations), M independent SVMs are trained. During testing, given the i-th test protein image, the output of the m-th SVM is


where image is the set of support vector indexes corresponding to the m-th SVM, image = 1 when pr belongs to class m and image = −1 otherwise, image are the Lagrange multipliers, and KGO (pr,qi) is a kernel function. The form of KGO (pr,qi) depends on the postprocessing method being used. For example, if vector norm is used for normalization, the kernel becomes


The SVM score image can be combined with the score of the profile alignment SVM described next.

4.2.2 PairProSVM: A homology-based method

Kernel techniques based on profile alignment have been used successfully in detecting remote homologous proteins [170] and in predicting subcellular locations of eukaryotic proteins [132]. Instead of extracting feature vectors directly from sequences, profile alignment methods train an SVM classifier by using the scores of local profile alignment.

This method, namely PairProSVM, extracts the features from protein sequences by aligning the profiles of the sequences with each of the training profiles [132]. A profile is a matrix in which elements in a column (sequence position) specify the frequency of individual amino acids appeared in the corresponding position of some homologous sequences. Given a sequence, a profile can be derived by aligning it with a set of similar sequences. The similarity score between a known and an unknown sequence can be computed by aligning the profile of the known sequence with that of the unknown sequence [170]. Since the comparison involves not only two sequences but also their closely related sequences, the score is more sensitive to detecting weak similarity between protein families.

The profile of a sequence can be obtained by presenting the sequence to PSI-BLAST [5], which searches against a protein database for homologous sequences. The information pertaining to the aligned sequences is represented by two matrices: the position-specific scoring matrix (PSSM) and the position-specific frequency matrix (PSFM). Each entry of a PSSM represents the log-likelihood of the residue substitutions at the corresponding position in the query sequence. The PSFM contains the weighted observation frequencies of each position of the aligned sequences.

The flowchart of the PairProSVM is illustrated in Figure 2.7. Given a query sequence, we first obtain its profile by presenting it to PSI-BLAST. Then we align it with the profile of each training sequence to form an alignment score vector, which is further used as inputs to an SVM classifier for classification. The details of obtaining profiles and profile alignment methods can be found in Section 2.1.3 of Chapter 2. Mathematically, given the i-th test protein sequence, we align its profile with each of the training profiles to obtain a profile-alignment test vector qi, whose elements are then normalized by the geometric mean as follows:


Similar to the GO method, a one-versus-rest SVM classifier was used to classify the profile-alignment vectors. Specifically, the score of the m-th profile alignment SVM for the i-th test protein image is


which is to be fused with the score of the GO SVM.

4.2.3 Fusion of InterProGOSVM and PairProSVM

Figure 4.7 illustrates the fusion of InterProGOSVM and PairProSVM. The GO and profile alignment scores produced by the GO and profile alignment SVMs are normalized by Z-norm:


where image and image are respectively the mean and standard derivation of the GO and profile alignment SVM scores derived from the training sequences. The advantage of Z-norm is that estimating the normalization parameters can be done off-line during training [8]. The normalized GO and profile-alignment SVM scores are fused:


where WGO + WPA = 1. Finally, the predicted class of the test sequence is given by


For ease of reference, the fusion predictor is referred to as FusionSVM.


Fig. 4.7: Flowchart of fusion of InterProGOSVM and PairProSVM.

4.3 Summary

This chapter presented two predictors for single-location protein subcellular localization, namely GOASVM and FusionSVM. Both predictors use GO information as features and SVM as classifiers for prediction. Moreover, the ways to construct GO vectors are the same for these two predictors.

However, there are three differences between GOASVM and FusionSVM: (1) the former retrieves the GO terms from the GOA Database, while the latter retrieves them from the InterProScan program; (2) the former does not post-process the GO vectors while the latter does; (3) the former uses only GO information as features and adopts a successive-search strategy to make sure this method is applicable to novel proteins, while the latter combines GO features and homology-based features for prediction.

