8 Experimental setup

As stated in a comprehensive review [37], valid dataset construction for training and testing, and unbiased measurements for evaluating the performance of predictors are two indispensable steps in establishing a statistical protein predictor. This chapter will focus on these two important parts for an experimental setup. Datasets construction and performance metrics for predicting single- and multi-label proteins will be presented.

8.1 Prediction of single-label proteins

This section will focus on constructing datasets and will introduce performance metrics for single-location proteins which are used in GOASVM and InterProGOSVM.

8.1.1 Datasets construction

8.1.1.1 Datasets for GOASVM

Two benchmark datasets (EU16 [46] and HUM12 [44]) and a novel dataset (NE16) were used to evaluate the performance of GOASVM. The EU16 and HUM12 datasets were created from Swiss-Prot 48.2 in 2005 and Swiss-Prot 49.3 in 2006, respectively. The EU16 comprises 4150 eukaryotic proteins (2423 in the training set and 1727 in the independent test set) with 16 classes, and the HUM12 has 2041 human proteins (919 in the training set and 1122 in the independent test set) with 12 classes. In both datasets, the sequence identity was cut off at 25% by a culling program [219]. Here we use the EU16 dataset as an example to illustrate the details of dataset construction procedures. To obtain high-quality well-defined datasets, the data was strictly screened according to the criteria described below [46]:

  1. Only protein sequences annotated with “eukaryotic” were included, since the current study only focused on eukaryotic proteins.
  2. Sequences annotated with ambiguous or uncertain terms, such as “probably”, “maybe”, “probable”, “potential”, or “by similarity”, were excluded.
  3. Those protein sequences labelled with two or more subcellular locations were excluded because of their lack of uniqueness.
  4. Sequences annotated with “fragments” were excluded, and sequences with less than 50 amino acid residues were removed, since these proteins might be fragments.
  5. To avoid any homology bias, the sequence similarity in the same subcellular location among the obtained dataset was cut off at 25% by a culling program [219] to winnow the redundant sequences.
  6. Subcellular locations (subsets) containing less than 20 protein sequences were left out because they lack statistical significance.

After strictly following the criteria mentioned above, only 4150 protein sequences were found, of which there are 25 cell wall, 21 centriole, 258 chloroplast, 97 cyanelle, 718 cytoplasm, 25 cytoskeleton, 113 endoplasmic reticulum, 806 extracellular, 85 Golgi apparatus, 46 lysosome, 228 mitochondrion, 1169 nucleus, 64 peroxisome, 413 plasma membrane, 38 plastid, and 44 vacuole. Then, this dataset was further divided into a training dataset (2423 sequences) and a testing dataset (1727 sequences). The numbers of training and testing proteins of each kind are shown in Table 8.1. As can be seen, both the training and testing datasets are quite imbalanced in that the number of proteins in different subcellular locations vary significantly (from 4 to 695). The severe imbalance in the number of proteins in different subcellular locations and the low sequence similarity make classification difficult.

The proteins in HUM12 were screened according to the same criteria mentioned above, except that instead of sequences annotated with “eukaryotic”, sequences annotated with “human” in the ID (identification) field in Swiss-Prot were collected. The breakdown of the HUM12 dataset is shown in Table 8.2.

These two datasets are good benchmarks for performance comparison, because none of the proteins has more than a 25% sequence identity with any other proteins in the same subcellular location. However, the training and testing sets of these two datasets were constructed during the same period of time. Therefore, the training and testing sets are likely to share similar GO information, causing over-estimation in the prediction accuracy.

Table 8.1: Breakdown of the benchmark dataset for single-label eukaryotic proteins (EU16). EU16 was extracted from Swiss-Prot 48.2. The sequence identity is below 25%.

Label Subcellular location No. of sequences
Training Testing
1 Cell wall 20 5
2 Centriole 17 4
3 Chloroplast 207 51
4 Cyanelle 78 19
5 Cytoplasm 384 334
6 Cytoskeleton 20 5
7 Endoplasmic reticulum 91 22
8 Extracellular 402 404
9 Golgi apparatus 68 17
10 Lysosome 37 9
11 Mitochondrion 183 45
12 Nucleus 474 695
13 Peroxisome 52 12
14 Plasma membrane 323 90
15 Plastid 31 7
16 Vacuole 36 8
Total   2423 1727

Table 8.2: Breakdown of the benchmark dataset for single-label human proteins (HUM12). HUM12 was extracted from Swiss-Prot 49.3. The sequence identity is below 25%.

Label Subcellular location No. of sequences
Training Testing
1 Centriole 20 5
2 Cytoplasm 155 222
3 Cytoskeleton 12 2
4 Endoplasmic reticulum 28 7
5 Extracellular 140 161
6 Golgi apparatus 33 9
7 Lysosome 32 8
8 Microsome 7 1
9 Mitochondrion 125 103
10 Nucleus 196 384
11 Peroxisome 18 5
12 Plasma membrane 153 215
Total   919 1122

To avoid over-estimating the prediction performance and to demonstrate the effectiveness of predictors, a eukaryotic dataset (NE16) containing novel proteins was constructed by using the criteria specified above. To ensure that the proteins are really novel to the predictors, the creation dates of these proteins should be significantly later than the training proteins and also later than the GOA Database. Because EU16 was created in 2005 and the GOA Database used in our experiments was released on March 8, 2011, we selected the proteins that were added to Swiss-Prot between March 8, 2011 and April 18, 2012. Moreover, only proteins with a single subcellular location which falls within the 16 classes of the EU16 dataset were selected. After limiting the sequence similarity to 25%, 608 eukaryotic proteins distributed in 14 subcellular locations (see Table 8.3) were selected.

8.1.1.2 Datasets for FusionSVM

For the fusion of InterProGOSVM and PairProSVM, the performance was evaluated on a benchmark dataset (OE11 [101]), which was created by selecting all eukaryotic proteins with annotated subcellular locations from Swiss-Prot 41.0. The OE11 dataset comprises 3572 proteins with 11 classes. The breakdown of the OE11 dataset is shown in Table 8.4. Specifically, there are 622 cytoplasm, 1188 nuclear, 424 mitochondria, 915 extracellular, 26 Golgi apparatus, 225 chloroplast, 45 endoplasmic reticulum, 7 cyoskeleton, 29 vacuole, 47 peroxisome, and 44 lysosome. The sequence similarity was cut off at 50%.

Table 8.3: Breakdown of the novel single-label eukaryotic-protein dataset for GOASVM (NE16). The NE16 dataset contains proteins that were added to Swiss-Prot created between March 8, 2011 and Aprril 18, 2012. The sequence identity of the dataset is below 25%.*: no new proteins were found in the corresponding subcellular location.

Label Subcellular location No. of sequences
1 Cell wall 2
2 Centriole 0*
3 Chloroplast 51
4 Cyanelle 0*
5 Cytoplasm 77
6 Cytoskeleton 4
7 Endoplasmic reticulum 28
8 Extracellular 103
9 Golgi apparatus 14
10 Lysosome 1
11 Mitochondrion 73
12 Nucleus 57
13 Peroxisome 6
14 Plasma membrane 169
15 Plastid 5
16 Vacuole 18
Total   608

Table 8.4: Breakdown of the dataset used for FusionSVM (OE11). OE11 was extracted from Swiss-Prot 41.0, and the sequence similarity is cut off to 50%.

Label Subcellular location No. of sequences
1 Cytoplasm 622
2 Nuclear 1188
3 Mitochondria 424
4 Extracellular 915
5 Golgi apparatus 24
6 Chloroplast 225
7 Endoplasmic reticulum 45
8 Cytoskeleton 7
9 Vacuole 29
10 Peroxisome 47
11 Lysosome 44
Total   3572

Among the 3572 protein sequences, only 3120 sequences have valid GO vectors with at least one nonzero element in the GO vectors by using InterProScan. For the remaining 452 sequences, InterProScan cannot find any GO terms. Therefore, we only used sequences with valid GO vectors in our experiments and reduced the dataset size to 3120 protein sequences.

8.1.2 Performance metrics

Several performance measures were used, including the overall accuracy (ACC), overall Mathew’s correlation coefficient (OMCC) [132], and weighted average Mathew’s correlation (WAMCC) [132]. The latter two measures are based on Mathew’s correlation coefficient (MCC) [138]. Specifically, denote M RC×C as the confusion matrix of the prediction results, where C is the number of subcellular locations. Then Mi,j(1 ≤ i,jC) represents the number of proteins actually belonging to class i but are predicted as class j. Then, we further denote

image

image

image

image

where c(1 ≤ cC) is the index of a particular subcellular location. For class c, pc is the number of true positives, qc is the number of true negatives, rc is the number of false positives, and sc is the number of false negatives. Based on these notations, the ACC, MCCc for class c, OMCC, and WAMCC are defined respectively as

image

image

image

image

where image

MCC can overcome the shortcoming of accuracy on imbalanced data and have the advantage of avoiding the performance to be dominated by the majority classes. For example, a classifier which predicts all samples as positive cannot be regarded as a good classifier unless it can also accurately predict negative samples. In this case, the accuracy and MCC of the positive class are 100% and 0 %, respectively. Therefore, MCC is a better measure for the classification of imbalanced datasets.

8.2 Prediction of multi-label proteins

This section focuses on constructing datasets and introduces performance metrics for multi-location proteins which are used in mGOASVM, AD-SVM, mPLR-Loc, SS-Loc, HybridGO-Loc, RP-SVM, and R3P-Loc.

8.2.1 Dataset construction

For multi-location protein subcellular localization, datasets from three groups are constructed: virus, plant, and eukaryote.

The datasets used for evaluating the proposed multi-label predictors have also been used by other multi-label predictors, including Virus-mPLoc [189], KNN-SVM [121], Plant-mPLoc [51], Euk-mPLoc 2.0 [49], iLoc-Virus [233], iLoc-Plant [230], and iLoc-Euk [52].

The virus dataset was created from Swiss-Prot 57.9. It contains 207 viral proteins distributed in 6 locations (see Table 8.5). Of the 207 viral proteins, 165 belong to one subcellular location, 39 to two locations, 3 to three locations and none to four or more locations. This means that about 20% of proteins are located in more than one subcellular location. The sequence identity of this dataset was cut off at 25 %.

The plant dataset was created from Swiss-Prot 55.3. It contains 978 plant proteins distributed in 12 locations (see Table 8.6). Of the 978 plant proteins, 904 belong to one subcellular location, 71 to two locations, 3 to three locations and none to four or more locations. In other words, 8% of the plant proteins in this dataset are located in multiple locations. The sequence identity of this dataset was cut off at 25 %.

The eukaryotic dataset was created from Swiss-Prot 55.3. It contains 7766 eukaryotic proteins distributed in 22 locations (see Table 8.7). Of the 7766 eukaryotic proteins, 6687 belong to one subcellular location, 1029 to two locations, 48 to three locations, 2 to four locations and none to five or more locations. In other words, about 14% of the eukaryotic proteins in this dataset are located in multiple locations. Similar to the virus dataset, the sequence identity of this dataset was cut off at 25 %.

To further demonstrate the effectiveness of the proposed predictors, a plant dataset containing novel proteins was constructed by using the criteria specified in [51, 230]. Specifically, to ensure that the proteins are really novel to predictors, the creation dates of these proteins should be significantly later than the training proteins (from the plant dataset) and also later than the GOA database. Because the plant dataset was created in 2008 and the GOA database used in our experiments was released on March 8,2011, we selected the proteins that were added to Swiss-Prot between March 8, 2011 and Aprril 18, 2012. Moreover, only proteins with multiple subcellular locations that fall within the 12 classes specified in Table 8.6 were included. After limiting the sequence similarity to 25%, 175 plant proteins distributed in 12 subcellular locations (see Table 8.8) were selected. Of the 175 plant proteins, 147 belong to one subcellular location, 27 belong to two locations, 1 belongs to three locations and none to four or more locations. In other words, 16% of the plant proteins in this novel dataset are located in multiple locations.

Table 8.5: Breakdown of the multi-label virus protein dataset. The sequence identity is cut off at 25 %. The superscripts v stand for the virus dataset.

Label Subcellular location No. of locative proteins
1 Viral capsid 8
2 Host cell membrane 33
3 Host endoplasmic reticulum 20
4 Host cytoplasm 87
5 Host nucleus 84
6 Secreted 20
Total number of locative proteins image 252
Total number of actual proteins image 207

Table 8.6: Breakdown of the multi-label plant protein dataset. The sequence identity is cut off at 25%. The superscript p stands for the plant dataset.

Label Subcellular location No. of locative proteins
1 Cell membrane 56
2 Cell wall 32
3 Chloroplast 286
4 Cytoplasm 182
5 Endoplasmic reticulum 42
6 Extracellular 22
7 Golgi apparatus 21
8 Mitochondrion 150
9 Nucleus 152
10 Peroxisome 21
11 Plastid 39
12 Vacuole 52
Total number of locative proteins image 1055
Total number of actual proteins image 978

Table 8.7: Breakdown of the multi-label eukaryotic protein dataset. The sequence identity is cut off at 25 %. The superscripts e stand for the eukaryotic dataset.

Label Subcellular location No. of locative proteins
1 Acrosome 14
2 Cell membrane 697
3 Cell wall 49
4 Centrosome 96
5 Chloroplast 385
6 Cyanelle 79
7 Cytoplasm 2186
8 Cytoskeleton 139
9 ER 457
10 Endosome 41
11 Extracellular 1048
12 Golgi apparatus 254
13 Hydrogenosome 10
14 Lysosome 57
15 Melanosome 47
16 Microsome 13
17 Mitochondrion 610
18 Nucleus 2320
19 Peroxisome 110
20 SPI 68
21 Synapse 47
22 Vacuole 170
Total number of locative proteins image 8897
Total number of actual proteins image 7766

Here we take the new plant dataset as an example to illustrate the details of the procedures, which are specified as follows:

  1. Go to the UniProt/SwissProt official webpage (http://www.uniprot.org/).
  2. Go to the “Search” section and select ‘Protein Knowledgebase (UniProtKB)’ (default) in the “Search in” option.
  3. In the “Query” option, select or type “reviewered: yes”.
  4. Select ‘AND” in the “Advanced Search” option, and then select ‘Taxonomy [OC]” and type in “Viridiplantae”.
  5. Select “AND” in the “Advanced Search” option, and then select “Fragment: no’.
  6. Select “AND” in the “Advanced Search” option, and then select “Sequence length” and type in “50 - ” (no less than 50).
  7. Select “AND” in the “Advanced Search” option, and then select “Date entry integrated” and type in “20110308-20120418”.
  8. Select “AND” in the “Advanced Search” option, and then select “Subcellular location: XXX Confidence: Experimental”. (XXX means the specific subcellular locations Here it includes 12 different locations: cell membrane, cell wall, chloroplast, endoplasmic reticulum, extracellular, golgi apparatus, mitochondrion, nucleus, peroxisome, plastid, and vacuole.)
  9. Further exclude those proteins which are not experimentally annotated (this is to recheck the proteins to guarantee that they are all experimentally annotated).

Table 8.8: Breakdown of the new multi-label plant dataset. The dataset was constructed from Swiss-Prot created between 08-Mar-2011 and 18-Apr-2012. The sequence identity of the dataset is below 25%.

Label Subcellular location No. of locative proteins
1 Cell membrane 16
2 Cell wall 1
3 Chloroplast 54
4 Cytoplasm 38
5 Endoplasmic reticulum 9
6 Extracellular 3
7 Golgi apparatus 7
8 Mitochondrion 16
9 Nucleus 46
10 Peroxisome 6
11 Plastid 1
12 Vacuole 7
Total number of locative proteins 204
Total number of actual proteins 175

After selecting the proteins, Blastclust20 was applied to reduce the redundancy in the dataset so that none of the sequence pairs has a sequence identity higher than 25%.

8.2.2 Datasets analysis

Because multi-label datasets are more complicated than single-label ones, some analysis should be carried out. To better visualize the distributions of proteins in each subcellular locations in these three datasets, we have listed the breakdown for them in Figures 8.1, 8.2, and 8.3. Figure 8.1 shows that the majority (68 %) of viral proteins in the virus dataset are located in the host cytoplasm and host nucleus, while proteins located in the rest of the subcellular locations account totally for only around one third. This means that this multi-label dataset is imbalanced across the six subcellular locations. Similar conclusions can be drawn from Figure 8.2, where most of the plant proteins exist in the chloroplast, cytoplasm, nucleus, and mitochondrion, while proteins in 8 other subcellular locations totally account for less than 30 %. This imbalanced property makes the prediction of these two multi-label datasets difficult. Figure 8.3 also shows the imbalanced property of the multi-label eukaryotic dataset, where the majority (78%) of eukaryotic proteins are located in the nucleus, cytoplasm, extracellular, cell membrane, and mitochondrion, while proteins in 17 other subcellular locations account for less than 22 %.

image

Fig. 8.1: Breakdown of the multi-label virus dataset. The number of proteins shown in each subcellular location represents the number of “locative proteins” [213, 233]. Here, 207 actual proteins have 252 locative proteins. The viral proteins are distributed in six subcellular locations: viral capsid, host cell membrane, host endoplasmic reticulum, host cytoplasm, host nucleus, and secreted.

image

Fig. 8.2: Breakdown of the multi-label plant dataset. The number of proteins shown in each subcellular location represents the number of ‘locative proteins’ [213, 233]. Here, 978 actual proteins have 1055 locative proteins. The plant proteins are distributed in 12 subcellular locations: cell membrane, cell wall, chloroplast, cytoplasm, endoplasmic reticulum, extracellular, Golgi apparatus, mitochondrion, nucleus, peroxisome, plastid, and vacuole.

image

Fig. 8.3: Breakdown of the multi-label eukaryotic dataset. The number of proteins shown in each subcellular location represents the number of “locative proteins” [213, 233]. Here, 7766 actual proteins have 8897 locative proteins. The eukaryotic proteins are distributed in 22 subcellular locations: acrosome (ACR), cell membrane (CM), cell wall (CW), centrosome (CEN), chloroplast (CHL), cyanelle (CYA), cytoplasm (CYT), cytoskeleton (CYK), endoplasmic reticulum (ER), endosome (END), extracellular (EXT), Golgi apparatus (GOL), hydrogenosome (HYD), lysosome (LYS), melanosome (MEL), microsome (MIC), mitochondrion (MIT), nucleus (NUC), peroxisome (PER), spindle pole body (SPI), synapse (SYN), and vacuole (VAC).

More detailed statistical properties of these three datasets are listed in Table 8.9. In the table, M and N denote the number of actual (or distinct) subcellular locations and the number of actual (or distinct) proteins. Besides the commonly used properties for single-label classification, the following measurements [204] are used as well to explicitly quantify the multi-label properties of the datasets:

  1. Label cardinality (LC).LC is the average number of labels per data instance, which is defined as: image, where image is the label set of the protein image and | · | denotes the cardinality of a set.
  2. Label density (LD). LD is LC normalized by the number of classes, which is defined as: image.
  3. Distinct label set (DLS). DLS is the number of label combinations in the dataset.
  4. Proportion of distinct label set (PDLS). PDLS is DLS normalized by the number of actual data instances, which is defined as: image.
  5. Total locative number (TLN). TLN is the total number of locative proteins. This concept is derived from locative proteins in [233], which will be further elaborated in Section 8.2.3.

Table 8.9: Statistical properties of the three multi-label benchmark datasets used in our experiments.

Dataset M N TLN LC LD DLS PDLS
Virus 6 207 252 1.2174 0.2029 17 0.0821
Plant 12 978 1055 1.0787 0.0899 32 0.0327
Eukaryote 22 7766 8897 1.1456 0.0521 112 0.0144

M: number of subcellular locations.
N: number of actual proteins.
LC: label cardinality.
LD: label density.
DLS: distinct label set.
PDLS: proportion of distinct label set.
TLN: total locative number.

Among these measurements, LC is used to measure the degree of multi-labels in a dataset. For a single-label dataset, LC = 1; for a multi-label dataset, LC >1. Also, the larger the LC, the higher is the degree of multi-labels. LD takes into consideration the number of classes in the classification problem. For two datasets with the same LC, the lower the LD, the more difficult the classification will be. DLS represents the number of possible label combinations in the dataset. The higher the DLS, the more complicated the composition will be. PDLS represents the degree of distinct labels in a dataset. The larger the PDLS, the more probable the individual label-sets will be different from each other. From Table 8.9, we can see that although the number of proteins in the virus dataset (N = 207, TLN = 252) is smaller than that of the plant dataset (N = 978, TLN = 1055) and of the eukaryotic dataset (N = 7766, TLN = 8897), the former (LC = 1.2174, LD = 0.2029) is a multil-abel dataset which is denser than the latter two (LC = 1.0787, LD = 0.0899 and LC = 1.1456, LD = 0.0521).

8.2.3 Performance metrics

Compared to traditional single-label classification, multi-label classification requires more complicated performance metrics to better reflect the multi-label capabilities of classifiers. Conventional single-label measures need to be modified to adapt them to multi-label classification. These measures include accuracy, precision, recall

, F1-score (F1) and hamming loss (HL) [60, 74]. Specifically, denote image and image as the true label set and the predicted label set for the i-th protein image (i = 1,...,N,) respectively.21 Then the five measurements are defined as follows:

image

image

image

image

image

where | · | means counting the number of elements in the set therein and ∩ represents the intersection of sets.

Accuracy, precision, recall and F1 indicate the classification performance. The higher the measures, the better the prediction performance. Among them, accuracy is the most commonly used criteria. F1-score is the harmonic mean of precision and recall, which allows us to compare the performance of classification systems by taking the trade-off between precision and recall into account. The hamming loss (HL) [60, 74] is different from other metrics. As can be seen from equation (8.13), when all of the proteins are correctly predicted, i.e. image, then HL = 0, whereas, other metrics will be equal to 1. On the other hand, when the predictions of all proteins are completely wrong, i.e. image and image, then HL = 1, whereas, other metrics will be equal to 0. Therefore, the lower the HL, the better the prediction performance.

Two additional measurements [213, 233] are often used in multi-label subcellular localization prediction. They are overall locative accuracy (OLA) and overall actual accuracy (OAA). The former is given by

image

and the overall actual accuracy (OAA) is

image

where

image

To explain equations (8.14) and (8.15), the concepts of locative proteins and actual proteins are introduced here. If a protein exists in two different subcellular locations, it will be counted as two locative proteins; if a protein coexists in three locations, then it will be counted as three locative proteins; and so forth. But no matter at how many subcellular locations a protein simultaneously resides, it will be counted as only one actual protein. Mathematically, denote Nloc as the total number of locative proteins, Nact as the total number of actual proteins, M as the number of subcellular locations, and nact(m) (m = 1,..., M) as the number of actual proteins coexisting in m subcellular locations. Then, the Nact and Nloc can be expressed as

image

and

image

In the virus dataset, M = 6; in the plant dataset, M = 12; and in the eukaryotic dataset, M = 22. Then, from equations (8.17) and (8.18), we obtain

image

image

image

image

image

image

where the superscripts v, p, and e stand for the virus, plant, and eukaryotic datasets, respectively. Thus, for the virus dataset, 207 actual proteins correspond to 252 locative proteins; for the plant dataset, 978 actual proteins correspond to 1055 locative proteins; and for the eukaryotic dataset, 7766 actual proteins correspond to 8897 locative proteins.

According to equation (8.14), a locative protein is considered to be correctly predicted if any of the predicted labels matches any labels in the true label set. On the other hand, equation (8.15) suggests that an actual protein is considered to be correctly predicted only if all of the predicted labels exactly match those in the true label set. For example, for a protein coexisting in, say three subcellular locations, if only two of the three are correctly predicted, or if the predicted result contains a location not belonging to the three, the prediction is considered to be incorrect. In other words, when and only when all of the subcellular locations of a query protein are exactly predicted without any overprediction or underprediction can the prediction be considered as correct. Therefore, OAA is a more stringent measure as compared to OLA. OAA is also more objective than OLA. This is because locative accuracy is liable to give biased performance measures when the predictor tends to overpredict, i.e. giving large image for many image. In the extreme case, if every protein is predicted to have all of the M subcellular locations, according to equation (8.14), the OLA is 100%. But the predictions are obviously wrong and meaningless. On the contrary, OAA is 0% in this extreme case, which definitely reflects the real performance.

Among all the metrics mentioned above, OAA is the most stringent and objective. This is because if only some (but not all) of the subcellular locations of a query protein are correctly predicted, the numerators of the other four measures (equations (8.9)(8.14)) are nonzero, whereas the numerator of OAA in equation (8.15) is 0 (thus contribute nothing to the frequency count).

8.3 Statistical evaluation methods

In statistical prediction there are three methods which are often used for testing the generalization capabilities of predictors: independent tests, subsampling tests (or K-fold cross-validation), and leave-one-out cross validation (LOOCV).

In independent tests the training set and the testing set were fixed, thus enabling us to obtain a fixed accuracy for the predictors. This kind of method can directly demonstrate the capability of predictors. However, the selection of an independent dataset often bears some sort of arbitrariness [54], which inevitably leads to non-bias-free accuracy for the predictors.

For subsampling tests, here we use fivefold cross validation as an example. The whole dataset is randomly divided into 5 disjoint parts of equal size [142]. The last part may have 1–4 more examples than the former four parts in order for each example to be evaluated on the model. Then one part of the dataset was used as the test set, and the remaining parts are jointly used as the training set. This procedure is repeated five times, and each time a different part is chosen as the test set. The number of the selections in dividing the benchmark dataset is obviously an astronomical figure even for a small dataset. This means that different selections lead to different results even for the same benchmark dataset, and thus they still liable to statistical arbitrariness. Subsampling tests with a smaller K definitely work faster than those with a larger K. Thus, subsampling tests are faster than LOOCV, which can be regarded as N-fold cross-validation, where N is the number of samples in the dataset, and N > K. At the same time it is also statistically acceptable and usually regarded as less biased than the independent tests.

In LOOCV, every protein in the benchmark dataset will be singled out one-by-one and is tested by the classifier trained by the remaining proteins. In each fold of LOOCV, a protein of the training dataset (suppose there are N proteins) is singled out as the test protein and the remaining (N -1) proteins are used as the training data. This procedure is repeated N times, and in each fold a different protein is selected as the test protein. This ensures that every sequence in the dataset will be tested. In this case, arbitrariness can be avoided, because LOOCV will yield a unique outcome for the predictors. Therefore, LOOCV is considered to be the most rigorous and bias-free method [86]. Note that the jackknife cross validation in iLoc-Plant and its variants is the same as LOOCV, as mentioned in [54, 230]. Because the term jackknife also refers to the methods which estimate the bias and variance of an estimator [1], to avoid confusion we use only the term LOOCV throughout this book.

For both single- and multi-label datasets, LOOCV was used for benchmark datasets and independent tests were implemented for the novel datasets, with the benchmark datasets of the corresponding species as the training part.

8.4 Summary

This chapter introduced experimental setups for both single- and multi-label protein subcellular localization. Datasets construction and performance metrics were presented for single- multi-label cases. For both cases, three benchmark datasets and one novel dataset were introduced. Different performance metrics were presented for prediction of single- and multi-label proteins. Generally speaking, datasets and performance metrics for the multi-label case are much more sophisticated than the single-label one. Finally, some statistical methods for evaluation were elaborated.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset