8 Experimental setup

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

8 Experimental setup

As stated in a comprehensive review [37], valid dataset construction for training and testing, and unbiased measurements for evaluating the performance of predictors are two indispensable steps in establishing a statistical protein predictor. This chapter will focus on these two important parts for an experimental setup. Datasets construction and performance metrics for predicting single- and multi-label proteins will be presented.

8.1 Prediction of single-label proteins

This section will focus on constructing datasets and will introduce performance metrics for single-location proteins which are used in GOASVM and InterProGOSVM.

8.1.1 Datasets construction

8.1.1.1 Datasets for GOASVM

Two benchmark datasets (EU16 [46] and HUM12 [44]) and a novel dataset (NE16) were used to evaluate the performance of GOASVM. The EU16 and HUM12 datasets were created from Swiss-Prot 48.2 in 2005 and Swiss-Prot 49.3 in 2006, respectively. The EU16 comprises 4150 eukaryotic proteins (2423 in the training set and 1727 in the independent test set) with 16 classes, and the HUM12 has 2041 human proteins (919 in the training set and 1122 in the independent test set) with 12 classes. In both datasets, the sequence identity was cut off at 25% by a culling program [219]. Here we use the EU16 dataset as an example to illustrate the details of dataset construction procedures. To obtain high-quality well-defined datasets, the data was strictly screened according to the criteria described below [46]:

Only protein sequences annotated with “eukaryotic” were included, since the current study only focused on eukaryotic proteins.
Sequences annotated with ambiguous or uncertain terms, such as “probably”, “maybe”, “probable”, “potential”, or “by similarity”, were excluded.
Those protein sequences labelled with two or more subcellular locations were excluded because of their lack of uniqueness.
Sequences annotated with “fragments” were excluded, and sequences with less than 50 amino acid residues were removed, since these proteins might be fragments.
To avoid any homology bias, the sequence similarity in the same subcellular location among the obtained dataset was cut off at 25% by a culling program [219] to winnow the redundant sequences.
Subcellular locations (subsets) containing less than 20 protein sequences were left out because they lack statistical significance.

After strictly following the criteria mentioned above, only 4150 protein sequences were found, of which there are 25 cell wall, 21 centriole, 258 chloroplast, 97 cyanelle, 718 cytoplasm, 25 cytoskeleton, 113 endoplasmic reticulum, 806 extracellular, 85 Golgi apparatus, 46 lysosome, 228 mitochondrion, 1169 nucleus, 64 peroxisome, 413 plasma membrane, 38 plastid, and 44 vacuole. Then, this dataset was further divided into a training dataset (2423 sequences) and a testing dataset (1727 sequences). The numbers of training and testing proteins of each kind are shown in Table 8.1. As can be seen, both the training and testing datasets are quite imbalanced in that the number of proteins in different subcellular locations vary significantly (from 4 to 695). The severe imbalance in the number of proteins in different subcellular locations and the low sequence similarity make classification difficult.

The proteins in HUM12 were screened according to the same criteria mentioned above, except that instead of sequences annotated with “eukaryotic”, sequences annotated with “human” in the ID (identification) field in Swiss-Prot were collected. The breakdown of the HUM12 dataset is shown in Table 8.2.

These two datasets are good benchmarks for performance comparison, because none of the proteins has more than a 25% sequence identity with any other proteins in the same subcellular location. However, the training and testing sets of these two datasets were constructed during the same period of time. Therefore, the training and testing sets are likely to share similar GO information, causing over-estimation in the prediction accuracy.

Table 8.1: Breakdown of the benchmark dataset for single-label eukaryotic proteins (EU16). EU16 was extracted from Swiss-Prot 48.2. The sequence identity is below 25%.

Label	Subcellular location	No. of sequences
Label	Subcellular location	Training	Testing
1	Cell wall	20	5
2	Centriole	17	4
3	Chloroplast	207	51
4	Cyanelle	78	19
5	Cytoplasm	384	334
6	Cytoskeleton	20	5
7	Endoplasmic reticulum	91	22
8	Extracellular	402	404
9	Golgi apparatus	68	17
10	Lysosome	37	9
11	Mitochondrion	183	45
12	Nucleus	474	695
13	Peroxisome	52	12
14	Plasma membrane	323	90
15	Plastid	31	7
16	Vacuole	36	8
Total		2423	1727

Table 8.2: Breakdown of the benchmark dataset for single-label human proteins (HUM12). HUM12 was extracted from Swiss-Prot 49.3. The sequence identity is below 25%.

Label	Subcellular location	No. of sequences
Label	Subcellular location	Training	Testing
1	Centriole	20	5
2	Cytoplasm	155	222
3	Cytoskeleton	12	2
4	Endoplasmic reticulum	28	7
5	Extracellular	140	161
6	Golgi apparatus	33	9
7	Lysosome	32	8
8	Microsome	7	1
9	Mitochondrion	125	103
10	Nucleus	196	384
11	Peroxisome	18	5
12	Plasma membrane	153	215
Total		919	1122

To avoid over-estimating the prediction performance and to demonstrate the effectiveness of predictors, a eukaryotic dataset (NE16) containing novel proteins was constructed by using the criteria specified above. To ensure that the proteins are really novel to the predictors, the creation dates of these proteins should be significantly later than the training proteins and also later than the GOA Database. Because EU16 was created in 2005 and the GOA Database used in our experiments was released on March 8, 2011, we selected the proteins that were added to Swiss-Prot between March 8, 2011 and April 18, 2012. Moreover, only proteins with a single subcellular location which falls within the 16 classes of the EU16 dataset were selected. After limiting the sequence similarity to 25%, 608 eukaryotic proteins distributed in 14 subcellular locations (see Table 8.3) were selected.

8.1.1.2 Datasets for FusionSVM

For the fusion of InterProGOSVM and PairProSVM, the performance was evaluated on a benchmark dataset (OE11 [101]), which was created by selecting all eukaryotic proteins with annotated subcellular locations from Swiss-Prot 41.0. The OE11 dataset comprises 3572 proteins with 11 classes. The breakdown of the OE11 dataset is shown in Table 8.4. Specifically, there are 622 cytoplasm, 1188 nuclear, 424 mitochondria, 915 extracellular, 26 Golgi apparatus, 225 chloroplast, 45 endoplasmic reticulum, 7 cyoskeleton, 29 vacuole, 47 peroxisome, and 44 lysosome. The sequence similarity was cut off at 50%.

Table 8.3: Breakdown of the novel single-label eukaryotic-protein dataset for GOASVM (NE16). The NE16 dataset contains proteins that were added to Swiss-Prot created between March 8, 2011 and Aprril 18, 2012. The sequence identity of the dataset is below 25%.*: no new proteins were found in the corresponding subcellular location.

Label	Subcellular location	No. of sequences
1	Cell wall	2
2	Centriole	0*
3	Chloroplast	51
4	Cyanelle	0*
5	Cytoplasm	77
6	Cytoskeleton	4
7	Endoplasmic reticulum	28
8	Extracellular	103
9	Golgi apparatus	14
10	Lysosome	1
11	Mitochondrion	73
12	Nucleus	57
13	Peroxisome	6
14	Plasma membrane	169
15	Plastid	5
16	Vacuole	18
Total		608

Table 8.4: Breakdown of the dataset used for FusionSVM (OE11). OE11 was extracted from Swiss-Prot 41.0, and the sequence similarity is cut off to 50%.

Label	Subcellular location	No. of sequences
1	Cytoplasm	622
2	Nuclear	1188
3	Mitochondria	424
4	Extracellular	915
5	Golgi apparatus	24
6	Chloroplast	225
7	Endoplasmic reticulum	45
8	Cytoskeleton	7
9	Vacuole	29
10	Peroxisome	47
11	Lysosome	44
Total		3572

Among the 3572 protein sequences, only 3120 sequences have valid GO vectors with at least one nonzero element in the GO vectors by using InterProScan. For the remaining 452 sequences, InterProScan cannot find any GO terms. Therefore, we only used sequences with valid GO vectors in our experiments and reduced the dataset size to 3120 protein sequences.

8.1.2 Performance metrics

Several performance measures were used, including the overall accuracy (ACC), overall Mathew’s correlation coefficient (OMCC) [132], and weighted average Mathew’s correlation (WAMCC) [132]. The latter two measures are based on Mathew’s correlation coefficient (MCC) [138]. Specifically, denote M ∈ RC×C as the confusion matrix of the prediction results, where C is the number of subcellular locations. Then Mi,j(1 ≤ i,j ≤C) represents the number of proteins actually belonging to class i but are predicted as class j. Then, we further denote

where c(1 ≤ c ≤ C) is the index of a particular subcellular location. For class c, pc is the number of true positives, qc is the number of true negatives, rc is the number of false positives, and sc is the number of false negatives. Based on these notations, the ACC, MCCc for class c, OMCC, and WAMCC are defined respectively as

where

MCC can overcome the shortcoming of accuracy on imbalanced data and have the advantage of avoiding the performance to be dominated by the majority classes. For example, a classifier which predicts all samples as positive cannot be regarded as a good classifier unless it can also accurately predict negative samples. In this case, the accuracy and MCC of the positive class are 100% and 0 %, respectively. Therefore, MCC is a better measure for the classification of imbalanced datasets.

8.2 Prediction of multi-label proteins

This section focuses on constructing datasets and introduces performance metrics for multi-location proteins which are used in mGOASVM, AD-SVM, mPLR-Loc, SS-Loc, HybridGO-Loc, RP-SVM, and R3P-Loc.

8.2.1 Dataset construction

For multi-location protein subcellular localization, datasets from three groups are constructed: virus, plant, and eukaryote.

The datasets used for evaluating the proposed multi-label predictors have also been used by other multi-label predictors, including Virus-mPLoc [189], KNN-SVM [121], Plant-mPLoc [51], Euk-mPLoc 2.0 [49], iLoc-Virus [233], iLoc-Plant [230], and iLoc-Euk [52].

The virus dataset was created from Swiss-Prot 57.9. It contains 207 viral proteins distributed in 6 locations (see Table 8.5). Of the 207 viral proteins, 165 belong to one subcellular location, 39 to two locations, 3 to three locations and none to four or more locations. This means that about 20% of proteins are located in more than one subcellular location. The sequence identity of this dataset was cut off at 25 %.

The plant dataset was created from Swiss-Prot 55.3. It contains 978 plant proteins distributed in 12 locations (see Table 8.6). Of the 978 plant proteins, 904 belong to one subcellular location, 71 to two locations, 3 to three locations and none to four or more locations. In other words, 8% of the plant proteins in this dataset are located in multiple locations. The sequence identity of this dataset was cut off at 25 %.

The eukaryotic dataset was created from Swiss-Prot 55.3. It contains 7766 eukaryotic proteins distributed in 22 locations (see Table 8.7). Of the 7766 eukaryotic proteins, 6687 belong to one subcellular location, 1029 to two locations, 48 to three locations, 2 to four locations and none to five or more locations. In other words, about 14% of the eukaryotic proteins in this dataset are located in multiple locations. Similar to the virus dataset, the sequence identity of this dataset was cut off at 25 %.

To further demonstrate the effectiveness of the proposed predictors, a plant dataset containing novel proteins was constructed by using the criteria specified in [51, 230]. Specifically, to ensure that the proteins are really novel to predictors, the creation dates of these proteins should be significantly later than the training proteins (from the plant dataset) and also later than the GOA database. Because the plant dataset was created in 2008 and the GOA database used in our experiments was released on March 8,2011, we selected the proteins that were added to Swiss-Prot between March 8, 2011 and Aprril 18, 2012. Moreover, only proteins with multiple subcellular locations that fall within the 12 classes specified in Table 8.6 were included. After limiting the sequence similarity to 25%, 175 plant proteins distributed in 12 subcellular locations (see Table 8.8) were selected. Of the 175 plant proteins, 147 belong to one subcellular location, 27 belong to two locations, 1 belongs to three locations and none to four or more locations. In other words, 16% of the plant proteins in this novel dataset are located in multiple locations.

Table 8.5: Breakdown of the multi-label virus protein dataset. The sequence identity is cut off at 25 %. The superscripts v stand for the virus dataset.

Label	Subcellular location	No. of locative proteins
1	Viral capsid	8
2	Host cell membrane	33
3	Host endoplasmic reticulum	20
4	Host cytoplasm	87
5	Host nucleus	84
6	Secreted	20
Total number of locative proteins		252
Total number of actual proteins		207

Table 8.6: Breakdown of the multi-label plant protein dataset. The sequence identity is cut off at 25%. The superscript p stands for the plant dataset.

Label	Subcellular location	No. of locative proteins
1	Cell membrane	56
2	Cell wall	32
3	Chloroplast	286
4	Cytoplasm	182
5	Endoplasmic reticulum	42
6	Extracellular	22
7	Golgi apparatus	21
8	Mitochondrion	150
9	Nucleus	152
10	Peroxisome	21
11	Plastid	39
12	Vacuole	52
Total number of locative proteins		1055
Total number of actual proteins		978

Table 8.7: Breakdown of the multi-label eukaryotic protein dataset. The sequence identity is cut off at 25 %. The superscripts e stand for the eukaryotic dataset.

Label	Subcellular location	No. of locative proteins
1	Acrosome	14
2	Cell membrane	697
3	Cell wall	49
4	Centrosome	96
5	Chloroplast	385
6	Cyanelle	79
7	Cytoplasm	2186
8	Cytoskeleton	139
9	ER	457
10	Endosome	41
11	Extracellular	1048
12	Golgi apparatus	254
13	Hydrogenosome	10
14	Lysosome	57
15	Melanosome	47
16	Microsome	13
17	Mitochondrion	610
18	Nucleus	2320
19	Peroxisome	110
20	SPI	68
21	Synapse	47
22	Vacuole	170
Total number of locative proteins		8897
Total number of actual proteins		7766

Here we take the new plant dataset as an example to illustrate the details of the procedures, which are specified as follows:

Go to the UniProt/SwissProt official webpage (http://www.uniprot.org/).
Go to the “Search” section and select ‘Protein Knowledgebase (UniProtKB)’ (default) in the “Search in” option.
In the “Query” option, select or type “reviewered: yes”.
Select ‘AND” in the “Advanced Search” option, and then select ‘Taxonomy [OC]” and type in “Viridiplantae”.
Select “AND” in the “Advanced Search” option, and then select “Fragment: no’.
Select “AND” in the “Advanced Search” option, and then select “Sequence length” and type in “50 - ” (no less than 50).
Select “AND” in the “Advanced Search” option, and then select “Date entry integrated” and type in “20110308-20120418”.
Select “AND” in the “Advanced Search” option, and then select “Subcellular location: XXX Confidence: Experimental”. (XXX means the specific subcellular locations Here it includes 12 different locations: cell membrane, cell wall, chloroplast, endoplasmic reticulum, extracellular, golgi apparatus, mitochondrion, nucleus, peroxisome, plastid, and vacuole.)
Further exclude those proteins which are not experimentally annotated (this is to recheck the proteins to guarantee that they are all experimentally annotated).

Table 8.8: Breakdown of the new multi-label plant dataset. The dataset was constructed from Swiss-Prot created between 08-Mar-2011 and 18-Apr-2012. The sequence identity of the dataset is below 25%.

Label	Subcellular location	No. of locative proteins
1	Cell membrane	16
2	Cell wall	1
3	Chloroplast	54
4	Cytoplasm	38
5	Endoplasmic reticulum	9
6	Extracellular	3
7	Golgi apparatus	7
8	Mitochondrion	16
9	Nucleus	46
10	Peroxisome	6
11	Plastid	1
12	Vacuole	7
Total number of locative proteins		204
Total number of actual proteins		175

After selecting the proteins, Blastclust20 was applied to reduce the redundancy in the dataset so that none of the sequence pairs has a sequence identity higher than 25%.

8.2.2 Datasets analysis

Because multi-label datasets are more complicated than single-label ones, some analysis should be carried out. To better visualize the distributions of proteins in each subcellular locations in these three datasets, we have listed the breakdown for them in Figures 8.1, 8.2, and 8.3. Figure 8.1 shows that the majority (68 %) of viral proteins in the virus dataset are located in the host cytoplasm and host nucleus, while proteins located in the rest of the subcellular locations account totally for only around one third. This means that this multi-label dataset is imbalanced across the six subcellular locations. Similar conclusions can be drawn from Figure 8.2, where most of the plant proteins exist in the chloroplast, cytoplasm, nucleus, and mitochondrion, while proteins in 8 other subcellular locations totally account for less than 30 %. This imbalanced property makes the prediction of these two multi-label datasets difficult. Figure 8.3 also shows the imbalanced property of the multi-label eukaryotic dataset, where the majority (78%) of eukaryotic proteins are located in the nucleus, cytoplasm, extracellular, cell membrane, and mitochondrion, while proteins in 17 other subcellular locations account for less than 22 %.

Fig. 8.1: Breakdown of the multi-label virus dataset. The number of proteins shown in each subcellular location represents the number of “locative proteins” [213, 233]. Here, 207 actual proteins have 252 locative proteins. The viral proteins are distributed in six subcellular locations: viral capsid, host cell membrane, host endoplasmic reticulum, host cytoplasm, host nucleus, and secreted.

Fig. 8.2: Breakdown of the multi-label plant dataset. The number of proteins shown in each subcellular location represents the number of ‘locative proteins’ [213, 233]. Here, 978 actual proteins have 1055 locative proteins. The plant proteins are distributed in 12 subcellular locations: cell membrane, cell wall, chloroplast, cytoplasm, endoplasmic reticulum, extracellular, Golgi apparatus, mitochondrion, nucleus, peroxisome, plastid, and vacuole.

Fig. 8.3: Breakdown of the multi-label eukaryotic dataset. The number of proteins shown in each subcellular location represents the number of “locative proteins” [213, 233]. Here, 7766 actual proteins have 8897 locative proteins. The eukaryotic proteins are distributed in 22 subcellular locations: acrosome (ACR), cell membrane (CM), cell wall (CW), centrosome (CEN), chloroplast (CHL), cyanelle (CYA), cytoplasm (CYT), cytoskeleton (CYK), endoplasmic reticulum (ER), endosome (END), extracellular (EXT), Golgi apparatus (GOL), hydrogenosome (HYD), lysosome (LYS), melanosome (MEL), microsome (MIC), mitochondrion (MIT), nucleus (NUC), peroxisome (PER), spindle pole body (SPI), synapse (SYN), and vacuole (VAC).

More detailed statistical properties of these three datasets are listed in Table 8.9. In the table, M and N denote the number of actual (or distinct) subcellular locations and the number of actual (or distinct) proteins. Besides the commonly used properties for single-label classification, the following measurements [204] are used as well to explicitly quantify the multi-label properties of the datasets:

Label cardinality (LC).LC is the average number of labels per data instance, which is defined as: , where is the label set of the protein and | · | denotes the cardinality of a set.
Label density (LD). LD is LC normalized by the number of classes, which is defined as: .
Distinct label set (DLS). DLS is the number of label combinations in the dataset.
Proportion of distinct label set (PDLS). PDLS is DLS normalized by the number of actual data instances, which is defined as: .
Total locative number (TLN). TLN is the total number of locative proteins. This concept is derived from locative proteins in [233], which will be further elaborated in Section 8.2.3.

Table 8.9: Statistical properties of the three multi-label benchmark datasets used in our experiments.

Dataset	M	N	TLN	LC	LD	DLS	PDLS
Virus	6	207	252	1.2174	0.2029	17	0.0821
Plant	12	978	1055	1.0787	0.0899	32	0.0327
Eukaryote	22	7766	8897	1.1456	0.0521	112	0.0144

M: number of subcellular locations.
N: number of actual proteins.
LC: label cardinality.
LD: label density.
DLS: distinct label set.
PDLS: proportion of distinct label set.
TLN: total locative number.

Among these measurements, LC is used to measure the degree of multi-labels in a dataset. For a single-label dataset, LC = 1; for a multi-label dataset, LC >1. Also, the larger the LC, the higher is the degree of multi-labels. LD takes into consideration the number of classes in the classification problem. For two datasets with the same LC, the lower the LD, the more difficult the classification will be. DLS represents the number of possible label combinations in the dataset. The higher the DLS, the more complicated the composition will be. PDLS represents the degree of distinct labels in a dataset. The larger the PDLS, the more probable the individual label-sets will be different from each other. From Table 8.9, we can see that although the number of proteins in the virus dataset (N = 207, TLN = 252) is smaller than that of the plant dataset (N = 978, TLN = 1055) and of the eukaryotic dataset (N = 7766, TLN = 8897), the former (LC = 1.2174, LD = 0.2029) is a multil-abel dataset which is denser than the latter two (LC = 1.0787, LD = 0.0899 and LC = 1.1456, LD = 0.0521).

8.2.3 Performance metrics

Compared to traditional single-label classification, multi-label classification requires more complicated performance metrics to better reflect the multi-label capabilities of classifiers. Conventional single-label measures need to be modified to adapt them to multi-label classification. These measures include accuracy, precision, recall

, F1-score (F1) and hamming loss (HL) [60, 74]. Specifically, denote and as the true label set and the predicted label set for the i-th protein (i = 1,...,N,) respectively.21 Then the five measurements are defined as follows:

where | · | means counting the number of elements in the set therein and ∩ represents the intersection of sets.

Accuracy, precision, recall and F1 indicate the classification performance. The higher the measures, the better the prediction performance. Among them, accuracy is the most commonly used criteria. F1-score is the harmonic mean of precision and recall, which allows us to compare the performance of classification systems by taking the trade-off between precision and recall into account. The hamming loss (HL) [60, 74] is different from other metrics. As can be seen from equation (8.13), when all of the proteins are correctly predicted, i.e. , then HL = 0, whereas, other metrics will be equal to 1. On the other hand, when the predictions of all proteins are completely wrong, i.e. and , then HL = 1, whereas, other metrics will be equal to 0. Therefore, the lower the HL, the better the prediction performance.

Two additional measurements [213, 233] are often used in multi-label subcellular localization prediction. They are overall locative accuracy (OLA) and overall actual accuracy (OAA). The former is given by

and the overall actual accuracy (OAA) is

where

To explain equations (8.14) and (8.15), the concepts of locative proteins and actual proteins are introduced here. If a protein exists in two different subcellular locations, it will be counted as two locative proteins; if a protein coexists in three locations, then it will be counted as three locative proteins; and so forth. But no matter at how many subcellular locations a protein simultaneously resides, it will be counted as only one actual protein. Mathematically, denote Nloc as the total number of locative proteins, Nact as the total number of actual proteins, M as the number of subcellular locations, and nact(m) (m = 1,..., M) as the number of actual proteins coexisting in m subcellular locations. Then, the Nact and Nloc can be expressed as

and

In the virus dataset, M = 6; in the plant dataset, M = 12; and in the eukaryotic dataset, M = 22. Then, from equations (8.17) and (8.18), we obtain

where the superscripts v, p, and e stand for the virus, plant, and eukaryotic datasets, respectively. Thus, for the virus dataset, 207 actual proteins correspond to 252 locative proteins; for the plant dataset, 978 actual proteins correspond to 1055 locative proteins; and for the eukaryotic dataset, 7766 actual proteins correspond to 8897 locative proteins.

According to equation (8.14), a locative protein is considered to be correctly predicted if any of the predicted labels matches any labels in the true label set. On the other hand, equation (8.15) suggests that an actual protein is considered to be correctly predicted only if all of the predicted labels exactly match those in the true label set. For example, for a protein coexisting in, say three subcellular locations, if only two of the three are correctly predicted, or if the predicted result contains a location not belonging to the three, the prediction is considered to be incorrect. In other words, when and only when all of the subcellular locations of a query protein are exactly predicted without any overprediction or underprediction can the prediction be considered as correct. Therefore, OAA is a more stringent measure as compared to OLA. OAA is also more objective than OLA. This is because locative accuracy is liable to give biased performance measures when the predictor tends to overpredict, i.e. giving large for many . In the extreme case, if every protein is predicted to have all of the M subcellular locations, according to equation (8.14), the OLA is 100%. But the predictions are obviously wrong and meaningless. On the contrary, OAA is 0% in this extreme case, which definitely reflects the real performance.

Among all the metrics mentioned above, OAA is the most stringent and objective. This is because if only some (but not all) of the subcellular locations of a query protein are correctly predicted, the numerators of the other four measures (equations (8.9)–(8.14)) are nonzero, whereas the numerator of OAA in equation (8.15) is 0 (thus contribute nothing to the frequency count).

8.3 Statistical evaluation methods

In statistical prediction there are three methods which are often used for testing the generalization capabilities of predictors: independent tests, subsampling tests (or K-fold cross-validation), and leave-one-out cross validation (LOOCV).

In independent tests the training set and the testing set were fixed, thus enabling us to obtain a fixed accuracy for the predictors. This kind of method can directly demonstrate the capability of predictors. However, the selection of an independent dataset often bears some sort of arbitrariness [54], which inevitably leads to non-bias-free accuracy for the predictors.

For subsampling tests, here we use fivefold cross validation as an example. The whole dataset is randomly divided into 5 disjoint parts of equal size [142]. The last part may have 1–4 more examples than the former four parts in order for each example to be evaluated on the model. Then one part of the dataset was used as the test set, and the remaining parts are jointly used as the training set. This procedure is repeated five times, and each time a different part is chosen as the test set. The number of the selections in dividing the benchmark dataset is obviously an astronomical figure even for a small dataset. This means that different selections lead to different results even for the same benchmark dataset, and thus they still liable to statistical arbitrariness. Subsampling tests with a smaller K definitely work faster than those with a larger K. Thus, subsampling tests are faster than LOOCV, which can be regarded as N-fold cross-validation, where N is the number of samples in the dataset, and N > K. At the same time it is also statistically acceptable and usually regarded as less biased than the independent tests.

In LOOCV, every protein in the benchmark dataset will be singled out one-by-one and is tested by the classifier trained by the remaining proteins. In each fold of LOOCV, a protein of the training dataset (suppose there are N proteins) is singled out as the test protein and the remaining (N -1) proteins are used as the training data. This procedure is repeated N times, and in each fold a different protein is selected as the test protein. This ensures that every sequence in the dataset will be tested. In this case, arbitrariness can be avoided, because LOOCV will yield a unique outcome for the predictors. Therefore, LOOCV is considered to be the most rigorous and bias-free method [86]. Note that the jackknife cross validation in iLoc-Plant and its variants is the same as LOOCV, as mentioned in [54, 230]. Because the term jackknife also refers to the methods which estimate the bias and variance of an estimator [1], to avoid confusion we use only the term LOOCV throughout this book.

For both single- and multi-label datasets, LOOCV was used for benchmark datasets and independent tests were implemented for the novel datasets, with the benchmark datasets of the corresponding species as the training part.

8.4 Summary

This chapter introduced experimental setups for both single- and multi-label protein subcellular localization. Datasets construction and performance metrics were presented for single- multi-label cases. For both cases, three benchmark datasets and one novel dataset were introduced. Different performance metrics were presented for prediction of single- and multi-label proteins. Generally speaking, datasets and performance metrics for the multi-label case are much more sophisticated than the single-label one. Finally, some statistical methods for evaluation were elaborated.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 8 Experimental setup

Create new playlist

Sign In

Sign Up

8 Experimental setup

8.1 Prediction of single-label proteins

8.1.1 Datasets construction

8.1.1.1 Datasets for GOASVM

8.1.1.2 Datasets for FusionSVM

8.1.2 Performance metrics

8.2 Prediction of multi-label proteins

8.2.1 Dataset construction

8.2.2 Datasets analysis

8.2.3 Performance metrics

8.3 Statistical evaluation methods

8.4 Summary

Table of Contents for
8 Experimental setup