Chapter 6

Review of Recent Protein-Protein Interaction Techniques

Maad Shatnawi    Higher Colleges of Technology, Abu Dhabi, UAE

Abstract

Protein-protein interactions (PPIs) play a crucial role in cellular functions and biological processes in all organisms. The identification of protein interactions can lead to a better understanding of infection mechanisms and the development of several medication drugs and treatment optimization. Several physiochemical experimental techniques have been applied to identify PPIs. However, these techniques are computationally expensive, significantly time consuming, and have covered only a small portion of the complete PPI networks. As a result, the need for computational techniques has been increased to validate experimental results and to predict nondiscovered PPIs. This chapter investigates and compares most of the recent computational PPI prediction techniques and discusses the technical challenges and open issues in this domain.

Keywords

Protein-protein interaction (PPI) prediction

protein-protein interaction (PPI)

protein sequences

computational techniques

1 Introduction

Proteins are the building blocks of all living organisms. The primary structure of a protein is the linear sequence of its amino acid (AA) units, starting from the amino-terminal residue (N-terminal) to the carboxyl-terminal residue (C-terminal). Amino acids consist of carbon, hydrogen, oxygen, and nitrogen atoms that are clustered into functional groups. All amino acids have the same general structure, but each has a different R group. The carbon atom to which the R group is connected is called the alpha carbon. There are 20 amino acids in proteins and are connected by a chemical reaction in which a molecule of water is removed, leaving two amino acids residues connected by a peptide bond. These 20 amino acids are alanine, arginine, asparagine, aspartic acid, cysteine, glutamic acid, glutamine, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, and valine. These amino acids are represented by one-letter abbreviations: A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, and V, respectively (Berg et al., 2002).

The secondary structure of a protein is the general three-dimensional form of its local parts. The most common secondary structures are alpha (α) helices and beta (β) sheets. The α-helix is a right-handed spiral array, while the β-sheet is made of beta strands connected crosswise by two or more hydrogen bonds, forming a twisted, pleated sheet. These secondary structures are linked by tight turns and loose, flexible loops (Berg et al., 2002). Protein domains are the basic functional units of protein tertiary structures. A protein domain is a conserved part of a protein sequence that can evolve, function, and exist independently. Each domain forms a three-dimensional structure and can be stable and folded independently. Several domains are joined in different combinations to form multidomain protein sequences (Chothia, 1992; Yoo et al., 2008).

A protein interacts with other proteins to perform certain tasks. Protein-protein interaction (PPI) occurs at almost every level of cell function. The identification of interactions among proteins provides a global picture of cellular functions and biological processes. Since most biological processes involve one or more protein-protein interactions, the accurate identification of the set of interacting proteins in an organism is very useful for deciphering the molecular mechanisms underlying given biological functions, as well as for assigning functions to unknown proteins based on their interacting partners (Zaki et al., 2009; Xenarios and Eisenberg, 2001; Kim et al., 2002). Protein interaction prediction is also a fundamental step in the construction of PPI networks for humans and other organisms. The identification of possible viral-host protein interactions can lead to a better understanding of infection mechanisms and, in turn, to the development of several medication drugs and treatment optimization. Abnormal PPIs are implicated in several neurological disorders, including Creutzfeld-Jakob and Alzheimer’s diseases (Bader and Hogue, 2002; Von Mering et al., 2002; Qi et al., 2006). Therefore, the development of accurate and reliable methods for identifying PPIs has very important impacts in several protein research areas.

This chapter, as an extension of a previous paper (Shatnawi, 2014), provides a comprehensive and comparative study and categorization of the existing computational approaches in PPI prediction. It also discusses the technical challenges and open issues in this field. The rest of this chapter is organized as follows: The next section addresses the key technical challenges that face PPI prediction and the open issues in this field. Section 3 discusses the performance measures that are typically used in PPI prediction. Section 4 provides a comprehensive description and comparison of the most current computational PPI predictors. Concluding remarks are presented in Section 5.

2 Technical challenges and open issues

There are several technical challenges that face computational analysis of protein sequences in general and PPI prediction in particular. First, there have been a huge amount of newly discovered protein sequences in the past genomic era. Second, protein chains are typically long, which makes them difficult, time consuming, and expensive to characterize by experimental methods. Third, the availability of large, comprehensive, and accurate benchmark data sets is required for the training and evaluation of prediction methods. Fourth, appropriate performance measures to evaluate the significance of the predictors should be developed to minimize the number of false positives and false negatives. Fifth, it is difficult to distinguish between novel interactions and false positives. Sixth, computational PPI methods are based on experimentally collected data, and therefore, any error in the experimental data will affect the computational PPI predictions.

One of the challenges of protein prediction methods is protein representation. Protein prediction methods vary in protein representation and feature extraction in order to build their classification models. There are two kinds of models that were generally used to represent protein samples; the sequential model and the discrete model. The most and simplest sequential model for a protein is its entire AA sequence. However, this representation does not work well when the query protein does not have high sequence similarity to any attribute-known proteins. Several nonsequential models, or discrete models, have been proposed. The simplest discrete model is AA composition, which is the normalized occurrence frequencies of the 20 native amino acids in a protein. However, all the sequence-order knowledge will be lost using this representation, which in turn will negatively affect the prediction quality (Chou, 2011). Some approaches use AA physiochemical properties, while others use pairwise similarity. Some approaches are template-based, while others are statistical-based or machine learning (ML)–based.

There are various challenges that face ML protein interaction prediction methods. Selecting the best ML approach is a great challenge. There are many techniques that vary in accuracy, robustness, complexity, computational cost, data diversity, overfitting, and ability to deal with missing attributes and different features. Most ML approaches of protein sequences are computationally expensive and often suffer from low prediction accuracy. They are also susceptible to overfitting (Melo et al., 2003).

Most PPI prediction approaches have achieved a reasonable performance on balanced data sets containing an equal number of interacting and noninteracting protein pairs. However, this ratio is highly imbalanced in nature, and these approaches have not been comprehensively assessed with respect to the effect of the large number of noninteracting pairs in realistic data sets. In addition, since highly imbalanced distributions usually lead to large data sets, more efficient prediction methods, algorithmic optimizations, and continued improvements in hardware performance are required to handle such challenging tasks.

3 Performance measures

There are several performance measures that are used to evaluate a PPI predictor and compare it with other approaches. The most frequently used evaluation measures in this field are accuracy, sensitivity, specificity, precision, F-measure (F1), Matthews correlation coefficient (MCC), receiver operating characteristic (ROC), and AUC (Area Under the ROC Curve).

Accuracy (Ac) is the proportion of correctly predicted interacting and noninteracting protein pairs to all of the protein pairs listed in the data set. Sensitivity, or recall (R), is the proportion of correctly predicted interacting protein pairs to all of the interacting protein pairs listed in the data set. Precision (P) is the proportion of correctly predicted interacting protein pairs to all of the predicted interacting protein pairs. Specificity (Sp) is the proportion of correctly predicted noninteracting protein pairs to all the noninteracting protein pairs listed in the data set. These metrics can be represented mathematically as follows:

Ac=TP+TNTP+TN+FN+FP

si1_e  (6.1)

R=TPTP+FN

si2_e  (6.2)

P=TPTP+FP

si3_e  (6.3)

Sp=TNTN+FP,

si4_e  (6.4)

where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively.

F1 is an evaluation metric that combines precision and recall into a single value. It is defined as the harmonic mean of precision and recall (Sasaki, 2007; Powers, 2011):

F1=2PRP+R.

si5_e  (6.5)

Matthews correlation coefficient (MCC) is a measure that balances prediction sensitivity and specificity. MCC ranges from − 1, indicating an inverse prediction, through 0, which corresponds to a random classifier, to 1 for perfect prediction, and is calculated as follows:

MCC=TPTN-FPFNTP+FNTP+FPTN+FP(TN+FN).

si6_e  (6.6)

The receiver operating characteristic (ROC) curve is created by plotting the true positive rate (Recall) against the false positive rate (1 − Specificity) at various threshold settings. AUC is the area under the ROC curve, and it represents the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. The AUC can also be interpreted as the average recall over the entire range of possible specificity, or the average specificity over the entire range of possible recalls (Fawcett, 2006; Hanley and McNeil, 1982; Metz, 1978).

4 Computational approaches

PPI prediction has been studied extensively by several researchers and a large number of approaches have been proposed. These approaches can be classified into physiochemical experimental and computational approaches. Physiochemical experimental techniques identify the physiochemical interactions between proteins which, in turn, are used to predict the functional relationships between them. These techniques include yeast two-hybrid based methods (Bartel and Fields, 1997), mass spectrometry (Gavin et al., 2002), tandem affinity purification (Rigaut et al., 1999), protein chips (Zhu et al., 2001), and hybrid approaches (Tong et al., 2002). Although these techniques have succeeded in identifying several important interacting proteins in several species such as Saccharomyces cerevisiae (Yeast), Drosophila, and Helicobacter pylori (Shen et al., 2007), they are computationally expensive and significantly time consuming, and so far the predicted PPIs have covered only a small portion of the complete PPI network. As a result, the need for computational tools has been increased in order to validate physiochemical experimental results and to predict nondiscovered PPIs (Zaki et al., 2009; Szilágyi et al., 2005).

Several computational methods have been proposed for PPI prediction and can be classified into sequence-based and structure-based methods according to the used protein features. Sequence-based methods utilize AA features and can be further categorized into statistical and ML-based methods. The structure-based methods use three-dimensional structural features (Porollo and Meller, 2012) and can be categorized into template-based, statistical, and ML-based methods. This section provides an overview and discussion of some of the current computational sequence-based and structure-based PPI prediction approaches.

4.1 Sequence-based approaches

Sequence-based PPI prediction methods utilize AA features such as hydrophobicity, physiochemical properties, evolutionary profiles, AA composition, AA mean, or weighted average over a sliding window (Porollo and Meller, 2012). Sequence-based methods can be categorized into statistical and ML-based methods. This section presents and evaluates some of the existing sequence-based approaches.

4.1.1 Statistical sequence-based approaches

This section presents and describes several existing statistical sequence-based PPI prediction approaches.

4.1.1.1 Mirror tree method

Pazos and Valencia (2001) introduced the mirror tree method based on the comparison of the evolutionary distances between the sequences of the associated protein families and using the topological similarity of phylogenetic trees to predict PPIs. These distances were calculated as the average value of the residue similarities taken from the McLachlan amino acid homology matrix (McLachlan, 1971). The similarity between trees was calculated as the correlation between the distance matrices used to build the trees.

The mirror tree method does not require the creation of the phylogenetic trees; only the underlying distance matrices are analyzed, and therefore, this approach is independent of any given tree-construction method. Although the mirror tree method does not require the presence of fully sequenced genomes, it requires orthologous proteins in all the species under consideration. As a result, when more species’ genomes become available, fewer proteins could be applied. In addition to that, the method is restricted to cases where at least 11 sequences were collected from the same species for both proteins. This minimum limit was set empirically as a compromise between being small enough to provide enough cases and large enough for the matrices to contain sufficient information. The approach can be improved by increasing the number of possible interactions by collecting sequences from a larger number of genomes. Further, since the distance matrices are not a perfect representation of the corresponding phylogenetic trees, it is possible that some inaccuracies are introduced by comparing distance matrices instead of the real phylogenetic trees.

4.1.1.2 PIPE

Pitre et al. (2006) introduced the Protein-protein Interaction Prediction Engine (PIPE) to estimate the likelihood of interactions between pairs of Saccharomyces cerevisiae proteins using protein primary structure information. PIPE is based on the assumption that interactions between proteins occur by a finite number of short polypeptide sequences observed in a database of known interacting protein pairs. These sequences are typically shorter than the classical domains and reoccur in different proteins within the cell. PIPE estimates the likelihood of a PPI by measuring the reoccurrence of these short polypeptides within known interacting protein pairs. To determine whether two proteins, A and B, interact, the two proteins are scanned for similarity to a database of known interacting protein pairs. For each known interacting pair (X; Y), PIPE uses sliding windows to compare the AA residues in protein A against that in X and protein B against Y, and then measures how many times a window of protein A finds a match in X and at the same time a window in protein B matches a window in Y. These matches are counted and added up in a two-dimensional matrix. A positive protein interaction is predicted when the reoccurrence count in certain cells of the matrix exceeds a predefined threshold value. PIPE was evaluated on a randomly selected set of 100 interacting yeast protein pairs and 100 noninteracting proteins from the database of interacting proteins (DIP; http://dip.doe-mbi.ucla.edu; also see Salwinski et al., 2004) and Munich Information Center for Protein Sequences (MIPS; Mewes et al., 2002) databases. PIPE showed a prediction sensitivity of 0.61 and specificity of 0.89.

Since PIPE is based on protein primary structure information without any previous knowledge about the higher structure, domain composition, evolutionary conservation, or function of the target proteins. It can identify interactions of protein pairs for which limited structural information is available. The limitations of PIPE are as follows. PIPE is computationally intensive and requires hours of computation per protein pair, as it scans the interaction library repeatedly every time. Second, PIPE shows a weakness in detecting novel interactions among genomewide, large-scale data sets as it reported a large number of false positives. Third, PIPE was evaluated on uncertain data of interactions that were determined using several methods, each with limited accuracy.

Pitre et al. (2008) then developed PIPE2 as an improved and more efficient version of PIPE, which showed a specificity of 0.999. PIPE2 represents AA sequences in a binary code, which speeds up searching the similarity matrix. Unlike the original PIPE that scans the interaction database repeatedly every time, PIPE2 precomputes all window comparisons in advance and stores them on a local disk.

Although PIPE2 achieves high specificity, it has a large number of false positives with a sensitivity of 0.146 only. The rate of false positives can be reduced by incorporating other information about the target protein pairs, including subcellular localization or functional annotation. A major limitation of PIPE2 is that it relies exclusively on a database of preexisting interaction pairs for the identification of reoccurring short polypeptide sequences; so in the absence of sufficient data, PIPE2 will be ineffective. PIPE2 is also less effective for motifs that span discontinuous primary sequence, as it does not account for gaps within the short polypeptide sequences.

4.1.1.3 CD

Liu et al. (2013) introduced a sequence-based coevolution PPI prediction method in human proteins. The authors defined coevolutionary divergence (CD) based on two assumptions. First, PPI pairs may have similar substitution rates. Second, protein interaction is more likely to conserve across related species. CD is defined as the absolute value of the substitution rate difference between two proteins. It can be used to predict PPIs, as the CD values of interacting protein pairs are expected to be smaller than those of noninteracting pairs. The method was evaluated using 172,338 protein sequences obtained from the Evola database (Matsuya et al., 2008) for Homo sapiens and their orthologous protein sequences in 13 different vertebrates. The PPI data set was downloaded from the Human Protein Reference Database (Prasad et al., 2009). Pairwise alignment of the orthologous proteins was made with ClustalW2 software. The absolute value of substitution rate difference between two proteins was used to measure the CDs of protein pairs, which were then used to construct the likelihood ratio table of interacting protein pairs.

The CD method combines coevolutionary information of interacting protein pairs from many species. The method does not use multiple alignments, thus taking less time than other alignment methods such as the mirror tree method. The method is not limited to proteins with orthologous across all species under consideration. However, increasing the number of species will provide more information to improve the accuracy of the CD method. Although this method could rank the likelihood of interaction for a given pair of proteins, it did not infer specific features of interaction, such as the interacting residues in the interfaces.

Table 6.1 summarizes these statistical sequence-based approaches including the features that are used, the technique and/or the tools applied, and the validation data sets used.

Table 6.1

Statistical Sequence-based PPI Prediction Approaches

ApproachExtracted FeaturesTechnique/ToolData Sets
Mirror tree (Pazos and Valencia, 2001)Similarity of phylogenetic treesEvolutionary distance, McLachlan AA homology matrixEscherichia coli protein (Dandekar et al., 1998)
PIPE (Pitre et al., 2006) PIPE2 (Pitre et al., 2008)Short AA polypeptidesSimilarity measureYeast protein (DIP and MIPS)
CD
(Liu et al., 2013)
Coevolutionary information,Pairwise alignment, ClustalW2Human protein (Matsuya et al., 2008, Prasad et al., 2009)

t0010

4.1.2 ML sequence-based approaches

This section describes several existing ML sequence-based PPI prediction approaches.

4.1.2.1 Auto covariance

Guo et al. (2008) proposed a sequence-based method using auto covariance (AC) and support vector machines (SVMs). AA residues were represented by seven physicochemical properties: hydrophobicity, hydrophilicity, volumes of side chains, polarity, polarizability, solvent-accessible surface area, and net charge index of AA side chains. AC counts for the interactions between residues that are a certain distance apart in the sequence. AA physicochemical properties were analyzed by AC based on the calculation of covariance. A protein sequence was characterized by a series of ACs that covered the information of interactions between each AA residue and its 30 vicinal residues in the sequence. Finally, a SVM model with a radial basis function (RBF) kernel was constructed using the vectors of AC variables as input. The optimization experiment demonstrated that the interactions of 1 AA residue and its 30 vicinal AAs would contribute to characterizing the PPI information. The software and data sets are available at http://www.scucic.cn/Predict_PPI/index.htm. A data set of 11,474 yeast PPIs extracted from DIP (Xenarios et al., 2002) was used to evaluate the model, and the average prediction accuracy, sensitivity, and precision achieved are 0.86, 0.85, and 0.87, respectively.

One of the advantages of this approach is that AC includes long-range-interaction information of AA residues, which are important in PPI identification. The use of SVM as a predictor is another advantage. SVM is a state-of-the-art ML technique with many benefits; it overcomes many limitations of other techniques. SVM has strong foundations in statistical learning theory (Cristianini and Shawe-Taylor, 2000) and has been successfully applied in various classification problems (Zaki et al., 2011). SVM offers several related computational advantages, such as the lack of local minima in the optimization (Vapnik, 1998).

4.1.2.2 Pairwise similarity

Zaki et al. (2009) proposed a PPI predictor based on pairwise similarity of protein primary structure. Each protein sequence was represented by a vector of pairwise similarities against large AA subsequences created by a sliding window that passes over concatenated protein training sequences. Each coordinate of this vector is the E-value of the Smith-Waterman (SW) score (Smith and Waterman, 1981). These vectors were then used to compute the kernel matrix, which was exploited in conjunction with an RBF-kernel SVM. Two proteins may interact by the means of the score similarities they produce (Zaki et al., 2006; Zaki, 2007). Each sequence in the testing set was aligned against each sequence in the training set, counted the number of positions that have identical residues, and then the number of positions was divided by the total length of the alignment.

The method was evaluated on a data set of yeast S. cerevisiae proteins created by Chen and Liu (2005) and contains 4917 interacting protein pairs and 4000 noninteracting pairs. The method achieved an accuracy of 0.78, a sensitivity of 0.81, a specificity of 0.744, and a ROC of 0.85.

SW alignment score provides a relevant measure of similarity between proteins. Therefore, protein sequence similarity typically implies homology, which in turn may imply structural and functional similarity (Liao and Noble, 2003). SW score parameters have been optimized over the past two decades to provide relevant measures of similarity between sequences and they now represent core tools in computational biology (Saigo et al., 2004). The use of SVM as a predictor is another advantage. This work can be improved by combining knowledge about gene ontology, interdomain linker regions, and interacting sites to achieve more accurate prediction.

4.1.2.3 AA composition

Roy et al. (2009) examined the role of amino acid composition (AAC) in PPI prediction and its performance against well-known features such as domains, the tuple feature, and the signature product feature. Every protein pair was represented by AAC and domain features. AAC was represented by monomer and dimer features. Monomer features capture composition of individual amino acids, whereas dimer features capture composition of pairs of consecutive AAs. To generate the monomer features, a 20-dimensional vector representing the normalized proportion of the 20 AAs in a protein was created. The real-valued composition was then discretized into 25 bits, producing a set of 500 binary features. To generate the dimer features, a 400-dimensional vector of all possible AA pairs was extracted from the protein sequence and discretized into 10 bits, producing a set of 4000 binary features. The domains were represented as binary features with each feature identified by a domain name. To compare AAC against other nondomain sequence-based features, tuple features (Gomez et al., 2003) and signature products (Martin et al., 2005) were obtained. The tuple features were created by grouping AAs into six categories based on their biochemical properties, and then all possible strings of length 4 were created using these categories. The signature products were obtained by first extracting signatures of length 3 from the individual protein sequences. Each signature consists of a middle letter and two flanking AAs represented in alphabetical order. Thus, two 3-tuples with the first and third amino acid letter permuted have the same signature. The signatures were used to construct a signature kernel specifying the inner product between two proteins.

The proposed approach was examined using three ML classifiers (logistic regression, SVM, and the naive Bayes classifier) on PPI data sets from yeast, worm, and fly. Three data sets for S. cerevisiae were extracted from the General Repository for Interaction Datasets (GRID) database (Stark et al., 2006), Yeast Two-hybrid (TWOHYB), Affinity pull down with Mass Spectrometry (AFFMS), and protein complementation assay. In addition to that, a data set each for worm, Caenorhabditis elegans (Biogrid data set; see Li et al., 2004) and fly, Drosophila melanogaster (Stark et al., 2006) were used. The authors reported that AAC features made almost equivalent contributions as domain knowledge across different data sets and classifiers, which indicated that AAC captures significant information for identifying PPIs. AAC is simple, computationally cheap, and applicable to any protein sequence, and it can be used when there is a lack of domain information. AAC can be combined with other features to enhance PPI prediction.

4.1.2.4 AA Triad

Yu et al. (2010) proposed a probability-based approach of estimating triad significance to alleviate the effect of AA distribution in nature. The relaxed variable kernel density estimator (RVKDE; see Oyang et al., 2005) was employed to predict PPIs based on AA triad information. The method is summarized as follows. Each protein sequence was represented as AA triads by considering every three continuous residues in the protein sequence as a unit. To reduce the feature dimensionality vector, the 20 AA types were categorized into seven groups based on their dipole strength and side chain volumes (Shen et al., 2007). The triads were then scanned one by one along the sequence, and each scanned triad was counted in an occurrence vector, O. Subsequently, a significance vector, S, was proposed to represent a protein sequence by estimating the probability of observing fewer occurrences of each triad than the one that is actually observed in O. Each PPI pair was then encoded as a feature vector by concatenating the two significance vectors of the two individual proteins. Finally, the feature vector was used to train a RVKDE PPI predictor. The method was evaluated on 37,044 interacting pairs within 9,441 proteins from the Human Protein Reference Database (HPRD) (Peri et al., 2003; Mishra et al., 2006). Data sets with different positive-to-negative ratios (from 1:1 to 1:15) were generated with the same positive instances and distinct negative sets, which were obtained by randomly sampling from the negative instances. The authors concluded that the degree of data set imbalance is important to PPI predictor behavior. With a 1:1 positive-to-negative ratio, the proposed method achieves 0.81 sensitivity, 0.79 specificity, 0.79 precision, and 0.8 F1. These evaluation measures drop as the data gets more imbalanced to reach 0.39 sensitivity, 0.97 specificity, 0.495 precision, and 0.44 F1, with a 1:15 positive-to-negative ratio.

RVKDE is an ML algorithm that constructs an RBF neural network to approximate the probability density function of each class of objects in the training data set. One main distinct feature of RVKDE is that it takes an average time complexity of O(nlogn) for the model training process, where n is the number of instances in the training set. In order to improve the prediction efficiency, RVKDE considers only a limited number of nearest instances within the training data set to compute the kernel density estimator of each class. One important advantage of RVKDE, in comparison with SVM, is that the learning algorithm generally takes far less training time with an optimized parameter setting. In addition, the number of training samples remaining after a data reduction mechanism is applied is very close to the number of support vectors of SVM algorithm. Unlike SVM, RVKDE is capable of classifying data with more than two classes in one single run (Oyang et al., 2005).

4.1.2.5 UNISPPI

Valente et al. (2013) introduced Universal In Silico Predictor of Protein-Protein Interactions (UNISPPI). The authors examined both the frequency and composition of the physicochemical properties of the 20 protein AAs to train a decision tree PPI classifier. The frequency feature set included the percentages of each of the 20 AAs in the protein sequence. The composition feature set was obtained by grouping each AA of a protein into one of three different groups related to seven physicochemical properties and calculating the percentage of each group for each feature, ending up with a total of 21 composition features. The seven physicochemical properties are hydrophobicity, normalized van der Waals volume, polarity, polarizability, charge, secondary structure, and solvent accessibility. When tested on a data set of PPI pairs of 20 different eukaryotic species (including eukaryotes, prokaryotes, viruses, and parasite-host associations), UNISPPI correctly classified 0.79 of known PPIs and 0.73 of non-PPIs. The authors concluded that using only the AA frequencies was sufficient to predict PPIs. They further concluded that the AA frequencies of asparagine (N), cysteine (C), and isoleucine (I) are important features for distinguishing between interacting and noninteracting protein pairs.

The main advantages of UNISPPI are its simplicity and low computational cost as small amount of features were used to train the decision tree classifier, which is fast to build and has few parameters to tune. Decision trees can be easily analyzed, and their features can be ranked according to their capabilities of distinguishing PPIs from non-PPIs. However, decision tree classifiers normally suffer from overfitting.

4.1.2.6 ETB-Viterbi

Kern et al. (2013) proposed the Early Traceback Viterbi (ETB-Viterbi) as a decoding algorithm with an early traceback mechanism in Interaction Profile Hidden Markov Models (ipHMMs) (Friedrich et al., 2006), which was designed to optimally incorporate long-distance correlations between interacting AA residues in input sequences. The method was evaluated with real data from the 3did database (Stein et al., 2005), along with simulated data generated from 3did data containing different degrees of correlation and reversed sequence orientation. ETB-Viterbi was able to capture the long-distance correlations for improved prediction accuracy and was not much affected by sequence orientation. The hidden Markov model (HMM) is a powerful probabilistic modeling tool for analyzing and simulating sequences of symbols that are emitted from underlying states and not directly observable (Rabiner and Juang, 1986). The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states. However, this algorithm is expensive in terms of memory and computing time. HMM training involves repeated iterations of the Viterbi algorithm, which makes it quite slow. HMM may not converge on a truly optimal parameter set for a given training set, as it can be trapped in local maxima, and can suffer from overfitting (Krogh et al., 1994; 1998; Eddy, 1998; Yoon, 2009).

Table 6.2 summarizes these ML sequence-based approaches and compared them in terms of features, techniques, tools, and validation data sets.

Table 6.2

ML Sequence-based PPI Prediction Approaches

ApproachExtracted FeaturesTechnique/ToolData Sets
AC
(Guo et al., 2008)
AA physicochemical propertiesAuto covariance, SVMYeast protein (DIP and MIPS)
Pairwise similarity
(Zaki et al., 2009)
Pairwise similaritySVMYeast protein
AA composition
(Roy et al., 2009)
AACLogistic regression, SVM, Naive BayesYeast protein (GRID), worm protein (Li et al., 2004),
fly protein (Biogrid)
AA triad
(Yu et al., 2010)
AA triad informationRVKDEHuman protein (HPRD)
UNISPPI
(Valente et al., 2013)
Frequency and composition of AA physiochemical propertiesDecision tree20 different eukaryotic species
ETB-Viterbi
(Kern et al., 2013)
AA residuesHMMs, ETB-Viterbi3did database

t0015

4.2 Structure-based approaches

Structure-based PPI prediction methods use three-dimensional structural features such as domain information, solvent accessibility, secondary structure states, and hydrophobic and polar surface locations (Porollo and Meller, 2012). Structure-based PPI prediction methods can be categorized into template-based, statistical, and ML-based methods. This section presents and evaluates some of the state-of-the-art structure-based approaches.

4.2.1 Template structure-based approaches

4.2.1.1 PRISM

Tuncbag et al. (2011) developed PRISM (Protein Interactions by Structural Matching) as a template-based PPI prediction method based on information regarding the interaction surface of crystalline complex structures. The two sides of a template interface are compared with the surfaces of two target monomers by structural alignment. If regions of the target surfaces are similar to the complementary sides of the template interface, then these two targets are predicted to interact with each other through the template interface architecture. The method can be summarized as follows. First, interacting surface residues of target chains are extracted using the Naccess software program (Hubbard and Thornton, 1993). Second, complementary chains of template interfaces are separated and structurally compared with each of the target surfaces by using MultiProt (Shatsky et al., 2004). Third, the structural alignment results are filtered according to threshold values, and the resulting set of target surfaces is transformed into the corresponding template interfaces to form a complex. Finally, the Fiber-Dock (Mashiach et al., 2010) algorithm is used to refine the interactions to introduce flexibility, compute the global energy of the complex, and rank the solutions according to their energies. When the computed energy of a protein pair is less than a threshold of − 10 kcal/mol, the pair is determined to interact.

PRISM has been applied for predicting PPIs in a human apoptosis pathway (Acuner Ozbabacan et al., 2012) and a p53 protein-related pathway (Tuncbag et al., 2009), and has contributed to the understanding of the structural mechanisms underlying some types of signal transduction. PRISM obtained a precision of 0.231 when applied to a human apoptosis pathway that consisted of 57 proteins.

4.2.1.2 PrePPI

Zhang et al. (2012) proposed Predicting Protein-Protein Interaction (PrePPI) as a structural alignment PPI predictor based on geometric relationships between secondary structure information. Given a pair of query proteins, A and B, representative structures for the individual subunits (MA; MB) are taken from the Protein Data Bank (PDB) (Berman et al., 2000) or from the ModBase (Pieper et al., 2006) and SkyBase (Mirkovic et al., 2007) homology model databases. Close and remote structural neighbors are found for each subunit. A template for the interaction exists if a PDB or PQS (Protein Quaternary Structure) (Henrick and Thornton, 1998) contains interacting pairs that are structural neighbors of MA and MB. A model is constructed by superposing the individual subunits, MA and MB, on their corresponding structural neighbors. The likelihood for each model to represent a true interaction is then calculated using a Bayesian network trained on 11,851 yeast interactions and 7409 human interactions data sets. Finally, the structure-derived score is combined with nonstructural information, including coexpression and functional similarity, into a naive Bayes classifier.

Although template-based methods can achieve high prediction accuracy when close templates are retrieved, the accuracy significantly decreases when the sequence identity of target and template is low.

4.2.2 Statistical structure-based approaches

4.2.2.1 PID matrix score

Kim et al. (2002) presented the potentially interacting domain (PID) pair matrix as a domain-based PPI prediction algorithm. The PID matrix score was constructed as a measure of interactability (interaction probability) between domains. The algorithm analysis was based on the DIP, which contains more than 10,000 mostly experimentally verified interacting protein pairs. Domain information was extracted from InterPro (Apweiler et al., 2001), an integrated database of protein families, domains, and functional sites. Cross-validation was performed with subsets of DIP data (positive data sets) and randomly generated protein pairs from the TrEMBL/SwissProt database (negative data sets). The method achieved 0.50 sensitivity and 0.98 specificity. The authors reported that the PID matrix can also be used in the mapping of the genome-wide interaction networks.

4.2.2.2 PreSPI

Han et al. (2003, 2004) proposed a domain combination-based method that considers all possible domain combinations as the basic units of protein interactions. The domain combination interaction probability is based on the number of interacting protein pairs containing the domain combination pair and the number of domain combinations in each protein. The method considers the possibility of domain combinations appearing in both interacting and noninteracting sets of protein pairs. The ranking of multiple protein pairs were decided by the interacting probabilities computed through the interacting probability equation.

The method was evaluated using an interacting set of protein pairs in yeast acquired from the DIP (Salwinski et al., 2004) and a randomly generated, noninteracting set of protein pairs. The domain information for the proteins was extracted from the PDB (http://www.ebi.ac.uk/proteome/; see Berman et al., 2000; Apweiler et al., 2001). PreSPI achieved a sensitivity of 0.77 and a specificity of 0.95.

PreSPI suffers from several limitations, though. First, this method ignores other domain-domain interaction information between the protein pairs. Second, it assumes that one domain combination is independent of another. Third, the method is computationally expensive, as all possible domain combinations are considered.

4.2.2.3 DCC

Jang et al. (2012) proposed a domain cohesion and coupling (DCC)–based PPI prediction method using the information of intraprotein domain interactions and interprotein domain interactions. The method aims to identify which domains are involved in a PPI by determining the probability that the domains cause the proteins to interact regardless of the number of participating domains. The coupling powers of all domain interaction pairs are stored in an interaction significance (IS) matrix, which is used to predict PPIs. The method was evaluated on S. cerevisiae proteins and achieved 0.82 sensitivity and 0.83 specificity. The domain information for the proteins was extracted from Pfam (http://pfam.sanger.ac.uk) (Punta et al., 2011), a protein domain family database that contains multiple sequence alignments of common domain families.

4.2.2.4 MEGADOCK

Ohue et al. (2013a) developed MEGADOCK as a protein-protein docking software package using the real Pairwise Shape Complementarity (rPSC) score. First, they conducted rigid-body docking calculations based on a simplified energy function considering shape complementarities, electrostatics, and hydrophobic interactions for all possible binary combinations of proteins in the target set. Using this process, a group of high-scoring docking complexes for each pair of proteins were obtained. Then ZRANK (Pierce and Weng, 2007) was applied for more advanced binding energy calculation and the docking results were reranked based on ZRANK energy scores. The deviation of the selected docking scores from the score distribution of high-ranked complexes was determined as a standardized score (Z-score) and was used to assess possible interactions. Potential complexes that had no other high-scoring interactions nearby were rejected using structural differences. Thus, binding pairs that had at least one populated area of high-scoring structures were considered. MEGADOCK was applied for PPI prediction for 13 proteins of a bacterial chemotaxis pathway (Ohue et al. 2012; Matsuzaki et al., 2013), and a precision of 0.4 was obtained. MEGADOCK is available at http://www.bi.cs.titech.ac.jp/megadock.

One of the limitations of this approach is the generation of false positives in cases in which no similar structures are seen in known complex structure databases.

4.2.2.5 Meta approach

Ohue et al. (2013b) proposed a PPI prediction approach based on combining the template-based and docking methods. The approach applies PRISM (Tuncbag et al., 2011) as a template-matching method and MEGADOCK (Ohue et al. 2013a) as a docking method. A protein pair is considered to be interacting if both PRISM and MEGADOCK predict that this protein pair interacts. When applied to the human apoptosis signaling pathway, the method obtained a precision of 0.333, which is higher than that achieved using individual methods (0.231 for PRISM and 0.145 for MEGADOCK), while maintaining an F1 of 0.285 comparable to that obtained using individual methods (0.296 for PRISM and 0.220 for MEGADOCK).

Meta approaches have already been used in the field of protein tertiary structure prediction (Zhou et al., 2009), and critical experiments have demonstrated improved performance of Meta predictors when compared with individual methods. The Meta approach has also provided favorable results in protein domain prediction (Saini and Fischer, 2005) and the prediction of disordered regions in proteins (Ishida and Kinoshita, 2008). Although some true positives may be dropped by this method, the remaining predicted pairs are expected to have higher reliability because of the consensus between two prediction methods that have different characteristics.

4.2.3 ML structure-based approaches

4.2.3.1 Random Forest

Chen and Liu (2005) introduced a domain-based Random Forest PPI predictor. Protein pairs were characterized by the domains existing in each protein. The protein domain information was collected from the Pfam database (Bateman et al., 2004). Each protein pair was represented by a vector of features where each feature corresponded to a Pfam domain. If a domain existed in both proteins, then the associated feature value was 2. If the domain existed in one of the two proteins, then its associated feature value was 1. If a domain did not exist in both proteins, then the feature value was 0. These domain features were used to train a Random Forest classifier. The Random Forest constructs many decision trees, and each is grown from a different subset of training samples and random subset of features; the final classification of a given protein pair is determined by majority votes among the classes decided by the forest of trees.

When evaluated on a data set containing 9834 yeast protein interaction pairs among 3713 proteins, and 8000 negative randomly generated samples, the method achieved a sensitivity of 0.8 and a specificity of 0.64. Yeast PPI data was collected from the DIP (Salwinski et al., 2004; Deng et al., 2002; Schwikowski et al., 2000). The data set of Deng et al. (2002) is a combined interaction data experimentally obtained through two hybrid assays on S. cerevisiae by Uetz et al. (2000) and Ito et al. (2000). Schwikowski et al. (2000) gathered their data from yeast two-hybrid, biochemical, and genetic data.

The Random Forest classifier has several advantages. It is relatively fast, simple, robust to outliers and noise, and easily parallelized; avoids overfitting; and performs well in many classification problems (Breiman, 2001; Caruana et al., 2008). Random Forest shows a significant performance improvement over the single tree classifiers. It interprets the importance of the features using measures such as decrease mean accuracy or Gini importance (Chang and Yang, 2013). Random Forest benefits from the randomization of decision trees, as they have low bias and high variance. Random Forest has few parameters to tune and is less dependent on tuning parameters (Izmirlian, 2004; Qi, 2012). However, the computational cost of Random Forest increases as the number of generated trees increases. One of the limitations of this approach is that PPI prediction depends on domain knowledge so proteins without domain information cannot provide any useful information for prediction. Therefore, the method excluded the pairs where at least one of the proteins has no domain information.

4.2.3.2 Struct2Net

Singh et al. (2006) introduced Struct2Net as a structure-based PPI predictor. The method predicts interactions by threading each pair of protein sequences into potential structures in the PDB (Berman et al., 2000). Given two protein sequences (or one sequence against all sequences of a species), Struct2Net threads the sequence to all the protein complexes in the PDB and then chooses the best potential match. Based on this match, it uses the logistic regression technique to predict whether the two proteins interact.

Later, Singh et al. (2010) introduced Struct2Net as a web server with multiple querying options; it is available at http://struct2net.csail.mit.edu. Users can retrieve yeast, fly, and human PPI predictions by gene name or identifier, while they can query for proteins of other organisms by AA sequence in FASTA format. Struct2Net returns a list of interacting proteins if one protein sequence is provided and an interaction prediction if two sequences are provided. When evaluated on yeast and fly protein pairs, Struct2Net achieves a recall of 0.80, with a precision of 0.30.

A common limitation of all structure-based PPI prediction approaches is the low coverage as the number of known protein structures is much smaller than the number of known protein sequences. Therefore, such approaches fail when there is no structural template available for the queried protein pair. Table 6.3 summarizes these structure-based approaches and compares them in terms of features, techniques, tools, and validation data sets.

Table 6.3

Structure-based PPI Prediction Approaches

ApproachExtracted FeaturesTechnique/ToolData Sets
PRISM
(Tuncbag et al., 2011)
Interaction surface of crystalline complex structuresNaccess, MultiProt, Fiber-DockHuman protein (Acuner Ozbabacan et al., 2012; Tuncbag et al., 2009)
PrePPI
(Zhang et al., 2012)
Secondary structureBayesian networks, naive BayesYeast protein, human protein
PID matrix score
(Kim et al., 2002)
Potentially interacting domain pairsPID matrixDIP, InterPro, TrEMBL/SwissProt
PreSPI
(Han et al., 2003, 2004)
Domain combination interaction probabilityInteracting probability equationYeast protein (DIP), PDB
DCC
(Jang et al., 2012)
Intraprotein and interprotein domain interactionsInteraction significance matrixS. cerevisiae protein, Pfam
MEGADOCK
(Ohue et al., 2013a)
Shape complementarities, electrostatics, and hydrophobic interactionsrPSC, ZRANKBacterial protein (Ohue et al., 2012; Matsuzaki et al., 2013)
Meta approach
(Ohue et al., 2013b)
Interaction surface of crystalline complex structures, shape complementarities, electrostatics, and hydrophobic interactionsPRISM, MEGADOCKHuman protein
Random Forest
(Chen and Liu, 2005)
Existence of similar domainsRandom ForestDIP, Deng et al., 2002; Schwikowski et al., 2000; Pfam
Struct2Net
(Singh et al., 2006, 2010)
Homology with known protein complexes in PDBLogistic regressionYeast, fly, and human protein

t0020

5 Conclusion

This chapter provided a review of the computational techniques for PPI prediction, including the open issues and main challenges in this domain. We investigated several relevant existing approaches and provided a categorization and comparison of them. It is clearly noticed that PPI prediction still needs much more research to achieve reasonable prediction accuracy. One of the issues of the PPI prediction methods is that they do not use a uniform data set and evaluation measure. We recommend creating a freely available standard benchmark data set, taking into consideration the biological properties of proteins and examining the performance of all these methods on this benchmark data set using well-defined evaluation measures. This will allow researchers to compare the performance of these prediction methods in a fair and uniform fashion. This work can be extended by investigating more recently published PPI prediction techniques, analyze them in depth, and compare their performance on a uniform data set according to a uniform evaluation metric. More focus should be given to the techniques that incorporate biological knowledge into the prediction process.

References

Acuner Ozbabacan SE, Keskin O, Nussinov R, Gursoy A. Enriching the human apoptosis pathway by predicting the structures of protein–protein complexes. J. Struct. Biol. 2012;179(3):338–346.

Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MDR, et al. The interpro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 2001;29(1):37–40.

Bader GD, Hogue CW. Analyzing yeast protein–protein interaction data obtained from different sources. Nat. Biotechnol. 2002;20(10):991–997.

Bartel PL, Fields S. The Yeast Two-Hybrid System. Oxford University Press; 1997.

Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, et al. The pfam protein families database. Nucleic Acids Res. 2004;32(Suppl. 1):D138–D141.

Berg JM, Tymoczko JL, Stryer L. Protein Structure and Function. 2002.

Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat T, Weissig H, Shindyalov IN, Bourne PE. The protein data bank. Nucleic Acids Res. 2000;28(1):235–242.

Breiman L. Random forests. Mach. Learn. 2001;45(1):5–32.

Caruana R, Karampatziakis N, Yessenalina A. An empirical evaluation of supervised learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning; ACM; 2008:96–103.

Chang KY, Yang JR. Analysis and prediction of highly effective antiviral peptides based on random forests. PLoS One. 2013;8(8):e70166.

Chen XW, Liu M. Prediction of protein–protein interactions using random decision forest framework. Bioinformatics. 2005;21(24):4394–4400.

Chothia C. Proteins. One thousand families for the molecular biologist. Nature. 1992;357(6379):543.

Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 2011;273(1):236–247.

Cristianini N, Shawe-Taylor J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge university press; 2000.

Dandekar T, Snel B, Huynen M, Bork P. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci. 1998;23(9):324–328.

Deng M, Mehta S, Sun F, Chen T. Inferring domain–domain interactions from protein–protein interactions. Genome Res. 2002;12(10):1540–1548.

Eddy SR. Profile hidden markov models. Bioinformatics. 1998;14(9):755–763.

Fawcett T. An introduction to roc analysis. Pattern Recogn. Lett. 2006;27(8):861–874.

Friedrich T, Pils B, Dandekar T, Schultz J, Müller T. Modelling interaction sites in protein domains with interaction profile hidden markov models. Bioinformatics. 2006;22(23):2851–2857.

Gavin AC, Bösche M, Krause R, Grandi P, Marzioch M, Bauer, Schultz J, Rick JM, Michon AM, Cruciat CM, et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002;415(6868):141–147.

Gomez SM, Noble WS, Rzhetsky A. Learning to predict protein–protein interactions from protein sequences. Bioinformatics. 2003;19(15):1875–1881.

Guo Y, Yu L, Wen Z, Li M. Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences. Nucleic Acids Res. 2008;36(9):3025–3030.

Han D, Kim HS, Seo J, Jang W. A domain combination based probabilistic framework for protein-protein interaction prediction. Genome Inform. 2003;250–260.

Han DS, Kim HS, Jang WH, Lee SD, Suh JK. Prespi: a domain combination based prediction system for protein–protein interaction. Nucleic Acids Res. 2004;32(21):6312–6320.

Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology. 1982;143(1):29–36.

Henrick K, Thornton JM. Pqs: a protein quaternary structure file server. Trends Biochem. Sci. 1998;23(9):358–361.

Hubbard SJ, Thornton JM. In: Department of Biochemistry and Molecular Biology, University College London; “Naccess,” Computer Program. 1993;vol. 2 no. 1.

Ishida T, Kinoshita K. Prediction of disordered regions in proteins based on the meta approach. Bioinformatics. 2008;24(11):1344–1348.

Ito T, Tashiro K, Muta S, Ozawa R, Chiba T, Nishizawa M, Yamamoto K, Kuhara S, Sakaki Y. Toward a protein–protein interaction map of the budding yeast: a comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc. Natl. Acad. Sci. 2000;97(3):1143–1147.

Izmirlian G. Application of the random forest classification algorithm to a seldi-tof proteomics study in the setting of a cancer prevention trial. Ann. N. Y. Acad. Sci. 2004;1020(1):154–174.

Jang WH, Jung SH, Han DS. A computational model for predicting protein interactions based on multidomain collaboration. IEEE/ACM Trans. Comput. Biol. Bioinform. 2012;9(4):1081–1090.

Kern C, Gonzalez AJ, Liao L, Vijay-Shanker K. Predicting interacting residues using long-distance information and novel decoding in hidden markov models. IEEE Trans. Nanosci. 2013;12(13):158–164.

Kim WK, Park J, Suh JK, et al. Large scale statistical prediction of protein-protein interaction by potentially interacting domain (pid) pair. Genome Informa. 2002;42–50.

Krogh A, Brown M, Mian IS, Sjölander K, Haussler D. Hidden markov models in computational biology: applications to protein modeling. J. Mol. Biol. 1994;235(5):1501–1531.

Krogh A, et al. An introduction to hidden markov models for biological sequences. New Compr. Biochem. 1998;32:45–63.

Li S, Armstrong CM, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JDJ, Chesneau A, Hao T, et al. A map of the interactome network of the metazoan c. elegans. Science. 2004;303(5657):540–543.

Liao L, Noble WS. Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J. Comput. Biol. 2003;10(6):857–868.

Liu CH, Li KC, Yuan S. Human protein–protein interaction prediction by a novel sequence-based co-evolution method: co-evolutionary divergence. Bioinformatics. 2013;29(1):92–98.

Martin S, Roe D, Faulon JL. Predicting protein–protein interactions using signature products. Bioinformatics. 2005;21(2):218–226.

Mashiach E, Nussinov R, Wolfson HJ. Fiberdock: flexible induced-fit backbone refinement in molecular docking. Protein. Struct. Funct. Bioinforma. 2010;78(6):1503–1519.

Matsuya A, Sakate R, Kawahara Y, Koyanagi KO, Sato Y, Fujii, Yamasaki C, Habara T, Nakaoka H, Todokoro F, et al. Evola: ortholog database of all human genes in h-invdb with manual curation of phylogenetic trees. Nucleic Acids Res. 2008;36(Suppl. 1):D787–D792.

Matsuzaki Y, Ohue M, Uchikoga N, Akiyama Y. Protein-protein interaction network prediction by using rigid-body docking tools: application to bacterial chemotaxis. Protein Pept. Lett. 2013.

McLachlan AD. Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c551. J. Mol. Biol. 1971;61(2):409–424.

Melo JC, Cavalcanti G, Guimaraes K. Pca feature extraction for protein structure prediction. In: IEEE; 2952–2957. Neural Networks, 2003. Proceedings of the International Joint Conference on. 2003;vol. 4.

Metz CE. Basic principles of roc analysis. In: Elsevier; 283–298. Seminars in Nuclear Medicine. 1978;vol. 8 no. 4.

Mewes HW, Frishman D, Güldener U, Mannhaupt G, Mayer K, Mokrejs, Morgenstern B, Münsterkötter M, Rudd S, Weil. Mips: a database for genomes and protein sequences. Nucleic Acids Res. 2002;30(1):31–34.

Mirkovic N, Li Z, Parnassa A, Murray D. Strategies for high-throughput comparative modeling: Applications to leverage analysis in structural genomics and protein family organization. Protein. Struct. Funct. Bioinforma. 2007;66(4):766–777.

Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shivakumar K, Anuradha N, Reddy R, Raghavan TM, et al. Human protein reference databaseâAT2006 update. Nucleic Acids Res. 2006;34(Suppl. 1):D411–D414.

Ohue M, Matsuzaki Y, Ishida T, Akiyama Y. Improvement of the protein–protein docking prediction by introducing a simple hydrophobic interaction model: an application to interaction pathway analysis. In: Pattern Recognition in Bioinformatics. Springer; 2012:178–187.

Ohue M, Matsuzaki Y, Uchikoga N, Ishida T, Akiyama Y. Megadock: an all-to-all protein-protein interaction prediction system using tertiary structure data. Protein Pept. Lett. 2013a.

Ohue M, Matsuzaki Y, Shimoda T, Ishida T, Akiyama Y. Highly precise protein-protein interaction prediction based on consensus between template-based and de novo docking methods. In: BioMed Central Ltd; BMC Proceedings. 2013b;vol. 7 no. Suppl. 7.

Oyang YJ, Hwang SC, Ou YY, Chen CY, Chen ZW. Data classification with radial basis function networks based on a novel kernel density estimation algorithm. IEEE Trans. Neural Netw. 2005;16(1):225–236.

Pazos F, Valencia A. Similarity of phylogenetic trees as indicator of protein–protein interaction. Protein Eng. 2001;14(9):609–614.

Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi T, Gronborg M, et al. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 2003;13(10):2363–2371.

Pieper U, Eswar N, Davis FP, Braberg H, Madhusudhan MS, Rossi A, Marti-Renom M, Karchin R, Webb BM, Eramian D, et al. Modbase: a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res. 2006;34(Suppl. 1):D291–D295.

Pierce B, Weng Z. Zrank: reranking protein docking predictions with an optimized energy function. Protein. Struct. Funct. Bioinform. 2007;67(4):1078–1086.

Pitre S, Dehne F, Chan A, Cheetham J, Duong A, Emili A, Gebbia M, Greenblatt J, Jessulat M, Krogan N, et al. Pipe: a protein-protein interaction prediction engine based on the reoccurring short polypeptide sequences between known interacting protein pairs. BMC Bioinformatics. 2006;7(1):365.

Pitre S, North C, Alamgir M, Jessulat M, Chan A, Luo X, Green, Dumontier M, Dehne F, Golshani A. Global investigation of protein–protein interactions in yeast saccharomyces cerevisiae using reoccurring short polypeptide sequences. Nucleic Acids Res. 2008;36(13):4286–4294.

Porollo A, Meller J. Computational methods for prediction of protein-protein interaction sites. In: Cai W, Hong H, eds. InTech; 3–26. Protein-Protein Interactions - Computational and Experimental Tools. 2012;vol. 472.

Powers D. Evaluation: From precision, recall and f-measure to roc., informedness, markedness & correlation. J. Mach. Learn. Techn. 2011;2(1):37–63.

Prasad TK, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan, Telikicherla D, Raju R, Shafreen B, Venugopal A, et al. Human protein reference database - 2009 update. Nucleic Acids Res. 2009;37(Suppl. 1):D767–D772.

Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, et al. The pfam protein families database. Nucleic Acids Res. 2011;gkr1065.

Qi Y. Random forest for bioinformatics. In: Ensemble Machine Learning. Springer; 2012:307–323.

Qi Y, Bar-Joseph Z, Klein-Seetharaman J. Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Protein. Struct. Funct. Bioinforma. 2006;63(3):490–500.

Rabiner L, Juang BH. An introduction to hidden markov models. IEEE ASSP Mag. 1986;3(1):4–16.

Rigaut G, Shevchenko A, Rutz B, Wilm M, Mann M, Séraphin. A generic protein purification method for protein complex characterization and proteome exploration. Nat. Biotechnol. 1999;17(10):1030–1032.

Roy S, Martinez D, Platero H, Lane T, Werner-Washburne M. Exploiting amino acid composition for predicting protein-protein interactions. PLoS One. 2009;4(11):e7813.

Saigo H, Vert JP, Ueda N, Akutsu T. Protein homology detection using string alignment kernels. Bioinformatics. 2004;20(11):1682–1689.

Saini HK, Fischer D. Meta-dp: domain prediction meta-server. Bioinformatics. 2005;21(12):2917–2920.

Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg. The database of interacting proteins: 2004 update. Nucleic Acids Res. 2004;32(suppl 1):D449–D451.

Sasaki Y. The truth of the f-measure. Teach Tutor Mater. 2007;1–5.

Schwikowski B, Uetz P, Fields S. A network of protein– protein interactions in yeast. Nat. Biotechnol. 2000;18(12):1257–1261.

Shatnawi M. Computational methods for protein-protein interaction prediction. In: BIOCOMP’14. 2014.

Shatsky M, Nussinov R, Wolfson HJ. A method for simultaneous alignment of multiple protein structures. Protein. Struct. Funct. Bioinform. 2004;56(1):143–156.

Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y, Jiang. Predicting protein–protein interactions based only on sequences information. Proc. Natl. Acad. Sci. 2007;104(11):4337–4341.

Singh R, Xu J, Berger B. Struct2net: integrating structure into protein-protein interaction prediction. In: Citeseer; 403–414. Pacific Symposium on Biocomputing. 2006;vol. 11.

Singh R, Park D, Xu J, Hosur R, Berger B. Struct2net: a web service to predict protein–protein interactions using a structure-based approach. Nucleic Acids Res. 2010;38(Suppl. 2):W508–W515.

Smith TF, Waterman MS. Identification of common molecular subsequences. J. Mol. Biol. 1981;147(1):195–197.

Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. Biogrid: a general repository for interaction datasets. Nucleic Acids Res. 2006;34(Suppl. 1):D535–D539.

Stein A, Russell RB, Aloy P. 3did: interacting protein domains of known three-dimensional structure. Nucleic Acids Res. 2005;33(Suppl. 1):D413–D417.

Szilágyi A, Grimm V, Arakaki AK, Skolnick J. Prediction of physical protein–protein interactions. Phys. Biol. 2005;2(2):S1.

Tong AHY, Drees B, Nardelli G, Bader GD, Brannetti B, Castagnoli, Evangelista M, Ferracuti S, Nelson B, Paoluzi S, et al. A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science. 2002;295(5553):321–324.

Tuncbag N, Kar G, Gursoy A, Keskin O, Nussinov R. Towards inferring time dimensionality in protein–protein interaction networks by integrating structures: the p53 example. Mol. Biosyst. 2009;5(12):1770–1778.

Tuncbag N, Gursoy A, Nussinov R, Keskin O. Predicting protein-protein interactions on a proteome scale by matching evolutionary and structural similarities at interfaces using prism. Nat. Protoc. 2011;6(9):1341–1354.

Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, et al. A comprehensive analysis of protein–protein interactions in saccharomyces cerevisiae. Nature. 2000;403(6770):623–627.

Valente GT, Acencio ML, Martins C, Lemke N. The development of a universal in silico predictor of protein-protein interactions. PLoS One. 2013;8(5):e65587.

Vapnik, V.N. 1998. Statistical learning theory (adaptive and learning systems for signal processing, communications and control series).

Von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P. Comparative assessment of large-scale data sets of protein–protein interactions. Nature. 2002;417(6887):399–403.

Xenarios I, Eisenberg D. Protein interaction databases. Curr. Opin. Biotechnol. 2001;12(4):334–339.

Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg. Dip, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002;30(1):303–305.

Yoo PD, Sikder AR, Taheri J, Zhou BB, Zomaya AY. Domnet: protein domain boundary prediction using enhanced general regression network and new profiles. IEEE Trans. Nanobiosci. 2008;7(2):172–181.

Yoon BJ. Hidden markov models and their applications in biological sequence analysis. Curr. Genomics. 2009;10(6):402.

Yu CY, Chou LC, Chang DT. Predicting protein-protein interactions in unbalanced data using the primary structure of proteins. BMC Bioinformatics. 2010;11(1):167.

Zaki N. Protein-protein interaction prediction using homology and inter-domain linker region information. Adv. Electr. Eng. Comput. Sci. 2007;67(4):635–645.

Zaki N, Deris S, Alashwal H. Protein-protein interaction detection based on substring sensitivity measure. Int. J. Biomed. Sci. 2006;2(1):148–154.

Zaki N, Lazarova-Molnar S, El-Hajj W, Campbell P. Protein-protein interaction based on pairwise similarity. BMC Bioinformatics. 2009;10(1):150.

Zaki N, Wolfsheimer S, Nuel G, Khuri S. Conotoxin protein classification using free scores of words and support vector machines. BMC Bioinformatics. 2011;12(1):217.

Zhang QC, Petrey D, Deng L, Qiang L, Shi Y, Thu CA, Bisikirska B, Lefebvre C, Accili D, Hunter T, et al. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature. 2012;490(7421):556–560.

Zhou H, Pandit SB, Skolnick J. Performance of the pro-sp3-tasser server in casp8. Protein. Struct. Funct. Bioinform. 2009;77(S9):123–127.

Zhu H, Bilgin M, Bangham R, Hall D, Casamayor A, Bertone P, Lan, Jansen R, Bidlingmaier S, Houfek T, et al. Global analysis of protein activities using proteome chips. Science. 2001;293(5537):2101–2105.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset