Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6

Review of Recent Protein-Protein Interaction Techniques

Maad Shatnawi Higher Colleges of Technology, Abu Dhabi, UAE

Abstract

Protein-protein interactions (PPIs) play a crucial role in cellular functions and biological processes in all organisms. The identification of protein interactions can lead to a better understanding of infection mechanisms and the development of several medication drugs and treatment optimization. Several physiochemical experimental techniques have been applied to identify PPIs. However, these techniques are computationally expensive, significantly time consuming, and have covered only a small portion of the complete PPI networks. As a result, the need for computational techniques has been increased to validate experimental results and to predict nondiscovered PPIs. This chapter investigates and compares most of the recent computational PPI prediction techniques and discusses the technical challenges and open issues in this domain.

Keywords

Protein-protein interaction (PPI) prediction

protein-protein interaction (PPI)

protein sequences

computational techniques

1 Introduction

Proteins are the building blocks of all living organisms. The primary structure of a protein is the linear sequence of its amino acid (AA) units, starting from the amino-terminal residue (N-terminal) to the carboxyl-terminal residue (C-terminal). Amino acids consist of carbon, hydrogen, oxygen, and nitrogen atoms that are clustered into functional groups. All amino acids have the same general structure, but each has a different R group. The carbon atom to which the R group is connected is called the alpha carbon. There are 20 amino acids in proteins and are connected by a chemical reaction in which a molecule of water is removed, leaving two amino acids residues connected by a peptide bond. These 20 amino acids are alanine, arginine, asparagine, aspartic acid, cysteine, glutamic acid, glutamine, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, and valine. These amino acids are represented by one-letter abbreviations: A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, and V, respectively (Berg et al., 2002).

The secondary structure of a protein is the general three-dimensional form of its local parts. The most common secondary structures are alpha (α) helices and beta (β) sheets. The α-helix is a right-handed spiral array, while the β-sheet is made of beta strands connected crosswise by two or more hydrogen bonds, forming a twisted, pleated sheet. These secondary structures are linked by tight turns and loose, flexible loops (Berg et al., 2002). Protein domains are the basic functional units of protein tertiary structures. A protein domain is a conserved part of a protein sequence that can evolve, function, and exist independently. Each domain forms a three-dimensional structure and can be stable and folded independently. Several domains are joined in different combinations to form multidomain protein sequences (Chothia, 1992; Yoo et al., 2008).

A protein interacts with other proteins to perform certain tasks. Protein-protein interaction (PPI) occurs at almost every level of cell function. The identification of interactions among proteins provides a global picture of cellular functions and biological processes. Since most biological processes involve one or more protein-protein interactions, the accurate identification of the set of interacting proteins in an organism is very useful for deciphering the molecular mechanisms underlying given biological functions, as well as for assigning functions to unknown proteins based on their interacting partners (Zaki et al., 2009; Xenarios and Eisenberg, 2001; Kim et al., 2002). Protein interaction prediction is also a fundamental step in the construction of PPI networks for humans and other organisms. The identification of possible viral-host protein interactions can lead to a better understanding of infection mechanisms and, in turn, to the development of several medication drugs and treatment optimization. Abnormal PPIs are implicated in several neurological disorders, including Creutzfeld-Jakob and Alzheimer’s diseases (Bader and Hogue, 2002; Von Mering et al., 2002; Qi et al., 2006). Therefore, the development of accurate and reliable methods for identifying PPIs has very important impacts in several protein research areas.

This chapter, as an extension of a previous paper (Shatnawi, 2014), provides a comprehensive and comparative study and categorization of the existing computational approaches in PPI prediction. It also discusses the technical challenges and open issues in this field. The rest of this chapter is organized as follows: The next section addresses the key technical challenges that face PPI prediction and the open issues in this field. Section 3 discusses the performance measures that are typically used in PPI prediction. Section 4 provides a comprehensive description and comparison of the most current computational PPI predictors. Concluding remarks are presented in Section 5.

2 Technical challenges and open issues

There are several technical challenges that face computational analysis of protein sequences in general and PPI prediction in particular. First, there have been a huge amount of newly discovered protein sequences in the past genomic era. Second, protein chains are typically long, which makes them difficult, time consuming, and expensive to characterize by experimental methods. Third, the availability of large, comprehensive, and accurate benchmark data sets is required for the training and evaluation of prediction methods. Fourth, appropriate performance measures to evaluate the significance of the predictors should be developed to minimize the number of false positives and false negatives. Fifth, it is difficult to distinguish between novel interactions and false positives. Sixth, computational PPI methods are based on experimentally collected data, and therefore, any error in the experimental data will affect the computational PPI predictions.

One of the challenges of protein prediction methods is protein representation. Protein prediction methods vary in protein representation and feature extraction in order to build their classification models. There are two kinds of models that were generally used to represent protein samples; the sequential model and the discrete model. The most and simplest sequential model for a protein is its entire AA sequence. However, this representation does not work well when the query protein does not have high sequence similarity to any attribute-known proteins. Several nonsequential models, or discrete models, have been proposed. The simplest discrete model is AA composition, which is the normalized occurrence frequencies of the 20 native amino acids in a protein. However, all the sequence-order knowledge will be lost using this representation, which in turn will negatively affect the prediction quality (Chou, 2011). Some approaches use AA physiochemical properties, while others use pairwise similarity. Some approaches are template-based, while others are statistical-based or machine learning (ML)–based.

There are various challenges that face ML protein interaction prediction methods. Selecting the best ML approach is a great challenge. There are many techniques that vary in accuracy, robustness, complexity, computational cost, data diversity, overfitting, and ability to deal with missing attributes and different features. Most ML approaches of protein sequences are computationally expensive and often suffer from low prediction accuracy. They are also susceptible to overfitting (Melo et al., 2003).

Most PPI prediction approaches have achieved a reasonable performance on balanced data sets containing an equal number of interacting and noninteracting protein pairs. However, this ratio is highly imbalanced in nature, and these approaches have not been comprehensively assessed with respect to the effect of the large number of noninteracting pairs in realistic data sets. In addition, since highly imbalanced distributions usually lead to large data sets, more efficient prediction methods, algorithmic optimizations, and continued improvements in hardware performance are required to handle such challenging tasks.

3 Performance measures

There are several performance measures that are used to evaluate a PPI predictor and compare it with other approaches. The most frequently used evaluation measures in this field are accuracy, sensitivity, specificity, precision, F-measure (F1), Matthews correlation coefficient (MCC), receiver operating characteristic (ROC), and AUC (Area Under the ROC Curve).

Accuracy (Ac) is the proportion of correctly predicted interacting and noninteracting protein pairs to all of the protein pairs listed in the data set. Sensitivity, or recall (R), is the proportion of correctly predicted interacting protein pairs to all of the interacting protein pairs listed in the data set. Precision (P) is the proportion of correctly predicted interacting protein pairs to all of the predicted interacting protein pairs. Specificity (Sp) is the proportion of correctly predicted noninteracting protein pairs to all the noninteracting protein pairs listed in the data set. These metrics can be represented mathematically as follows:

$A c = \frac{T P + T N}{T P + T N + F N + F P}$ $A c = \frac{T P + T N}{T P + T N + F N + F P}$

(6.1)

$R = \frac{T P}{T P + F N}$ $R = \frac{T P}{T P + F N}$

(6.2)

$P = \frac{T P}{T P + F P}$ $P = \frac{T P}{T P + F P}$

(6.3)

$S p = \frac{T N}{T N + F P},$ $S p = \frac{T N}{T N + F P},$

(6.4)

where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively.

F1 is an evaluation metric that combines precision and recall into a single value. It is defined as the harmonic mean of precision and recall (Sasaki, 2007; Powers, 2011):

$F 1 = \frac{2 P R}{P + R} .$ $F 1 = \frac{2 P R}{P + R} .$

(6.5)

Matthews correlation coefficient (MCC) is a measure that balances prediction sensitivity and specificity. MCC ranges from − 1, indicating an inverse prediction, through 0, which corresponds to a random classifier, to 1 for perfect prediction, and is calculated as follows:

$M C C = \frac{T P \cdot T N - F P \cdot F N}{\sqrt{(T P + F N) (T P + F P) (T N + F P) (T N + F N)}} .$ $M C C = \frac{T P \cdot T N - F P \cdot F N}{\sqrt{(T P + F N) (T P + F P) (T N + F P) (T N + F N)}} .$

si6_e (6.6)

The receiver operating characteristic (ROC) curve is created by plotting the true positive rate (Recall) against the false positive rate (1 − Specificity) at various threshold settings. AUC is the area under the ROC curve, and it represents the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. The AUC can also be interpreted as the average recall over the entire range of possible specificity, or the average specificity over the entire range of possible recalls (Fawcett, 2006; Hanley and McNeil, 1982; Metz, 1978).

4 Computational approaches

PPI prediction has been studied extensively by several researchers and a large number of approaches have been proposed. These approaches can be classified into physiochemical experimental and computational approaches. Physiochemical experimental techniques identify the physiochemical interactions between proteins which, in turn, are used to predict the functional relationships between them. These techniques include yeast two-hybrid based methods (Bartel and Fields, 1997), mass spectrometry (Gavin et al., 2002), tandem affinity purification (Rigaut et al., 1999), protein chips (Zhu et al., 2001), and hybrid approaches (Tong et al., 2002). Although these techniques have succeeded in identifying several important interacting proteins in several species such as Saccharomyces cerevisiae (Yeast), Drosophila, and Helicobacter pylori (Shen et al., 2007), they are computationally expensive and significantly time consuming, and so far the predicted PPIs have covered only a small portion of the complete PPI network. As a result, the need for computational tools has been increased in order to validate physiochemical experimental results and to predict nondiscovered PPIs (Zaki et al., 2009; Szilágyi et al., 2005).

Several computational methods have been proposed for PPI prediction and can be classified into sequence-based and structure-based methods according to the used protein features. Sequence-based methods utilize AA features and can be further categorized into statistical and ML-based methods. The structure-based methods use three-dimensional structural features (Porollo and Meller, 2012) and can be categorized into template-based, statistical, and ML-based methods. This section provides an overview and discussion of some of the current computational sequence-based and structure-based PPI prediction approaches.

4.1 Sequence-based approaches

Sequence-based PPI prediction methods utilize AA features such as hydrophobicity, physiochemical properties, evolutionary profiles, AA composition, AA mean, or weighted average over a sliding window (Porollo and Meller, 2012). Sequence-based methods can be categorized into statistical and ML-based methods. This section presents and evaluates some of the existing sequence-based approaches.

4.1.1 Statistical sequence-based approaches

This section presents and describes several existing statistical sequence-based PPI prediction approaches.

4.1.1.1 Mirror tree method

Pazos and Valencia (2001) introduced the mirror tree method based on the comparison of the evolutionary distances between the sequences of the associated protein families and using the topological similarity of phylogenetic trees to predict PPIs. These distances were calculated as the average value of the residue similarities taken from the McLachlan amino acid homology matrix (McLachlan, 1971). The similarity between trees was calculated as the correlation between the distance matrices used to build the trees.

The mirror tree method does not require the creation of the phylogenetic trees; only the underlying distance matrices are analyzed, and therefore, this approach is independent of any given tree-construction method. Although the mirror tree method does not require the presence of fully sequenced genomes, it requires orthologous proteins in all the species under consideration. As a result, when more species’ genomes become available, fewer proteins could be applied. In addition to that, the method is restricted to cases where at least 11 sequences were collected from the same species for both proteins. This minimum limit was set empirically as a compromise between being small enough to provide enough cases and large enough for the matrices to contain sufficient information. The approach can be improved by increasing the number of possible interactions by collecting sequences from a larger number of genomes. Further, since the distance matrices are not a perfect representation of the corresponding phylogenetic trees, it is possible that some inaccuracies are introduced by comparing distance matrices instead of the real phylogenetic trees.

4.1.1.2 PIPE

Pitre et al. (2006) introduced the Protein-protein Interaction Prediction Engine (PIPE) to estimate the likelihood of interactions between pairs of Saccharomyces cerevisiae proteins using protein primary structure information. PIPE is based on the assumption that interactions between proteins occur by a finite number of short polypeptide sequences observed in a database of known interacting protein pairs. These sequences are typically shorter than the classical domains and reoccur in different proteins within the cell. PIPE estimates the likelihood of a PPI by measuring the reoccurrence of these short polypeptides within known interacting protein pairs. To determine whether two proteins, A and B, interact, the two proteins are scanned for similarity to a database of known interacting protein pairs. For each known interacting pair (X; Y), PIPE uses sliding windows to compare the AA residues in protein A against that in X and protein B against Y, and then measures how many times a window of protein A finds a match in X and at the same time a window in protein B matches a window in Y. These matches are counted and added up in a two-dimensional matrix. A positive protein interaction is predicted when the reoccurrence count in certain cells of the matrix exceeds a predefined threshold value. PIPE was evaluated on a randomly selected set of 100 interacting yeast protein pairs and 100 noninteracting proteins from the database of interacting proteins (DIP; http://dip.doe-mbi.ucla.edu; also see Salwinski et al., 2004) and Munich Information Center for Protein Sequences (MIPS; Mewes et al., 2002) databases. PIPE showed a prediction sensitivity of 0.61 and specificity of 0.89.

Since PIPE is based on protein primary structure information without any previous knowledge about the higher structure, domain composition, evolutionary conservation, or function of the target proteins. It can identify interactions of protein pairs for which limited structural information is available. The limitations of PIPE are as follows. PIPE is computationally intensive and requires hours of computation per protein pair, as it scans the interaction library repeatedly every time. Second, PIPE shows a weakness in detecting novel interactions among genomewide, large-scale data sets as it reported a large number of false positives. Third, PIPE was evaluated on uncertain data of interactions that were determined using several methods, each with limited accuracy.

Pitre et al. (2008) then developed PIPE2 as an improved and more efficient version of PIPE, which showed a specificity of 0.999. PIPE2 represents AA sequences in a binary code, which speeds up searching the similarity matrix. Unlike the original PIPE that scans the interaction database repeatedly every time, PIPE2 precomputes all window comparisons in advance and stores them on a local disk.

Although PIPE2 achieves high specificity, it has a large number of false positives with a sensitivity of 0.146 only. The rate of false positives can be reduced by incorporating other information about the target protein pairs, including subcellular localization or functional annotation. A major limitation of PIPE2 is that it relies exclusively on a database of preexisting interaction pairs for the identification of reoccurring short polypeptide sequences; so in the absence of sufficient data, PIPE2 will be ineffective. PIPE2 is also less effective for motifs that span discontinuous primary sequence, as it does not account for gaps within the short polypeptide sequences.

4.1.1.3 CD

Liu et al. (2013) introduced a sequence-based coevolution PPI prediction method in human proteins. The authors defined coevolutionary divergence (CD) based on two assumptions. First, PPI pairs may have similar substitution rates. Second, protein interaction is more likely to conserve across related species. CD is defined as the absolute value of the substitution rate difference between two proteins. It can be used to predict PPIs, as the CD values of interacting protein pairs are expected to be smaller than those of noninteracting pairs. The method was evaluated using 172,338 protein sequences obtained from the Evola database (Matsuya et al., 2008) for Homo sapiens and their orthologous protein sequences in 13 different vertebrates. The PPI data set was downloaded from the Human Protein Reference Database (Prasad et al., 2009). Pairwise alignment of the orthologous proteins was made with ClustalW2 software. The absolute value of substitution rate difference between two proteins was used to measure the CDs of protein pairs, which were then used to construct the likelihood ratio table of interacting protein pairs.

The CD method combines coevolutionary information of interacting protein pairs from many species. The method does not use multiple alignments, thus taking less time than other alignment methods such as the mirror tree method. The method is not limited to proteins with orthologous across all species under consideration. However, increasing the number of species will provide more information to improve the accuracy of the CD method. Although this method could rank the likelihood of interaction for a given pair of proteins, it did not infer specific features of interaction, such as the interacting residues in the interfaces.

Table 6.1 summarizes these statistical sequence-based approaches including the features that are used, the technique and/or the tools applied, and the validation data sets used.

Table 6.1

Statistical Sequence-based PPI Prediction Approaches

Approach	Extracted Features	Technique/Tool	Data Sets
Mirror tree (Pazos and Valencia, 2001)	Similarity of phylogenetic trees	Evolutionary distance, McLachlan AA homology matrix	Escherichia coli protein (Dandekar et al., 1998)
PIPE (Pitre et al., 2006) PIPE2 (Pitre et al., 2008)	Short AA polypeptides	Similarity measure	Yeast protein (DIP and MIPS)
CD (Liu et al., 2013)	Coevolutionary information,	Pairwise alignment, ClustalW2	Human protein (Matsuya et al., 2008, Prasad et al., 2009)

t0010

4.1.2 ML sequence-based approaches

This section describes several existing ML sequence-based PPI prediction approaches.

4.1.2.1 Auto covariance

Guo et al. (2008) proposed a sequence-based method using auto covariance (AC) and support vector machines (SVMs). AA residues were represented by seven physicochemical properties: hydrophobicity, hydrophilicity, volumes of side chains, polarity, polarizability, solvent-accessible surface area, and net charge index of AA side chains. AC counts for the interactions between residues that are a certain distance apart in the sequence. AA physicochemical properties were analyzed by AC based on the calculation of covariance. A protein sequence was characterized by a series of ACs that covered the information of interactions between each AA residue and its 30 vicinal residues in the sequence. Finally, a SVM model with a radial basis function (RBF) kernel was constructed using the vectors of AC variables as input. The optimization experiment demonstrated that the interactions of 1 AA residue and its 30 vicinal AAs would contribute to characterizing the PPI information. The software and data sets are available at http://www.scucic.cn/Predict_PPI/index.htm. A data set of 11,474 yeast PPIs extracted from DIP (Xenarios et al., 2002) was used to evaluate the model, and the average prediction accuracy, sensitivity, and precision achieved are 0.86, 0.85, and 0.87, respectively.

One of the advantages of this approach is that AC includes long-range-interaction information of AA residues, which are important in PPI identification. The use of SVM as a predictor is another advantage. SVM is a state-of-the-art ML technique with many benefits; it overcomes many limitations of other techniques. SVM has strong foundations in statistical learning theory (Cristianini and Shawe-Taylor, 2000) and has been successfully applied in various classification problems (Zaki et al., 2011). SVM offers several related computational advantages, such as the lack of local minima in the optimization (Vapnik, 1998).

4.1.2.2 Pairwise similarity

Zaki et al. (2009) proposed a PPI predictor based on pairwise similarity of protein primary structure. Each protein sequence was represented by a vector of pairwise similarities against large AA subsequences created by a sliding window that passes over concatenated protein training sequences. Each coordinate of this vector is the E-value of the Smith-Waterman (SW) score (Smith and Waterman, 1981). These vectors were then used to compute the kernel matrix, which was exploited in conjunction with an RBF-kernel SVM. Two proteins may interact by the means of the score similarities they produce (Zaki et al., 2006; Zaki, 2007). Each sequence in the testing set was aligned against each sequence in the training set, counted the number of positions that have identical residues, and then the number of positions was divided by the total length of the alignment.

The method was evaluated on a data set of yeast S. cerevisiae proteins created by Chen and Liu (2005) and contains 4917 interacting protein pairs and 4000 noninteracting pairs. The method achieved an accuracy of 0.78, a sensitivity of 0.81, a specificity of 0.744, and a ROC of 0.85.

SW alignment score provides a relevant measure of similarity between proteins. Therefore, protein sequence similarity typically implies homology, which in turn may imply structural and functional similarity (Liao and Noble, 2003). SW score parameters have been optimized over the past two decades to provide relevant measures of similarity between sequences and they now represent core tools in computational biology (Saigo et al., 2004). The use of SVM as a predictor is another advantage. This work can be improved by combining knowledge about gene ontology, interdomain linker regions, and interacting sites to achieve more accurate prediction.

4.1.2.3 AA composition

Roy et al. (2009) examined the role of amino acid composition (AAC) in PPI prediction and its performance against well-known features such as domains, the tuple feature, and the signature product feature. Every protein pair was represented by AAC and domain features. AAC was represented by monomer and dimer features. Monomer features capture composition of individual amino acids, whereas dimer features capture composition of pairs of consecutive AAs. To generate the monomer features, a 20-dimensional vector representing the normalized proportion of the 20 AAs in a protein was created. The real-valued composition was then discretized into 25 bits, producing a set of 500 binary features. To generate the dimer features, a 400-dimensional vector of all possible AA pairs was extracted from the protein sequence and discretized into 10 bits, producing a set of 4000 binary features. The domains were represented as binary features with each feature identified by a domain name. To compare AAC against other nondomain sequence-based features, tuple features (Gomez et al., 2003) and signature products (Martin et al., 2005) were obtained. The tuple features were created by grouping AAs into six categories based on their biochemical properties, and then all possible strings of length 4 were created using these categories. The signature products were obtained by first extracting signatures of length 3 from the individual protein sequences. Each signature consists of a middle letter and two flanking AAs represented in alphabetical order. Thus, two 3-tuples with the first and third amino acid letter permuted have the same signature. The signatures were used to construct a signature kernel specifying the inner product between two proteins.

The proposed approach was examined using three ML classifiers (logistic regression, SVM, and the naive Bayes classifier) on PPI data sets from yeast, worm, and fly. Three data sets for S. cerevisiae were extracted from the General Repository for Interaction Datasets (GRID) database (Stark et al., 2006), Yeast Two-hybrid (TWOHYB), Affinity pull down with Mass Spectrometry (AFFMS), and protein complementation assay. In addition to that, a data set each for worm, Caenorhabditis elegans (Biogrid data set; see Li et al., 2004) and fly, Drosophila melanogaster (Stark et al., 2006) were used. The authors reported that AAC features made almost equivalent contributions as domain knowledge across different data sets and classifiers, which indicated that AAC captures significant information for identifying PPIs. AAC is simple, computationally cheap, and applicable to any protein sequence, and it can be used when there is a lack of domain information. AAC can be combined with other features to enhance PPI prediction.

4.1.2.4 AA Triad

Yu et al. (2010) proposed a probability-based approach of estimating triad significance to alleviate the effect of AA distribution in nature. The relaxed variable kernel density estimator (RVKDE; see Oyang et al., 2005) was employed to predict PPIs based on AA triad information. The method is summarized as follows. Each protein sequence was represented as AA triads by considering every three continuous residues in the protein sequence as a unit. To reduce the feature dimensionality vector, the 20 AA types were categorized into seven groups based on their dipole strength and side chain volumes (Shen et al., 2007). The triads were then scanned one by one along the sequence, and each scanned triad was counted in an occurrence vector, O. Subsequently, a significance vector, S, was proposed to represent a protein sequence by estimating the probability of observing fewer occurrences of each triad than the one that is actually observed in O. Each PPI pair was then encoded as a feature vector by concatenating the two significance vectors of the two individual proteins. Finally, the feature vector was used to train a RVKDE PPI predictor. The method was evaluated on 37,044 interacting pairs within 9,441 proteins from the Human Protein Reference Database (HPRD) (Peri et al., 2003; Mishra et al., 2006). Data sets with different positive-to-negative ratios (from 1:1 to 1:15) were generated with the same positive instances and distinct negative sets, which were obtained by randomly sampling from the negative instances. The authors concluded that the degree of data set imbalance is important to PPI predictor behavior. With a 1:1 positive-to-negative ratio, the proposed method achieves 0.81 sensitivity, 0.79 specificity, 0.79 precision, and 0.8 F1. These evaluation measures drop as the data gets more imbalanced to reach 0.39 sensitivity, 0.97 specificity, 0.495 precision, and 0.44 F1, with a 1:15 positive-to-negative ratio.

RVKDE is an ML algorithm that constructs an RBF neural network to approximate the probability density function of each class of objects in the training data set. One main distinct feature of RVKDE is that it takes an average time complexity of O(nlogn) for the model training process, where n is the number of instances in the training set. In order to improve the prediction efficiency, RVKDE considers only a limited number of nearest instances within the training data set to compute the kernel density estimator of each class. One important advantage of RVKDE, in comparison with SVM, is that the learning algorithm generally takes far less training time with an optimized parameter setting. In addition, the number of training samples remaining after a data reduction mechanism is applied is very close to the number of support vectors of SVM algorithm. Unlike SVM, RVKDE is capable of classifying data with more than two classes in one single run (Oyang et al., 2005).

4.1.2.5 UNISPPI

Valente et al. (2013) introduced Universal In Silico Predictor of Protein-Protein Interactions (UNISPPI). The authors examined both the frequency and composition of the physicochemical properties of the 20 protein AAs to train a decision tree PPI classifier. The frequency feature set included the percentages of each of the 20 AAs in the protein sequence. The composition feature set was obtained by grouping each AA of a protein into one of three different groups related to seven physicochemical properties and calculating the percentage of each group for each feature, ending up with a total of 21 composition features. The seven physicochemical properties are hydrophobicity, normalized van der Waals volume, polarity, polarizability, charge, secondary structure, and solvent accessibility. When tested on a data set of PPI pairs of 20 different eukaryotic species (including eukaryotes, prokaryotes, viruses, and parasite-host associations), UNISPPI correctly classified 0.79 of known PPIs and 0.73 of non-PPIs. The authors concluded that using only the AA frequencies was sufficient to predict PPIs. They further concluded that the AA frequencies of asparagine (N), cysteine (C), and isoleucine (I) are important features for distinguishing between interacting and noninteracting protein pairs.

The main advantages of UNISPPI are its simplicity and low computational cost as small amount of features were used to train the decision tree classifier, which is fast to build and has few parameters to tune. Decision trees can be easily analyzed, and their features can be ranked according to their capabilities of distinguishing PPIs from non-PPIs. However, decision tree classifiers normally suffer from overfitting.

4.1.2.6 ETB-Viterbi

Kern et al. (2013) proposed the Early Traceback Viterbi (ETB-Viterbi) as a decoding algorithm with an early traceback mechanism in Interaction Profile Hidden Markov Models (ipHMMs) (Friedrich et al., 2006), which was designed to optimally incorporate long-distance correlations between interacting AA residues in input sequences. The method was evaluated with real data from the 3did database (Stein et al., 2005), along with simulated data generated from 3did data containing different degrees of correlation and reversed sequence orientation. ETB-Viterbi was able to capture the long-distance correlations for improved prediction accuracy and was not much affected by sequence orientation. The hidden Markov model (HMM) is a powerful probabilistic modeling tool for analyzing and simulating sequences of symbols that are emitted from underlying states and not directly observable (Rabiner and Juang, 1986). The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states. However, this algorithm is expensive in terms of memory and computing time. HMM training involves repeated iterations of the Viterbi algorithm, which makes it quite slow. HMM may not converge on a truly optimal parameter set for a given training set, as it can be trapped in local maxima, and can suffer from overfitting (Krogh et al., 1994; 1998; Eddy, 1998; Yoon, 2009).

Table 6.2 summarizes these ML sequence-based approaches and compared them in terms of features, techniques, tools, and validation data sets.

Table 6.2

ML Sequence-based PPI Prediction Approaches

Approach	Extracted Features	Technique/Tool	Data Sets
AC (Guo et al., 2008)	AA physicochemical properties	Auto covariance, SVM	Yeast protein (DIP and MIPS)
Pairwise similarity (Zaki et al., 2009)	Pairwise similarity	SVM	Yeast protein
AA composition (Roy et al., 2009)	AAC	Logistic regression, SVM, Naive Bayes	Yeast protein (GRID), worm protein (Li et al., 2004), fly protein (Biogrid)
AA triad (Yu et al., 2010)	AA triad information	RVKDE	Human protein (HPRD)
UNISPPI (Valente et al., 2013)	Frequency and composition of AA physiochemical properties	Decision tree	20 different eukaryotic species
ETB-Viterbi (Kern et al., 2013)	AA residues	HMMs, ETB-Viterbi	3did database

t0015

4.2 Structure-based approaches

Structure-based PPI prediction methods use three-dimensional structural features such as domain information, solvent accessibility, secondary structure states, and hydrophobic and polar surface locations (Porollo and Meller, 2012). Structure-based PPI prediction methods can be categorized into template-based, statistical, and ML-based methods. This section presents and evaluates some of the state-of-the-art structure-based approaches.

4.2.1 Template structure-based approaches

4.2.1.1 PRISM

Tuncbag et al. (2011) developed PRISM (Protein Interactions by Structural Matching) as a template-based PPI prediction method based on information regarding the interaction surface of crystalline complex structures. The two sides of a template interface are compared with the surfaces of two target monomers by structural alignment. If regions of the target surfaces are similar to the complementary sides of the template interface, then these two targets are predicted to interact with each other through the template interface architecture. The method can be summarized as follows. First, interacting surface residues of target chains are extracted using the Naccess software program (Hubbard and Thornton, 1993). Second, complementary chains of template interfaces are separated and structurally compared with each of the target surfaces by using MultiProt (Shatsky et al., 2004). Third, the structural alignment results are filtered according to threshold values, and the resulting set of target surfaces is transformed into the corresponding template interfaces to form a complex. Finally, the Fiber-Dock (Mashiach et al., 2010) algorithm is used to refine the interactions to introduce flexibility, compute the global energy of the complex, and rank the solutions according to their energies. When the computed energy of a protein pair is less than a threshold of − 10 kcal/mol, the pair is determined to interact.

PRISM has been applied for predicting PPIs in a human apoptosis pathway (Acuner Ozbabacan et al., 2012) and a p53 protein-related pathway (Tuncbag et al., 2009), and has contributed to the understanding of the structural mechanisms underlying some types of signal transduction. PRISM obtained a precision of 0.231 when applied to a human apoptosis pathway that consisted of 57 proteins.

4.2.1.2 PrePPI

Zhang et al. (2012) proposed Predicting Protein-Protein Interaction (PrePPI) as a structural alignment PPI predictor based on geometric relationships between secondary structure information. Given a pair of query proteins, A and B, representative structures for the individual subunits (MA; MB) are taken from the Protein Data Bank (PDB) (Berman et al., 2000) or from the ModBase (Pieper et al., 2006) and SkyBase (Mirkovic et al., 2007) homology model databases. Close and remote structural neighbors are found for each subunit. A template for the interaction exists if a PDB or PQS (Protein Quaternary Structure) (Henrick and Thornton, 1998) contains interacting pairs that are structural neighbors of MA and MB. A model is constructed by superposing the individual subunits, MA and MB, on their corresponding structural neighbors. The likelihood for each model to represent a true interaction is then calculated using a Bayesian network trained on 11,851 yeast interactions and 7409 human interactions data sets. Finally, the structure-derived score is combined with nonstructural information, including coexpression and functional similarity, into a naive Bayes classifier.

Although template-based methods can achieve high prediction accuracy when close templates are retrieved, the accuracy significantly decreases when the sequence identity of target and template is low.

4.2.2 Statistical structure-based approaches

4.2.2.1 PID matrix score

Kim et al. (2002) presented the potentially interacting domain (PID) pair matrix as a domain-based PPI prediction algorithm. The PID matrix score was constructed as a measure of interactability (interaction probability) between domains. The algorithm analysis was based on the DIP, which contains more than 10,000 mostly experimentally verified interacting protein pairs. Domain information was extracted from InterPro (Apweiler et al., 2001), an integrated database of protein families, domains, and functional sites. Cross-validation was performed with subsets of DIP data (positive data sets) and randomly generated protein pairs from the TrEMBL/SwissProt database (negative data sets). The method achieved 0.50 sensitivity and 0.98 specificity. The authors reported that the PID matrix can also be used in the mapping of the genome-wide interaction networks.

4.2.2.2 PreSPI

Han et al. (2003, 2004) proposed a domain combination-based method that considers all possible domain combinations as the basic units of protein interactions. The domain combination interaction probability is based on the number of interacting protein pairs containing the domain combination pair and the number of domain combinations in each protein. The method considers the possibility of domain combinations appearing in both interacting and noninteracting sets of protein pairs. The ranking of multiple protein pairs were decided by the interacting probabilities computed through the interacting probability equation.

The method was evaluated using an interacting set of protein pairs in yeast acquired from the DIP (Salwinski et al., 2004) and a randomly generated, noninteracting set of protein pairs. The domain information for the proteins was extracted from the PDB (http://www.ebi.ac.uk/proteome/; see Berman et al., 2000; Apweiler et al., 2001). PreSPI achieved a sensitivity of 0.77 and a specificity of 0.95.

PreSPI suffers from several limitations, though. First, this method ignores other domain-domain interaction information between the protein pairs. Second, it assumes that one domain combination is independent of another. Third, the method is computationally expensive, as all possible domain combinations are considered.

4.2.2.3 DCC

Jang et al. (2012) proposed a domain cohesion and coupling (DCC)–based PPI prediction method using the information of intraprotein domain interactions and interprotein domain interactions. The method aims to identify which domains are involved in a PPI by determining the probability that the domains cause the proteins to interact regardless of the number of participating domains. The coupling powers of all domain interaction pairs are stored in an interaction significance (IS) matrix, which is used to predict PPIs. The method was evaluated on S. cerevisiae proteins and achieved 0.82 sensitivity and 0.83 specificity. The domain information for the proteins was extracted from Pfam (http://pfam.sanger.ac.uk) (Punta et al., 2011), a protein domain family database that contains multiple sequence alignments of common domain families.

4.2.2.4 MEGADOCK

Ohue et al. (2013a) developed MEGADOCK as a protein-protein docking software package using the real Pairwise Shape Complementarity (rPSC) score. First, they conducted rigid-body docking calculations based on a simplified energy function considering shape complementarities, electrostatics, and hydrophobic interactions for all possible binary combinations of proteins in the target set. Using this process, a group of high-scoring docking complexes for each pair of proteins were obtained. Then ZRANK (Pierce and Weng, 2007) was applied for more advanced binding energy calculation and the docking results were reranked based on ZRANK energy scores. The deviation of the selected docking scores from the score distribution of high-ranked complexes was determined as a standardized score (Z-score) and was used to assess possible interactions. Potential complexes that had no other high-scoring interactions nearby were rejected using structural differences. Thus, binding pairs that had at least one populated area of high-scoring structures were considered. MEGADOCK was applied for PPI prediction for 13 proteins of a bacterial chemotaxis pathway (Ohue et al. 2012; Matsuzaki et al., 2013), and a precision of 0.4 was obtained. MEGADOCK is available at http://www.bi.cs.titech.ac.jp/megadock.

One of the limitations of this approach is the generation of false positives in cases in which no similar structures are seen in known complex structure databases.

4.2.2.5 Meta approach

Ohue et al. (2013b) proposed a PPI prediction approach based on combining the template-based and docking methods. The approach applies PRISM (Tuncbag et al., 2011) as a template-matching method and MEGADOCK (Ohue et al. 2013a) as a docking method. A protein pair is considered to be interacting if both PRISM and MEGADOCK predict that this protein pair interacts. When applied to the human apoptosis signaling pathway, the method obtained a precision of 0.333, which is higher than that achieved using individual methods (0.231 for PRISM and 0.145 for MEGADOCK), while maintaining an F1 of 0.285 comparable to that obtained using individual methods (0.296 for PRISM and 0.220 for MEGADOCK).

Meta approaches have already been used in the field of protein tertiary structure prediction (Zhou et al., 2009), and critical experiments have demonstrated improved performance of Meta predictors when compared with individual methods. The Meta approach has also provided favorable results in protein domain prediction (Saini and Fischer, 2005) and the prediction of disordered regions in proteins (Ishida and Kinoshita, 2008). Although some true positives may be dropped by this method, the remaining predicted pairs are expected to have higher reliability because of the consensus between two prediction methods that have different characteristics.

4.2.3 ML structure-based approaches

4.2.3.1 Random Forest

Chen and Liu (2005) introduced a domain-based Random Forest PPI predictor. Protein pairs were characterized by the domains existing in each protein. The protein domain information was collected from the Pfam database (Bateman et al., 2004). Each protein pair was represented by a vector of features where each feature corresponded to a Pfam domain. If a domain existed in both proteins, then the associated feature value was 2. If the domain existed in one of the two proteins, then its associated feature value was 1. If a domain did not exist in both proteins, then the feature value was 0. These domain features were used to train a Random Forest classifier. The Random Forest constructs many decision trees, and each is grown from a different subset of training samples and random subset of features; the final classification of a given protein pair is determined by majority votes among the classes decided by the forest of trees.

When evaluated on a data set containing 9834 yeast protein interaction pairs among 3713 proteins, and 8000 negative randomly generated samples, the method achieved a sensitivity of 0.8 and a specificity of 0.64. Yeast PPI data was collected from the DIP (Salwinski et al., 2004; Deng et al., 2002; Schwikowski et al., 2000). The data set of Deng et al. (2002) is a combined interaction data experimentally obtained through two hybrid assays on S. cerevisiae by Uetz et al. (2000) and Ito et al. (2000). Schwikowski et al. (2000) gathered their data from yeast two-hybrid, biochemical, and genetic data.

The Random Forest classifier has several advantages. It is relatively fast, simple, robust to outliers and noise, and easily parallelized; avoids overfitting; and performs well in many classification problems (Breiman, 2001; Caruana et al., 2008). Random Forest shows a significant performance improvement over the single tree classifiers. It interprets the importance of the features using measures such as decrease mean accuracy or Gini importance (Chang and Yang, 2013). Random Forest benefits from the randomization of decision trees, as they have low bias and high variance. Random Forest has few parameters to tune and is less dependent on tuning parameters (Izmirlian, 2004; Qi, 2012). However, the computational cost of Random Forest increases as the number of generated trees increases. One of the limitations of this approach is that PPI prediction depends on domain knowledge so proteins without domain information cannot provide any useful information for prediction. Therefore, the method excluded the pairs where at least one of the proteins has no domain information.

4.2.3.2 Struct2Net

Singh et al. (2006) introduced Struct2Net as a structure-based PPI predictor. The method predicts interactions by threading each pair of protein sequences into potential structures in the PDB (Berman et al., 2000). Given two protein sequences (or one sequence against all sequences of a species), Struct2Net threads the sequence to all the protein complexes in the PDB and then chooses the best potential match. Based on this match, it uses the logistic regression technique to predict whether the two proteins interact.

Later, Singh et al. (2010) introduced Struct2Net as a web server with multiple querying options; it is available at http://struct2net.csail.mit.edu. Users can retrieve yeast, fly, and human PPI predictions by gene name or identifier, while they can query for proteins of other organisms by AA sequence in FASTA format. Struct2Net returns a list of interacting proteins if one protein sequence is provided and an interaction prediction if two sequences are provided. When evaluated on yeast and fly protein pairs, Struct2Net achieves a recall of 0.80, with a precision of 0.30.

A common limitation of all structure-based PPI prediction approaches is the low coverage as the number of known protein structures is much smaller than the number of known protein sequences. Therefore, such approaches fail when there is no structural template available for the queried protein pair. Table 6.3 summarizes these structure-based approaches and compares them in terms of features, techniques, tools, and validation data sets.

Table 6.3

Structure-based PPI Prediction Approaches

Approach	Extracted Features	Technique/Tool	Data Sets
PRISM (Tuncbag et al., 2011)	Interaction surface of crystalline complex structures	Naccess, MultiProt, Fiber-Dock	Human protein (Acuner Ozbabacan et al., 2012; Tuncbag et al., 2009)
PrePPI (Zhang et al., 2012)	Secondary structure	Bayesian networks, naive Bayes	Yeast protein, human protein
PID matrix score (Kim et al., 2002)	Potentially interacting domain pairs	PID matrix	DIP, InterPro, TrEMBL/SwissProt
PreSPI (Han et al., 2003, 2004)	Domain combination interaction probability	Interacting probability equation	Yeast protein (DIP), PDB
DCC (Jang et al., 2012)	Intraprotein and interprotein domain interactions	Interaction significance matrix	S. cerevisiae protein, Pfam
MEGADOCK (Ohue et al., 2013a)	Shape complementarities, electrostatics, and hydrophobic interactions	rPSC, ZRANK	Bacterial protein (Ohue et al., 2012; Matsuzaki et al., 2013)
Meta approach (Ohue et al., 2013b)	Interaction surface of crystalline complex structures, shape complementarities, electrostatics, and hydrophobic interactions	PRISM, MEGADOCK	Human protein
Random Forest (Chen and Liu, 2005)	Existence of similar domains	Random Forest	DIP, Deng et al., 2002; Schwikowski et al., 2000; Pfam
Struct2Net (Singh et al., 2006, 2010)	Homology with known protein complexes in PDB	Logistic regression	Yeast, fly, and human protein

t0020

5 Conclusion

This chapter provided a review of the computational techniques for PPI prediction, including the open issues and main challenges in this domain. We investigated several relevant existing approaches and provided a categorization and comparison of them. It is clearly noticed that PPI prediction still needs much more research to achieve reasonable prediction accuracy. One of the issues of the PPI prediction methods is that they do not use a uniform data set and evaluation measure. We recommend creating a freely available standard benchmark data set, taking into consideration the biological properties of proteins and examining the performance of all these methods on this benchmark data set using well-defined evaluation measures. This will allow researchers to compare the performance of these prediction methods in a fair and uniform fashion. This work can be extended by investigating more recently published PPI prediction techniques, analyze them in depth, and compare their performance on a uniform data set according to a uniform evaluation metric. More focus should be given to the techniques that incorporate biological knowledge into the prediction process.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6: Review of Recent Protein-Protein Interaction Techniques

Create new playlist

Sign In

Sign Up