Chapter 24

SMIR

A Web Server to Predict Residues Involved in the Protein Folding Core

Ruben Acuña1; Zoé Lacroix1; Jacques Chomilier2,3; Nikolaos Papandreou4    1 Scientific Data Management Laboratory, Arizona State University, Tempe, AZ, USA
2 IMPMC, Sorbonne Universités, Université Pierre et Marie Curie, CNRS, MNHN, IRD, Paris, France
3 RPBS, Université Paris Diderot, Paris, France
4 Department of Biotechnology, Agricultural University of Athens, Athens, Greece

Abstract

Motivation: Protein folding is the critical spontaneous phase when the protein gains its structural conformation, and hence, its functional shape. Should any error in the process affect its folding, the protein structure may fail to fold properly and perform its function. In some cases, such misfolded proteins can cause disease. Mutations are typical causes of protein misfolding, but some residues are more likely than others to affect the folding process when mutated. This chapter presents a new method, called SMIR, that identifies the residues involved in the core of proteins, thus more sensitive to mutations.

Results: A Monte Carlo algorithm is used to simulate the early steps of protein folding and the mean number of spatial, noncovalently bound neighbors is calculated after 106 steps. Residues surrounded by many others may play a role in the compactness of the protein and thus are called Most Interacting Residues (MIR). The original MIR method was updated and extended with a new smoothing method using hydrophobic-based residue neighborhood analysis. The resulting SMIR method is implemented and available as a server that supports the submission and the analysis of protein structures with MIR2.0 and SMIR. The server offers a dynamic interface with the display of results in a two-dimensional (2D) graph.

Availability: SMIR is free and open to all users as a function of the Structural Prediction for pRotein fOlding UTility System (SPROUTS) with no login requirement at http://sprouts.rpbs.univ-paris-diderot.fr/mir.html. The new server also offers a user-friendly interface and unlimited access to results stored in a database.

Supplementary information: The MIR method and SMIR extension are described in great details in the supplementary material available at Bioinformatics online.

Acknowledgments

We acknowledge Pierre Tufféry for his help with using the RPBS resources, Dirk Stratmann for exciting discussions on benchmarks and method comparison and integration, and Elodie Duprat for sharing her results on the beta/gamma-crystallin superfamily. Mathieu Lonquety and Christophe Legendre contributed to the SPROUTS database where SMIR results are stored, and Fayez Hadji tested a preliminary version of the server. They are all thanked for their help. We also wish to acknowledge our collaborators at ASU: Rida Bazzi, who is working with us on issues related to scientific workflow updates; Antonia Papandreou-Suppappola and Anna Malin, who have worked on an alternative MIR method; and Banu Ozkan, for evaluating SPROUTS functionalities and discussing future improvements.

Funding: This work was partially supported by the National Science Foundation (grants IIS 0431174, IIS 0551444, IIS 0612273, IIS 0738906, IIS 0832551, IIS 0944126, and CNS 0849980), and by an invitation of the Université Pierre et Marie Curie.

Disclaimer: Any opinion, finding, and conclusion or recommendation expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

1 Introduction

Amino acids involved in interresidue contacts may play a role in the compactness of the protein; hence are called Most Interacting Residues (MIR). The MIR method was first introduced to simulate the origin of protein folding (Chomilier et al., 2004). Starting from a random conformation, the folding process can be dynamically simulated in a discrete space (a lattice). Successive residues that collapse and form a local compact structure (linked to another one by an extended polypeptide chain) form a fragment. The MIR method focuses exclusively on the early steps of the folding process. In its very first implementation, it aimed to delineate the fragments formed at this stage. For this reason, the method was calibrated with time limits to maximize the number of fragments before the folding process reaches a single compact domain. It assigned a score between 2 and 8 to each residue, corresponding to the mean number of noncovalent neighbors in the lattice. A high score indicates that the residue is buried, thus belongs to a fragment. A low score indicates a low interacting residue belonging to a piece of the chain that links two consecutive fragments. A correspondence between fragments and regular secondary-structure elements (SSEs) was demonstrated on a set of 42 proteins, representative of various folds (Chomilier et al., 2004). However, it has been shown that a pertinent analysis of globular protein structures with respect to folding properties consists of describing them as an ensemble of contiguous closed loops (Berezovsky et al., 2000) or tightened end fragments (TEFs; Lamarine et al., 2001). This description reveals that the ends of TEFs are fold elements crucial for the formation of stable structures and for navigating the very process of protein folding. Meanwhile, the MIR algorithm evolved, and newer versions (including the actual presented one) aim to locate individual residues with a very high mean number of neighbors (typically ≥ 6), which are called MIRs. In the other limit, individual residues with a low mean number of neighbors (typically 2) are Least Interacting Residues (LIRs). Therefore, the residues identified as MIRs have the tendency to be buried at the early stages of the folding process. The comparison of MIR positions with the positions of the limits of closed loops, in proteins of known three-dimensional (3D) structures, showed a statistically significant agreement. MIRs also significantly correlate with topohydrophobic positions; i.e., positions in multiple alignments of sequences of common fold occupied only by hydrophobic amino acids, and correlated to the folding nucleus (Poupon and Mornon, 1998), thereby giving a route to simulations of the protein folding process (Papandreou et al., 2004). Thus, MIR is a potential method for an ab initio estimation of the residues that are important for folding and consequently, significantly sensitive to mutations.

It is important to keep in mind the difference between a protein core and a nucleus. Core is a static concept, and it results from the fact that a globular protein is a micelle, with an internal phase of hydrophobic character, and an external phase of hydrophilic character, statistically. The core of a protein can be derived by a simple accessible surface area calculation, or with more sophisticated methods (Bottini et al., 2013). In contrast, nucleus is a dynamic concept that relies on a model of folding—namely, the nucleation condensation model (Abkevich et al., 1994, Itzhaki et al., 1995). In a few words, a small set of dispersed amino acids come into contact during the folding because of the thermal vibrations of the molecule. They are hydrophobic, and once they form such a nucleus, the rest of the structure can be formed. Among proteins sharing the same fold, part of the nucleus is conserved. In addition, it is now documented that nonnative contacts are necessary for the folding, and they disappear once the stability is sufficient. Figure 24.1 illustrates the difference between core and nucleus in the case of a fibronectin.

f24-01-9780128025086
Figure 24.1 The difference between the core (left) and nucleus (right) of type III fibronectin (Lappalainen et al., 2008; Billings et al., 2008).

The knowledge of the residues constituting the folding nucleus is important for instance in the annotation of misfolding-related pathologies, but their experimental determination is not 100% secure. The role of prediction, at this moment, is a valuable complementary approach. The literature commonly admits that the number of residues involved in the folding nucleus is typically less than 10% of the sequence length, roughly one-third of the hydrophobic residues. Initial MIR calculation slightly overpredicts the nucleus. One guideline to improve prediction can be to produce a smoothing of the curve of NCN as a function of the sequence. This is one of the major improvements proposed with the Smoothed Most Interacting Residues (SMIR) method.

The SMIR method presented in this chapter aims at improving the accuracy of MIR in the prediction of residues involved in the folding nucleus. Indeed, it has been shown that MIR overestimates the folding nucleus of numerous proteins. The SMIR method is implemented and available as a server that supports the submission and the analysis of protein structures with MIR2.0 and SMIR. The server offers a dynamic interface with the display of results in a 2D graph.

2 Methods

The MIR method is an extension of previous simulations performed on cubic lattices, devoted to the complete folding of globular domains (Papandreou et al., 1998). The MIR algorithm is a topological calculation resulting from a series of energy-driven simulations of a protein backbone, where the mean number of noncovalent contacts is deduced for each residue. The analysis is performed at the early steps of folding and provides the number of noncovalent neighbors (NCNs) for each residue in the sequence.

The simulation of the early steps of the folding is designed in the following manner. First, an extended initial conformation is produced for an alpha-carbon-only simplified representation of the polypeptide chain. Each alpha carbon is placed at random (while constrained as a chain) on the nodes of a lattice. An extension of a cubic lattice, namely (2, 1, 0), originally proposed by Skolnick and Kolinski (1991) is used (see Figure 24.2). Compared to the simple cubic lattice, it allows a wider range of backbone angles, from 64° to 143°, among three contiguous alpha carbons. The number of first neighbors is also higher: 24 instead of 6. Side chains are discarded in the present simulation. Folding is produced by randomly selecting one amino acid and submitting it to one of two available moves: an end move for the N or C terminal positions, or a corner move otherwise. The crankshaft move is no longer permitted with the (2, 1, 0) lattice. The new position can be occupied if it was previously empty, and the energy of the new conformation is computed by means of a statistical potential of mean force taken from the literature (Miyazawa and Jernigan, 1996). The Metropolis criterion is applied to accept or reject the new conformation.

f24-02-9780128025086
Figure 24.2 Details of the (2, 1, 0) lattice, with respect to the underlying cubic lattice. The dotted line indicates a possible move to a free node (Acuña et al., 2014).

The process is stopped when roughly 106 to 107 Monte Carlo steps are reached, depending on the length of the query sequence. The full process is repeated 100 times, starting from 100 different initial conformations. The number of NCNs is recorded during each complete simulation. Two noncovalently bound residues are considered to interact if the distance between their respective alpha carbons does not exceed the upper limit of 5.9 Å. The mean NCN is calculated at the end of the process and for all the initial conformations. The distribution of NCN along the sequence presents maxima and minima. We paid most of our attention to the maxima because we were aiming to predict the core contacting residues, expected to be crucial for the formation of secondary structures (Kirster and Gelfand, 2009) and whose prediction allows for determining the fold (Jones et al., 2012). Therefore, a residue i is accepted as a MIR if NCN(i) is equal or higher than 6. The result is that more than 90% of the MIRs are hydrophobic (Acuña et al., 2014).

It has been demonstrated that for each protein, residues identified as MIRs constitute a nontrivial subset of the hydrophobic residues. Among families of folds (several domains per family, similar structure, potentially different functions, and very divergent sequences), MIRs occupy equivalent positions in the multiple alignments. Therefore, among families, a small number of hydrophobic positions are conserved as hydrophobic. They are compulsory for the folding to occur; they are deeply buried. For these reasons, it seems reasonable to question whether they constitute the folding nucleus of the various folds. The answer is positive, as proposed by the presently available studies. They concern a very small number of families because experimental evidence of the folding nucleus is not obvious and can show strong biases. Demonstration has been extensively proposed on two complete families, the immunoglobulins (56 structures of divergent sequences) and flavodoxins (43 structures).

One limitation of the MIR algorithm was the number of MIRs identified by the threshold—typically around 15% of the amino acids—while the rate of amino acids expected to belong to the folding nucleus lies roughly in the range of 5% to 10% This limitation also relates to the overall sharp variation in the graph. The SMIR extension addresses these issues, and it uses a Pascal triangle method to give smooth results. We also adjust the maxima that are identified in the smoothed graph to nearby (within three residues) hydrophobic positions based on the accepted precision of the algorithm (Chomilier et al., 2006). This is coherent with the expected accuracy for protein residue contact prediction of the contact prediction session of the Critical Assessment of (protein) Structure Prediction (CASP) experiments (Eickholt and Cheng, 2013). Hence, we continue to identify minima with a threshold but validate the extrema against the actual amino acids.

3 Results

3.1 Model

We model a protein as a chain of evenly spaced Cα atoms placed on a lattice (Chomilier et al., 2004). We define a lattice unit (lu) to be 1.7 Å. Hence, Cα atoms are connected by vectors of the form (2,1,0), these vectors are 51/2 lu in length which corresponds to 3.8 Å—the mean distance between adjacent Cα atoms. This results in 24 immediate neighbor positions for each point in the lattice. This represents the intersection of a 4 × 4 × 4 segmented cube with a sphere of radius 3.8 Å (5 1/2 lu).

The model does not take into account the presence of side chains; therefore, the required separation is modeled with the 3.8 Å minimum distance requirement. Based on chain geometry, we limit the angle between some Cαs at positions i and i + 2 by requiring the distance between them to range from 4.1 to 7.2 Å (or from 61/2 to 181/2 lu). This corresponds to angles from 66° to 143°, which is closer to the real angles in alpha and beta conformations. This is illustrated in Figure 24.3, where a residue i is fixed at [0, 0, 0] and all 24 possible positions for residue i + 1 are represented as black vectors. There is a choice of 23 possible vectors for residue i + 2. For the sake of clarity, only one position [0, 1, 2] (the green and red vectors) is shown. Red vectors are those that violate the distance (angle) restriction.

f24-03-9780128025086
Figure 24.3 The first five initial models (Acuña et al., 2014).

To initiate the simulations, 100 different starting models within this lattice are used. Figures 24.3 and 24.4 display a sample of five and all models, respectively, as a comprehensive plot. These models were computed randomly offline for chains of 1100 residues. For the initial models, our only requirement is that they have some level of noncompactness (Papandreou et al., 2004). Starting from the first residue, located at position [0, 0, 0], the first n positions in the seed model will be used for an input model with n residues—any additional residues in the random sequence will be discarded.

f24-04-9780128025086
Figure 24.4 All initial models (Acuña et al., 2014).

3.2 SMIR

The MIR method was first developed in 2004 (Chomilier et al., 2004; Papandreou et al., 2004), and MIR 1.0 was first made available online as a function of the Ressource Parisienne en Bioinformatique Structurale (RPBS) server in 2005 (Alland et al., 2005). The present SMIR server exploits MIR2.2 implemented with Fortran for server-side simulation and a SMIR JavaScript front end for interactive analysis. The input to the MIR algorithm is a FASTA file containing a sequence using standard amino acids. The input file is either provided by the user or automatically retrieved from RCSB via a Protein Data Bank (PDB) ID. The output consists of a table associated with each residue: AA letter, NCN, MIRs, and LIRs. NCN is an integer while MIR and LIR are Boolean flags.

Server-side computation time is quadratic on the length of the sequence and may be modeled by 6.062E − 5x2 − 0.0138x + 0.843 hours, where x is the number of residues, on an Intel Core 2 Duo E6600 computer. The results are stored in a MySQL 5.1 database running on Ubuntu 13.04 LTS. The new SMIR smoothing method is implemented in JavaScript with D3 (Bostock et al., 2011) and has been primarily tested in Google Chrome 33. Firefox 18 and Safari 6 have also been tested. Microsoft Internet Explorer is not currently supported. It has been found that the computation time for SMIR, once MIR results are available, is negligible on Intel Core 2 Duo–based computers. A browser-based implementation allows users to retrieve this new analysis for any existing protein without the need to resubmit the entry to our submission server.

3.3 Submitting a protein

The SMIR interface illustrated in Figure 24.5 supports the submission of a PDB ID, a list of PDB IDs, or a FASTA file. In the latter case, the user will also enter a four-letter alphanumeric code to identify the submission and later retrieve the results. The submission of an e-mail address is optional. Should one be submitted, it will be used only for the purpose of informing the user of the availability of the results in the database with a reminder of the code. After submission, the server returns a SMIR status window (see Figure 24.6). Here, the window displays the status for five proteins of PDB codes: 1AMM, 1DX5, 1I5I, 1QUC, and 1ZAC. At the top of the status windows are listed the PDB IDs that have already been analyzed by MIR. Each PDB (e.g., 1AMM, 1DX5, or 1I5I) is listed with a bullet. If a protein has more than one chain, each available chain will be listed on that PDB’s line and enclosed in parentheses [e.g., 1DX5(A), 1DX5(I), etc.]. The middle part of the window lists invalid retrieval PDB codes (e.g., 1QUC). The last part consists of the proteins that will be submitted to the server (e.g., 1ZAC). In this case, the PDB IDs will be added to the server queue for processing. Each protein submitted to the server is displayed in the status window with the retrieval link to access the data once the execution is completed. If an e-mail address was entered on the previous screen, a notification with a link will be sent upon completion. The proteins listed at the top of the SMIR status window are immediately viewable with a 2D graph (see Figure 24.7). If a PDB ID is not in the list of available proteins, it will be automatically submitted for analysis. Once the user’s protein is ready to be analyzed, the server downloads the information associated with that PDB ID from the PDB and runs MIR. After execution, the user may use the retrieval link or return to MIR query mode and enter that PDB ID to access the SMIR results. Additionally, the information that was generated for the new PDB ID is now available to other researchers for further use.

f24-05-9780128025086
Figure 24.5 MIR interface.
f24-06-9780128025086
Figure 24.6 MIR submission status.
f24-07-9780128025086
Figure 24.7 MIR results for protein 1AMM(A).

The graphical representation illustrated in Figure 24.7 is composed of three areas: legend for the MIR interface (top left), 2D display graph (top right), and data download (bottom). On recent browsers, such as Chrome 24, the data can be downloaded with a comma-separated values (CVS) file. They can alternately be copied and pasted from a text box. The MIR analysis for protein 1amm(A), shown in Figure 24.8, displays MIR residues in blue as a 2D graph. Dark blue vertical bars indicate which residues are MIRs, while dark red bars indicate LIRs (note that no LIR was shown in Figure 24.8). All the bars plot the NCN count at a position on the vertical axis. When browsing on the graph with the mouse, a black pop-up information box displays the amino acid name, its exact position in the protein (with respect to the FASTA file the protein is associated with), the number of NCNs, and the MIR status. The orange regions in the background indicate TEFs (Chomilier et al., 2004), which overlap on slightly darker orange areas. The SMIR method is activated with a check box. When SMIR is selected, the 2D graph will show dynamically how MIR predictions (see the top left of Figure 24.8) are replaced by SMIR predictions (see the bottom right of Figure 24.8). When in smooth mode (i.e., when SMIR is selected), the dark blue and dark red bars indicate SMIRS and SLIRS (smoothed LIR, which are minima in the NCN curves) respectively. Note that two SMIRS are shown for protein 1amm(A) in Figure 24.8.

f24-08-9780128025086
Figure 24.8 SMIR dynamic results (bottom right) activated for 1amm(A).

3.4 Use case and discussion

John Orban’s group has demonstrated how a single point mutation could have a transformative impact on the protein fold (He et al., 2005, 2012; Alexander et al., 2007, 2009). They first considered two short domains GA and GB with low sequence similarity (because their sequence identity is 16%, we will use the notations GA16 and GB16, respectively, for GA and GB in the rest of this discussion). The two wild types (WTs) GA16 and GB16 were not entered in PDB, but their sequences shown in Table 24.1 were published by He and colleagues in 2005. They engineered two proteins, 2LHC (also referred to as GA98) and 2LHD (also referred to as GB98), of length 56, with 98% sequence identity. Their sequences only differ by one residue on position 45, as displayed in Table 24.2. Although these two sequences show very high sequence identity, the structures display significantly different structures. 2LHC contains three alpha helices, while 2LHD contains four beta sheets and one alpha helix, as illustrated in Figure 24.9. A single mutation at the 45th residue (leucine toward tyrosine) changes dramatically the folded conformation.

Table 24.1

Sequences of WT proteins GA (ga16A) and GB (gb16A) and Engineered Proteins Until 98% Sequence Identity, Reached with GA98 (2lhcA) and GB98 (2lhdA)

%A (albumin-binding)IDB (IgG-binding)ID
98TTYKLILNLKQAKEEAIKELVDAGTAEKYFKLIANAKTVEGVWTLKDEIKTFTVTE2lhcATTYKLILNLKQAKEEAIKELVDAGTAEKYFKLIANAKTVEGVWTYKDEIKTFTVTE2lhdA
95TTYKLILNLKQAKEEAIKELVDAGTAEKYIKLIANAKTVEGVWTLKDEIKTFTVTE2kdlATTYKLILNLKQAKEEAIKEAVDAGTAEKYFKLIANAKTVEGVWTYKDEIKTFTVTE2kdmA
91TTYKLILNLKQAKEEAIKELVDAGTAEKYIKLIANAKTVEGVWTLKDEILTFTVTEga91aTTYKLILNLKQAKEEAIKEAVDAGTAEKYFKLIANAKTVEGVWTYKDEIKTFTVTE2kdmA
88TTYKLILNLKQAKEEAIKELVDAGIAEKYIKLIANAKTVEGVWTLKDEILTFTVTE2jwsATTYKLILNLKQAKEEAITEAVDAGTAEKYFKLYANAKTVEGVWTYKDEIKTFTVTEgb88A
77TTYKLILNLKQAKEEAIKELVDAGIAEKYIKLIANAKTVEGVWTLKDEILKATVTEga77ATTYKLILNGKQLKEEAITEAVDAATAEKYFKLYANAKTVEGVWTYKDETKTFTVTEgb77A
59MYYLVVNKQQNAFYEVLNMPNLNEDQRNAFIQSLKDDPSQSANVLAEAQKLNDVQAga59AMYYLVVNKGQNAFYETLTKAVDAETARNAFIQSLKDDGVQGVWTYDDATKTFTVQAgb59A
WTMDNKFNKEQQNAFYEVLNMPNLNEDQRNGFIQSLKDDPSQSANVLAEAQKLNDAQAga16AMTYKLVINGKTLKGETTTKAVDAETAEKAFKQYANDNGVDGVWTYDDATKTFTVTEgb16A

t0010

Table 24.2

Sequences of 2lhc(A) and 2lhd(A) with 98% Sequence Identity, with a Single Mutation on Residue 45

2LHC(A)TTYKLILNLKQAKEEAIKELVDAGTAEKYFKLIANAKTVEGVWTLKDEIKTFTVTE
2LHD(A)TTYKLILNLKQAKEEAIKELVDAGTAEKYFKLIANAKTVEGVWTYKDEIKTFTVTE
f24-09-9780128025086
Figure 24.9 PDB structures for 2LHC (GA98), on the left, and 2LHD (GB98), on the right. The colored dot indicates the position of the 45th residue in both sequences.

We conducted a SMIR analysis on the 14 proteins presented, including the 2 WTs, 10 intermediate engineered proteins, and 2LHC and 2LHD. The SMIRs and SLIRs collected are displayed in Table 24.3 and illustrated in Figure 24.10. The SMIR method is sensitive to structure conservation, which relies more on the concept of a protein seen as a micelle, with a hydrophobic core and an outer shell, mainly hydrophilic. Structure conservation when transferred into sequence space, can be approached as conservation of a class of amino acids. Usually, two classes are proposed, called H and P classes in the literature. Class H is the class of hydrophobic (FILMVYW) amino acids. The reasons for the transformation from one fold to another has been the object of much conjecture, for instance in the so-called Paracelsus challenge in the 1990s. Following this idea, John Orban and colleagues proposed to keep the initial fold of a set of two proteins, each one being mutated with the purpose of increasing the sequence identity with the partner. In execution, they used two domains of 56 amino acids each, from the Streptococus celle surface G. In a series of papers, starting from the WT pair at 16% identity, they progressively increased identity up to one single mutation, but with two folds—namely, three helices on one hand (called GA) and a four-stranded sheet plus one helix on the other (called GB).

Table 24.3

SMIR results for 14 proteins: WTs GA16 and GB16 (He et al., 2005) with 16% sequence identity, intermediate engineered proteins with pairwise increased sequence identity GA59 and GB59 (He et al., 2005), followed by GA77, GB77, GA88, GB88, GA91, GB91, GA95, GB95 (Alexander et al., 2009) and the final sequences GA98 (2LHC) and GB98 (2LHD) (Heet al., 2012) with 98% sequence identity.

ProteinSLIR residue positionSMIR residue position
GA16 or GA
GB16 or GB
25 381 13 45 51
1 5 7 43 52 54
GA59
GB59
25 37
24
 16 45 51 54
 14 45
GA77
GB77
14 37 7 45 49
 7 32 45 52 54
GA88
GB88
14 37
14 37
 7 49
 7 32 49
GA91
GB91
14 37
14
 7 49
 7 32 49
GA95
GB95
13 37
14
 7 45
 7 32 49
GA98
GB98
14 37
13 37
 7 45 49
 7 49
f24-10-9780128025086
Figure 24.10 SMIR results for 14 proteins: WTs GA16 and GB16 (He et al., 2005) with 16% sequence identity are shown at the top (GA16 on the left, GB16 on the right), intermediate engineered proteins with pairwise increased sequence identity GA59 and GB59 (He et al., 2005), followed by GA77, GB77, GA88, GB88, GA91, GB91, GA95, and GB95 (Alexander et al., 2009) and the final sequences GA98 (2lhc) and GB98 (2lhd) (He et al., 2012), with 98% sequence identity.

Figure 24.10 shows the smoothed NCN on the pairs of mutated proteins, starting from the WT, until the two sequences at 98% identity (i.e., different by a single mutation). If one looks at the left column of the figure, with the fold with three helices, some peaks appear in the NCN distribution, but one peak remains at the same location all over the proteins: the one at 30–31. Since some structures are available, such as 1ZXG for the protein GA59, one can see that positions 30–31 are in the middle of the second helix. The second observation that one can make on the set of proteins from the mixed fold (right column of Figure 24.10) is that most of the peaks are displaced when the number of mutations increases, at the exception of the one around 30–31, which remains a maximum in the NCN distribution throughout the set of proteins. Analysis of the structure with the PDB code 1ZXH, corresponding to the sequence GB59 in Figure 24.10, shows that positions 30–31 are also in the middle of the sole helix of this fold. Therefore, we may hypothesize that these positions are crucial for the formation of a helix, independent of the rest of the structure. It is not arbitrary if position 30 is occupied by a hydrophobic residue, F, in both WTs. Moreover, if one looks thoroughly at this position all along the sequences of either the A or the B fold (shown in Figure 24.10), there is either F or I at this position, and it is known to belong to the core of the GA protein (He et al., 2012). One must also remember that these lab proteins are rather unstable, as noticed by the authors.

For a possible understanding of the SMIR prediction in the context of this set of experiments, one may somehow consider a degenerate alphabet. We used the most common one in the field of lattice simulation of folding by two classes of amino acids: FILMVYW as the hydrophobic group (Callebaut et al. 1997), named H, and the rest denoted as P. It is noticeable that the profiles of smoothed NCN are sensitive to point mutation. Considering the A fold scenario, increasing the number of mutations in order to look like the B fold, one starts with a landscape of three main and clear peaks, roughly centered on the three alpha helices. So long as mutations are performed, a new peak appears at both terminal ends, although it is much clearer at the N terminus. This is the bulk of the discussion proposed by John Orban’s group in successive papers. There is an equilibrium between the two folds, at least when one single mutation separates the two proteins, and the two ends play a critical role. If the folding nucleates around position 30, producing the presence of a helix under the local interactions, the rest of the fold will depend on long-range interactions. In the case of a Leu at position 45, as is the case for GA98, this will favor the formation of a helix because Leu propensity is highest for helices. In addition, this is the most frequent residue predicted as a MIR. This second helix formation will guide the rest of the fold toward the alpha type. Otherwise, if there is a Tyr, the small difference in propensity to form helices and to be involved in numerous side chain interactions, is sufficient to drive the folding toward a beta fold. MIR simulation on its own is insufficient to provide a complete and correct prediction; nevertheless, it can help in obtaining a better understanding of the steps followed by the proteins along their folding pathways.

As a particular case of interpreting the SMIR results for generations of similar sequences, consider A91–A95. The main effect in A95 compared to A91 occurs at the place of the mutation, 50: the peak disappears, enhancing the 45 peak. In A95, position 30 is mutated, which is a peak in the smoothed NCN. The mutation is I30F. The frequencies of Ile and Phe as MIR (Acuña et al., 2014) are similar; thus, they should not significantly modify the number of local interactions. According to Alexander et al., the core of A95 contains 9 AA, with 5 hydrophobic interacting residues located at 16, 20, 30, 33, and 49. If we assume that we have a low resolution, it means that we have three interacting clusters: 20, 30, and 45. It is also indicated that the three positions 20, 30, and 45 can either produce a core and therefore an alpha class (with LIL at these positions) or a sheet with AFY. One can assume that the two leucines, which have a higher propensity for helices, drive the pathway. We predict these peaks, although we have an extra one at the N terminus. Looking at the structure (PDB 2KDL) for A95, we see that I6 is very close in distance from V39 (6.1 Å), even if it is in a loop. This is noted by Alexander et al. since both ends are loops in the A form and strands in the B form.

4 Conclusion

Based on previous work (Chomilier et al., 2004; Papandreou et al., 2004), we have presented the fundamental MIR algorithm and a method for increasing the readability and accuracy of residue interaction data. Our contribution over the previous MIR implementations is twofold: we have presented SMIR, an algorithm involving Pascal Triangle smoothing and hydrophobic residue analysis to calculate smoothed data. We have also implemented this algorithm in a new, dynamic 2D graphical interface. Users may now view the smoothed MIR data for all proteins already existing in the SPROUTS database without needing to resubmit the protein for processing. These contributions refine the MIR technique to make MIR results more intuitive and useful to the scientific community.

One practical aspect of the prediction of MIR that can be important for wet biologists can be in cases where they face the production of inclusion bodies during the process of expression and purification. One of the ways used to circumvent this difficulty is to practice random mutations. The use of this server can be a suggestion not to mutate some positions suspected to be important for the structure, and consequently for the function (specifically MIR). MIR and SMIR methods are also integrated into the SPROUTS workflow, where they can be compared with stability analysis (Acuña et al., 2015).

SMIR is free and open to all users as a functionality of the Structural Prediction for pRotein fOlding UTility System (SPROUTS) with no login requirement at http://sprouts.rpbs.univ-paris-diderot.fr/mir.html. SMIR is hosted at the Université Paris Diderot on the RPBS server which provides scientists with a large range of resources devoted to the analysis of protein structure (Alland et al., 2005). SMIR is also available for integrated analyses of point mutation on the protein structure (Lonquety et al., 2008).

References

Abkevich VI, Gutin AM, Shakhnovich EI. Specific nucleus as the transition state for protein folding: evidence from the lattice model. Biochemistry. 1994;33(33):10026–10036.

Acuña R, Lacroix Z, Papandreou N, Chomilier J. Protein intrachain contact prediction with most interacting residues (MIR). Bio. Algorithm Med.-Syst. 2014;10(4):227–242.

Acuña R, Lacroix Z, Chomilier J. SPROUTS 2.0: a database and workflow to predict protein stability upon point mutation. In: European Conference on Computational Biology 2014, 7 - 10 Sep 2014; 2015 E01.

Alexander PA, He Y, Chen Y, Orban J, Bryan PN. The design and characterization of two proteins with 88% sequence identity but different structure and function. Proc. Natl. Acad. Sci. U. S. A. 2007;104(29):11963–11968.

Alexander PA, He Y, Chen Y, Orban J, Bryan PN. A minimal sequence code for switching protein structure and function. Proc. Natl. Acad. Sci. U. S. A. 2009;106(50):21149–21154.

Alland C, Moreews F, Boens D, Carpentier M, Chiusa S, Lonquety M, Renault N, Wong Y, Cantalloube H, Chomilier J, et al. RPBS: a web resource for structural bioinformatics. Nucleic Acids Res. 2005;33:W44–W49.

Berezovsky IN, Grosberg AY, Trifonov EN. Closed loops of nearly standard size: common basic element of protein structure. Febs Lett. 2000;466:283–286.

Billings K, Best R, Rutherford T, Clake J. Crosstalk between the protein surface and hydrophobic core in a swapped fibronection type III domain. JMB. 2008;375:560–571.

Bostock M, Ogievetsky V, Heer J. D-3: Data-Driven Documents. IEEE Trans. Vis. Comput. Graph. 2011;17:2301–2309.

Bottini S, Bernini A, De Chiara M, Garlaschelli D, Spiga O, Dioguardi M, Vannuccini E, Tramontano A, Niccolai N. ProCoCoA: a quantitative approach for analyzing protein core composition. Comput. Biol. Chem. 2013;43:29–34.

Callebaut I, Labesse G, Durand P, Poupon A, Canard L, Chomilier J, Henrissat B, Mornon JP. Deciphering protein sequence information through hydrophobic cluster analysis (HCA): current status and perspectives. Cell. Mol. Life Sci. 1997;53:621–645.

Chomilier J, Lamarine M, Mornon JP, Torres JH, Eliopoulos E, Papandreou N. Analysis of fragments induced by simulated lattice protein folding. C. R. Biol. 2004;327:431–443.

Chomilier J, Lonquety M, Papandreou N, Berezovsky I. Towards the prediction of residues involved in the folding nucleus of proteins. In: Proc. DIMACS Workshop on Sequence, Structure and System Approaches to Predict Protein Function, May 3-5, 2006; Center for Discrete Mathematics and Theoretical Computer Science (DIMACS) Center, Rutgers University; 2006. http://dimacs.rutgers.edu/Workshops/ProteinFunction/slides/chomilier.pdf.

Eickholt J, Cheng J. A study and benchmark of DNcon: a method for protein residue contact prediction using deep networks. BMC Bioinformatics. 2013;14(Suppl):512.

Fersht A, Sato S. Φ-value analysis and the nature of protein folding transition states. Proc. Natl. Acad. Sci. U. S. A. 2004;101:7976–7981.

Garbuzynskiy SO, Finkelstein AV, Galzitskaya OV. On the prediction of folding nuclei in globular proteins. Mol. Biol. 2005;39(6):906–914.

Hamill S, Steward A, Clarke J. The folding of an immunoglobulin like Greek key protein is defined by a common core nucleus and regions constrained by topology. J. Mol. Biol. 2000;297:165–178.

He Y, Yeh DC, Alexander PA, Bryan PN, Orban J. Solution NMR structures of IgG binding domains with artificially evolved high levels of sequence identity but different folds. Biochemistry. 2005;44(43):14055–14061.

He Y, Chen Y, Alexander PA, Bryan PN, Orban J. Mutational tipping points for switching protein folds and functions. Structure. 2012;20(2):283–291.

Itzhaki LS, Otzen DE, Fersht AR. The structure of the transition state for folding of chmotrypsin inhibitor 2 analysed by protein engineering methods: evidence for a nucleation condensation mechanism for protein folding. J. Mol. Biol. 1995;25:260–288.

Jones D, Buchan D, Cozzetto D, Ponti M. PSICOV : precise structural contact prediction using spase inverse covariance estimation on large multiple sequence alignments. Bioinformatics. 2012;28:184–190.

Kister A, Gelfand I. Finding of residues crucial for supersecondary structure formation. Proc. Natl. Acad. Sci. U. S. A. 2009;106:18996–19000.

Lamarine M, Mornon JP, Berezovsky IN, Chomilier J. Distribution of tightened end fragments of globular proteins statistically matches that of topohydrophobic positions: towards an efficient punctuation of protein folding? Cell. Mol. Life Sci. 2001;58:492–498.

Lappalainen I, Hurley MG, Clarke J. Plasticity within the obligatory folding nucleus of an immunoglobulin-like domain. J. Mol. Biol. 2008;375:547–559.

Lonquety M, Lacroix Z, Papandreou N, Chomilier J. SPROUTS: a database for the evaluation of protein stability upon point mutation. Nucleic Acids Res. 2008;37(Database issue) D374-9.

Miyazawa S, Jernigan RL. Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J. Mol. Biol. 1996;256:623–644.

Papandreou N, Kanehisa M, Chomilier J. Folding of the human protein FKBP. Lattice Monte-Carlo simulations. Comptes Rendus De L’Académie Des Sciences Série Iii-Sciences De La Vie-Life Sciences. 1998;321:835–843.

Papandreou N, Berezovsky IN, Lopes A, Eliopoulos E, Chomilier J. Universal positions in globular proteins - From observation to simulation. Eur. J. Biochem. 2004;271:4762–4768.

Poupon A, Mornon JP. Populations of hydrophobic amino acids within protein globular domains: Identification of conserved “topohydrophobic” positions. Proteins-Structure Function and Genetics. 1998;33:329–342.

Skolnick J, Kolinski A. Dynamic Monte Carlo Simulations of a New LAttice Model of Globular Protein Folding, Structure and Dynamics. J. Mol. Biol. 1991;221:499–531.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset