7
Encoder-Decoder Models for Protein Secondary Structure Prediction

Ashish Kumar Sharma and Rajeev Srivastava

Department of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, Uttar Pradesh, India

Abstract

Proteins are arranged in a linear sequence due to peptide bonds. In proteins, a peptide bond combines the amino group of one protein with the carboxyl group of another protein. Protein secondary structure formation results from their biophysical and biochemical properties, like natural languages which depend on their grammatical rule. So, the proposed model predicts a secondary structure from protein primary sequences using the encoder-decoder based machine translation method. The proposed model uses an encoder-decoder model based on long short-term memory network. The proposed work uses training and testing performed on available public datasets, namely CullPDB and data1199. The proposed model has better Q3 accuracy of 84.87% and 87.39% for CullPDB and data1199, respectively. Further, the proposed work was evaluated by comparing their performance with other methods which predict secondary structure only from a single sequence. The Encoder-Decoder Model for predicting secondary structure from a single primary sequence is performing better than other single sequence-based methods.

Keywords: Protein structure prediction, amino acids sequence, proteomics, one hot encoding, encoder-decoder, long short-term memory

7.1 Introduction

Protein is important biomolecules which help in developing the cells of all living organisms. Proteins mainly perform the role of transporters, catalysts, receptors, and hormones within all living organisms [1]. The primary sequences of proteins are a linear sequence of twenty naturally occurring amino acids. The amino acids are represented by one or three letter symbols called codons. Three levels of structure mainly describe the Protein structure [2]. The primary structure is linearly connected amino acids with peptide bonds. The protein secondary structure is defined by the local segments which are categorized into eight classes. These eight categories of secondary structure are the alpha-helix (H), 3-10 helix (G), pi helix (I), beta-sheet (E), beta bridge (B), turn (T), bend (S), and loop (L) [3]. Further, the eight classes of secondary structure are simplified into three classes: helix, strand, and loop/coil. The tertiary structure is the three-dimensional structure. To predict the secondary structure of a protein, the primary sequence is a sequence-labeling task. The protein secondary structure helps in estimating its functions, which is essential for drug designing and protein engineering.

In natural language processing, there are several character-based methods for classification and prediction. These character-based methods are mainly classified on the basis of their character-level information into three categories. These three character-based models are bag-of-n-gram, tokenization-based, and end-to-end [4]. The neural machine translation has become popular with better performance of character-level models. The encoder-decoder models, which use character-based deep neural networks, are applied to several problems such as machine translation, question answering, and speech recognition [5].

The protein sequences are formed with amino acids which are defined by biophysical and biochemical principles. The biophysical and biochemical principles of proteins are similar to the grammar of natural languages. This similarity motivates protein sequences that are the output of a specific biological language and develops a method based on natural language processing such as the encoder-decoder method to discover functions encoded within protein sequences.

The contributions made by this work are: (1) in the proposed model, the primary sequence map to the secondary structure sequence as language translation; (2) one LSTM layer encodes the primary sequences and returns its internal state to input in the decoder and another LSTM layer works as a decoder for secondary structure prediction; and (3) the proposed encoder–decoder model for secondary structure prediction was evaluated on two datasets, cullpdb and data1199. The experiments performed show that the proposed encoder-decoder model better captures the features from amino acid sequences to predict their secondary structure.

7.2 Literature Review

In the last few decades, various methods have been proposed which are mapping a protein’s primary sequence to their secondary structure sequence. Protein secondary structure prediction methods combine sequence homology searches [68] with basic features such as physicochemical properties of the primary sequence [9] and backbone torsion angles [10]. Further, these combined features are fed into neural networks [11, 12] or deep neural networks [13, 14] that produce secondary structures of protein. Recently, methods used for protein secondary structure prediction are PSIPRED [15] and JPRED [16], which have an average Q3 accuracy of approximately 80–85% for benchmark datasets. PSIPRED was the first secondary structure prediction tool that uses the PSI-BLAST search to improve prediction accuracy. PSIPRED utilize the protein database UNIREF90 to obtain sequence profiles and pass to the neural network with two-layers. JPRED integrates the PSI-BLAST sequence profile and HMMer [17] sequence profiles to the neural network. The currently available tools which have Q3 accuracy around 85–90% include SSpro [18], DeepCNF [19], PORTER [11], and PSRSM [20]. SSpro uses template data by utilizing sequence homology sequences. If homology sequences are not available, then the neural network predicts their secondary structure class. Two methods, DeepCNF and PORTER, combine the protein sequence profiles with deep conditional neural fields and convolutional neural networks. The protein secondary structure prediction methods only from a single sequence are PSIPRED-Single and SPIDER3-Single [2124]. These methods utilize the information extracted from a single sequence to find the protein secondary structure.

7.3 Experimental Work

7.3.1 Data Set

To train the model, protein data was obtained by downloading from cullpdb [25], which has a resolution value below 2.5 Å, R-factor below 1.0, and 30% non-redundant sequences in February 2017. The sequences with incomplete information, having a length value below 30 residues, and with find similarity more than 30% are removed by BlastClust [26]. We used publicly available dataset data1199 as a testing set. The test set is 1199 sequences with no redundancy [27].

7.3.2 Proposed Methodology

The proposed protein secondary structure prediction model is a long short-term memory network-based encoder-decoder architecture. The input is the primary sequence and the output is the protein’s secondary structure. Naturally, both the input protein primary sequences and output secondary structure sequences are of equal lengths since each amino acid character has a corresponding secondary structure character. In the proposed model, the conditional probability of a secondary structure sequence (y1,... , ym) estimated is given as an input amino acid sequence (x1,... , xn).

(7.1)image

The proposed encoder-decoder model first encodes the input protein primary sequence and one amino acid character in the primary sequence is represented using a latent vector. Then, it decodes to a protein secondary structure sequence.

7.3.3 Data Preprocessing

The protein sequences are split into twenty amino acid characters. Each amino acid is numbered with an integer value in the range of 1-20. The amino acid characters are represented as one-hot vector. The protein sequences are of varying length, but the deep learning model accepts the fixed length. If any sequence exceeds in size, then the remaining character is discarded. We padded with zeros in the shorter sequence.

7.3.4 Long Short Term Memory

Long short term memory (LSTM) [28] uses the capability of input gate (it), forget gate (ft), and output gate (ot) to control the flow of information to operate selective read, selective forget, and selective write. To utilize the information efficiently and discard the unnecessary information, the three gates in LSTM use the current input, previous state, and output selectively. The activation function depends on the gates used in the LSTM. The logistic and sigmoid function are used as an activation function. The flow of information depends on memory blocks used in the hidden layer. The governing equations for LSTM are as follows:

(7.2)image
(7.3)image
(7.4)image
(7.5)image
(7.6)image
(7.7)image

where Wxa, Wxi, Whf, and Who are weight matrices for input vectors xt, Uha, Uhi, Uhf, and Uho, which are weight matrices for previous state Outt-1, ba, bi, bf, and bo, which are the bias terms for each memory gate at, it, ft, ot. ʘ signifies the Hadamard product (element-wise product) operation and σ is the sigmoid function.

The encoder-decoder model for the primary protein sequence to the secondary structure is depicted in the Figure 7.1. In the model, the encoder and decoder both use LSTM having 256 units. The state vector representation obtained from the encoder layer works as the input to the decoder LSTM. The encoder LSTM encoded the primary sequences, while the second decoder LSTM predicts the secondary structure sequences. The encoder LSTM layer processes the individual amino acid characters in a sequence and finally, produce a state vector. The decoder receives the state vectors from the encoder. After processing the complete input primary sequence, the decoder layer uses the secondary structure character to decode the state vectors into a secondary structure sequence. The decoder layer maximizes the value of log-likelihood of predicted secondary structure sequence on the basis of encoder layer state vectors and past protein secondary structure characters.

Image

Figure 7.1 Encoder–decoder model for protein primary sequence to secondary structure.

The encoder layer processes the protein primary sequences by utilizing individual amino acid characters and finally, produce the state vectors, while the decoder layer uses the secondary structure sequence and the encoder layer state vectors.

7.4 Results and Discussion

The performance metrics from Q3 accuracy are used for evaluating secondary structure prediction. Python version 3.6.7 is used for implementing the proposed sequence-to-sequence model. To implement the model at the front end, keras is used while Tensorflow is used as the back end. Keras is an open-source application programming interface for deep learning. Tensorflow has an excellent computational ability.

The statistical method used to sample the dataset was 70 to 30 for the training set and testing set to develop a generalized model. The encoder-decoder model is based on a long short-term memory network, which has several hidden layers and each layer has a number of processing cells, so a large amount of data is used for processing. The RmsProp optimization has a small minibatch of 64 to more quickly process the model. The weights and biases are updated using categorical cross-entropy, which is the negative log likelihood loss.

The proposed encoder-decoder model performance compared with other methods such as SPIDER3-Single [22] and PSIpred-Single [21], which only use a single sequence for the cullpdb dataset, are listed in Table 7.1. We find that the proposed model performance with single LSTM is higher than other methods, i.e., SPIDER3-Single [22] and PSIpred-Single [21].

SPIDER3-Single combines the one-hot feature vector with Bidirectional LSTM for protein secondary structure prediction and PSIpred-Single considers statics of significant amino acid by calculating their correlation at each segment.

Table 7.1 Comparison of performance of various single-sequence based prediction on Cullpdb.

MethodsQ3 (%)
Seq2Seq (LSTM)84.87
SPIDER3-Single73.24
PSIpred-Single70.21

In order to show the effectiveness of the sequence-to-sequence model, the proposed model performance is compared with single sequence based methods SPIDER3 [22], JPred4 [29], and RaptorX [30] to predict protein secondary structure for dataset data1199, are listed in Table 7.2.

Table 7.2 Comparison of various single-sequence based prediction on data1199.

MethodsQ3 (%)
Seq2Seq Model87.39
SPIDER383.3
JPred479.3
RaptorX81.5

Spider3 methods used a bidirectional long short-term memory network. JPred4 used the JNet procedure and RaptorX used deep convolutional neural fields for predicting the protein secondary structure. To estimate the performance of our model, the protein sequences in the testing sets and training sets have low similarity. The Q3 accuracy of our sequence-to-sequence model is 87.39%.

7.5 Conclusion

In the proposed work, primary protein sequences are translated to their secondary structure using the sequence-to-sequence model. Protein primary sequences are represented as one-hot encoding and directly feed to the LSTM based sequence-to-sequence model. The proposed model has comparatively better performance on two publicly available datasets, CullPDB and data1199. Despite its simplicity, the LSTM based sequence-to-sequence model easily captures the complex relationship between amino acids and their secondary structure.

References

  1. 1. Z. Wang, F. Zhao, J. Peng, J. Xu, Protein 8-class secondary structure prediction using conditional neural fields., Proteomics. 11 (2011) 3786–92. https://doi.org/10.1002/pmic.201100196.
  2. 2. Q. Jiang, X. Jin, S.-J. Lee, S. Yao, Protein secondary structure prediction: A survey of the state of the art., J. Mol. Graph. Model. 76 (2017) 379–402. https://doi.org/10.1016/j.jmgm.2017.07.015.
  3. 3. Y. Wang, H. Mao, Z. Yi, Protein secondary structure prediction by using deep learning method, Knowledge-Based Syst. (2017). https://doi.org/10.1016/j.knosys.2016.11.015.
  4. 4. H. Schütze, S. Schütze, H. Adel, E. Asgari, arXiv:1610.00479v3 [cs.CL] 1 May 2017, 2017.
  5. 5. E. Vylomova, T. Cohn, X. He, G. Haffari, Word Representation Models for Morphologically Rich Languages in Neural Machine Translation, in: Proc. First Work. Subword Character Lev. Model. NLP, Association for Computational Linguistics (ACL), 2018: pp. 103–108. https://doi.org/10.18653/v1/w17-4115.
  6. 6. A. Drozdetskiy, C. Cole, J. Procter, G.J. Barton, JPred4: A protein secondary structure prediction server, Nucleic Acids Res. 43 (2015) W389–W394. https://doi.org/10.1093/nar/gkv332.
  7. 7. D. Ganguly, D. Roy, M. Mitra, G.J.F. Jones, Word Embedding based Generalized Language Model for Information Retrieval, in: 2015. https://doi.org/10.1145/2766462.2767780.
  8. 8. M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight, J.T. Eppig, M.A. Harris, D.P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J.C. Matese, J.E. Richardson, M. Ringwald, G.M. Rubin, G. Sherlock, Gene ontology: Tool for the unification of biology, Nat. Genet. (2000). https://doi.org/10.1038/75556.
  9. 9. J. Chen, G. Liu, V. Pantalone, Q. Zhong, Physicochemical properties of proteins extracted from four new Tennessee soybean lines, J. Agric. Food Res. 2 (2020) 100022. https://doi.org/10.1016/j.jafr.2020.100022.
  10. 10. M.N. Faraggi, A. Arnau, V.M. Silkin, Role of band structure and local-field effects in the low-energy collective electronic excitation spectra of 2H-NbSe 2, Phys. Rev. B - Condens. Matter Mater. Phys. 86 (2012) 035115. https://doi.org/10.1103/PhysRevB.86.035115.
  11. 11. C. Mirabello, G. Pollastri, Porter, PaleAle 4.0: High-accuracy prediction of protein secondary structure and relative solvent accessibility, Bioinformatics. 29 (2013) 2056–2058. https://doi.org/10.1093/bioinformatics/btt344.
  12. 12. A. Yaseen, Y. Li, Context-based features enhance protein secondary structure prediction accuracy, J. Chem. Inf. Model. 54 (2014) 992–1002. https://doi.org/10.1021/ci400647u.
  13. 13. R. Heffernan, K. Paliwal, J. Lyons, A. Dehzangi, A. Sharma, J. Wang, A. Sattar, Y. Yang, Y. Zhou, Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning, Sci. Rep. 5 (2015) 1–11. https://doi.org/10.1038/srep11476.
  14. 14. S. Wang, J. Peng, J. Ma, J. Xu, Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields, Sci. Rep. 6 (2016). https://doi.org/10.1038/srep18962.
  15. 15. L.J. McGuffin, K. Bryson, D.T. Jones, The PSIPRED protein structure prediction server, Bioinformatics. 16 (2000) 404–405. https://doi.org/10.1093/bioinformatics/16.4.404.
  16. 16. J.A. Cuff, M.E. Clamp, A.S. Siddiqui, M. Finlay, G.J. Barton, JPred: A consensus secondary structure prediction server, Bioinformatics. 14 (1998) 892–893. https://doi.org/10.1093/bioinformatics/14.10.892.
  17. 17. R.D. Finn, J. Clements, S.R. Eddy, HMMER web server: Interactive sequence similarity searching, Nucleic Acids Res. 39 (2011) W29. https://doi.org/10.1093/nar/gkr367.
  18. 18. C.N. Magnan, P. Baldi, SSpro/ACCpro 5: Almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity, Bioinformatics. 30 (2014) 2592–2597. https://doi.org/10.1093/bioinformatics/btu352.
  19. 19. S. Wang, J. Peng, J. Ma, J. Xu, Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields, Sci. Rep. 6 (2016). https://doi.org/10.1038/srep18962.
  20. 20. Y. Ma, Y. Liu, J. Cheng, Protein secondary structure prediction based on data partition and semi-random subspace method, Sci. Rep. 8 (2018). https://doi.org/10.1038/s41598-018-28084-8.
  21. 21. Z. Aydin, Y. Altunbasak, M. Borodovsky, Protein secondary structure prediction for a single-sequence using hidden semi-Markov models, BMC Bioinformatics. 7 (2006). https://doi.org/10.1186/1471-2105-7-178.
  22. 22. R. Heffernan, K. Paliwal, J. Lyons, J. Singh, Y. Yang, Y. Zhou, Single-sequence-based prediction of protein secondary structures and solvent accessibility by deep whole-sequence learning, J. Comput. Chem. 39 (2018) 2210–2216. https://doi.org/10.1002/jcc.25534.
  23. 23. A.K. Sharma, R. Srivastava, Protein Secondary Structure Prediction using Character bi-gram Embedding and Bi-LSTM, (2019) 1–7.
  24. 24. A.K. Sharma, R. Srivastava, Variable Length Character N-Gram Embedding of Protein Sequences for Secondary Structure Prediction, Protein Pept. Lett. 28 (2020) 501–507. https://doi.org/10.2174/0929866527666201103145635.
  25. 25. G. Wang, R.L. Dunbrack, PISCES: a protein sequence culling server, Bioinforma. Appl. NOTE. 19 (2003) 1589–1591. https://doi.org/10.1093/bioinformatics/btg224.
  26. 26. D. Wei, Q. Jiang, Y. Wei, S. Wang, A novel hierarchical clustering algorithm for gene sequences, BMC Bioinformatics. 13 (2012) 174. https://doi.org/10.1186/1471-2105-13-174.
  27. 27. R. Heffernan, Y. Yang, K. Paliwal, Y. Zhou, Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility, Bioinformatics. 33 (2017) 2842–2849. https://doi.org/10.1093/bioinformatics/btx218.
  28. 28. S. Hochreiter, J. Schmidhuber, Long Short-Term Memory, Neural Comput. (1997). https://doi.org/10.1162/neco.1997.9.8.1735.
  29. 29. A. Drozdetskiy, C. Cole, J. Procter, G.J. Barton, JPred4: A protein secondary structure prediction server, Nucleic Acids Res. (2015). https://doi.org/10.1093/nar/gkv332.
  30. 30. S. Wang, W. Li, S. Liu, J. Xu, RaptorX-Property: a web server for protein structure property prediction, Nucleic Acids Res. 44 (2016) W430–W435. https://doi.org/10.1093/nar/gkw306.

Note

  1. Corresponding author: [email protected]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset