Chapter 23

An Updated Covariance Model for Rapid Annotation of Noncoding RNA

Yinglei Song1; Junfeng Qu2    1 School of Electronics and Information Science, Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, China
2 Department of Computer Science and Information Technology, Clayton State University, Morrow, GA, USA

Abstract

Noncoding RNA (ncRNA) plays a fundamental role in many biological processes. Models that can accurately describe the secondary structure of ncRNA molecules, therefore, are important for recognizing them in DNA genomes. The covariance model (CM) is one of the most accurate models that can describe the base pairs in the secondary structure of an ncRNA molecule. Methods based on CMs can achieve great accuracy for genome search. However, methods based on CM are in general computationally inefficient. In this chapter, we develop a new model to describe the secondary structure of an ncRNA molecule. We show that a significant improvement in time complexity can be achieved with our model for the sequence-structure alignment of ncRNAs. Testing results show that our new model with those of the conventional CM.

Keywords

Noncoding RNA (ncRNA)

sequence-structure alignment

covariance model (CM)

structure unit

updated covariance model

1 Introduction

Noncoding RNA (ncRNA) plays an important role in a variety of biological processes, including RNA modification, gene regulation, and choromosome replication (Frank and Pace, 1998; Luo and Zhou, 2001; Mitchels and Bensaude, 2001). In general, the biological functions of ncRNA are determined by its secondary structure. Recently, due to the large amount of available genome data, computational approaches have been developed to search genomes to identify new ncRNAs (Jones and Eddy, 2001; Lowe and Eddy, 1997). Most of these approaches use a structure model to describe the secondary structure of an ncRNA family and use sequence-structure alignment to determine whether a given sequence segment belongs to the ncRNA family.

Most software tools (Lowe and Eddy, 1997; Klein and Eddy, 2003) that can search genomes for ncRNAs use the covariance model (CM) (Eddy and Durbin, 1994) to model the secondary structure of an ncRNA family. Similar to the Hidden Markov model (HMM), CM uses the statistical profiles of all single and paired positions in the sequence to describe both the primary sequence and the secondary structure of an ncRNA family. The sequence-structure alignment between a sequence and a structure model can be performed by a dynamic programming algorithm (Eddy and Durbin, 1994). Based on the statistical profiles of single and paired positions, the algorithm can find the alignment with the maximum probability. The value of this probability is then used to decide whether the sequence is part of the family. The CM-based sequence structure alignment has been successfully used to identify a variety of ncRNAs in the genomes of many species.

The dynamic programming algorithm that can optimally align a sequence to a CM needs a computation time of O(WN3), where N is the number of nucleotides in the sequence and W is the size of the CM. The size of the CM is often proportional to the length of the sequences in the ncRNA family that it models. When the length of the sequence is long, CM-based searching becomes inefficient due to the time complexity needed to perform the sequence-structure alignment and the size of the genome. For example, a genome of moderate size contains around one million nucleotides and the annotation of an ncRNA that contains 300 nucleotides in such a genome may need a few days of computation time.

In this chapter, we develop a new model to describe the secondary structure of an ncRNA molecule. In particular, based on the length distribution of a structure unit in the secondary structure, we restrict the number of insertions and deletions that may occur in the structure unit during the evolution. We show that the computation time needed to align a sequence to this new model can be significantly reduced. The increase in speed that can be achieved by our model is significant for sequences that contain a large number of nucleotides.

We implemented this approach using a computer program and tested its performance in searching a few ncRNAs in genomes. We compared both the accuracy and computational efficiency of our approach with those of the CM-based search. The testing results showed that our approach can achieve search accuracy comparable with that of the CM-based search, while significantly reducing the amount of computation time needed to search through a genome.

2 Method

2.1 Conventional CM

A conventional CM usually contains two types of elements: states and transition rules that define the possible transitions among states in the model. Each state is used to describe a single unpaired position or two paired positions in the secondary structure. Each state is associated with a set of emission probability values that describe the probability for each nucleotide or base pair to appear in the position or positions. Each transition rule is also associated with a probability value that describes the probability for the transition to occur. A well-known fact regarding the conventional CM is that it is an extension to the HMM, such that a base pair in a secondary structure can be described by emitting the nucleotides that form the base pair simultaneously. A CM is equivalent to stochastic context-free grammar (SCFG). The optimal alignment between a sequence and a conventional CM is thus computed with a dynamic programming algorithm that is similar to the CYK algorithm.

2.2 The updated CM

In our new structural model, the secondary structure of an RNA family contains a few basic structure units, each of which is a stem or a loop. Each structure unit is associated with a pair of integers. One of the integers is the minimum length of the structure unit, and the other is its maximum length. In other words, we restricted the length of each structure unit to a small interval. We describe the method that we used to determine the length restrictions for a structure unit later in this chapter. In addition to the length restrictions for the structure units, we used a CM to model the secondary structure of the family.

2.3 Sequence-structure alignment

The sequence-structure alignment between a sequence L and a structure model M can be performed with a dynamic programming algorithm similar to the CYK algorithm used to align a sequence to a conventional CM. Since each structure unit has its own length restrictions, we are able to compute for each state in the CM the range of the lengths of the subsequences that can be derived from it. The dynamic programming for each state is then performed only within the range. For the conventional CM, the dynamic programming algorithm for sequence-structure alignment computes the probabilities for each state to derive all subsequences of L. Our structure model, therefore, can lead to a significant improvement in the computational efficiency of sequence-structure alignment.

2.4 Computing the length restrictions

Each structure unit in the structure model is associated with a pair of integers that describe the range of its length in an ncRNA sequence. Given the training sequences in an ncRNA family, we could construct an HMM M for each structure unit S in the secondary structure of the family. In addition, we computed the average length and the base composition of the subsequences that form S in the training sequences. We used A to denote the average length of S in the training sequence, and using the base composition to randomly generate N subsequences of length L, we then aligned all these generated subsequences to M and obtained N different alignment scores. These N alignment scores together formed a distribution of the alignment scores of subsequences from the given base composition. We computed the mean m and the standard deviation d of these N alignment scores.

We were then ready to compute the upper bounds and lower bounds for the length of S. Next, we describe an approach that can compute both bounds given a statistical confidence value p. In particular, starting with L' = L + 1, we randomly generated N different subsequences of length L' using the same base composition and align each of them to M. N alignment scores could be generated by performing these alignments. These N alignment scores formed a new distribution. We computed the mean of these alignment scores and used m' to denote its value. We then checked whether the difference between this new distribution and the one obtained on subsequences of length L was statistically significant. To this end, we checked whether the difference between m and m' is larger than cd or not, where c was a constant that depends on the statistical confidence value p. If that were the case, L' would be the upper bound for the structure unit. Otherwise, we incremented the value of L' by 1. We repeated this procedure until we found an L' that could lead to a distribution statistically different from the one obtained on subsequences of length L. The lower bound can be obtained with a similar approach.

3 Test results

We performed experiments to test the accuracy and efficiency of our approach and compared the performance of our approach with that of the conventional CM. The training data were obtained from the Rfam database. For each family, we chose up to 60 sequences, with their pairwise identities lower than 80% of the total number of nucleotides in each sequence. We inserted several RNA sequences from the same family into a random background generated with the same base composition as the sequences in the family. We then used both our algorithm and a CM-based searching algorithm to search for the inserted sequences.

In order to compute the length restrictions for each structure unit, we assumed that the lengths of each structure unit form a normal distribution and choose a confidence value of p = 0.01. The value of constant c, thus, is 3.0. We compared the sensitivity and specificity of both approaches used with several different RNA families, and the results are shown in Table 23.1. It is not difficult to see from this table that the search accuracy of our approach is comparable to that of the CM-based searching algorithm in terms of sensitivity and specificity. Table 23.2 compares the computation time of our approach with that of the CM-based search algorithm. It is not difficult to see that our approach is significantly faster than the CM-based approach for all tested ncRNA families. In particular, our approach only needs around 1.1% of the computation time needed by the CM-based search algorithm to search for Lin_4, while achieving the same search accuracy in terms of sensitivity and specificity.

Table 23.1

Sensitivity and Specificity of Our Approach and the CM-based Search Algorithm

RNAAverage LengthSensitivity of Our ApproachSpecificity of Our ApproachSensitivity of CM-based Search AlgorithmSpecificity of CM-based Search Algorithm
Entero_CRE610.780.970.811
Entero_OriR730.95111
Let_7841111
Lin_4721111
Purine1030.960.970.931
SECIS680.950.8910.97
S_box1120.93111
Tymo_tRNA-like861110.97

t0010

Table 23.2

Computation Time Needed to Search All ncRNA Families

RNAComputation Time for CM-based Search Algorithm (s)Computation Time for Our Approach (s)The Amount of Speed-up Using Our Approach
Entero_CRE57.962.4124.05X
Entero_OriR103.083.7227.71X
Let_7157.1110.8114.53X
Lin_4132.511.4690.76X
Purine179.293.6255.00X
SECIS185.217.2525.54X
S_box756.2724.5230.84X
Tymo_tRNA-like185.053.2357.29X

t0015

4 Conclusions

In this chapter, we developed a new structure model that can be used to significantly speed up the sequence-structure alignment for ncRNAs. This new model is based on the conventional CM, which describes the sequences of an ncRNA family and their secondary structure with a statistical model. In addition to employing a CM to describe the primary sequence content and the secondary structure of sequences in an ncRNA family. Our model imposes length restrictions on each structure unit in the secondary structure. Using the length restrictions of structure units, the dynamic programming algorithm for sequence-structure alignment is significantly faster than that of the conventional CM. We showed that the length restrictions of a structure unit can be computed from the training data based on a statistical confidence value. Our testing results showed that this approach can significantly speed up the search of ncRNAs in genomes while achieving a search accuracy comparable to that of the conventional CM-based approach.

One disadvantage of our approach is that it cannot be used to model the ncRNA family, where some sequences do not contain certain structure units that are present in other sequences. In other words, some structure units are missing in these sequences. Although a missing structure unit can be accurately modeled by a conventional CM, our model may have difficulty doing that. The difficulty is largely because a missing structure unit can generate a sudden change in the range of lengths for some states in the CM. This sudden change cannot be accurately described by the length restrictions without significantly increasing the amount of computation time. The development of new approaches or models that can also handle the missing structure units is one of the directions we are taking in our future work.

So far, our approach has been used only to search for ncRNAs that do not contain pseudoknots. Pseudoknots contain crossing stems and are more difficult to model than secondary structure that does not contain crossing stems. Our previous work (Song et al., 2006; Song and Chi, 2015; Song et al., to come, 2014) has developed methods and models that can estimate the parameters associated with structural models and search for ncRNAs that contain pseudoknots. Our future may extend this approach to modeling ncRNAs that contain pseudoknot structures. In addition, approaches based on data mining and statistical principles (Qu et al., 2007, 2008, 2009; Qu and Arabnia, 2005a, 2005b; Song et al., 2014; Song and Qu, 2014) have been proved to be effective for solving problems from a large number of areas. Combining our techniques with the techniques that are already available in these areas to further improve the accuracy of our approach is another possible direction for our future work.

References

Eddy S, Durbin R. RNA sequence analysis using covariance models. Nucleic Acids Res. 1994;22:2079–2088.

Frank DN, Pace NR. Ribonuclease p: unity and diversity in a tRNA procesing ribozyme. Annu. Rev. Biochem. 1998;67:153–180.

Jones ER, Eddy SR. Computational identification of noncoding RNAs in E. Coli by comparative genomics. Curr. Biol. 2001;11:1369–1373.

Klein RJ, Eddy SR. Rsearch: finding homologs of single structured RNA sequences. BMC Bioinformatics. 2003;4:44.

Lowe TM, Eddy SR. tRNASCAN-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25:955–964.

Luo ZY, Zhou Q. The 7sk small nuclear rna inhibits the cdk9/cyclin t1 kinase to control transcriptions. Nature. 2001;414:317–322.

Mitchels V, Bensaude O. 7sk small nuclear rna binds to and inhibits the activity of cdk9/cyclin t complexes. Nature. 2001;414:322–325.

Qu J, Arabnia HR. Mining structural changes in financial time series with gray system. In: Proceedings of the 2005 International Conference on Data Mining; 2005a:173–180.

Qu J, Arabnia HR. A novel short-term stock price predicting system. In: Proceedings of the 2005 International Conference on Information & Knowledge Engineering; 2005b:25–31.

Qu J, Arabnia HR, Song Y, Rasheed K, Houston J. Time series similarity with a new distance measure. In: Proceedings of the 2008 International Conference on Information & Knowledge Engineering; 2007:183–189.

Qu J, Song Y, Arabnia HR, Jeff B. Knowledge retrieval in financial domain. In: Proceedings of the 2008 International Conference on Information & Knowledge Engineering; 2008:140–144.

Qu J, Rahman MA, Song Y, Zhu L, Wei Y, Hong W, Arabnia HR. PaperGuard: a support vector machine approach for screening machine generated papers. In: Proceedings of the 2009 International Conference on Information & Knowledge Engineering; 2009:433–438.

Song Y, Chi A. A new approach for parameter estimation in the sequence-structure alignment in noncoding RNAs. J. Inform. Sci. Eng. 2015;31(2):593–607.

Song Y, Qu J. A graph theoretic approach for protein threading. In: Proceedings of the 10th International Conference on Intelligent Computing; 2014:501–507.

Song Y, Liu C, Malmberg RL, Pan F, Cai L. Tree decomposition based fast search of RNA structures including pseudoknots in genomes. In: Proceedings of the Fourth International IEEE Computational Systems Bioinformatics Conferences; 2005:223–234.

Song Y, Liu C, Huang X, Malmberg RL, Xu Y, Cai L. Efficient parameterized algorithms for biopolymer structure-sequence alignment. IEEE/ACM Trans. Comput. Biol. Bioinform. 2006;3(4):423–432.

Song Y, Wang C, Qu J. A parameterized algorithm for predicting transcription factor binding sites. In: Proceedings of the 10th International Conference on Intelligent Computing; 2014:339–350.

Song Y., Liu C., Wang, Z., A machine learning based approach for accurate annotation of noncoding RNAs. IEEE/ACM Transactions on Computational Biology and Bioinformatics, to appear.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset