Chapter 13

The Big ORF Theory

Algorithmic, Computational, and Approximation Approaches to Open Reading Frames in Short- and Medium-Length dsDNA Sequences

Steven M. Carr1,2,4 [email protected]; H. Dawn Marshall1; Todd Wareham2; Donald Craig3    1 Department of Biology, Memorial University of Newfoundland, St. John’s NL, Canada
2 Department of Computer Science, Memorial University of Newfoundland, St. John’s NL, Canada
3 eHealth Research Unit (Faculty of Medicine), Memorial University of Newfoundland, St. John’s NL, Canada
4 Terra Nova Genomics, Inc., St. John’s NL, Canada

Abstract

In the genomic data-mining era of genetics and bioinformatics, a frequent task is the exploration of double-stranded DNA (dsDNA) sequence data for the occurrence of protein-coding regions. The expectation is that five of the six possible three-letter reading frames will be closed by “stop” triplets in the Genetic Code, and the sixth will be an “Open Reading Frame” (ORF) comprising an series of ‘coding’ triplets without “stops” that specifies a polypeptide sequence. Here, we evaluate properties of the set of such “5&1” solutions for dsDNA sequences of length L base pairs. We provide an approximation of the “5&1” condition based on a simplified model of the three-stop Universal Genetic Code in which the proportions of ‘stop’ and ‘coding’ triplets are S = 3/64 and C = (64 - S)/64, respectively. We show that the probability of exactly one ORF solution is maximized at pmax = 0.40 for fragments of length 111 base pairs, and that pmax is constant for alternative Genetic Codes with S = 1, 2, or 4 ‘stops’. Monte Carlo sampling of DNA sequences and evaluation for “5&1” solutions confirms the validity of the triplet approximation. Demonstration that short random DNA sequences over a predictable size range will include ORFs at high probability bears on the evolution of alternative Genetics Codes, and the availability of such fragments to evolutionary recruitment and modification by natural selection.

Keywords

DNA

messenger RNA (mRNA)

genetic code

stop codons

Open Reading Frames (ORFs)

data mining

evolution

premature termination

readthrough

environmental DNA

Acknowledgments

S. M. Carr, H. D. Marshall, and T. Wareham were supported by NSERC Discovery Grants during the preparation of this chapter. We thank K. Tahlan for pointing out the implications for lateral gene transfer in bacteria. We thank H. Arabnia for his leadership of the BioComp conferences, and his support for the preparation of this manuscript. SMC dedicates this chapter to Professor William D. Stansfield of California Polytechnic State University, San Luis Obispo, in recognition of his long service in genetics education.

1 Introduction

The cracking of the genetic code by means of a rapid series of experiments and logical inferences is arguably the first instance of a “big science” approach in the history of molecular genetics (Judson, 1996). Theoretical considerations had already indicated that any nucleic acid code words must comprise a minimum of three letters (Crick, 1966). After it was demonstrated in 1961 that an artificial poly-U RNA template directs incorporation of the amino acid proline into a polypeptide, and thus that UUU was the code for PRO, Marshall Nirenberg’s lab had by 1963 deduced an incomplete “dictionary” of 50 three-letter code words (Nirenberg et al., 1965), and a substantially complete genetic code table was created by 1965 (Nirenberg et al., 1966; also see Figure 13.1). The iconic 4 × 4 × 4 table is now a standard feature of biology textbooks and has been incorporated into bioinformatic computational schemes as a fundamental feature.

f13-01-9780128025086
Figure 13.1 The genetic code, 1965. Note that uncertainties still existed as to the coding properties of UGA (a TERM or stop codon) and UGG (a Leu codon).

In this chapter, we consider properties of short segments of the genetic code that are of interest both theoretically, as unexplored computational challenges, and practically, bearing on the evolution and function of the code and coding molecules. Taken together, the solution of these challenges at the intersection of computational and biological science provides reciprocal illumination to each.

2 Molecular genetic and bioinformatic considerations

2.1 Molecular genetics of DNA 2794 RNA 2794 Protein

DNA is famously a double-stranded molecule (dsDNA) that comprises two polymeric sequences of four bases (A, C, G, and T) in an aperiodic order that conveys bioinformation. The two strands are arranged in antiparallel 5’27943’ directions that are implicit in the deoxyribose component. The strands are held together by noncovalent hydrogen bonds between paired A + T or C + G base pairs. The antiparallel arrangement and base pairing rules ensure that the alternative strands are complementary to each other. This relationship is the basis of DNA as a self-replicating molecule.

One DNA strand, designated the template strand, serves as a template for 5’27943’ synthesis (transcription) of a complementary messenger RNA (mRNA) molecule, where RNA differs from DNA in being single-stranded and substituting base U for T. The mRNA molecule is translated in the 5’27943’ direction into a polymer comprising a sequence of amino acids (a polypeptide), according to a genetic code (Figure 13.1). In the code, each of the 64 possible three-letter base sequences (codons) reads 5’27943’ and specifies a particular amino acid, except that three codons (UAA, UAG, and UGA) do not specify any amino acid and therefore serve as terminators (known as stops) to polypeptide synthesis. A common genetic code is universal for the nuclear genomes of all organisms.

2.2 Bioinformatic data-mining

Because the mRNA sequence is complementary to that of the DNA template strand, it necessarily has the same base sequence in the same 5’27943’ direction as the DNA strand complementary to the template strand, except for the substitution of U for T. This DNA strand, designated the sense strand, may therefore be read directly from the genetic code table, substituting T for U. As a bioinformatic process, it is straightforward to read the polypeptide sequence directly from the DNA sense strand, without the intermediate molecular steps of mRNA transcription and subsequent translation via tRNA. (By definition, codons occur only in mRNA: the equivalent three-letter sequences in the DNA sense strand are designated as triplets. Hereafter in this discussion, we adopt the National Center for Biotechnology Information (NCBI) bioinformatic convention and use a DNA triplet alphabet.)

Any dsDNA molecule may be read from six potential starting points, designated as reading frames (RFs), which are three-base windows that commence at the first, second, or third base from the 5’ end of one strand, after which each frame repeats; or from the 5’ end of the other strand starting at the opposite end of the molecule. Full-length DNA sequences of several hundred to more than a thousand bases that specify protein sequences that are hundreds of amino acids long are expected to show that only one of these RFs is an Open Reading Frame (ORF); that is, that it does not include a stop triplet over the required length of the polypeptide. As three out of 64 triplets are stops (TAA, TAG, and TGA), the five alternative RFs are expected to include multiple random stops at expected intervals of about 20 triplets: the first occurrence of a stop closes the RF. We designate this the 5&1 condition. Commercial DNA software programs perform this process as a matter of routine, either from novel data or data mined from online resources such as GenBank.

3 Algorithmic and programming considerations

An introduction to the theory of data mining for such ORFs typically begins with the propounding of short dsDNA sequence exemplars of length L = 15 ~ 25 base pairs that are constrained by the 5&1 condition. A practical algorithmic generator of such exemplars must be able to access the entire space of dsDNA sequences that satisfy the 5&1 condition for a specified L, sample that space in an at least approximately random manner, and be efficient in terms of both central processing unit (CPU) run time and required memory space. We developed two such algorithms (Carr et al., 2014a), the first based on a two-level recursive search that generates a dsDNA skeleton with at least one stop codon in each of five frames, and then completes the remainder of the dsDNA sequence by adding bases at random to the skeleton so as to produce an ORF exemplar in which the 5&1 condition is maintained. An app that generates dsDNA sequence exemplars that satisfy the 5&1 condition for L ≤ 100 is available at http://www.ucs.mun.ca/~donald/orf/biocomp/. We provide a more complete discussion of the pedagogical use of the web application in a previous study (Carr et al., 2014b). The second algorithm used an exhaustive search that enumerated all those dsDNA sequences of length L that satisfied the 5&1 condition without storing the results as exemplars.

The recursive and exhaustive algorithms show that there are no solutions for L = 5 ~ 10, and 96 for L = 11 (Figure 13.2, after Carr et al., 2014a). Enumerations from the two methods agree for 11 ≤ L ≤ 19, at which point the recursive algorithm succumbs to memory limitations. For L < 22, CPU usage for the exhaustive algorithm was measured on a single, quad-core PC. For L ≥ 22, CPU usage was measured over a network of such machines: by L = 25, exact CPU usage is obscured by competing demands from other users on the same network. Calculation of the number of 5&1 solutions for L > 25 with the resources available to us would require several days.

f13-02-9780128025086
Figure 13.2 Semilogarithmic plot of the enumerated number of ORF exemplars of length L (NORFL) for L = 11 ~ 25. The total number of possible dsDNA sequences of length L is 4L (25AA). Required CPU time for the exhaustive algorithm is given in seconds (♦); CPU is log-linear with respect to L, as CPU = 0.613(log10 L) − 11.736 (r2 = 0.9998). After Figure 3 in Carr et al. (2014a).

4 Analytical and random sampling solutions to L > 25 sequences: Triplet-based approximations

Given these limitations, we have developed a simplified analytical formulation of the 5&1 problem, in which the 64 triplets in the universal genetic code comprise C = 61 coding and S = (64 – C) = 3 stop triplets. If we disregard the actual nucleotide composition of coding and noncoding triplets and the overlapping nature of the six RFs, the probability that any given triplet is a coding triplet is C = 61/64. Next, the probability that a string of T triplets will be an ORF is simply calclulated as

pORFT=CT,

si1_e  (13.1)

and the probability that such a string will include at least one stop is calculated as

pstop=1CT.

si2_e  (13.2)

Then, an approximation of the probability that a string of triplets satisfies the 5&1 condition p(NORFT) is the joint probability that RF1 is open and RFs 2-5 are all closed, or that any of RF2, RF3, … RF6 is open and the other five RFs closed. Thus,

p5&1T=6CT1CT5

si3_e  (13.3)

Figure 13.3 shows a simultaneous plot of Eqs. (13.1), (13.2), and (13.3). Where Eq. (13.3) has a constant factor K = 6 and p(stop) enters the function as its fifth power, p(5&1 T)] initially tracks p(stop) toward the enumerable limit of L = 25 as observed, but the function maximizes at T = 37 (L = 111) at p = 0.4 of a 5&1 solution.

f13-03-9780128025086
Figure 13.3 Plot of the three components of the triplet approximation of the 5&1 solution in Eq. (13.3) {p(5&1)], for T = 1…150 (L = 3…450). Note that p(ORF) as a simple exponential (CT) starts high but declines log-linearly toward zero as T increases, and p(stop) = (1 – CT) starts low but converges on 1.

Thus, and counterintuitively, the scarcity of 5&1 solutions for smaller values of T (T < 37, L < 111) is determined by the low probability of exactly five simultaneously stopped frames (1 − CT)5, rather than the relative scarcity of ORFs (CT/4L). For larger T >> 37, any given ORF is almost certainly accompanied by five frames with multiple stops.

We evaluated Eq. (13.3) as an estimator of p(5&1) by sampling for each of L = 3 ~ 450 (mod 3) a set 106 random dsDNA sequences, and ascertaining the fraction that satisfied the 5&1 condition under the universal genetic code. Figure 13.4 shows that Eq. (13.3) very slightly overestimates the proportion of 5&1 solutions in the Monte Carlo simulation for L < 37. This is to be expected given the absence of constraints in the triplet approximation (triplet assignments, overlapping RFs, etc.), and otherwise the equation provides a close upper bound.

f13-04-9780128025086
Figure 13.4 Probability of a 5&1 solution in triplet strings of length T = 6…100, as estimated by the approximation in Eq. (13.3) (red) and by random sampling of 106 dsDNA sequences of length L = 3 T (black). Crosses indicate exact enumerations for T = 6, 7, and 8, corresponding to L = 18, 21, and 24 in Figure 13.2.

5 Alternative genetic codes

Besides the universal code with three stop triplets (Figure 13.1), there are several variant codes with one, two, or four stops (Itzkovitz and Alon, 2007). Figure 13.5 shows simultaneous plots of Eq. (13.3) with S = 1, 2, or 4, such that C = 63/64, 62/64, and 60/64, respectively. All variants have the same pmax = 0.40 as the three-stop code, at L = 342, 168, and 84, respectively. This maximum arises as the zero (horizontal) slope of the first derivative of Eq. (13.3) when CT = 1/6. Substituting this back into Eq. (13.3) gives p = 0.40. Because C is a constant for any one model of the code and co-occurs with T only in the form CT, the derivative of p(NORFT) is necessarily identical for different values of C. The equation for pmax can then be rearranged and solved to predict T, as

T=log1/6/logC=0.7782/logC

si4_e  (13.4)

f13-05-9780128025086
Figure 13.5 (a) Probability function of p(5&1T) for alternative genetic codes with different numbers of stop codons. The universal code has three stops (black), for which a sequence T = 37 triplets (L = 111) has the highest probability (pmax = 0.4) of providing a 5&1 solution. (b) Upper and lower bounds for p(5&1T) = 0.1, and pmax = 0.4, for alternative genetic codes with S = 4, 3, 2, and 1 stop triplets, respectively.

The upper bound on a 10% cutoff for p(5&1) increases rapidly as the number of stops decreases: for example, at the upper 10% probability bound, there are more than three times as many solutions with a one-stop (L = 254, T = 762) as with a three-stop code (L = 83, T = 249).

6 Implications for the evolution of ORF size

Broad conservation of the universal code in nuclear genomes indicates that a three-stop code optimizes some selective advantage (Itzkovitz and Alon, 2007), whereas retention of an unstopped TGA in the common ancestor of all Metazoan mitochondrial DNA (mtDNA) codes suggests that there is some advantage for a two-stop code, and the relatively recent evolution of the four-stop code in Chordata offers some advantage over a three-stop intermediate (Cannaozzi and Schneider, 2012). We have shown here that short random DNA sequences have a high probability of including single ORF over certain size ranges, and that this probability is inversely proportional to the number of stops in the genetic codes used. Might size variation of ORF coding sequences across genetic codes be subject to natural selection?

A recent model of stop codon evolution (Johnson et al., 2011) proposes that multi-stop codes provide a backstop against readthrough, balanced against an increased probability of random stop mutations. Like ours, the model predicts an inverse relationship between the number of stop triplets and the length of coding sequences. Consistent with this, their sampling of NCBI data shows a marked (though nonsignificant) relationship between longer coding sequence and fewer stops, for pairs of genomes in the same taxon alternatively decoded with one- versus two-, one- versus three-, or two- versus three-stop codes. There is no such trend for the two- versus four-stop Chordata comparison. Johnson et al. (2011) note a previous suggestion that reassignment of TGA from sense to stop has occurred frequently in association with the evolutionary reduction of genome size in mtDNA genomes, in apparent contradiction to the predicted direction. However, a phylogenetic perspective on the various mtDNA codes shows that this reassignment has occurred only once, in the shared ancestral code of all Animalia and Yeast (Ophisthokonta); this will be considered elsewhere.

In their data, coding sequences for genomes with three-stop codes are in the range of 250 ~ 400 bp, with animal mtDNA at about 300 bp: these are rather longer than our optimal of 111 or 84 bp for S = 3 or 4, respectively, but they are well within the range of reasonable probability (Figure 13.5). A longer coding sequence might also be assembled from several shorter single-ORF fragments, so long as the individual ORFs were assembled in the same RF. Recall that fragments shorter than the optimum are more likely to have multiple ORFs. Selection could then act to modify the function of the corresponding polypeptide product while maintaining a single ORF. DNA sequences 5’ or 3’ to the ORF region can be added easily, since there is a high probability that any 3-bp sequence added in the open frame also will be open [~(61/64)3 = 0.87], while the other five frames are already stopped. The fewer the stops, the longer the likely candidate sequences are. For example, in a four-stop code, there is less than 1% chance that an approximately 300-bp DNA will contain one and only one ORF, whereas for a one-stop code, there is a far greater than 10% chance that a sequence of many hundreds of base pairs will do so.

Are short, random DNA sequences with single ORFs of utility in evolution? It has recently been demonstrated that some free-living bacteria can take up ex vivo, fragmented DNA from the environment and incorporate it into their genomes by replication-dependent transformation (Overballe-Petersen et al., 2013). Fragments of 20 ~ 100 bp were most efficiently transformed at higher rates than larger fragments. We have shown that random fragments of just this size are most likely to include a single ORF, which might mediate the success of any such horizontal transfer and its incorporation into the host genome as a functional coding sequence. Overballe-Petersen et al., 2013 hypothesize that “rates of molecular evolution in naturally transformable species may be influenced by the diversity of free environmental DNA.” Our results suggest that one type of evolutionary diversity in random DNA may be the varying high probability that small fragments of various lengths will include unique ORFs subject to modification by natural selection.

References

Cannaozzi G, Schneider A. Codon Evolution, Mechanisms and Models. Oxford: Oxford University Press; 2012.

Carr SM, Wareham HT, Craig D. An algorithmic and computational approach to open reading frames in short dsDNA sequences: evaluation of “Carr’s Conjecture”. In: Arabnia HR, Tran Q.-N., Yang MQ, eds. Proceedings of the International Conference on Bioinformatics and Computational Biology; 2014a:37–43.

Carr SM, Craig D, Wareham HT. A web application for generation of DNA sequence exemplars with open and closed reading frames in genetics and bioinformatics education. CBE Life Sci. Educ. 2014b;13:373–374.

Crick FHC. The genetic code, yesterday, today, and tomorrow. Cold Spring Harbor Symp. Quant. Biol. 1966;31:3–9.

Itzkovitz S, Alon U. The genetic code is nearly optimal for allowing additional information within protein-coding sequences. Genome Res. 2007;17:405–412.

Johnson L, Cotton J, Lichtenstein C, Elgar G, Nichols R, Polly D, Le Comber S. Stops making sense: translational trade-offs and stop codon reassignment. BMC Evol. Biol. 2011;11:227.

Judson H. The Eighth Day of Creation. second ed. Cold Spring Harbor, New York: Cold Spring Laboratories; 1996.

Nirenberg M, Leder P, Bernfield M, Brimacombe R, Trupin J, Rottman F, O’Neal C. RNA codewords and protein synthesis VII. On the general nature of the RNA code. Proc. Natl. Acad. Sci. U. S. A. 1965;53:1161–1168.

Nirenberg M, Caskey T, Marshall R, Brimacombe R, Kellogg D, Doctor B, Hatfield D, Levin J, Rottman F, Pestka S, Wilcox M, Anderson F. The RNA code and protein synthesis. Cold Spring Harbor Symp. Quant. Biol. 1966;31:11–24.

Overballe-Petersen S, Harms K, Orlando L, Mayar J, Rasmussen S, Dahl T, Rosing M, Poole A, Sicheritz-Ponten T, Brunak S, Inselmann S, de Vries J, Wackernagel W, Pybus OG, Nielsen R, Johnsen P, Nielsen K, Willerslev E. Bacterial natural transformation by highly fragmented and damaged DNA. Proc. Natl. Acad. Sci. U. S. A. 2013;110:19860–19865.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset