The biologist knows that, given a sequence of DNA, it is necessary to examine all six reading frames of the DNA to find the coding regions the cell uses to make proteins.
Very often you won't know where, in the DNA you're studying, the cell actually begins translating the DNA into protein. Only about 1-1.5% of human DNA is in genes, which are the parts of DNA used for the translation into proteins. Furthermore, genes very often occur in pieces that are spliced together during the transcription/translation process.
If you don't know where the translation starts, you have to consider the six possible reading frames. Since the codons are three bases long, the translation happens in three "frames," for instance starting at the first base, or the second, or perhaps the third. (The fourth would be the same as starting from the first.) Each starting place gives a different series of codons, and, as a result, a different series of amino acids.
Also, transcription and translation can happen on either strand of the DNA; that is, either the DNA sequence, or its reverse complement, might contain DNA code that is actually translated. The reverse complement can also be read in any one of three frames. So a total of six reading frames have to be considered when looking for coding regions , that part of the DNA that encodes proteins.
It is therefore quite common to examine all six reading frames of a DNA sequence and to look at the resulting protein translations for long stretches of amino acids that lack stop codons.
The stop codons are definite breaks in the DNA→protein translation process. During translation (actually of RNA to protein, but I'm being deliberately informal and vague about the biochemistry), if a stop codon is reached, the translation stops, and the growing peptide chain grows no more.
Long stretches of DNA that don't contain any stop codons are called open reading frames (ORFs) and are important clues to the presence of a gene in the DNA under study. So gene finder programs need to perform the type of reading frame analysis we'll do in this chapter.
Based on the facts just presented, let's write some code that translates the DNA in all six reading frames.
In the real world, you'd look around for some subroutines that are already written to do that task. Given the basic nature of the task—something anyone who studies DNA has to do—you'd likely find something. But this is a tutorial, not the real world, so let's soldier on.
This problem doesn't sound too daunting. So, take stock of the subroutines at your disposal, think of where you are and how you can get to your destination.
Looking through the subroutines we've already written, recall dna2peptide
. You may recall considering adding
some arguments to specify starting and end points. Let's do this now.
Remember that although we calculated reverse complements back in Chapter 4, we never made a subroutine out of it. So let's start there:
# revcom # # A subroutine to compute the reverse complement of DNA sequence sub revcom { my($dna) = @_; # First reverse the sequence my $revcom = reverse $dna; # Next, complement the sequence, dealing with upper and lower case # A->T, T->A, C->G, G->C $revcom =~ tr/ACGTacgt/TGCAtgca/; return $revcom; }
Now, a little pseudocode to sketch an idea for the subroutine that will translate specific ranges of DNA:
Given DNA sequence subroutine translate_frame ( DNA, start, end) return dna2peptide( substr( DNA, start, end - start + 1 ) ) }
That went well! Luckily, the substr
built-in Perl function made it easy to apply the desired start and end points,
while passing the DNA into the already written dna2peptide
subroutine.
Note that the length of the sequence is end-start+1
. To give a small example: if you start at position 3
and end at position 5, you've got the bases at positions 3, 4, and 5, three
bases in all, which is exactly what 5 - 3 + 1 equals.
Dealing with indices like this has to be done carefully, or the code won't work. For many programs, this is the worst the mathematics gets.
You have to
decide if you wish to keep the numbering of positions from 0, which
is Perl's way to do it, or the first character of the sequence is in position 1,
which is the biologist's way to do it. Let's do it the biologist's way. The
positions will be decreased by one when passed to the Perl function substr
, which, of course, does it Perl's
way.
The corrected pseudocode looks like this:
Given DNA sequence subroutine translate_frame ( DNA, start, end) # start and end are numbering the sequence from 1 to length return dna2peptide( substr( DNA, start - 1, end - start + 1 ) ) }
The length of the desired sequence doesn't change with the change in indices, since:
(end - 1) - (start - 1) + 1 = end - start + 1
as we know from algebra. So let's write this subroutine:
# translate_frame # # A subroutine to translate a frame of DNA sub translate_frame { my($seq, $start, $end) = @_; my $protein; # To make the subroutine easier to use, you won't need to specify # the end point--it will just go to the end of the sequence # by default. unless($end) { $end = length($seq); } # Finally, calculate and return the translation return dna2peptide ( substr ( $seq, $start - 1, $end -$start + 1) ); }
Example 8-4 translates the DNA in all six reading frames.
Example 8-4. Translate a DNA sequence in all six reading frames
#!/usr/bin/perl # Translate a DNA sequence in all six reading frames use strict; use warnings; use BeginPerlBioinfo; # see Chapter 6 about this module # Initialize variables my @file_data = ( ); my $dna = ''; my $revcom = ''; my $protein = ''; # Read in the contents of the file "sample.dna" @file_data = get_file_data("sample.dna"); # Extract the sequence data from the contents of the file "sample.dna" $dna = extract_sequence_from_fasta_data(@file_data); # Translate the DNA to protein in six reading frames # and print the protein in lines 70 characters long print " -------Reading Frame 1-------- "; $protein = translate_frame($dna, 1); print_sequence($protein, 70); print " -------Reading Frame 2-------- "; $protein = translate_frame($dna, 2); print_sequence($protein, 70); print " -------Reading Frame 3-------- "; $protein = translate_frame($dna, 3); print_sequence($protein, 70); # Calculate reverse complement $revcom = revcom($dna); print " -------Reading Frame 4-------- "; $protein = translate_frame($revcom, 1); print_sequence($protein, 70); print " -------Reading Frame 5-------- "; $protein = translate_frame($revcom, 2); print_sequence($protein, 70); print " -------Reading Frame 6-------- "; $protein = translate_frame($revcom, 3); print_sequence($protein, 70); exit;
Here's the output of Example 8-4:
-------Reading Frame 1-------- RWRR_GVLGALGRPPTGLQRRRRMGPAQ_EYAAWEA_LEAEVVVGAFATAWDAAEWSVQVRGSLAGVVRE CAGSGDMEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCDNCNEWFHGDCIRITEKMAKA IREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEPRDEGGGRKRPVPDPDLQRRAGSGTGVGAMLAR GSASPHKSSPQPLVATPSQHHQQQQQQIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCR LRQCQLRARESYKYFPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATA TPEPLSDEDL -------Reading Frame 2-------- DGGAEGSWGL_AGHLLVCSGDDAWGLRNRSTLPGRRD_KRK_LWAPLQPPGTPPSGLCRFAGRWRGS_GS APGAEIWREMVQTQSLQMPGRTASPRMGRMRPSTASAANRTSTAS_SGVTTAMSGSMGTASGSLRRWPRP SGSGTVGSAERKTPS_RFAIGTRSHGSGMAMSGTAVSPGMRVEGARGLSLIQTCSAGQGQGQGLGPCLLG ALLRPTNPLRSPWWPHPASITSSSSSRSNGQPACVVSVRHVGALRTVVTVISVGT_RSSGAPTRSGRSAG CASASCGPGNRTSTSLPRSHQ_RPQSPCQGPAGHCPPNSSHSHHRS_GASVKMRGQWRHQQSRSLLRLQP HLSHSQMRT -------Reading Frame 3-------- MAALRGLGGSRPATYWFAAETTHGACAIGVRCLGGVTRSGSSCGRLCNRLGRRRVVCAGSRVAGGGREGV RRERRYGGRWFRPRASRCRGGQQVREWGECAHLLHLPQTGHQLLHDRV_QLQ_VVPWGLHPDH_EDGQGH PGVVLSGVQRERPQARDSLSAQEVTGAGWQ_AGQQ_APG_GWRAQEACP_SRPAAPGRVRDRGWGHACSG LCFAPQILSAALGGHTQPASPAAAAADQTVSPHVW_V_GMSAH_GLWSL_FLSGHEEVRGPQQDPAEVPA APVPAAGPGIVQVLPFLALTSDALRVPAKAPPATAHPTAATAITEVRAHP_R_GGSGVINSQGAS_GYSH T_ATLR_GP -------Reading Frame 4-------- _VLI_EWLRCGCSLRRLLDC_ _RHCPLIFTDAP_LL_WLWLLLGGQWPAGPWQGL_GRHW_ERGREVLVR FPGPQLALAQPALLPDLVGAPELLHVPTEITVTTVLSAPTCLTLTTHAG_PFDLLLLLLVMLAGCGHQGL RRGFVGRSRAPSKHGPNPCP_PCPALQVWIRDRPLAPSTLIPGLTAVPLIAIPLP_LLVPIANL_LGVFL SALPTVPLPDGLGHLLSDPDAVPMEPLIAVVTPDHEAVDVRFAADAVDGRILPILGLAVLPGIWRLWV_T ISLHISAPGALPHDPRQRPANLHRPLGGVPGGCKGAHNYFRF_SRLPGSVLLLRRPHASSPLQTSRWPA_ SPQDPSAPPS -------Reading Frame 5-------- RSSSESGSGVAVASGGSLTVDDATAPSSSRMRPNFCDGCGCCWVGSGRRGLGRDSEGVTGESEEGKYLYD SRARSWHWRSRHFCRILLGPPNFFMSRQKSQ_PQSSVRRHASHSPHMRADRLICCCCCW_CWLGVATKGC GEDLWGEAEPRASMAPTPVPDPARRCRSGSGTGLLRPPPSSRGSLLSRSLPSRSRDFLCR_RISSLGSFS LHSRQYHSRMALAIFSVIRMQSPWNHSLQLSHPIMKQLMSGLRQMQ_MGAFSPFSDLLSSPASGGSGSEP SPSISPLPAHSLTTPASDPRTCTDHSAASQAVAKAPTTTSASSHASQAAYSYCAGPMRRLRCKPVGGRPR APKTPQRRH -------Reading Frame 6-------- GPHLRVAQVWL_PQEAP_LLMTPLPPHLHGCALTSVMAVAAVGWAVAGGALAGTLRASLVRARKGSTCTI PGPAAGTGAAGTSAGSCWGPRTSSCPDRNHSDHSPQCADMPHTHHTCGLTV_SAAAAAGDAGWVWPPRAA ERICGAKQSPEQAWPQPLSLTLPGAAGLDQGQASCALHPHPGAHCCPAHCHPAPVTSCADSESLAWGLSL CTPDSTTPGWPWPSSQ_SGCSPHGTTHCSCHTRS_SS_CPVCGRCSRWAHSPHSRTCCPPRHLEALGLNH LPPYLRSRRTPSRPPPATREPAQTTRRRPRRLQRRPQLLPLLVTPPRQRTPIAQAPCVVSAANQ_VAGLE PPRPLSAAI