Example 8-1 shows how the new
codon2aa
subroutine can be used to
translate a whole DNA sequence into
protein.
Example 8-1. Translate DNA into protein
#!/usr/bin/perl # Translate DNA into protein use strict; use warnings; use BeginPerlBioinfo; # see Chapter 6 about this module # Initialize variables my $dna = 'CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC'; my $protein = ''; my $codon; # Translate each three-base codon into an amino acid, and append to a protein for(my $i=0; $i < (length($dna) - 2) ; $i += 3) { $codon = substr($dna,$i,3); $protein .= codon2aa($codon); } print "I translated the DNA $dna into the protein $protein "; exit;
To make this work, you'll need the BeginPerlBioinfo.pm
module for your subroutines in a separate file the program can find, as
discussed in Chapter 6. You also have to add
the codon2aa
subroutine to it. Alternatively,
you can add the code for the subroutine codon2aa
directly to the program in Example 8-1 and remove the reference to
the BeginPerlBioinfo.pm module.
Here's the output from Example 8-1:
I translated the DNA CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC into the protein RRLRTGLARVGR
You've seen all the elements in Example 8-1 before, except for the way it loops through the DNA with this statement:
for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {
Recall that a for
loop has three parts,
delimited by the two semicolons. The first part initializes a counter: my $i=0
statically scopes the $i
variable so it's visible only inside this block, and any other
$i
elsewhere in the code (well, in this case,
there aren't any, but it can happen) is now invisible inside the block. The third
part of the for
loop increments the counter after
all the statements in the block are executed and before returning to the beginning
of the loop:
$i += 3
Since you're trying to march through the DNA three bases at a shot, you increment by three.
The second, middle part of the for
loop tests
whether the loop should continue:
$i < (length($dna) - 2)
The point is that if there are none, one, or two bases left, you should quit,
because there's not enough to make a codon. Now, the positions in a string of DNA of
a certain length are numbered from 0
to length-1
. So if the position counter $i
has reached length-2
, there's only two more bases (at positions length-2
and length-1
), and you should quit. Only if the position counter $i
is less than length-2
will you still have at least three bases left, enough for a
codon. So the test succeeds only if:
$i < (length($dna) -2)
(Notice also how the whole expression to the right of the less-than sign is enclosed in parentheses; we'll discuss this in Chapter 9 in Section 9.3.1.)
The line of code:
$codon = substr ($dna, $i, 3);
actually extracts the 3-base codon from the DNA. The call to the substr
function specifies a substring of $dna
at position $i
of length 3
, and saves it in the variable
$codon
.
If you know you'll need to do this DNA-to-protein translation a lot, you can turn
Example 8-1 into a subroutine.
Whenever you write a subroutine, you have to think about which arguments you may
want to give the subroutine. So you realize, there may come a time when you'll have
some large DNA sequence but only want to translate a given part of it. Should you
add two arguments to the subroutine as beginning and end points? You could, but
decide not to. It's a judgment call—part of the art of decomposing a collection of
code into useful fragments. But it might be better to have a subroutine that just
translates; then you can make it part of a larger subroutine that picks endpoints in
the sequence, if needed. The thinking is that you'll usually just translate the
whole thing and always typing in 0
for the start
and length($dna)-1
at the end, would be an
annoyance. Of course, this depends on what you're doing, so this particular choice
just illustrates your thinking when you write the code.
You should also remove the informative print
statement at the end, because it's more suited to a main program than a
subroutine.
Anyway, you've now thought through the design and just want a subroutine that takes one argument containing DNA and returns a peptide translation:
# dna2peptide # # A subroutine to translate DNA sequence into a peptide sub dna2peptide { my($dna) = @_; use strict; use warnings; use BeginPerlBioinfo; # see Chapter 6 about this module # Initialize variables my $protein = ''; # Translate each three-base codon to an amino acid, and append to a protein for(my $i=0; $i < (length($dna) - 2) ; $i += 3) { $protein .= codon2aa( substr($dna,$i,3) ); } return $protein; }
Now add subroutine dna2peptide
to the
BeginPerlBioinfo.pm module.
Notice that you've eliminated one of the variables in making the subroutine out of
Example 8-1: the variable $codon
. Why?
Well, one reason is because you can. In Example 8-1, you were using substr
to extract the codon from $dna
, saving it in
variable $codon
and then passing it into the
subroutine codon2aa
. This new way eliminates
the middleman. Put the call to substr
that
extracts the codon as the argument to the subroutine codon2aa
so that the value is passed in just as before, but without
having to copy it to the variable $codon
first.
This has somewhat improved efficiency and speed. Since copying strings is one of the slower things computer programs do, eliminating a bunch of string copies is an easy and effective way to speed up a program.
But has it made the program less readable? You be the judge. I think it has, a little, but the comment right before the loop seems to make everything clear enough, for me, anyway. It's important to have readable code, so if you really need to boost the speed of a subroutine, but find it makes the code harder to read, be sure to include enough comments for the reader to be able to understand what's going on.
For the first time use
function calls are
being included in a subroutine instead of the main program:
use strict; use warnings; use BeginPerlBioinfo;
This may be redundant with the calls in the main program, but it doesn't do any harm (Perl checks and loads a module only once). If this subroutine should be called from a module that doesn't already load the modules, it's done some good after all.