Example 7-1 gave you the
tools you'll need to mutate DNA. In the following examples, you'll represent DNA, as
usual, by a string made out of the alphabet A, C, G, and T. You'll randomly select
positions in the string and then use the substr
function to alter the DNA.
This time, let's go about things a little differently and first compose some of the useful subroutines you'll need before showing the whole program.
Starting with simple pseudocode, here's a design for a subroutine that mutates a random position in DNA to a random nucleotide:
Select a random position in the string of DNA.
Choose a random nucleotide.
Substitute the random nucleotide into the random position in the DNA.
This seems short and to the point. So you decide to make each of the first two sentences into a subroutine.
How can you randomly
select a position in a string? Recall that the built-in function length
returns the length of a string. Also recall that positions in
strings are numbered from 0
to length-1
, just like positions in arrays. So
you can use the same general idea as in Example 7-1, and make a
subroutine:
# randomposition # # A subroutine to randomly select a position in a string. # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub randomposition { my($string) = @_; # This expression returns a random number between 0 and length-1, # which is how the positions in a string are numbered in Perl. return int(rand(length($string))); }
randomposition
is a very short
function, if you don't count the comments. It's a subroutine to select a
string position, very similar to the idea in Example 7-1 to select a random
array element.
While writing subroutines, you frequently want to write a little test to see if your subroutine works as you intend:
#!/usr/bin/perl -w # Test the randomposition subroutine my $dna = 'AACCGTTAATGGGCATCGATGCTATGCGAGCT'; srand(time|$$); for (my $i=0 ; $i < 20 ; ++$i ) { print randomposition($dna), " "; } print " "; exit; sub randomposition { my($string) = @_; return int rand length $string; }
Here's some representative output of the test (your results should vary):
28 26 20 1 29 7 1 27 2 24 8 1 23 7 13 14 2 12 13 27
Notice the new look of the for
loop:
for (my $i=0 ; $i < 20 ; ++$i ) {
This shows how you can localize the counter variables (in this case,
$i
) to the loop by declaring them
with my
inside the for
loop.
Next, let's write a subroutine that randomly chooses one of the four nucleotides:
# randomnucleotide # # A subroutine to randomly select a nucleotide # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub randomnucleotide { my(@nucs) = @_; # scalar returns the size of an array. # The elements of the array are numbered 0 to size-1 return $nucs[rand @nucs]; }
Again, this subroutine is short and sweet. (Most useful subroutines are; although writing a short subroutine is no guarantee it will be useful. In fact, you'll see in a bit how you can improve this one.)
Let's test this one too:
#!/usr/bin/perl -w # Test the randomnucleotide subroutine my @nucleotides = ('A', 'C', 'G', 'T'), srand(time|$$); for (my $i=0 ; $i < 20 ; ++$i ) { print randomnucleotide(@nucleotides), " "; } print " "; exit; sub randomnucleotide { my(@nucs) = @_; return $nucs[rand @nucs]; }
Here's some typical output (it's random, of course, so there's a high probability your output will differ):
C A A A A T T T T T A C A C T A A G G G
Now for the third and final subroutine, that actually does the mutation. Here's the code:
# mutate # # A subroutine to perform a mutation in a string of DNA # sub mutate { my($dna) = @_; my(@nucleotides) = ('A', 'C', 'G', 'T'), # Pick a random position in the DNA my($position) = randomposition($dna); # Pick a random nucleotide my($newbase) = randomnucleotide(@nucleotides); # Insert the random nucleotide into the random position in the DNA. # The substr arguments mean the following: # In the string $dna at position $position change 1 character to # the string in $newbase substr($dna,$position,1,$newbase); return $dna; }
Here, again, is a short subroutine. As you look it over, notice that it's
relatively easy to read and understand. You mutate by picking a random
position, then selecting a nucleotide at random and substituting that
nucleotide at that position in the string. (If you've forgotten how
substr
works, refer to Appendix
B or other Perl documentation. If you're like me, you probably
have to look things up a lot, especially to get the order of the arguments
right.)
There's a slightly different style used here for declaring variables. Whereas you've been declaring them at the beginning of a program, here you're declaring each variable the first time it's used. There are pros and cons for each programming style. Having all the variables at the top of the program gives good organization and can help in reading; declaring them on-the-fly can seem like a more natural way to write. The choice is yours.
Also, notice how this subroutine is mostly built from other subroutines, with a little bit added. That has a lot to do with its readability. At this point, you may be thinking that you've actually decomposed the problem pretty well, and the pieces are fairly easy to build and, in the end, they fit together well. But do they?
You're about to pat yourself on the back for writing the program so quickly, but you notice something. You keep
having to declare that pesky @nucleotides
array and then pass it in to the randomnucleotide
subroutine. But the only place you use the array is inside the
randomnucleotide
subroutine. So why not
change your design a little? Here's a new try:
# randomnucleotide # # A subroutine to randomly select a nucleotide # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub randomnucleotide { my(@nucs) = ('A', 'C', 'G', 'T'), # scalar returns the size of an array. # The elements of the array are numbered 0 to size-1 return $nucs[rand @nucs]; }
Notice that this function now has no arguments. It's called like so:
$randomnucleotide = randomnucleotide( );
It's asking for a random element from a very specific set. Of course, you're always thinking, and you say, "It'd be handy to have a subroutine that randomly selects an element from any array. I might not need it right now, but I bet I'll need it soon!" So you define two subroutines instead of one:
# randomnucleotide # # A subroutine to randomly select a nucleotide # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub randomnucleotide { my(@nucleotides) = ('A', 'C', 'G', 'T'), # scalar returns the size of an array. # The elements of the array are numbered 0 to size-1 return randomelement(@nucleotides); } # randomelement # # A subroutine to randomly select an element from an array # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub randomelement { my(@array) = @_; return $array[rand @array]; }
Look back and notice that you didn't have to change your subroutine mutate
; just the internal workings of randomnucleotide
changed, not its
behavior.
Now you've got all your ducks in place, so you write your main program as in Example 7-2 and see if your new subroutine works.
Example 7-2. Mutate DNA
#!/usr/bin/perl # Mutate DNA # using a random number generator to randomly select bases to mutate use strict; use warnings; # Declare the variables # The DNA is chosen to make it easy to see mutations: my $DNA = 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'; # $i is a common name for a counter variable, short for "integer" my $i; my $mutant; # Seed the random number generator. # time|$$ combines the current time with the current process id srand(time|$$); # Let's test it, shall we? $mutant = mutate($DNA); print " Mutate DNA "; print " Here is the original DNA: "; print "$DNA "; print " Here is the mutant DNA: "; print "$mutant "; # Let's put it in a loop and watch that bad boy accumulate mutations: print " Here are 10 more successive mutations: "; for ($i=0 ; $i < 10 ; ++$i) { $mutant = mutate($mutant); print "$mutant "; } exit; ################################################################################ # Subroutines for Example 7-2 ################################################################################ # Notice, now that we have a fair number of subroutines, we # list them alphabetically # A subroutine to perform a mutation in a string of DNA # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub mutate { my($dna) = @_; my(@nucleotides) = ('A', 'C', 'G', 'T'), # Pick a random position in the DNA my($position) = randomposition($dna); # Pick a random nucleotide my($newbase) = randomnucleotide(@nucleotides); # Insert the random nucleotide into the random position in the DNA # The substr arguments mean the following: # In the string $dna at position $position change 1 character to # the string in $newbase substr($dna,$position,1,$newbase); return $dna; } # A subroutine to randomly select an element from an array # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub randomelement { my(@array) = @_; return $array[rand @array]; } # randomnucleotide # # A subroutine to select at random one of the four nucleotides # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub randomnucleotide { my(@nucleotides) = ('A', 'C', 'G', 'T'), # scalar returns the size of an array. # The elements of the array are numbered 0 to size-1 return randomelement(@nucleotides); } # randomposition # # A subroutine to randomly select a position in a string. # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub randomposition { my($string) = @_; # Notice the "nested" arguments: # # $string is the argument to length # length($string) is the argument to rand # rand(length($string))) is the argument to int # int(rand(length($string))) is the argument to return # But we write it without parentheses, as permitted. # # rand returns a decimal number between 0 and its argument. # int returns the integer portion of a decimal number. # # The whole expression returns a random number between 0 and length-1, # which is how the positions in a string are numbered in Perl. # return int rand length $string; }
Here's some typical output from Example 7-2:
Mutate DNA Here is the original DNA: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Here is the mutant DNA: AAAAAAAAAAAAAAAAAAAAGAAAAAAAAA Here are 10 more successive mutations: AAAAAAAAAAAAAAAAAAAAGACAAAAAAA AAAAAAAAAAAAAAAAAAAAGACAAAAAAA AAAAAAAAAAAAAAAAAAAAGACAAAAAAA AAAAAAAAAAAAAACAAAAAGACAAAAAAA AAAAAAAAAAAAAACAACAAGACAAAAAAA AAAAAAAAAAAAAACAACAAGACAAAAAAA AAAAAAAAAGAAAACAACAAGACAAAAAAA AAAAAATAAGAAAACAACAAGACAAAAAAA AAAAAATAAGAAAACAACAAGACAAAAAAA AAAAAATTAGAAAACAACAAGACAAAAAAA
Example 7-2 was something of a programming challenge, but you end up with the satisfaction of seeing your (simulated) DNA mutate. How about writing a graphical display for this, so that every time a base gets mutated, it makes a little explosion and the color gets highlighted, so you can watch it happening in real-time?
Before you scoff, you should know how important good graphical displays are for the success of most programs. This may be a trivial-sounding graphic, but if you can demonstrate the most common mutations in, for instance, the BRCA breast cancer genes in this way, it might be useful.
To return to the business at hand, you may have noticed something when you looked over the output from Example 7-2. Look at the first two lines of the "10 more successive mutations." They are exactly the same! Could it be that after patting yourself on the back and telling yourself what a good bit of work you'd done, you've discovered a bug?
How can you track it down? You may want to step through the running of the program with the Perl debugger, which you saw in Chapter 6. However, this time, you stop and think about your design instead. You're replacing the bases at random positions with randomly chosen bases. Aha! Sometimes the base at the position you randomly choose is exactly the same as the base you randomly choose to plug into its place! You're replacing a base with itself on occasion![3]
Let's say you decide that behavior is not useful. At each successive mutation,
you need to see one base change. How can you alter your code to ensure that?
Let's start with some pseudocode for the mutate
subroutine:
Select a random position in the string of DNA Repeat: Choose a random nucleotide Until: random nucleotide differs from the nucleotide in the random position Substitute the random nucleotide into the random position in the DNA
This seems like something that should work, so you alter the mutate
subroutine, calling it the mutate_better
subroutine:
# mutate_better # # Subroutine to perform a mutation in a string of DNA--version 2, in which # it is guaranteed that one base will change on each call # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub mutate_better { my($dna) = @_; my(@nucleotides) = ('A', 'C', 'G', 'T'), # Pick a random position in the DNA my($position) = randomposition($dna); # Pick a random nucleotide my($newbase); do { $newbase = randomnucleotide(@nucleotides); # Make sure it's different than the nucleotide we're mutating }until ( $newbase ne substr($dna, $position,1) ); # Insert the random nucleotide into the random position in the DNA # The substr arguments mean the following: # In the string $dna at position $position change 1 character to # the string in $newbase substr($dna,$position,1,$newbase); return $dna; }
When you plug this subroutine in place of mutate
and run the code, you get the following output:
Mutate DNA Here is the original DNA: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Here is the mutant DNA: AAAAAAAAAAAAATAAAAAAAAAAAAAAAA Here are 10 more successive mutations: AAAAAAAAAAAAATAAAAAAAACAAAAAAA AAAAATAAAAAAATAAAAAAAACAAAAAAA AAATATAAAAAAATAAAAAAAACAAAAAAA AAATATAAAAAAATAAAAAAAACAACAAAA AATTATAAAAAAATAAAAAAAACAACAAAA AATTATTAAAAAATAAAAAAAACAACAAAA AATTATTAAAAAATAAAAAAAACAACACAA AATTATTAAAAAGTAAAAAAAACAACACAA AATTATTAAAAAGTGAAAAAAACAACACAA AATTATTAAAAAGTGATAAAAACAACACAA
which seems to indeed make a real change on every iteration.
Notice one more thing about
declaring variables. In this code for mutate_better
, if you'd declared $newbase
within the loop, since the loop is enclosed in a block,
the variable $newbase
would not then be
visible outside of that loop. In particular, it wouldn't be available in the
substr
call that does the actual base
change for the mutation. So, in mutate_better
, you had to declare the variable outside of the
loop.
This is a frequent source of confusion for programmers who like to declare variables on the fly and a powerful argument for getting into the habit of collecting variable definitions together at the top of the program.
Even so, there are often times when you want to hide a variable within a block, because that's the only place where you will use it. Then you may want to do the declaration in the block . (Perhaps at the top of the block, if it's a long one?)
[3] How often? In DNA that's all one base, it's happening 1/4 of the time. In DNA that's equally populated with the four bases, it's happening...1/4 of the time!