In this final example dealing with randomization, you'll collect some statistics on DNA in order to answer the question: on average, what percentage of bases are the same between two random DNA sequences? Although some simple mathematics can answer the question for you, the point of the program is to show that you now have the necessary programming ability to ask and answer questions about your DNA sequences. (If you were using real DNA, say a collection of some particular gene as it appears in several organisms in slightly different forms, the answer would be somewhat more interesting. You may want to try that later.)
So let's generate a set of random DNA, all the same length, then ask the following question about the set. What's the average percentage of positions that are the same between pairs of DNA sequences in this set?
As usual, let's try to sketch an idea of the program in pseudocode:
Generate a set of random DNA sequences, all the same length For each pair of DNA sequences How many positions in the two sequences are identical as a fraction? } Report the mean of the preceding calculations as a percentage
Clearly, to write this code, you can reuse at least some of the work you've already done. You certainly know how to generate a set of random DNA sequences. Also, although you don't have a subroutine that compares, position by position, the bases in two sequences, you know how to look at the positions in DNA strings. So that subroutine shouldn't be hard to write. In fact, let's write some pseudocode that compares each nucleotide in one sequence with the nucleotides in the same positions in another sequence:
assuming DNA1 is the same length as DNA2, for each position from 1 to length(DNA) if the character at that position is the same in DNA_1 and DNA_2 ++$count } } return count/length
The whole problem now seems eminently do-able. You also have to write the code that picks each pair of sequences, collects the results, and finally takes the mean of the results and report it as a percentage. That can all go into the main program. Example 7-4 gives it a try, all in one shot.
Example 7-4. Calculate average % identity between pairs of random DNA sequences
#!/usr/bin/perl # Calculate the average percentage of positions that are the same # between two random DNA sequences, in a set of 10 sequences. use strict; use warnings; # Declare and initialize the variables my $percent; my @percentages; my $result; # An array, initialized to the empty list, to store the DNA in my @random_DNA = ( ); # Seed the random number generator. # time|$$ combines the current time with the current process id srand(time|$$); # Generate the data set of 10 DNA sequences. @random_DNA = make_random_DNA_set( 10, 10, 10 ); # Iterate through all pairs of sequences for (my $k = 0 ; $k < scalar @random_DNA - 1 ; ++$k) { for (my $i = ($k + 1) ; $i < scalar @random_DNA ; ++$i) { # Calculate and save the matching percentage $percent = matching_percentage($random_DNA[$k], $random_DNA[$i]); push(@percentages, $percent); } } # Finally, the average result: $result = 0; foreach $percent (@percentages) { $result += $percent; } $result = $result / scalar(@percentages); #Turn result into a true percentage $result = int ($result * 100); print "In this run of the experiment, the average percentage of "; print "matching positions is $result% "; exit; ################################################################################ # Subroutines ################################################################################ # matching_percentage # # Subroutine to calculate the percentage of identical bases in two # equal length DNA sequences sub matching_percentage { my($string1, $string2) = @_; # we assume that the strings have the same length my($length) = length($string1); my($position); my($count) = 0; for ($position=0; $position < $length ; ++$position) { if(substr($string1,$position,1) eq substr($string2,$position,1)) { ++$count; } } return $count / $length; } # make_random_DNA_set # # Subroutine to make a set of random DNA # # Accept parameters setting the maximum and minimum length of # each string of DNA, and the number of DNA strings to make # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub make_random_DNA_set { # Collect arguments, declare variables my($minimum_length, $maximum_length, $size_of_set) = @_; # length of each DNA fragment my $length; # DNA fragment my $dna; # set of DNA fragments my @set; # Create set of random DNA for (my $i = 0; $i < $size_of_set ; ++$i) { # find a random length between min and max $length = randomlength ($minimum_length, $maximum_length); # make a random DNA fragment $dna = make_random_DNA ( $length ); # add $dna fragment to @set push( @set, $dna ); } return @set; } # randomlength # # A subroutine that will pick a random number from # $minlength to $maxlength, inclusive. # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub randomlength { # Collect arguments, declare variables my($minlength, $maxlength) = @_; # Calculate and return a random number within the # desired interval. # Notice how we need to add one to make the endpoints inclusive, # and how we first subtract, then add back, $minlength to # get the random number in the correct interval. return ( int(rand($maxlength - $minlength + 1)) + $minlength ); } # make_random_DNA # # Make a string of random DNA of specified length. # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub make_random_DNA { # Collect arguments, declare variables my($length) = @_; my $dna; for (my $i=0 ; $i < $length ; ++$i) { $dna .= randomnucleotide( ); } return $dna; } # randomnucleotide # # Select at random one of the four nucleotides # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub randomnucleotide { my(@nucleotides) = ('A', 'C', 'G', 'T'), # scalar returns the size of an array. # The elements of the array are numbered 0 to size-1 return randomelement(@nucleotides); } # randomelement # # randomly select an element from an array # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub randomelement { my(@array) = @_; return $array[rand @array]; }
If the code in Example 7-4 seems somewhat repetitive of code from previous examples, it is. In the interest of presentation, I included the subroutine code in the program. (You'll start using modules in Chapter 8 as a way to avoid this repetition.)
Here's the output of Example 7-4:
In this run of the experiment, the average number of matching positions is 24%
Well, that seems reasonable. You might say, it's obvious: a quarter of the positions match, and there are four bases. But the point isn't to verify elementary probability, it's to show you have enough programming under your belt to write some programs that ask and answer questions about DNA sequences.
Notice in the main program that when it calls:
@random_DNA = make_random_DNA_set( 10, 10, 10 );
you don't need to declare and initialize variables such as $minimum_length
. You can just fill in the actual
numbers when you call the subroutine. (However it's often a good idea to put
such things in variables declared at the top of the program, where it's easy to
find and change them.) Here, you set the maximum and minimum lengths to 10 and
ask for 10 sequences.
Let's restate the problem we just solved. You have to compare all pairs of DNA, and for each pair, calculate the percentage of positions that have the same nucleotides. Then, you have to take the mean of these percentages.
Here's the code that accomplishes this in the main program of Example 7-4:
# Iterate through all pairs of sequences for (my $k = 0 ; $k < scalar @random_DNA - 1 ; ++$k) { for (my $i = ($k + 1) ; $i < scalar @random_DNA ; ++$i) { # Calculate and save the matching percentage $percent = matching_percentage($random_DNA[$k], $random_DNA[$i]); push(@percentages, $percent); } }
To look at each pair, you use a nested loop. A nested loop is simply a loop within another loop. These are fairly common in programming but must be handled with care. They may seem a little complex; take some time to see how the nested loop works, because it's common to have to select all combinations of two (or more) elements from a set.
The nested loop involves looking at (n * (n-1)) / 2 pairs of sequences, which is a square function of the size of the data set. This can get very big! Try gradually increasing the size of the data set and rerunning the program, and you'll see your compute time increase, and more than gradually.
See how the looping works? First sequence 0 (indexed by $k
) is paired with sequences 1,2,3,...,9, in turn
(indexed by $i
). Then sequence 1 is paired
with 2,3,...,9, etc. Finally, 8 is paired with 9. (Recall that array elements
are numbered starting at 0, so the last element of an array with 10 elements is
numbered 9. Also recall that scalar @random_DNA
returns the number of elements in the array.)
You might find it a worthwhile exercise to let the number of sequences be some
small value, say 3 or 4, and think through (paper and pencil in hand) how the
nested loops and the variables $k
and
$i
evolve during the running of the
program. Or you can use the Perl
debugger to watch how it happens.