It's not necessary to explode a string into an array in order to look at each character. In fact, sometimes you'd want to avoid that. A large string takes up a large amount of memory in your computer. So does a large array. When you explode a string into an array, the original string is still there, and you also have to make a copy of each character for the elements of the new array you're creating. If you have a large string, that already uses a good portion of available memory, creating an additional array can cause you to run out of memory. When you run out of memory, your computer performs poorly; it can slow to a crawl, crash, or freeze ("hang"). These haven't been worrisome considerations up to now, but if you use large data sets (such as the human genome), you have to take these things into account.
So let's say you'd like to avoid making a copy of the DNA sequence data into
another variable. Is there a way to just look at the string $DNA
and
count the bases from it directly? Yes. Here's some pseudocode, followed
by a Perl program:
read in the DNA from a file join the lines of the file into a single string of $DNA # initialize the counts count_of_A = 0 count_of_C = 0 count_of_G = 0 count_of_T = 0 for each base at each position in $DNA if base is A count_of_A = count_of_A + 1 if base is C count_of_C = count_of_C + 1 if base is G count_of_G = count_of_G + 1 if base is T count_of_T = count_of_T + 1 done print count_of_A, count_of_C, count_of_G, count_of_T
Example 5-6 shows a program that examines each base in a string of DNA.
Example 5-6. Determining frequency of nucleotides, take 2
#!/usr/bin/perl -w # Determining frequency of nucleotides, take 2 # Get the DNA sequence data print "Please type the filename of the DNA sequence data: "; $dna_filename = <STDIN>; chomp $dna_filename; # Does the file exist? unless ( -e $dna_filename) { print "File "$dna_filename" doesn't seem to exist!! "; exit; } # Can we open the file? unless ( open(DNAFILE, $dna_filename) ) { print "Cannot open file "$dna_filename" "; exit; } @DNA = <DNAFILE>; close DNAFILE; $DNA = join( '', @DNA); # Remove whitespace $DNA =~ s/s//g; # Initialize the counts. # Notice that we can use scalar variables to hold numbers. $count_of_A = 0; $count_of_C = 0; $count_of_G = 0; $count_of_T = 0; $errors = 0; # In a loop, look at each base in turn, determine which of the # four types of nucleotides it is, and increment the # appropriate count. for ( $position = 0 ; $position < length $DNA ; ++$position ) { $base = substr($DNA, $position, 1); if ( $base eq 'A' ) { ++$count_of_A; } elsif ( $base eq 'C' ) { ++$count_of_C; } elsif ( $base eq 'G' ) { ++$count_of_G; } elsif ( $base eq 'T' ) { ++$count_of_T; } else { print "!!!!!!!! Error - I don't recognize this base: $base "; ++$errors; } } # print the results print "A = $count_of_A "; print "C = $count_of_C "; print "G = $count_of_G "; print "T = $count_of_T "; print "errors = $errors "; # exit the program exit;
Here's the output of Example 5-6:
Please type the filename of the DNA sequence data: small.dna !!!!!!!! Error - I don't recognize this base: V A = 40 C = 27 G = 24 T = 17 errors = 1
In Example 5-6, I added a line of code to see if the file exists:
unless ( -e $dna_filename) {
There are file test operators for several conditions; see Appendix B or the Perl documentation under
-X
. Note that files have several
attributes, such as size, permission, location in the filesystem, and
type of file, and that many of these things can be tested for easily with the file
test operators.
Notice, also, that I have kept the detailed comments about the regular expression, because regular expressions can be hard to read, and a little commenting here helps a reader to skim the code.
Everything else is familiar,
until you hit the for
loop; it
requires a little explanation:
for ( $position = 0 ; $position < length $DNA ; ++$position ) { # the statements in the block }
This for
loop is the equivalent of this
while
loop:
$position = 0; while( $position < length $DNA ) { # the same statements in the block, plus ... ++$position; }
Take a moment and compare these two loops. You'll see the same statements but in different locations.
As you can see, the for
loop brings the
initialization and increment of a counter ($position
) into the loop statement, whereas in the while
loop, they are separate statements. In the
for
loop, both the initialization and the
increment statement are placed between parentheses, whereas you find only the
conditional test in the while
loop. In the
for
loop, you can put initializations
before the first semicolon and increment statements after the second semicolon. The
initialization statement is done just once before starting the loop, and the
increment statement is done at the end of each iteration through the block before
going back to the conditional test. It's really just a shorthand for the equivalent
while
loop as just shown.
The conditional test checks to see if the position reached in the string is less
than the length of the string. It uses the length
Perl function. Obviously, you don't want to check characters beyond the
length of the string. But a word is in order here about the numbering of positions in strings and arrays.
By default, Perl assumes that a string begins at position 0
and its last character is at a position that's numbered one less
than the length of the string. Why do it this way instead of numbering the positions
from 1 up to and including the length of the string? There are reasons, but they're
somewhat abstruse; see the documentation for enlightenment. If it's any comfort,
many other programming languages make the same choice. (However, many do it the
intuitive way, starting at 1. Ah well.)
This way of numbering is important to biologists because they are used to
numbering sequences beginning with 1, not with 0
the way Perl does it. You sometimes have to add 1 to a position before printing out
results so they'll make sense to nonprogrammers. It's mildly annoying, but you'll
get used to it.
The same holds true for numbering the elements of an array. The first element of
an array is element 0
; the last is element
$length-1
.
Anyway, you see that the conditional test evaluates to true
while the value of $position
is length-1
or less and fails when $position
reaches the same value as the length of the
string. For example, say you have a string that contains the text "seeing". This has
a length of six characters. The "s" is at position 0, and the "g" is at position 5,
which is one less than the string length 6.
Back in the block, you call the substr
function to look into the string:
$base = substr($DNA, $position, 1);
This is a fairly general-purpose function for working with strings; you can also
insert and delete things. Here, you look at just one character, so you call
substr
on the string $DNA
, ask it to look in position $position
for one character, and save the result in
scalar variable $base
. Then you proceed to
accumulate the
count as in the preceding version of the program, Example 5-4.