Over the fairly short history of bioinformatics, several different biologists and programmers have invented several ways to format sequence data in computer files, and so bioinformaticians must deal with these different formats. We need to extract the sequence data and the annotations from these files, which requires writing code to deal with each different format.
There are many such formats, perhaps as many as 20 in regular use for DNA alone. The very multiplicity of these formats can be an annoyance when you're analyzing a sequence in the lab: it becomes necessary to translate from one format to another for the various programs you use to examine the sequence. Here are some of the most popular:
The FASTA and Basic Local Alignment Search Technique (BLAST) programs are popular; they both use the FASTA format. Because of its simplicity, the FASTA format is perhaps the most widely used of all formats, aside from GenBank.
GenBank is a collection of all publicly released genetic data. It includes lots of information in addition to the DNA sequence. It's very important, and we'll be looking closely at GenBank files in Chapter 10.
The EMBL database has substantially the same data as the GenBank and the DDBJ (DNA Data Bank of Japan), but the format is somewhat different.
This is DNA sequence data that has no formatting whatsoever, just the characters that represent the bases; it is output into files by the sequencing machines from ABI and from other machines and programs.
The GCG program (a.k.a. the GCG Wisconsin package) from Accelrys is used at many large research institutions. Data must be in GCG format to be usable by their programs.
Of these six sequence formats, GenBank and FASTA are by far the most common. The next few sections take you through the process of reading and manipulating data in FASTA.
Let's write a subroutine that can handle FASTA-style data. This is useful in its own right and as a warm-up for the upcoming chapters on GenBank, PDB, and BLAST.
FASTA format is basically just lines of sequence data with newlines at the end so it can be printed on a page or displayed on a computer screen. The length of the lines isn't specified, but for compatibility, it's best to limit them to at most 80 characters. There is also header information , a line or lines at the beginning of the file that start with the greater-than > character, that can contain any text whatsoever (or no text). Typically, a header line contains the name of the DNA or the gene it comes from, often separated by a vertical bar from additional information about the sequence, the experiment that produced it, or other, nonsequence information of that nature.
Much FASTA-aware software insists that there must be only one header line; others permit several lines. Our subroutine will accept either one or several header lines plus comments beginning with #.
The following is a FASTA file. We'll call it sample.dna and use it in several programs. You should copy it, download it from this book's web site, or make up your own file with your own data.
> sample dna | (This is a typical fasta header.) agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga gagaaagaccccaagctagagattcgctatcggcacaagaagtcacggga gcgggatggcaatgagcgggacagcagtgagccccgggatgagggtggag ggcgcaagaggcctgtccctgatccagacctgcagcgccgggcagggtca gggacaggggttggggccatgcttgctcggggctctgcttcgccccacaa atcctctccgcagcccttggtggccacacccagccagcatcaccagcagc agcagcagcagatcaaacggtcagcccgcatgtgtggtgagtgtgaggca tgtcggcgcactgaggactgtggtcactgtgatttctgtcgggacatgaa gaagttcgggggccccaacaagatccggcagaagtgccggctgcgccagt gccagctgcgggcccgggaatcgtacaagtacttcccttcctcgctctca ccagtgacgccctcagagtccctgccaaggccccgccggccactgcccac ccaacagcagccacagccatcacagaagttagggcgcatccgtgaagatg agggggcagtggcgtcatcaacagtcaaggagcctcctgaggctacagcc acacctgagccactctcagatgaggaccta
In Chapter 4, you learned how to read in sequence data; here, you just have to extend that method to deal with the header lines. You'll also learn how to discard empty lines and lines that begin with the pound sign #, i.e., comments in Perl and other languages and file formats. (These don't appear in the FASTA file sample.dna just shown.)
There are two choices when reading in the data. You can read from the open file one line at a time, making decisions as you go. Or, you can slurp the whole file into an array and then operate on the array. For very big files, it's sometimes best to read them one line at a time, especially if you're looking for some small bit of information. (This is because reading a large file into an array uses a large amount of memory. If your system isn't robust enough, it may crash.)
For smaller, normal-sized files, the advantage to reading all the data into an array is that you can then easily look through at the data and do operations on it. That's what we'll do with our subroutine, but remember, this approach can cause memory space problems with larger files, and there are other ways of proceeding.
Let's write a subroutine that, given as an argument a filename containing FASTA-formatted data, returns the sequence data.
Before doing so you should think about whether you should have just one subroutine, or perhaps one subroutine that opens and reads a file, called by another subroutine that extracts the sequence data. Let's use two subroutines, keeping in mind that you can reuse the subroutine that deals with arbitrary files every time you need to write such a program for other formats.
Let's start with some pseudocode:
subroutine get data from a file argument = filename open file if can't open, print error message and exit read in data and return @data } Subroutine extract sequence data from fasta file argument = array of file data in fasta format Discard all header lines (and blank and comment lines for good measure) If first character of first line is >, discard it Read in the rest of the file, join in a scalar, edit out nonsequence data return sequence }
In the first subroutine that gets data from a file, there's a question as to what's the best thing to do when the file can't be read. Here, we're taking the drastic approach: yelling "Fire!" and exiting. But you wouldn't necessarily want your program to just stop whenever it can't open a file. Maybe you're asking for filenames from the user at the keyboard or on a web page, and you'd like to give them three chances to type in the filename correctly. Or maybe, if the file can't be opened, you want a default file instead.
Maybe you can return a false
value, such as
an empty array, if you can't open the file. Then a program that calls this
subroutine can exit, try again, or whatever it wants. But what if you
successfully open the file, but it was absolutely empty? Then you'd have
succeeded and returned an empty array, and the program calling this subroutine
would think, incorrectly, that the file couldn't be opened. So, that wouldn't
work.
There are other options, such as returning the special "undefined" value. Let's keep what we've got, but it's important to remember that handling errors can be an important, and sometimes tricky, part of writing robust code, code that responds well in unusual circumstances.
The second subroutine takes the array of FASTA-formatted sequence and returns just the unformatted sequence in a string.
Now that you've thought about the problem, written some pseudocode, considered alternate ways of designing the subroutines and the costs and benefits of the choices, you're ready to code:
# get_file_data # # A subroutine to get data from a file given its filename sub get_file_data { my($filename) = @_; use strict; use warnings; # Initialize variables my @filedata = ( ); unless( open(GET_FILE_DATA, $filename) ) { print STDERR "Cannot open file "$filename" "; exit; } @filedata = <GET_FILE_DATA>; close GET_FILE_DATA; return @filedata; } # extract_sequence_from_fasta_data # # A subroutine to extract FASTA sequence data from an array sub extract_sequence_from_fasta_data { my(@fasta_file_data) = @_; use strict; use warnings; # Declare and initialize variables my $sequence = ''; foreach my $line (@fasta_file_data) { # discard blank line if ($line =~ /^s*$/) { next; # discard comment line } elsif($line =~ /^s*#/) { next; # discard fasta header line } elsif($line =~ /^>/) { next; # keep line, add to sequence string } else { $sequence .= $line; } } # remove non-sequence data (in this case, whitespace) from $sequence string $sequence =~ s/s//g; return $sequence; }
Notice that nowhere in the code for
extract_sequence_from_fasta_data
do you
check to see what's in the file: is it really DNA or protein sequence data in
FASTA format? Of course, you can write a subroutine—call it is_fasta
—that checks the data to see if it's what
we expect. But I'll leave that for the exercises.
A few comments about the extract_sequence_from_fasta_data
subroutine should be made. The
following line includes a
variable declaration as it is used in a loop:
foreach my $line (@fasta_file_data) {
You've seen this in for
loops as well. It's
often convenient to declare a loop's "index variable" (like $line
) as a my
variable, right on the spot in the loop, as they tend to have common names and
aren't used outside the loop.
Some of the regular expressions deserve brief comment. In this line:
if ($line =~ /^s*$/) {
the s
matches whitespace, that is, space, tab, formfeed, carriage return, or
newline. s*
matches any amount of whitespace
(even none). The ^
matches the beginning of
the line, and the $
matches the end of the
line. So altogether, this regular expression matches blank lines with nothing or
only whitespace in them.
This regular expression also has nothing or only whitespace at the beginning of the line, up to a pound sign:
} elsif($line =~ /^s*#/) {
This expression matches a greater-than sign at the beginning of the line:
} elsif($line =~ /^>/) {
Finally, the following statement removes whitespace, including newlines:
$sequence =~ s/s//g;
We've placed these two new subroutines into our BeginPerlBioinfo.pm module. Now let's write a main program for these subroutines and look at the output. First, there's one more subroutine to write that handles the printing of long sequences.
When
you try to print the "raw" sequence data, it can be a problem if the data is
much longer than the width of the page. For most practical purposes, 80
characters is about the maximum length you should try to fit across a page.
Let's write a print_sequence
subroutine
that takes as its arguments some sequence and a line length and prints out the
sequence, breaking it up into lines of that length. It will have a strong
similarity to the dna2peptide
subroutine.
Here it is:
# print_sequence # # A subroutine to format and print sequence data sub print_sequence { my($sequence, $length) = @_; use strict; use warnings; # Print sequence in lines of $length for ( my $pos = 0 ; $pos < length($sequence) ; $pos += $length ) { print substr($sequence, $pos, $length), " "; } }
The code depends on the behavior of substr
, which gives the partial substring at the end of the
string, even if it's less than the requested length. You can see there's a new
print_sequence
subroutine in the
BeginPerlBioinfo.pm module (see Chapter 6). We remembered to keep the
statement 1;
as the last line of the module.
Example 8-2 shows the main
program.
Example 8-2. Read a FASTA file and extract the sequence data
#!/usr/bin/perl # Read a fasta file and extract the sequence data use strict; use warnings; use BeginPerlBioinfo; # see Chapter 6 about this module # Declare and initialize variables my @file_data = ( ); my $dna = ''; # Read in the contents of the file "sample.dna" @file_data = get_file_data("sample.dna"); # Extract the sequence data from the contents of the file "sample.dna" $dna = extract_sequence_from_fasta_data(@file_data); # Print the sequence in lines 25 characters long print_sequence($dna, 25); exit;
Here's the output of Example 8-2:
agatggcggcgctgaggggtcttgg gggctctaggccggccacctactgg tttgcagcggagacgacgcatgggg cctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagt agttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgc aggttcgcgggtcgctggcgggggt cgtgagggagtgcgccgggagcgga gatatggagggagatggttcagacc cagagcctccagatgccggggagga cagcaagtccgagaatggggagaat gcgcccatctactgcatctgccgca aaccggacatcaactgcttcatgat cgggtgtgacaactgcaatgagtgg ttccatggggactgcatccggatca ctgagaagatggccaaggccatccg ggagtggtactgtcgggagtgcaga gagaaagaccccaagctagagattc gctatcggcacaagaagtcacggga gcgggatggcaatgagcgggacagc agtgagccccgggatgagggtggag ggcgcaagaggcctgtccctgatcc agacctgcagcgccgggcagggtca gggacaggggttggggccatgcttg ctcggggctctgcttcgccccacaa atcctctccgcagcccttggtggcc acacccagccagcatcaccagcagc agcagcagcagatcaaacggtcagc ccgcatgtgtggtgagtgtgaggca tgtcggcgcactgaggactgtggtc actgtgatttctgtcgggacatgaa gaagttcgggggccccaacaagatc cggcagaagtgccggctgcgccagt gccagctgcgggcccgggaatcgta caagtacttcccttcctcgctctca ccagtgacgccctcagagtccctgc caaggccccgccggccactgcccac ccaacagcagccacagccatcacag aagttagggcgcatccgtgaagatg agggggcagtggcgtcatcaacagt caaggagcctcctgaggctacagcc acacctgagccactctcagatgagg accta
Now, one final program for this section. Let's add to the preceding program a translation from DNA to protein and print out the protein instead. Notice how short Example 8-3 is! As you accumulate useful subroutines in modules, programs get easier and easier to write.
Example 8-3. Read a DNA FASTA file, translate to protein, and format output
#!/usr/bin/perl # Read a fasta file and extract the DNA sequence data # Translate it to protein and print it out in 25-character-long lines use strict; use warnings; use BeginPerlBioinfo; # see Chapter 6 about this module # Initialize variables my @file_data = ( ); my $dna = ''; my $protein = ''; # Read in the contents of the file "sample.dna" @file_data = get_file_data("sample.dna"); # Extract the sequence data from the contents of the file "sample.dna" $dna = extract_sequence_from_fasta_data(@file_data); # Translate the DNA to protein $protein = dna2peptide($dna); # Print the sequence in lines 25 characters long print_sequence($protein, 25); exit;
Here's the output of Example 8-3:
RWRR_GVLGALGRPPTGLQRRRRMG PAQ_EYAAWEA_LEAEVVVGAFATA WDAAEWSVQVRGSLAGVVRECAGSG DMEGDGSDPEPPDAGEDSKSENGEN APIYCICRKPDINCFMIGCDNCNEW FHGDCIRITEKMAKAIREWYCRECR EKDPKLEIRYRHKKSRERDGNERDS SEPRDEGGGRKRPVPDPDLQRRAGS GTGVGAMLARGSASPHKSSPQPLVA TPSQHHQQQQQQIKRSARMCGECEA CRRTEDCGHCDFCRDMKKFGGPNKI RQKCRLRQCQLRARESYKYFPSSLS PVTPSESLPRPRRPLPTQQQPQPSQ KLGRIREDEGAVASSTVKEPPEATA TPEPLSDEDL