Now that you've successfully extracted the sequence, let's look at parsing the annotations of a GenBank file.
Looking at a GenBank record, it's interesting to think about how to extract the useful information. The FEATURES table is certainly a key part of the story. It has considerable structure: what should be preserved, and what is unnecessary? For instance, sometimes you just want to see if a word such as "endonuclease" appears anywhere in the record. For this, you just need a subroutine that searches for any regular expression in the annotation. Sometimes this is enough, but when detailed surgery is necessary, Perl has the necessary tools to make the operation successful.
Example 10-3 parses a few pieces of information from the annotations in a GenBank file. It does this using the data in the form of an array.
Example 10-3. Parsing GenBank annotations using arrays
#!/usr/bin/perl # Parsing GenBank annotations using arrays use strict; use warnings; use BeginPerlBioinfo; # see Chapter 6 about this module # Declare and initialize variables my @genbank = ( ); my $locus = ''; my $accession = ''; my $organism = ''; # Get GenBank file data @genbank = get_file_data('record.gb'), # Let's start with something simple. Let's get some of the identifying # information, let's say the locus and accession number (here the same # thing) and the definition and the organism. for my $line (@genbank) { if($line =~ /^LOCUS/) { $line =~ s/^LOCUSs*//; $locus = $line; }elsif($line =~ /^ACCESSION/) { $line =~ s/^ACCESSIONs*//; $accession = $line; }elsif($line =~ /^ ORGANISM/) { $line =~ s/^s*ORGANISMs*//; $organism = $line; } } print "*** LOCUS *** "; print $locus; print "*** ACCESSION *** "; print $accession; print "*** ORGANISM *** "; print $organism; exit;
Here's the output from Example 10-3:
*** LOCUS *** AB031069 2487 bp mRNA PRI 27-MAY-2000 *** ACCESSION *** AB031069 *** ORGANISM *** Homo sapiens
Now let's slightly extend that program to handle the DEFINITION field. Notice that the DEFINITION field can extend over
more than one line. To collect that field, use a trick you've already seen in
Example 10-1: set a flag when you're in the "state" of collecting a definition. The flag
variable is called, unsurprisingly, $flag
.
Example 10-4. Parsing GenBank annotations using arrays, take 2
#!/usr/bin/perl # Parsing GenBank annotations using arrays, take 2 use strict; use warnings; use BeginPerlBioinfo; # see Chapter 6 about this module # Declare and initialize variables my @genbank = ( ); my $locus = ''; my $accession = ''; my $organism = ''; my $definition = ''; my $flag = 0; # Get GenBank file data @genbank = get_file_data('record.gb'), # Let's start with something simple. Let's get some of the identifying # information, let's say the locus and accession number (here the same # thing) and the definition and the organism. for my $line (@genbank) { if($line =~ /^LOCUS/) { $line =~ s/^LOCUSs*//; $locus = $line; }elsif($line =~ /^DEFINITION/) { $line =~ s/^DEFINITIONs*//; $definition = $line; $flag = 1; }elsif($line =~ /^ACCESSION/) { $line =~ s/^ACCESSIONs*//; $accession = $line; $flag = 0; }elsif($flag) { chomp($definition); $definition .= $line; }elsif($line =~ /^ ORGANISM/) { $line =~ s/^s*ORGANISMs*//; $organism = $line; } } print "*** LOCUS *** "; print $locus; print "*** DEFINITION *** "; print $definition; print "*** ACCESSION *** "; print $accession; print "*** ORGANISM *** "; print $organism; exit;
Example 10-4 outputs:
*** LOCUS *** AB031069 2487 bp mRNA PRI 27-MAY-2000 *** DEFINITION *** Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1, complete cds. *** ACCESSION *** AB031069 *** ORGANISM *** Homo sapiens
This use of flags to remember which part of the file you're in, from one iteration of a loop to the next, is a common technique when extracting information from files that have multiline sections. As the files and their fields get more complex, the code must keep track of many flags at a time to remember which part of the file it's in and what information needs to be extracted. It works, but as the files become more complex, so does the code. It becomes hard to read and hard to modify. So let's look at regular expressions as a vehicle for parsing annotations.
We've used
two methods to parse GenBank files: regular expressions and looping through arrays of lines and setting
flags. We used both methods to separate the annotation from the sequence in a
previous section of this chapter. Both methods were equally well suited, since
in GenBank files, the annotation is followed by the sequence, clearly delimited
by an ORIGIN
line: a simple structure.
However, parsing the annotations seems a bit more complicated; therefore, let's
try to use regular expressions to accomplish the task.
To begin, let's wrap the code we've been working on into some convenient subroutines to focus on parsing the annotations. You'll want to fetch GenBank records one at a time from a library (a file containing one or more GenBank records), extract the annotations and the sequence, and then if desired parse the annotations. This would be useful if, say, you were looking for some motif in a GenBank library. Then you can search for the motif, and, if found, you can parse the annotations to look for additional information about the sequence.
As mentioned previously, we'll use the file library.gb, which you can download from this book's web site.
Since dealing with annotation data is somewhat complex, let's take a minute to break our tasks into convenient subroutines. Here's the pseudocode:
sub open_file given the filename, return the filehandle sub get_next_record given the filehandle, get the record (we can get the offset by first calling "tell") sub get_annotation_and_dna given a record, split it into annotation and cleaned-up sequence sub search_sequence given a sequence and a regular expression, return array of locations of hits sub search_annotation given a GenBank annotation and a regular expression, return array of locations of hits sub parse_annotation separate out the fields of the annotation in a convenient form sub parse_features given the features field, separate out the components
The idea is to make a subroutine for each important task you want to accomplish and then combine them into useful programs. Some of these can be combined into other subroutines: for instance, perhaps you want to open a file and get the record from it, all in one subroutine call.
You're designing these subroutines to work with library files, that is, files
with multiple GenBank records. You pass the filehandle into the subroutines as
an argument, so that your subroutines can access open library files as
represented by the filehandles. Doing so enables you to have a
get_next_record
function, which is handy in a
loop. Using the Perl function tell
also allows you to save the byte offset of any record of interest,
and then return later and extract the record at that byte offset very quickly.
(A byte offset is just the number of characters into the
file where the information of interest lies.) The operating system supports Perl
in letting you go immediately to any byte offset location in even huge files,
thus bypassing the usual way of opening the file and reading from the beginning
until you get where you want to be.
Using a byte offset is important when you're dealing with large files. Perl
gives you built-in functions such as seek
that allow you, on an open file, to go immediately to any location in the
file
. The idea is that when you find something in a file, you can save
the byte offset using the Perl function
tell
. Then, when you want to return to that point in the file,
you can just call the Perl function seek
with the byte offset as an argument. You'll see this later in this chapter when
you build a DBM file to look up records based on their accession numbers. But
the main point is that with a 250-MB file, it takes too long to find something
by searching from the beginning, and there are ways of getting around it.
The parsing of the data is done in three steps, according to the design:
You'll separate out the annotation and the sequence (which you'll clean up by removing whitespace, etc., and making it a simple string of sequence). Even at this step, you can search for motifs in the sequence, as well as look for text in the annotation.
Extract out the fields.
Parse the features table.
These steps seem natural, and, depending on what you want to do, allow you to parse to whatever depth is needed.
Here's a main program in pseudocode that shows how to use those subroutines:
open_file while ( get_next_record ) get_annotation_and_dna if ( search_sequence for a motif AND search_annotation for chromosome 22 ) parse_annotation parse_features to get sizes of exons, look for small sizes } } return accession numbers of records meeting the criteria
This example shows how to use subroutines to answer a question such as: what are the genes on chromosome 22 that contain a given motif and have small exons?
Let's test these subroutines with Example 10-5, which has some subroutine definitions that will be added to the BeginPerlBioinfo.pm module:
Example 10-5. GenBank library subroutines
#!/usr/bin/perl # - test program of GenBank library subroutines use strict; use warnings; # Don't use BeginPerlBioinfo # Since all subroutines defined in this file # use BeginPerlBioinfo; # see Chapter 6 about this module # Declare and initialize variables my $fh; # variable to store filehandle my $record; my $dna; my $annotation; my $offset; my $library = 'library.gb'; # Perform some standard subroutines for test $fh = open_file($library); $offset = tell($fh); while( $record = get_next_record($fh) ) { ($annotation, $dna) = get_annotation_and_dna($record); if( search_sequence($dna, 'AAA[CG].')) { print "Sequence found in record at offset $offset "; } if( search_annotation($annotation, 'homo sapiens')) { print "Annotation found in record at offset $offset "; } $offset = tell($fh); } exit; ################################################################################ # Subroutines ################################################################################ # open_file # # - given filename, set filehandle sub open_file { my($filename) = @_; my $fh; unless(open($fh, $filename)) { print "Cannot open file $filename "; exit; } return $fh; } # get_next_record # # - given filehandle for GenBank file, get GenBank record sub get_next_record { my($fh) = @_; my($offset); my($record) = ''; my($save_input_separator) = $/; $/ = "// "; $record = <$fh>; $/ = $save_input_separator; return $record; } # get_annotation_and_dna # # - given GenBank record, get annotation and DNA sub get_annotation_and_dna { my($record) = @_; my($annotation) = ''; my($dna) = ''; # Now separate the annotation from the sequence data ($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGINs* )(.*)// /s); # clean the sequence of any whitespace or / characters # (the / has to be written / in the character class, because # / is a metacharacter, so it must be "escaped" with ) $dna =~ s/[s/d]//g; return($annotation, $dna) } # search_sequence # # - search sequence with regular expression sub search_sequence { my($sequence, $regularexpression) = @_; my(@locations) = ( ); while( $sequence =~ /$regularexpression/ig ) { push( @locations, pos ); } return (@locations); } # search_annotation # # - search annotation with regular expression sub search_annotation { my($annotation, $regularexpression) = @_; my(@locations) = ( ); # note the /s modifier--. matches any character including newline while( $annotation =~ /$regularexpression/isg ) { push( @locations, pos ); } return (@locations); }
Example 10-5 generates the following output on our little GenBank library:
Sequence found in record at offset 0 Annotation found in record at offset 0 Sequence found in record at offset 6256 Annotation found in record at offset 6256 Sequence found in record at offset 12366 Annotation found in record at offset 12366 Sequence found in record at offset 17730 Annotation found in record at offset 17730 Sequence found in record at offset 22340 Annotation found in record at offset 22340
The tell
function reports the byte offset of the file up to the point where
it's been read; so you want to first call tell
and then read the record to get the proper offset
associated with the beginning of the
record.
Now let's parse the annotations.
There is a document from NCBI we mentioned earlier that gives the details of the structure of a GenBank record. This file is gbrel.txt and is part of the GenBank release, available at the NCBI web site and their FTP site. It's updated with every release (every two months at present), and it includes notices of changes to the format. If you program with GenBank records, you should read this document and keep a copy around for reference use, and check periodically for announced changes in the GenBank record format.
If you look back at the complete GenBank record earlier in this chapter, you'll see that the annotations have a certain structure. You have some fields, such as LOCUS, DEFINITION, ACCESSION, VERSION, KEYWORDS, SOURCE, REFERENCE, FEATURES, and BASE COUNT that start at the beginning of a line. Some of these fields have subfields, especially the FEATURES field, which has a fairly complex structure.
But for now, let's just extract the top-level fields. You will need a regular expression that matches everything from a word at the beginning of a line to a newline that just precedes another word at the beginning of a line.
Here's a regular expression that matches our definition of a field:
/^[A-Z].* (^s.* )*/m
What does this regular expression say? First of all, it has the /m
pattern matching modifier, which means the
caret ^
and the dollar sign $
also match around embedded newlines (not just at
the beginning and end of the entire string, which is the default
behavior).
The first part of the regular expression:
^[A-Z].*
matches a capital letter at the beginning of a line, followed by any number of characters (except newlines), followed by a newline. That's a good description of the first lines of the fields you're trying to match.
The second part of the regular expression:
(^s.* )*
matches a space or tab s
at the beginning
of a line, followed by any number of characters (except newlines), followed by a
newline. This is surrounded by parentheses and followed by a *, which means 0 or
more such lines. This matches succeeding lines in a field, lines that start with
whitespace. A field may have no extra lines of this sort or several such
lines.
So, the two parts of the regular expression combined match the fields with their optional additional lines.
Example 10-6 shows a subroutine that, given the annotations section of a GenBank record stored in a scalar variable, returns a hash with keys equal to the names of the top-level fields and values equal to the contents of those fields.
Example 10-6. Parsing GenBank annotation
#!/usr/bin/perl # - test program for parse_annotation subroutine use strict; use warnings; use BeginPerlBioinfo; # see Chapter 6 about this module # Declare and initialize variables my $fh; my $record; my $dna; my $annotation; my %fields; my $library = 'library.gb'; # Open library and read a record $fh = open_file($library); $record = get_next_record($fh); # Parse the sequence and annotation ($annotation, $dna) = get_annotation_and_dna($record); # Extract the fields of the annotation %fields = parse_annotation($annotation); # Print the fields foreach my $key (keys %fields) { print "******** $key ********* "; print $fields{$key}; } exit; ################################################################################ # Subroutine ################################################################################ # parse_annotation # # given a GenBank annotation, returns a hash with # keys: the field names # values: the fields sub parse_annotation { my($annotation) = @_; my(%results) = ( ); while( $annotation =~ /^[A-Z].* (^s.* )*/gm ) { my $value = $&; (my $key = $value) =~ s/^([A-Z]+).*/$1/s; $results{$key} = $value; } return %results; }
In the
subroutine parse_annotation
,
note how the variables $key
and $value
are scoped within the while
block. One benefit of this is that you don't
have to reinitialize the variables each time through the loop. Also note that
the key is the name of the field, and the value is the whole field.
You should take the time to understand the regular expression that extracts the field name for the key:
(my $key = $value) =~ s/^([A-Z]+).*/$1/s;
This first assigns $key
the value $value
. It then replaces everything in $key
(note the /s
modifier for embedded newlines) with $1
, which is a special variable that remembers the pattern that
matched between the first pair of parentheses ([A-Z]+
). This pattern is one or more capital letters (anchored to
the beginning of the string, i.e., the field name), so it sets $key
to the value of the first word in the field
name.
You get the following output from Example 10-6 (the test just fetches the first record in the GenBank library):
******** SOURCE ********* SOURCE Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to mRNA. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. ******** DEFINITION ********* DEFINITION Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1, complete cds. ******** KEYWORDS ********* KEYWORDS . ******** VERSION ********* VERSION AB031069.1 GI:8100074 ******** FEATURES ********* FEATURES Location/Qualifiers source 1..2487 /organism="Homo sapiens" /db_xref="taxon:9606" /sex="male" /cell_line="HuS-L12" /cell_type="lung fibroblast" /dev_stage="embryo" gene 229..2199 /gene="PCCX1" CDS 229..2199 /gene="PCCX1" /note="a nuclear protein carrying a PHD finger and a CXXC domain" /codon_start=1 /product="protein containing CXXC domain 1" /protein_id="BAA96307.1" /db_xref="GI:8100075" /translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR" ******** REFERENCE ********* REFERENCE 2 (bases 1 to 2487) AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and Takano,T. TITLE Direct Submission JOURNAL Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases. Tadahiro Fujino, Keio University School of Medicine, Department of Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan (E-mail:[email protected], Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508) ******** ACCESSION ********* ACCESSION AB031069 ******** LOCUS ********* LOCUS AB031069 2487 bp mRNA PRI 27-MAY-2000 ******** ORIGIN ********* ORIGIN ******** BASE ********* BASE COUNT 564 a 715 c 768 g 440 t
As you see, the method is working, and apart from the difficulty of reading the regular expressions (which will become easier with practice), the code is very straightforward, just a few short subroutines.
Let's take this
one step further and parse the features table to its next level,
composed of the source
, gene
, and CDS
features keys. (See later in this section for a more
complete list of these features keys.) In the exercises at the end of the
chapter, you'll be challenged to descend further into the FEATURES table.
To study the FEATURES table, you should first look over the NCBI gbrel.txt document mentioned previously. Then you should study the most complete documentation for the FEATURES table, available at http://www.ncbi.nlm.nih.gov/collab/FT/index.html.
Although our GenBank entry is fairly simple and includes only three features, there are actually quite a few of them. Notice that the parsing code will find all of them, because it's just looking at the structure of the document, not for specific features.
The following is a list of the features defined for GenBank records. Although lengthy, I think it's important to read through it to get an idea of the range of information that may be present in a GenBank record.
Obsolete; see variation
feature key
Sequence related to transcription termination
Span of the C immunological feature
CAAT box in eukaryotic promoters
Sequence coding for amino acids in protein (includes stop codon)
Independent sequence determinations differ
Displacement loop
Span of the D immunological feature
Cis-acting enhancer of promoter function
Region that codes for part of spliced mRNA
Region that defines a functional gene, possibly including upstream (promoter, enhancer, etc.) and downstream control elements, and for which a name has been assigned
GC box in eukaryotic promoters
Intervening DNA eliminated by recombination
Transcribed region excised by mRNA splicing
Span of the J immunological feature
Long terminal repeat
Mature peptide coding region (doesn't include stop codon)
Miscellaneous binding site
Miscellaneous difference feature
Region of biological significance that can't be described by any other feature
Miscellaneous recombination feature
Miscellaneous transcript feature not defined by other RNA keys
Miscellaneous signal
Miscellaneous DNA or RNA structure
The indicated base is a modified nucleotide
Messenger RNA
Obsolete: see variation
feature key
Span of the N immunological feature
Presented sequence revises a previous version
Signal for cleavage and polyadenylation
Site at which polyadenine is added to mRNA
Any RNA species that isn't yet the mature RNA product
Primary (unprocessed) transcript
Primer binding region used with PCR
Noncovalent primer binding site
A region involved in transcription initiation
Noncovalent protein binding site on DNA or RNA
Ribosome binding site
Replication origin for duplex DNA
Sequence containing repeated subsequences
One repeated unit of a repeat_region
Ribosomal RNA
Span of the S immunological feature
Satellite repeated sequence
Small cytoplasmic RNA
Signal peptide coding region
Small nuclear RNA
Biological source of the sequence data represented by a
GenBank record; mandatory feature, one or more per record; for
organisms that have been incorporated within the NCBI taxonomy
database, an associated /db_xref="taxon:NNNN"
qualifier will be present
(where NNNNN
is the numeric
identifier assigned to the organism within the NCBI taxonomy
database)
Hairpin loop structure in DNA or RNA
Sequence Tagged Site: operationally unique sequence that identifies the combination of primer spans used in a PCR assay
TATA box in eukaryotic promoters
Sequence causing transcription termination
Transit peptide coding region
Transposable element (TN)
Transfer RNA
Authors are unsure about the sequence in this region
Span of the V immunological feature
A related population contains stable mutation
Placeholder (hyphen)
Pribnow box in prokaryotic promoters
-35 box in prokaryotic promoters
3'-most region of a precursor transcript removed in processing
3' untranslated region (trailer)
5'-most region of a precursor transcript removed in processing
5' untranslated region (leader)
These feature keys can have their own additional features, which you'll see here and in the exercises.
Example 10-7 finds whatever features are present and returns an array populated with them. It doesn't look for the complete list of features as presented in the last section; it finds just the features that are actually present in the GenBank record and returns them for further use.
It's often the case that there are multiple instances of the same feature in a record. For instance, there may be several exons specified in the FEATURES table of a GenBank record. For this reason we'll store the features as elements in an array, rather than in a hash keyed on the feature name (as a hash would allow you to store, for instance, only one value with key 'exon').
Example 10-7. Testing subroutine parse_features
#!/usr/bin/perl # - main program to test parse_features use strict; use warnings; use BeginPerlBioinfo; # see Chapter 6 about this module # Declare and initialize variables my $fh; my $record; my $dna; my $annotation; my %fields; my @features; my $library = 'library.gb'; # Get the fields from the first GenBank record in a library $fh = open_file($library); $record = get_next_record($fh); ($annotation, $dna) = get_annotation_and_dna($record); %fields = parse_annotation($annotation); # Extract the features from the FEATURES table @features = parse_features($fields{'FEATURES'}); # Print out the features foreach my $feature (@features) { # extract the name of the feature (or "feature key") my($featurename) = ($feature =~ /^ {5}(S+)/); print "******** $featurename ********* "; print $feature; } exit; ############################################################################ # Subroutine ############################################################################ # parse_features # # extract the features from the FEATURES field of a GenBank record sub parse_features { my($features) = @_; # entire FEATURES field in a scalar variable # Declare and initialize variables my(@features) = (); # used to store the individual features # Extract the features while( $features =~ /^ {5}S.* (^ {21}S.* )*/gm ) { my $feature = $&; push(@features, $feature); } return @features; }
Example 10-7 gives the output:
******** source ********* source 1..2487 /organism="Homo sapiens" /db_xref="taxon:9606" /sex="male" /cell_line="HuS-L12" /cell_type="lung fibroblast" /dev_stage="embryo" ******** gene ********* gene 229..2199 /gene="PCCX1" ******** CDS ********* CDS 229..2199 /gene="PCCX1" /note="a nuclear protein carrying a PHD finger and a CXXC domain" /codon_start=1 /product="protein containing CXXC domain 1" /protein_id="BAA96307.1" /db_xref="GI:8100075" /translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR
In subroutine parse_features
of Example 10-7, the regular
expression that extracts the features is much like the regular expression
used in Example 10-6 to parse
the top level of the annotations. Let's look at the essential parsing code
of Example 10-7:
while( $features =~ /^ {5}S.* (^ {21}S.* )*/gm ) {
On the whole, and in brief, this regular expression finds features formatted with the first lines beginning with 5 spaces, and optional continuation lines beginning with 21 spaces.
First, note that the pattern modifier /m
enables the ^
metacharacter to match after embedded newlines. Also, the {5}
and {21}
are quantifiers that specify there should be exactly 5,
or 21, of the preceding item, which in both cases is a space.
The regular expression is in two parts, corresponding to the first line
and optional continuation lines of the feature. The first part ^
{5}S.*
means that the beginning of a
line (^
) has 5 spaces ({5}
), followed by a non-whitespace character
(S
) followed by any number of
non-newlines (.*
) followed by a newline
(
). The second part of the regular
expression, (^
{21}S.*
)*
means the beginning of a
line (^
) has 21 spaces ({21}
) followed by a non-whitespace character
(S
) followed by any number of
non-newlines (.*
) followed by a newline
(
); and there may be 0
or more such lines, indicated by the
()*
around the whole
expression.
The main program has a short regular expression along similar lines to extract the feature name (also called the feature key) from the feature.
So, again, success. The FEATURES table is now decomposed or "parsed" in some detail, down to the level of separating the individual features. The next stage in parsing the FEATURES table is to extract the detailed information for each feature. This includes the location (given on the same line as the feature name, and possibly on additional lines); and the qualifiers indicated by a slash, a qualifier name, and if applicable, an equals sign and additional information of various kinds, possibly continued on additional lines.
I'll leave this final step for the exercises. It's a fairly straightforward extension of the approach we've been using to parse the features. You will want to consult the documentation from the NCBI web site for complete details about the structure of the FEATURES table before trying to parse the location and qualifiers from a feature.
The method I've used to parse the FEATURES table maintains the structure
of the information. However, sometimes you just want to see if some word
such as "endonulease" appears anywhere in the record. For this, recall that
you created a
search_annotation
subroutine in Example 10-5 that searches for
any regular expression in the entire annotation; very often, this is all you
really need. As you've now seen, however, when you really need to dig into
the FEATURES table, Perl has its own features that make the job possible
and even fairly easy.