So why parse BLAST output? One reason is to see if your DNA has any new matches against the DNA stored in the constantly growing databases. You can write a program to automatically perform a daily BLAST search and compare its results with those of the previous day by parsing the summary list of hits and comparing it with the previous day's summary list. You can then have the program email you if something new has turned up.
Example 12-1 consists of
a main program and two new subroutines. The
subroutines—parse_blast
and
parse_blast_alignment
—use regular expressions to extract the various bits of data from a
scalar string. I chose this method because the data, although structured, does
not clearly identify each line with its function. (See the discussion in Chapter 10 and Chapter 11.)
Example 12-1. Extract annotation and alignments from BLAST output file
#!/usr/bin/perl # Extract annotation and alignments from BLAST output file use strict; use warnings; use BeginPerlBioinfo; # see Chapter 6 about this module # declare and initialize variables my $beginning_annotation = ''; my $ending_annotation = ''; my %alignments = ( ); my $filename = 'blast.txt'; parse_blast($beginning_annotation, $ending_annotation, \%alignments, $filename); # Print the annotation, and then # print the DNA in new format just to check if we got it okay. print $beginning_annotation; foreach my $key (keys %alignments) { print "$key XXXXXXXXXXXX ", $alignments{$key}, " XXXXXXXXXXX "; } print $ending_annotation; exit; ################################################################################ # Subroutines for Example 12-1 ################################################################################ # parse_blast # # --parse beginning and ending annotation, and alignments, # from BLAST output file sub parse_blast { my($beginning_annotation, $ending_annotation, $alignments, $filename) = @_; # $beginning_annotation--reference to scalar # $ending_annotation --reference to scalar # $alignments --reference to hash # $filename --scalar # declare and initialize variables my $blast_output_file = ''; my $alignment_section = ''; # Get the BLAST program output into an array from a file $blast_output_file = join( '', get_file_data($filename)); # Extract the beginning annotation, alignments, and ending annotation ($$beginning_annotation, $alignment_section, $$ending_annotation) = ($blast_output_file =~ /(.*^ALIGNMENTS )(.*)(^ Database:.*)/ms); # Populate %alignments hash # key = ID of hit # value = alignment section %$alignments = parse_blast_alignment($alignment_section); } # parse_blast_alignment # # --parse the alignments from a BLAST output file, # return hash with # key = ID # value = text of alignment sub parse_blast_alignment { my($alignment_section) = @_; # declare and initialize variables my(%alignment_hash) = ( ); # loop through the scalar containing the BLAST alignments, # extracting the ID and the alignment and storing in a hash # # The regular expression matches a line beginning with >, # and containing the ID between the first pair of | characters; # followed by any number of lines that don't begin with > while($alignment_section =~ /^>.* (^(?!>).* )+/gm) { my($value) = $&; my($key) = (split(/|/, $value)) [1]; $alignment_hash{$key} = $value; } return %alignment_hash; }
The main program does little more than call the parsing subroutine and print the results. The arguments, initialized as empty, are passed by reference (see Chapter 6).
The subroutine parse_blast
does the
top-level parsing job of separating the three sections of a BLAST output file:
the annotation at the beginning, the alignments in the middle, and the
annotation at the end. It then calls the parse_blast_alignment
subroutine to extract the individual
alignments from that middle alignment section. The data is first read in from
the named file with our old friend the get_file_data
subroutine from Chapter 8. Use the join
function to store the array of file data into a scalar string.
The three sections of the BLAST output file are separated by the following statement:
($$beginning_annotation, $alignment_section, $$ending_annotation) = ($blast_output_file =~ /(.*^ALIGNMENTS )(.*)(^ Database:.*)/ms);
The pattern match contains three parenthesized expressions:
(.*^ALIGNMENTS )
which is returned into $$beginning_annotation;
(.*)
which is saved in $alignment_section;
and:
(^ Database:.*)
which is saved in $$ending_annotation
.
The use of $$
instead of $
at the beginning of two of these variables
indicates that they are references to scalar variables. Recall that they were
passed in as arguments to the subroutine, where they were preceded by a
backslash, like so:
parse_blast($beginning_annotation, $ending_annotation, \%alignments, $filename);
You've seen references to variables before, starting in Chapter 6. Let's review them briefly.
Within the parse_blast
subroutine, those
variables with only one $
are references to
the scalar variables. They need an additional $
to represent actual scalar variables. This is how references
work; they need an additional special character to indicate what kinds of
variables they are references to. So a reference to a scalar variable needs to
start with $$
, a reference to an array
variable needs to start with @$
, and a
reference to a hash variable needs to start with %$
.
The regular expression in the previous code snippet matches everything up to
the word ALIGNMENTS
at the end of a line
(.*^ALIGNMENTS
)
; then everything for a
while (.*)
; then a line that begins with two
spaces and the word Database:
followed by the
rest of the file (^
Database:.*)
. These three expressions in
parentheses correspond to the three desired parts of the BLAST output file; the
beginning annotation, the alignment section, and the ending annotation.
The alignments saved in $alignment_section
are separated out by the subroutine parse_blast_alignment
. This subroutine has one important
loop:
while($alignment_section =~ /^>.* (^(?!>).* )+/gm) { my($value) = $&; my($key) = (split(/|/, $value)) [1]; $alignment_hash{$key} = $value; }
You're probably thinking that this regular expression looks downright evil. At first glance, regular expressions do sometimes seem incomprehensible, so let's take a closer look. There are a few new things to examine.
The five lines comprise a while
loop, which
(due to the global /g
modifier on the pattern
match in the while
loop) keeps matching the
pattern as many times as it appears in the string. Each time the program cycles
through the loop, the pattern match finds the value (the entire alignment), then
determines the key. The key and values are saved in the hash %alignment_hash
.
Here's an example of one of the matches that's found by this while
loop when parsing the BLAST output shown in
Section 12.3:
>emb|AL353748.13|AL353748 Human DNA sequence from clone RP11-317B17 on chromosome 9, complete sequence [Homo sapiens] Length = 179155 Score = 38.2 bits (19), Expect = 4.1 Identities = 22/23 (95%) Strand = Plus / Plus Query: 192 ggcgggggtcgtgagggagtgcg 214 |||| |||||||||||||||||| Sbjct: 48258 ggcgtgggtcgtgagggagtgcg 48280
This text starts with a line beginning with a >
character. In the complete BLAST output, sections like these
follow one another. What you want to do is start matching from a line beginning
with >
and include all following adjacent
lines that don't start with a >
character.
You also want to extract the identifier, which appears between the first and
second vertical bar | characters on the first line (e.g., AL353748.13
in this alignment).
Let's dissect the regular expression:
$alignment_section =~ /^>.* (^(?!>).* )+/gm
This pattern match, which appears in a while
loop within the code, has the modifier m
for multiline. The m
modifier allows ^
to match
any beginning-of-line inside the multiline string, and $
to match any end-of-line.
The regular expression breaks down as follows. The first part is:
^>.*
It looks for >
at the beginning of the
BLAST output line, followed by .*
, which
matches any quantity of anything (except newlines), up to the first newline. In
other words, it matches the first line of the alignment.
Here's the rest of the regular expression:
(^(?!>).* )+
After the ^
which matches the beginning of
the line, you'll see a negative lookahead assertion,
(?!>)
, which ensures that a >
doesn't follow. Next, the .*
matches all non-newline characters, up to the
final
at the end of the line. All of that
is wrapped in parentheses with a surrounding +
, to match all available such lines (at least one).
Now that you've matched the entire alignment, you want to extract the key and
populate the hash with your key and value. Within the while
loop, the alignment that you just matched is automatically
set by Perl as the value of the special variable $&
and saved in the variable $value
. Now you need to extract your key from the alignment. It
can be found on the first line of the alignment stored in $value
, between the first and second |
symbols.
Extracting this identifying key is done using the split
function, which breaks the string into an array. The call
to split
:
split(/|/, $value)
splits $value
into pieces delimited by
|
characters. That is, the |
symbol is used to determine where one list
element ends and the next one begins. (Remember that the vertical bar |
is a metacharacter and must be escaped as
|
.) By surrounding the call to split
with parentheses and adding an array offset ([1]
), you can isolate the key and save it into $key
.
Let's step back now and look at Example 12-1 in its entirety. Notice that it's very short—barely more than two pages, including comments. Although it's not an easy program, due to the complexity of the regular expressions involved, you can make sense of it if you put a little effort into examining the BLAST output files and the regular expressions that parse it.
Regular expressions have lots of complex features, but as a result, they can do lots of useful things. As a Perl programmer, the effort you put into learning them is well worth it and can have significant payoffs down the road.
Let's take the parsing of the BLAST output file a little further. Notice that some of the alignments include more than one aligned string—for instance, the alignment for ID AK017941.1, shown again here:
>dbj|AK017941.1|AK017941 Mus musculus adult male thymus cDNA, RIKEN full-length enriched library, clone:5830420C16, full insert sequence Length = 1461 Score = 210 bits (106), Expect = 5e-52 Identities = 151/166 (90%) Strand = Plus / Plus Query: 235 ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggg 294 |||||||| ||||||| || ||||| ||||||||||| ||||||||||| ||||||||| Sbjct: 1048 ggagatggctcagacctggaacctccggatgccggggacgacagcaagtctgagaatggg 1107 Query: 295 gagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcggg 354 ||||| || ||||||||||||||||| ||||||||||||||||| ||||||||||| || Sbjct: 1108 gagaacgctcccatctactgcatctgtcgcaaaccggacatcaattgcttcatgattgga 1167 Query: 355 tgtgacaactgcaatgagtggttccatggggactgcatccggatca 400 |||||||||||||| |||||||||||||| |||||||||||||||| Sbjct: 1168 tgtgacaactgcaacgagtggttccatggagactgcatccggatca 1213 Score = 44.1 bits (22), Expect = 0.066 Identities = 25/26 (96%) Strand = Plus / Plus Query: 118 gcggaagtagttgtgggcgcctttgc 143 ||||||||||||| |||||||||||| Sbjct: 235 gcggaagtagttgcgggcgcctttgc 260
To parse these alignments, we have to parse out each of the matched strings, which in BLAST terminology are called high-scoring pairs (HSPs).
Each HSP also contains some annotation, and then the HSP itself. Let's parse each HSP into annotation, query string, and subject string, together with the starting and ending positions of the strings. More parsing is possible; you can extract specific features of the annotation, as well as the locations of identical and nonidentical bases in the HSP, for instance.
Example 12-2 includes a pair of subroutines; one to parse the alignments into their HSPs, and the second to extract the sequences and their end positions. The main program extends Example 12-1 using these new subroutines.
Example 12-2. Parse alignments from BLAST output file
#!/usr/bin/perl # Parse alignments from BLAST output file use strict; use warnings; use BeginPerlBioinfo; # see Chapter 6 about this module # declare and initialize variables my $beginning_annotation = ''; my $ending_annotation = ''; my %alignments = ( ); my $alignment = ''; my $filename = 'blast.txt'; my @HSPs = ( ); my($expect, $query, $query_range, $subject, $subject_range) = ('','','','',''), parse_blast($beginning_annotation, $ending_annotation, \%alignments, $filename); $alignment = $alignments{'AK017941.1'}; @HSPs = parse_blast_alignment_HSP($alignment); ($expect, $query, $query_range, $subject, $subject_range) = extract_HSP_information($HSPs[1]); # Print the results print " -> Expect value: $expect "; print " -> Query string: $query "; print " -> Query range: $query_range "; print " -> Subject String: $subject "; print " -> Subject range: $subject_range "; exit; ################################################################################ # Subroutines for Example 12-2 ################################################################################ # parse_blast_alignment_HSP # # --parse beginning annotation, and HSPs, # from BLAST alignment # Return an array with first element set to the beginning annotation, # and each successive element set to an HSP sub parse_blast_alignment_HSP { my($alignment ) = @_; # declare and initialize variables my $beginning_annotation = ''; my $HSP_section = ''; my @HSPs = ( ); # Extract the beginning annotation and HSPs ($beginning_annotation, $HSP_section ) = ($alignment =~ /(.*?)(^ Score =.*)/ms); # Store the $beginning_annotation as the first entry in @HSPs push(@HSPs, $beginning_annotation); # Parse the HSPs, store each HSP as an element in @HSPs while($HSP_section =~ /(^ Score =.* )(^(?! Score =).* )+/gm) { push(@HSPs, $&); } # Return an array with first element = the beginning annotation, # and each successive element = an HSP return(@HSPs); } # extract_HSP_information # # --parse a HSP from a BLAST output alignment section # - return array with elements: # Expect value # Query string # Query range # Subject string # Subject range sub extract_HSP_information { my($HSP) = @_; # declare and initialize variables my($expect) = ''; my($query) = ''; my($query_range) = ''; my($subject) = ''; my($subject_range) = ''; ($expect) = ($HSP =~ /Expect = (S+)/); $query = join ( '' , ($HSP =~ /^Query(.*)/gm) ); $subject = join ( '' , ($HSP =~ /^Sbjct(.*)/gm) ); $query_range = join('..', ($query =~ /(d+).*D(d+)/s)); $subject_range = join('..', ($subject =~ /(d+).*D(d+)/s)); $query =~ s/[^acgt]//g; $subject =~ s/[^acgt]//g; return ($expect, $query, $query_range, $subject, $subject_range); }
Example 12-2 gives the following output:
-> Expect value: 5e-52 -> Query string: ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggg gagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgacaactgcaatgagt ggttccatggggactgcatccggatca -> Query range: 235..400 -> Subject String: ggagatggctcagacctggaacctccggatgccggggacgacagcaagtctgagaatggg gagaacgctcccatctactgcatctgtcgcaaaccggacatcaattgcttcatgattggatgtgacaactgcaacgagt ggttccatggagactgcatccggatca -> Subject range: 1048..1213
Let's discuss the new features of Example 12-2 and its subroutines. First notice that the two new subroutines from Example 12-1 have been placed into the BeginPerlBioinfo.pm module, so they aren't printed again here.
The main program, Example 12-2,
starts the same as Example 12-1;
it calls the parse_blast
subroutine to separate the annotation from the alignments in the
BLAST output file.
The next line fetches one of the alignments from the %alignments
hash, which is then used as the argument to the parse_blast_alignment_HSP
subroutine, which then returns an array of annotation (as the first
element) and HSPs in @HSPs
.
Finally, Example 12-2 does the
lower-level parsing of an individual HSP by calling the extract_HSP_information
subroutine, and the extracted parts of one of the HSPs are
printed.
Example 12-2 shows a certain inconsistency in our design. Some subroutines call their arguments by reference; others call them by value (see Chapter 6). You may ask: is this a bad thing?
The answer is: not necessarily. The subroutine parse_blast
mixes several arguments, and one of them is not a
scalar type. Recall that this is a potentially good place to use
call-by-reference in Perl. The other subroutines don't mix argument
types this way. However, they can be designed to call their arguments by
reference.
Continuing with the code, let's examine the subroutine parse_blast_alignment_HSP
. This takes one of the alignments from the BLAST output and
separates out the individual HSP string matches. The technique used is, once
again, regular expressions operating on a single string that contains all the
lines of the alignment given as the input argument.
The first regular expression parses out the annotation and the section containing the HSPs:
($beginning_annotation, $HSP_section ) = ($alignment =~ /(.*?)(^ Score =.*)/ms);
The first parentheses in the regular expression is (.*?)
This is the nongreedy or minimal matching; it matches the shortest
string possible. By default, * is greedy, matching the longest string possible.
Here it gobbles up everything before the first line that begins Score
=
(without the ?
after the *
, it would gobble
everything until the final line that begins Score
=
). This is the exact dividing line between
the beginning annotation and the HSP string matches.
The next loop and regular expression separates the individual HSP string matches:
while($HSP_section =~ /(^ Score =.* )(^(?! Score =).* )+/gm) { push(@HSPs, $&); }
This is the same kind of
global string match in a while
loop you've seen before; it keeps iterating as long as the match can be found.
The other modifier /m
is the multiline modifier, which enables the metacharacters
$
and ^
to
match before and after embedded newlines, respectively.
The expression within the first pair of parentheses—(^
Score
=.*
)
—matches a line that begins Score =
, which is the kind of line that begins an
HSP string match section.
The code within the second pair of parentheses—(^(?!
Score
=).*
)+
—matches one or more (the +
following the outer parentheses) lines that do
not begin with Score
=
. The ?!
at the beginning of the embedded parentheses is the negative lookahead assertion
you encountered in Example 12-1.
So, in total, the regular expression captures a line beginning with Score
=
and
all succeeding
adjacent lines that don't
begin with Score
=
.