Perl makes it easy to start other programs and collect their output, all from within your Perl program. This is an extremely useful capability; for most programs, Perl makes it fairly simple to accomplish.
You may need to run some particular program many times, for instance over every file in PDB to extract secondary structure information. The program itself may not have a way to tell it "run yourself over all these files." Also, the output of the program may have all sorts of extraneous information. What you need is a much simpler report that just presents the information that interests you—perhaps in a format that could then be input to another program! With Perl you can write a program to do exactly this.
An important kind of program to automate is a web site that provides some useful program or data online. Using the appropriate Perl modules, you can connect to the web site, send it your input, collect the output, and then parse and reformat as you wish. It's actually not hard to do! O'Reilly's Perl Cookbook, a companion volume to Programming Perl, is an excellent source of short programs and helpful descriptions to get you started.
Perl is a great way to automate other programs. The next section shows an example of a Perl program that starts another program and collects, parses, reformats, and outputs the results. This program will control another program on the same computer. The example will be from a Unix or Linux environment; consult your Perl documentation on how to get the same functionality from your Windows or older Macintosh platform.
We will use an external program to
calculate the secondary structure from the 3D coordinates of a PDB
file. As a secondary structure assignment engine, I use a a program called
stride that outputs a secondary structure report. stride is available from EMBL (http://www.embl-heidelberg.de/stride/stride_info.html) and runs on
Unix, Linux, Windows, Macintosh, and VMS systems. The program works very simply;
just give it a command-line argument of a PDB filename. I collect the output in
the subroutine call_stride
that follows.
Example 11-7 is the entire program: two subroutines and a main program, followed by a discussion.
Example 11-7. Call another program for secondary structure prediction
#!/usr/bin/perl # Call another program to perform secondary structure prediction use strict; use warnings; # Call "stride" on a file, collect the report my(@stride_output) = call_stride('pdb/c1/pdb1c1f.ent'), # Parse the stride report into primary sequence, and secondary # structure prediction my($sequence, $structure) = parse_stride(@stride_output); # Print out the beginnings of the sequence and the secondary structure print substr($sequence, 0, 80), " "; print substr($structure, 0, 80), " "; exit; ################################################################################ # Subroutine for Example 11-7 ################################################################################ # call_stride # # --given a PDB filename, return the output from the "stride" # secondary structure prediction program sub call_stride { use strict; use warnings; my($filename) = @_; # The stride program options my($stride) = '/usr/local/bin/stride'; my($options) = ''; my(@results) = ( ); # Check for presence of PDB file unless ( -e $filename ) { print "File "$filename" doesn't seem to exist! "; exit; } # Start up the program, capture and return the output @results = `$stride $options $filename`; return @results; } # parse_stride # #--given stride output, extract the primary sequence and the # secondary structure prediction, returning them in a # two-element array. sub parse_stride { use strict; use warnings; my(@stridereport) = @_; my($seq) = ''; my($str) = ''; my $length; # Extract the lines of interest my(@seq) = grep(/^SEQ /, @stridereport); my(@str) = grep(/^STR /, @stridereport); # Process those lines to discard all but the sequence # or structure information for (@seq) { $_ = substr($_, 10, 50) } for (@str) { $_ = substr($_, 10, 50) } # Return the information as an array of two strings $seq = join('', @seq); $str = join('', @str); # Delete unwanted spaces from the ends of the strings. # ($seq has no spaces that are wanted, but $str may) $seq =~ s/(s+)$//; $length = length($1); $str =~ s/s{$length}$//; return( ($seq, $str) ); }
As you can see in the
subroutine call_stride
,
variables have been made for the program name ($stride
) and for the options you may want to pass ($options
). Since these are parts of the program
you may want to change, put them as variables near the top of the code, to make
them easy to find and alter. The argument to the subroutine is the PDB filename
($filename
). (Of course, if you expect
the options to change frequently, you can make them another argument to the
subroutine.)
Since you're dealing with a program that takes a file, do a little error
checking to see if a file by that name actually exists. Use the
-e
file test operator. Or you can omit this
and let the stride program figure it out,
and capture its error output. But that requires parsing the stride output for
its error output, which involves figuring out how stride reports errors. This can get complicated, so I'd stick
with using the -e
file test operator.
The actual running of the program and collecting its output happen in just one line. The program to be run is enclosed in backticks, which run the program (first expanding variables) and return the output as an array of lines.
There are other ways to run a program. One common way is the system
function call. It behaves differently from the backticks: it
doesn't return the output of the command it calls (it just returns the exit
status, an integer indicating success or failure of the command). Other
methods include qx
, the open
system call, and the
fork
and exec
functions.
I don't go into too much detail here about parsing the output of stride. Let's just exhibit some code that extracts the primary sequence and the secondary structure prediction. See the exercises at the end of the chapter for a challenge to extract the secondary structure information from a PDB file's HELIX, SHEET, and TURN record types and output the information in a similar format as stride does here.
Here is a typical section of a stride output (not the entire output):
SEQ 1 MDKNELVQKAKLAEQAERYDDMAACMKSVTEQGAELSNEERNLLSVAYKN 50 1A4O STR HHHHHHHHHHHHHH HHHHHHHHHHHHHTTT HHHHHHHHHHHHH 1A4O REM 1A4O REM . . . . . 1A4O SEQ 51 VVGARRSSWRVVSSIEQKEKKQQMAREYREKIETELRDICNDVLSLLEKF 100 1A4O STR HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHT 1A4O REM 1A4O REM . . . . . 1A4O SEQ 101 LIPNAAESKVFYLKMKGDYYRYLAEVAAGDDKKGIVDQSQQAYQEAFEIS 150 1A4O STR TTTTT HHHHHHHHHHHHHHHHHHHH HHHHHHHHHHHHHHHHHHHHH 1A4O REM 1A4O REM . . . . 1A4O SEQ 151 KKEMIRLGLALNFSVFYYACSLAKTAFDEAIAELLIMQLLRDNLTLW 197 1A4O STR TTTTHHHHHHHHHHHH HHHHHHHHHHHHH HHHHHHHHHH 1A4O
Notice that each line is prefaced by an identifier, which should make
collecting the different record types easy. Without even consulting the
documentation (a slightly dangerous but sometimes expedient approach), you can
see that the primary sequence has keyword SEQ
, the structure prediction has keyword STR
, and the data of interest lies from position 11 up to
position 60 on each line. (We'll ignore everything else for now.)
The following list shows the one-letter secondary structure codes used by stride:
H |
Alpha helix |
G |
3-10 helix |
I |
PI helix |
E |
Extended conformation |
B or b |
Isolated bridge |
T |
Turn |
C |
Coil (none of the above) |
Now let's examine the subroutine parse_stride
in
Example 11-7 that takes
stride output and returns an array of
two strings, the primary sequence and the structure prediction.
This is a very "Perlish" subroutine that uses some features that manipulate text. What's interesting is the brevity of the program, which some of Perl's built-in functions make possible.
First, you receive the output of the stride program in the subroutine argument @_
. Next, use the grep
function to extract those lines of interest, which are easy to
identify in this output, as they begin with clear identifiers SEQ
and STR
.
Next, you want to save just those positions (or columns) of these lines that
have the sequence or structure information; you don't need the keywords,
position numbers, or the PDB entry name at the end of the lines. Using the
substr
function, the two for
loops alter each line of the two arrays by
saving the 11th to the 60th positions of those strings. This is where the
desired information lies.
Finally, join the arrays into single strings. Here, there's one detail to handle; you need to remove any unneeded spaces from the ends of the strings. Notice that stride sometimes leaves spaces in the structure prediction, and in this example, has left some at the end of the structure prediction. So you shouldn't throw away all the spaces at the ends of the strings. Instead, throw away all the spaces at the end of the sequence string, because they are just superfluous spaces on the line. Now, see how many spaces that was, and throw the equal amount away at the end of the structure prediction string, thus preserving spaces that correspond to undetermined secondary structure.
Example 11-7 contains a main program that calls two subroutines, which, since they are short, are all included (so there's no need here for the BeginPerlBioinfo module). Here's the output of Example 11-7:
GGLQVKNFDFTVGKFLTVGGFINNSPQRFSVNVGESMNSLSLHLDHRFNYGADQNTIVMNSTLKGDNGWETEQRSTNFTL TTTTTTBTTT EEEEEEETTTT EEEEEEEEETTEEEEEEEEEEEETTEEEEEEEEEETTGGG B EEE
The first line shows the amino acids, and the second line shows the prediction of the secondary structure. The exercises that follow ask you to write a subroutine that will improve that output.