Finding Needles in Haystacks

One of the most powerful parts of the UNIX operating system is its capability to understand complex and sophisticated regular expressions. Combined with wildcards, it's an entire language for describing just what you're looking for or seeking to match, and the grep command offers just the tool you need in order to exploit this new language.

Task 9.1: Filename Wildcards

By now you are doubtless tired of typing every letter of each filename into your system for each example. There is a better and easier way! Just as the special card in poker can have any value, UNIX has special characters that the various shells (the command-line interpreter programs) all interpret as wildcards. This allows for much easier typing of patterns.


There are two wildcards to learn here: * acts as a match for any number and sequence of characters, and ? acts as a match for any single character. In the broadest sense, a lone * acts as a match for all files in the current directory (in other words, ls * is identical to ls), whereas a single ? acts as a match for all one-character-long filenames in a directory (for instance, ls ?, which will list only those filenames that are one character long).

  1. Start by using ls to list your home directory

    % ls -CF
    Archives/               OWL/                    keylime.pie
    InfoWorld/              bin/                    src/
    Mail/                   bitnet.mailing-lists.Z temp/
    News/                   drop.text.hqx           testme
    
  2. To experiment with wildcards, it's easiest to use the echo command. If you recall, echo repeats anything given to it, but—and here's the secret to its value—the shell interprets anything that is entered before the shell lets echo see it. That is, the * is expanded before the shell hands the arguments over to the command.

    % echo *
    Archives InfoWorld Mail News OWL bin bitnet.mailing-lists.Z
    drop.text.hqx keylime.pie src temp testme
    

    Using the * wildcard enables me to easily reference all files in the directory. This is quite helpful.

  3. A wildcard is even more helpful than the example suggests, because it can be embedded in the middle of a word or otherwise used to limit the number of matches. To see all files that began with the letter t, use t*:

    % echo t*
    temp testme
    

    Try echo b* to see all your files that start with the letter b.

  4. Variations are possible too. I could use wildcards to list all files or directories that end with the letter s:

    % echo *s
    Archives News
    

    Watch what happens if I try the same command using the ls command rather than the echo command:

    % ls -CF *s
    Archives:
    Interleaf.story   Tartan.story.Z        nextstep.txt.Z
    Opus.story        interactive.txt.Z     rae.assist.infoworld.Z
    
    News:
    mailing.lists.usenet usenet.1              usenet.alt
    

    Using the ls command here makes UNIX think I want it to list two directories, not just the names of the two files. This is where the -d flag to ls could prove helpful to force a listing of the directories rather than of their contents.

  5. Notice that, in the News directory, I have three files with the word usenet somewhere in their names. The wildcard pattern usenet* would match two of the files, and *usenet would match one. A valuable aspect of the * wildcard is that it can match zero or more characters, so the pattern *usenet* will match all three filenames:

    % echo News/*usenet*
    News/mailing.lists.usenet News/usenet.1 News/usenet.alt
    

    Also notice that wildcards can be embedded in a filename or pathname. In this example, I specified that I was interested in files in the News directory.

  6. Could you match a single character? To see how this can be helpful, it's time to move into a different directory, OWL on my system:

    % cd OWL
    % ls -CF
    Student.config   owl.c            owl.o
    WordMap/         owl.data         simple.editor.c
    owl*             owl.h            simple.editor.o
    

    If I request owl*, which files will be listed?

    % echo owl*
    owl owl.c owl.data owl.h owl.o
    

    What do I do if I am interested only in the source, header, and object files, which are here indicated by a.c, .h, or .o suffix? Using a wildcard that matches zero or more letters won't work; I don't want to see owl or owl.data. One possibility would be to use the pattern owl.* (by adding the period, I can eliminate the owl file itself). What I really want, however, is to be able to specify all files that start with the four characters owl. and have exactly one more character. This is a situation in which the ? wildcard works:

    % echo owl.?
    owl.c owl.h owl.o
    

    Because no files have exactly one letter following the three letters owl, watch what happens when I specify owl? as the pattern:

    % echo owl?
    echo: No match.
    

    This leads to a general observation. If you want to have echo return a question to you (output a question mark), you have to do it carefully because the shell interprets the question mark as a wildcard:

    % echo are you listening?
    echo: No match.
    

    To accomplish this, you simply need to surround the entire question with single quotation marks:

    % echo 'are you listening?'
    are you listening?
    

It won't surprise you that there are more complex ways of using wildcards to build filename patterns. What likely will surprise you is that the vast majority of UNIX users don't even know about the * and ? wildcards! This knowledge gives you a definite advantage.


Task 9.2: Advanced Filename Wildcards

Earlier, you learned about two special wildcard characters that can help you when you're specifying files for commands in UNIX. The first was ?, which matches any single character, and the other was *, which matches zero or more characters.


“Zero or more characters,” I can hear you asking. “Why would I need that?” The answer is that sometimes you want to have a pattern that might or might not contain a specific character.

There are more special wildcards for the shell you can use when specifying filenames, and it's time to learn about another of them. This new notation is known as a character range, serving as a wildcard less general than the question mark.

  1. A pair of square brackets denotes a range of characters, which can be either explicitly listed or indicated as a range with a dash between them. I'll start with a list of files in my current directory:

    % ls
    Archives/    News/        bigfiles     owl.c        src/
    InfoWorld/   OWL/         bin/         sample       temp/
    Mail/        awkscript    keylime.pie  sample2      tetme
    

    If I want to see both bigfiles and the bin directory, I can use b* as a file pattern:

    % ls -ld b*
    -rw-rw----  1 taylor        165 Dec  3 16:42 bigfiles
    drwx------  2 taylor        512 Oct 13 10:45 bin/
    

    If I want to see all entries that start with a lowercase letter, I can explicitly type each one:

    % ls -ld a* b* k* o* s* t*
    -rw-rw----  1 taylor        126 Dec  3 16:34 awkscript
    -rw-rw----  1 taylor        165 Dec  3 16:42 bigfiles
    drwx------  2 taylor        512 Oct 13 10:45 bin/
    -rw-rw----  1 taylor      12556 Nov 16 09:49 keylime.pie
    -rw-rw----  1 taylor       8729 Dec  2 21:19 owl.c
    -rw-rw----  1 taylor        199 Dec  3 16:11 sample
    -rw-rw----  1 taylor        207 Dec  3 16:11 sample2
    drwx------  2 taylor        512 Oct 13 10:45 src/
    drwxrwx---  2 taylor        512 Nov  8 22:20 temp/
    -rw-rw----  1 taylor        582 Nov 27 18:29 tetme
    

    That's clearly quite awkward. Instead, I can specify a subrange of characters to match. I specify the range by listing them all tucked neatly into a pair of square brackets:

    % ls -ld [abkost] *
    -rw-rw----  1 taylor        126 Dec  3 16:34 awkscript
    -rw-rw----  1 taylor        165 Dec  3 16:42 bigfiles
    drwx------  2 taylor        512 Oct 13 10:45 bin/
    -rw-rw----  1 taylor      12556 Nov 16 09:49 keylime.pie
    -rw-rw----  1 taylor       8729 Dec  2 21:19 owl.c
    -rw-rw----  1 taylor        199 Dec  3 16:11 sample
    -rw-rw----  1 taylor        207 Dec  3 16:11 sample2
    drwx------  2 taylor        512 Oct 13 10:45 src/
    drwxrwx---  2 taylor        512 Nov  8 22:20 temp/
    -rw-rw----  1 taylor        582 Nov 27 18:29 tetme
    

    In this case, the shell matches all files that start with a, b, k, o, s, or t. This notation is still a bit clunky and would be more so if more files were involved.

  2. The ideal is to specify a range of characters by using the hyphen character in the middle of a range:

    % ls -ld [a-z] *
    -rw-rw----  1 taylor        126 Dec  3 16:34 awkscript
    -rw-rw----  1 taylor        165 Dec  3 16:42 bigfiles
    drwx------  2 taylor        512 Oct 13 10:45 bin/
    -rw-rw----  1 taylor      12556 Nov 16 09:49 keylime.pie
    -rw-rw----  1 taylor       8729 Dec  2 21:19 owl.c
    -rw-rw----  1 taylor        199 Dec  3 16:11 sample
    -rw-rw----  1 taylor        207 Dec  3 16:11 sample2
    drwx------  2 taylor        512 Oct 13 10:45 src/
    drwxrwx---  2 taylor        512 Nov  8 22:20 temp/
    -rw-rw----  1 taylor        582 Nov 27 18:29 tetme
    

    In this example, the shell will match any file that begins with a lowercase letter, ranging from a to z, as specified.

  3. Space is critical in all wildcard patterns, too. Watch what happens if I accidentally add a space between the closing bracket of the range specification and the asterisk following:

    % ls -CFd [a-z] *
    Archives/    News/        bigfiles     owl.c        src/
    InfoWorld/   OWL/         bin/         sample       temp/
    Mail/        awkscript    keylime.pie  sample2      tetme
    

    This time, the shell tried to match all files whose names were one character long and lowercase, and then it tried to match all files that matched the asterisk wildcard, which, of course, included all regular files in the directory.

  4. The combination of character ranges, single-character wildcards, and multicharacter wildcards can be tremendously helpful. If I move to another directory, I can easily search for all files that contain a single digit, a dot, or an underscore in the name:

    % cd Mail
    % ls -CF
    71075.446         emilyc            mailbox           sartin
    72303.2166        gordon_hat        manley            sent
    bmcinern          harrism           mark              shalini
    bob_gull          j=taylor          marmi             siob_n
    cennamo           james             marv              steve
    dan_some          jeffv             matt_ruby         tai
    dataylor          john_welch        mcwillia          taylor
    decc              john_prage        netnews.postings  v892127
    disserli          kcs               raf               wcenter
    druby             lehman            rexb              windows
    dunlaplm          lenz              rock              xd1f
    ean_huts          mac               rustle
    
    % ls *[0-9._]*
    71075.446         ean_huts          matt_ruby     xd1f
    72303.2166        gordon_hat        netnews.postings
    bob_gull          john_welcher      siob_n
    dan_some          john_prage        v892127
    

I think that the best way to learn about pervasive features of UNIX such as shell filename wildcards is just to use them. If you flip through this book, you immediately notice that the examples are building on earlier information. This will continue to be the case, and the filename range notation shown here will be used again and again, in combination with the asterisk and question mark, to specify groups of files or directories.


Remember that if you want to experiment with filename wildcards, you can most easily use the echo command because it dutifully prints the expanded version of any pattern you specify.

Task 9.3: Creating Sophisticated Regular Expressions

A regular expression can be as simple as a word to be matched letter for letter, such as acme, or as complex as '(^[a-zA-Z]|:wi)', which matches all lines that begin with an upper- or lowercase letter or that contain :wi.


The language of regular expressions is full of punctuation characters and other letters used in unusual ways. It is important to remember that regular expressions are different from shell wildcard patterns. It's unfortunate, but it's true. In the C shell, for example, a* lists any file that starts with the letter a. Regular expressions aren't left rooted, which means that you need to specify ^a if you want to match only lines that begin with the letter a. The shell pattern a* matches only filenames that start with the letter a, and the * has a different interpretation completely when used as part of a regular expression: a* is a pattern that matches zero or more occurrences of the letter a. The notation for regular expressions is shown in Table 9.1. The egrep command has additional notation, which you will learn shortly.

Table 9.1. Summary of Regular Expression Notation
Notation Meaning
c Matches the character c
c Forces c to be read as the letter c, not as another meaning the character might have
^ Beginning of the line
$ End of the line
. Any single character
[xy] Any single character in the set specified
[^xy] Any single character not in the set specified
c* Zero or more occurrences of character c

The notation isn't as complex as it looks in this table. The most important things to remember about regular expressions are that the * denotes zero or more occurrences of the preceding character, and . is any single character. Remember that shell patterns use * to match any set of zero or more characters independent of the preceding character, and ? to match a single character.

  1. The easy searches with grep are those that search for specific words without any special regular expression notation:

    % grep taylor /etc/passwd
    taylorj:?:1048:1375:James Taylor:/users/taylorj:/bin/csh
    mtaylor:?:769:1375:Mary Taylor:/users/mtaylor:/usr/local/bin/tcsh
    dataylor:?:375:518:Dave Taylor:/users/dataylor:/usr/local/lib/msh
    taylorjr:?:203:1022:James Taylor:/users/taylorjr:/bin/csh
    taylorrj:?:662:1042:Robert Taylor:/users/taylorrj:/bin/csh
    taylorm:?:869:1508:Melanie Taylor:/users/taylorm:/bin/csh
    taylor:?:1989:1412:Dave Taylor:/users/taylor:/bin/csh
    

    I searched for all entries in the passwd file that contain the pattern taylor.

  2. I've found more matches than I wanted, though. If I'm looking for my own account, I don't want to see all these alternatives. Using the ^ character before the pattern left-roots the pattern:

    % grep '^taylor' /etc/passwd
    taylorj:?:1048:1375:James Taylor:/users/taylorj:/bin/csh
    taylorjr:?:203:1022:James Taylor:/users/taylorjr:/bin/csh
    taylorrj:?:662:1042:Robert Taylor:/users/taylorrj:/bin/csh
    taylorm:?:869:1508:Melanie Taylor:/users/taylorm:/bin/csh
    taylor:?:1989:1412:Dave Taylor:/users/taylor:/bin/cshx
    

    Now I want to narrow the search further. I want to specify a pattern that says “show me all lines that start with taylor, followed by a character that is not a lowercase letter.”

  3. To accomplish this, I use the [^xy] notation, which indicates an exclusion set, or set of characters that cannot match the pattern:

    % grep '^taylor[^a-z]' /etc/passwd
    taylor:?:1989:1412:Dave Taylor:/users/taylor:/bin/csh
    

    It worked! You can specify a set in two ways: You can either list each character or use a hyphen to specify a range starting with the character to the left of the hyphen and ending with the character to the right of the hyphen. That is, a-z is the range beginning with a and ending with z, and 0-9 includes all digits.

  4. To see which accounts were excluded, remove the ^ to search for an inclusion range, which is a set of characters of which one must match the pattern:

    % grep '^taylor[a-z]' /etc/passwd
    taylorj:?:1048:1375:James Taylor:/users/taylorj:/bin/csh
    taylorjr:?:203:1022:James Taylor:/users/taylorjr:/bin/csh
    taylorrj:?:668:1042:Robert Taylor:/users/taylorrj:/bin/csh
    taylormx:?:869:1508:Melanie Taylor:/users/taylorm:/bin/csh
    
  5. To see some other examples, I use head to view the first 10 lines of the password file:

    % head /etc/passwd
    root:?:0:0:root:/:/bin/csh
    news:?:6:11:USENET News:/usr/spool/news:/bin/ksh
    ingres:*?:7:519:INGRES Manager:/usr/ingres:/bin/csh
    usrlimit:?:8:800:(1000 user system):/mnt:/bin/false
    vanilla:*?:20:805:Vanilla Account:/mnt:/bin/sh
    charon:*?:21:807:The Ferryman:/users/tomb:
    actmaint:?:23:809:Maintenance:/usr/adm/actmaint:/bin/ksh
    pop:*?:26:819::/usr/spool/pop:/bin/csh
    lp:*?:70:10:Lp Admin:/usr/spool/lp:
    trouble:*?:97:501:Report Facility:/usr/mrg/trouble:/usr/local/lib/msh
    

    Now I'll specify a pattern that tells grep to search for all lines that contain zero or more occurrences of the letter z.

    % grep 'z*' /etc/passwd | head
    root:?:0:0:root:/:/bin/csh
    news:?:6:11:USENET News:/usr/spool/news:/bin/ksh
    ingres:*?:7:519:INGRES Manager:/usr/ingres:/bin/csh
    usrlimit:?:8:800:(1000 user system):/mnt:/bin/false
    vanilla:*?:20:805:Vanilla Account:/mnt:/bin/sh
    charon:*?:21:807:The Ferryman:/users/tomb:
    actmaint:?:23:809:Maintenance:/usr/adm/actmaint:/bin/ksh
    pop:*?:26:819::/usr/spool/pop:/bin/csh
    lp:*?:70:10:Lp Adminuniverse(att):/usr/spool/lp:
    trouble:*?:97:501:Report Facility:/usr/mrg/trouble:/usr/local/lib/msh
    Broken pipe
    

    The result is identical to the preceding command, but it shouldn't be a surprise. Specifying a pattern that matches zero or more occurrences will match every line! Specifying only the lines that have one or more z's is accomplished with an odd-looking pattern:

    % grep 'zz*' /etc/passwd | head
    marg:?:724:1233:Guyzee:/users/marg:/bin/ksh
    axy:?:1272:1233:martinez:/users/axy:/bin/csh
    wizard:?:1560:1375:Oz:/users/wizard:/bin/ksh
    zhq:?:2377:1318:Zihong:/users/zhq:/bin/csh
    mm:?:7152:1233:Michael Kenzie:/users/mm:/bin/ksh
    tanzm:?:7368:1140:Zhen Tan:/users/tanzm:/bin/csh
    mendozad:?:8176:1233:Don Mendoza:/users/mendozad:/bin/csh
    pavz:?:8481:1175:Mary L. Pavzky:/users/pavz:/bin/csh
    hurlz:?:9189:1375:Tom Hurley:/users/hurlz:/bin/csh
    tulip:?:9222:1375:Liz Richards:/users/tulip:/bin/csh
    Broken pipe
    
  6. Earlier I found that a couple of lines in the /etc/passwd file were for accounts that didn't specify a login shell. Each line in the password file must have a certain number of colons, and the last character on the line for these accounts will be a colon, an easy grep pattern:

    % grep ':$' /etc/passwd
    charon:*?:21:807:The Ferryman:/users/tomb:
    lp:*?:70:10:System V Lp Adminuniverse(att):/usr/spool/lp:
    
  7. Consider this. I get a call from my accountant, and I need to find a file containing a message about a $100 outlay of cash to buy some software. I can use grep to search for all files that contain a dollar sign, followed by a one, followed by one or more zeros:

    % grep '$100*' * */*
    Mail/bob_gale:     Unfortunately, our fees are currently $100 per test drive, budgets
    Mail/dan_sommer:We also pay $100 for Test Drives, our very short "First Looks" section. We often
    Mail/james:has been dropped, so if I ask for $1000 is that way outta line
    Mail/john_spragens:time testing things since it's a $100 test drive: I'm willing to
    Mail/john_spragens:     Finally, I'd like to request $200 rather than $100 for
    Mail/mac:again: expected pricing will be $10,000 - $16,000 and the BriteLite LX with
    Mail/mark:I'm promised $1000 / month for a first
    Mail/netnews.postings: Win Lose or Die, John Gardner (hardback) $10
    Mail/netnews.postings:I'd be willing to pay, I dunno, $100 / year for the space? I would
    Mail/sent:to panic that they'd want their $10K advance back, but the good news is
    Mail/sent:That would be fine. How about $100 USD for both, to include any
    Mail/sent:      Amount: $100.00
    

    That's quite a few matches. Notice that among the matches are $1000, $10K, and $10. To match the specific value $100, of course, I can use $100 as the search pattern.

    This pattern demonstrates the sophistication of UNIX with regular expressions. For example, the $ character is a special character that can be used to indicate the end of a line, but only if it is placed at the end of the pattern. Because I did not place it at the end of the pattern, the grep program reads it as the $ character itself.

You can use the shell to expand files not just in the current directory, but one level deeper into subdirectories, too: * expands your search beyond files in the current directory, and */* expands your search to all files contained one directory below the current point. If you have lots of files, you might instead see the error arg list too long; that's where the find command proves handy.


  1. Here's one more example. In the old days, when people were tied to typewriters, an accepted convention for writing required that you put two spaces after the period at the end of a sentence even though only one space followed the period of an abbreviation such as J. D. Salinger. Nowadays, with more text being produced through word processing and desktop publishing, the two-space convention is less accepted, and indeed, when submitting work for publication, I often have to be sure that I don't have two spaces after punctuation lest I get yelled at! The grep command can help ferret out these inappropriate punctuation sequences, fortunately; but the pattern needed is tricky.

    To start, I want to see whether, anywhere in the file dickens.note, I have used a period followed by a single space:

    % grep '. ' dickens.note
                                    A Tale of Two Cities
                                          Preface
    When I was acting, with my children and friends, in Mr Wilkie
    Collins'sdrama of The Frozen Deep, I first conceived the main idea of this
    story. A strong desire came upon me then, to
    embody it in my own person;
    and I traced out in my fancy, the state of mind of which it would
    necessitate the presentation
    to an observant spectator, with particular
    care and interest.
    As the idea became familiar to me, it gradually shaped itself into
    its present form. Throughout its execution, it has had complete
    possession of me; I have so far verified what
    is done and suffered in these pages,
    as that I have certainly done and suffered it all myself.
    Whenever any reference (however slight) is made here to the condition
    of the Danish people before or during the Revolution, it is truly
    made, on the faith of the most trustworthy
    witnesses. It has been one of my hopes to add
    something to the popular and picturesque means of
    understanding that terrible time, though no one can hope
    to add anything to the philosophy of Mr Carlyle's wonderful book.
    Tavistock House
    November 1859
    

    What's happening here? The first line doesn't have a period in it, so why does grep say it matches the pattern? In grep, the period is a special character that matches any single character, not specifically the period itself. Therefore, my pattern matches any line that contains a space preceded by any character.

    To avoid this interpretation, I must preface the special character with a backslash () if I want it to be read as the . character itself:

    % grep '. ' dickens.note
    story. A strong desire came upon me then, to
    present form. Throughout its execution, it has had complete
    possession
    witnesses. It has been one of my hopes to add
    

    Ahhh, that's better. Notice that all three of these lines have two spaces after each period.

With the relatively small number of notations available in regular expressions, you can create quite a variety of sophisticated patterns to find information in a file.


Task 9.4: Searching Files Using grep

Two commonly used commands are the key to your becoming a power user and becoming comfortable with the capabilities of the system. The ls command is one example, and the grep command is another. The oddly named grep command makes it easy to find lost files or to find files that contain specific text.


The grep command not only has a ton of command options, but has two variations in UNIX systems, too. These variations are egrep, for specifying more complex patterns (regular expressions), and fgrep, for using file-based lists of words as search patterns.

After laborious research and countless hours debating with UNIX developers, I am reasonably certain that the derivation of the name grep is as follows: Before this command existed, UNIX users would use a crude line-based editor called ed to find matching text. As you know, search patterns in UNIX are called regular expressions. To search throughout a file, the user prefixed the command with global. Once a match was made, the user wanted to have it listed to the screen with print. To put it all together, the operation was global/regular expression/print. That phrase was pretty long, however, so users shortened it to g/re/p. Thereafter, when a command was written, grep seemed to be a natural, if odd and confusing, name.


You could spend the next 100 pages learning all the obscure and weird options to the grep family of commands. When you boil it down, however, you're probably going to use only the simplest patterns and maybe a useful flag or two. Think of it this way: Just because there are more than 500,000 words in the English language (according to the Oxford English Dictionary) doesn't mean that you must learn them all to communicate effectively.

With this in mind, you'll learn the basics of grep this hour, but you'll pick up more insight into the program's capabilities and options during the next few hours. A few of the most important grep command flags are listed in Table 9.2.

Table 9.2. The Most Helpful GREP Flags
Flag Function
-c List a count of matching lines only.
-i Ignore the case of the letters in the pattern.
-l List filenames of files that contain the specified pattern.
-n Include line numbers.

  1. Begin by making sure you have a test file to work with. The example shows the testme file from the previous uniq examples:

    % cat testme
    Archives/               OWL/                    keylime.pie
    InfoWorld/              bin/                    src/
    Mail/                   bitnet.mailing-lists.Z  temp/
    News/                   drop.text.hqx           testme
    
  2. The general form of grep is to specify the command, any flags you want to add, the pattern, and a filename:

    % grep bitnet testme
    Mail/                   bitnet.mailing-lists.Z temp/
    

    As you can see, grep easily pulled out the line in the testme file that contained the pattern bitnet.

  3. Be aware that grep finds patterns in a case-sensitive manner:

    % grep owl testme
    %
    

    Note that OWL was not found because the pattern specified with the grep command was all lowercase, owl.

    But that's where the -i flag can be helpful, which causes grep to ignore case:

    % grep -i owl testme
    Archives/               OWL/                    keylime.pie
    
  4. For the next few examples, I'll move into the /etc directory because some files there have lots of lines. The wc command shows that the file /etc/passwd has almost 4,000 lines:

    % cd /etc
    % wc -l /etc/passwd
       3877
    

    My account is taylor. I'll use grep to see my account entry in the password file:

    % grep taylor /etc/passwd
    taylorj:?:1048:1375:James Taylor:/users/taylorj:/bin/csh
    mtaylor:?:760:1375:Mary Taylor:/users/mtaylor:/usr/local/bin/tcsh
    dataylor:?:375:518:Dave Taylor:/users/dataylor:/usr/local/lib/msh
    taylorjr:?:203:1022:James Taylor:/users/taylorjr:/bin/csh
    taylorrj:?:668:1042:Robert Taylor:/users/taylorrj:/bin/csh
    taylorm:?:862:1508:Melanie Taylor:/users/taylormx:/bin/csh
    taylor:?:1989:1412:Dave Taylor:/users/taylor:/bin/csh
    

    Try this on your system too.

  5. As you can see, many accounts contain the pattern taylor.

    A smarter way to see how often the taylor pattern appears is to use the -c flag to grep, which will indicate how many case-sensitive matches are in the file before any of them are displayed on the screen:

    % grep -c taylor /etc/passwd
    7
    

    The command located seven matches. Count the listing in instruction 4 to confirm this.

  6. With 3,877 lines in the password file, it could be interesting to see whether all the Taylors started their accounts at about the same time. (This presumably would mean they all appear in the file at about the same point.) To do this, I'll use the -n flag to number the output lines:

    % grep -n taylor /etc/passwd
    319:taylorj:?:1048:1375:James Taylor:/users/taylorj:/bin/csh
    1314:mtaylor:?:760:1375:Mary Taylor:/users/mtaylor:/usr/local/bin/tcsh
    1419:dataylor:?:375:518:Dave Taylor:/users/dataylor:/usr/local/lib/msh
    1547:taylorjr:?:203:1022:James Taylor:/users/taylorjr:/bin/csh
    1988:taylorrj:?:668:1042:Robert Taylor:/users/taylorrj:/bin/csh
    2133:taylorm:?:8692:1508:Melanie Taylor:/users/taylorm:/bin/csh
    3405:taylor:?:1989:1412:Dave Taylor:/users/taylor:/bin/csh
    

    This is a great example of a default separator adding incredible confusion to the output of a command. Normally, a line number followed by a colon would be no problem, but in the passwd file (which is already littered with colons), it's confusing. Compare this output with the output obtained in instruction 4 with the grep command alone to see what has changed.

    You can see that my theory about when the Taylors started their accounts was wrong. If proximity in the passwd file is an indicator that accounts are assigned at similar times, then no Taylors started their accounts even within the same week.

These examples of how to use grep barely scratch the surface of how this powerful and sophisticated command can be used. Explore your own file system using grep to search files for specific patterns.


Armed with wildcards, you now can try the -l flag to grep, which, as you recall, indicates the names of the files that contain a specified pattern, rather than printing the lines that match the pattern. If I go into my electronic mail archive directory—Mail—I can easily, using the command grep -l -i chicago Mail/*, search for all files that contain Chicago. Try using grep -l to search across all files in your home directory for words or patterns.


Task 9.5: For Complex Expressions, Try egrep

Sometimes a single regular expression can't locate what you seek. For example, perhaps you're looking for lines that have either one pattern or a second pattern. That's where the egrep command proves helpful. The command gets its name from “expression grep,” and it has a notational scheme more powerful than that of grep, as shown in Table 9.3.


Table 9.3. Regular Expression Notation for EGREP
Notation Meaning
c Matches the character c
c Forces c to be read as the letter c, not as another meaning the character might have
^ Beginning of the line
$ End of the line
. Any single character
[xy] Any single character in the set specified
[^xy] Any single character not in the set specified
c* Zero or more occurrences of character c
c+ One or more occurrences of character c
c? Zero or one occurrences of character c
a|b Either a or b
(a) Regular expression

  1. Now I'll search the password file to demonstrate egrep. A pattern that seemed a bit weird was the one used with grep to search for lines containing one or more occurrences of the letter z: 'zz*'. With egrep, this search is much easier:

    % egrep 'z+' /etc/passwd | head
    marg:?:724:1233:Guyzee:/users/marg:/bin/ksh
    axy:?:1272:1233:martinez:/users/axy:/bin/csh
    wizard:?:1560:1375:Oz:/users/wizard:/bin/ksh
    zhq:?:2377:1318:Zihong:/users/zhq:/bin/csh
    mm:?:7152:1233:Michael Kenzie:/users/mm:/bin/ksh
    tanzm:?:7368:1140:Zhen Tan:/users/tanzm:/bin/csh
    mendozad:?:8176:1233:Don Mendoza:/users/mendozad:/bin/csh
    pavz:?:8481:1175:Mary L. Pavzky:/users/pavz:/bin/csh
    hurlz:?:9189:1375:Tom Hurley:/users/hurlz:/bin/csh
    tulip:?:9222:1375:Liz Richards:/users/tulip:/bin/csh
    Broken pipe
    
  2. To search for lines that have either a z or a q, I can use the following:

    % egrep '(z|q)' /etc/passwd | head
    aaq:?:528:1233:Don Kid:/users/aaq:/bin/csh
    abq:?:560:1233:K Laws:/users/abq:/bin/csh
    marg:?:724:1233:Guyzee:/users/marg:/bin/ksh
    ahq:?:752:1233:Andy Smith:/users/ahq:/bin/csh
    cq:?:843:1233:Rob Till:/users/cq:/usr/local/bin/tcsh
    axy:?:1272:1233:Alan Yeltsin:/users/axy:/bin/csh
    helenq:?:1489:1297:Helen Schoy:/users/helenq:/bin/csh
    wizard:?:1560:1375:Oz:/users/wizard:/bin/ksh
    qsc:?:1609:1375:Enid Grim:/users/qsc:/usr/local/bin/tcsh
    zhq:?:2377:1318:Zong Qi:/users/zhq:/bin/csh
    Broken pipe
    
  3. Now I can visit a complicated egrep pattern, and it should make sense to you:

    % egrep '(^[a-zA-Z]|:wi)' /etc/printcap | head
    aglw:
            :wi=AG 23:wk=multiple Apple LaserWriter IINT:
    aglw1:
            :wi=AG 23:wk=Apple LaserWriter IINT:
    aglw2:
            :wi=AG 23:wk=Apple LaserWriter IINT:
    aglw3:
            :wi=AG 23:wk=Apple LaserWriter IINT:
    aglw4:
            :wi=AG 23:wk=Apple LaserWriter IINT:
    Broken pipe
    

    Now you can see that the pattern specified looks either for lines that begin (^) with an upper- or lowercase letter ([a-zA-Z]) or for lines that contain the pattern :wi.

Any time you want to look for lines that contain more than a single pattern, egrep is the best command to use.


Task 9.6: Searching for Multiple Patterns at Once with fgrep

Sometimes it's helpful to look for many patterns at once. For example, you might want to have a file of patterns and invoke a UNIX command that searches for lines which contain any of the patterns in that file. That's where the fgrep, or file-based grep, command comes into play. A file of patterns can contain any pattern that grep would understand (which means, unfortunately, that you can't use the additional notation available in egrep) and is specified with the -f file option.


  1. I use fgrep with wrongwords, an alias and file that contains a list of words I commonly misuse. Here's how it works:

    % alias wrongwords fgrep -i -f .wrongwords
    % cat .wrongwords
    effect
    affect
    insure
    ensure
    idea
    thought
    

    Any time I want to check a file, for example dickens.note, to see whether it has any of these commonly misused words, I simply enter the following:

    % wrongwords dickens.note
    drama of The Frozen Deep, I first conceived the main idea of this
    As the idea became familiar to me, it gradually shaped itself into
    its
    

    I need to determine whether these are ideas or thoughts. It's a subtle distinction I often forget in my writing.

  2. Here's another sample file that contains a couple of words from wrongwords:

    % cat sample3
    At the time I was hoping to insure that the cold weather
    would avoid our home, so I, perhaps foolishly, stapled the
    weatherstripping along the inside of the sliding glass
    door in the back room. I was surprised how much affect it
    had on our enjoyment of the room, actually.
    

    Can you see the two incorrectly used words in that sentence? The spell program can't:

    % spell sample3
    							

    The wrongwords alias, on the other hand, can detect these words:

    % wrongwords sample3
    At the time I was hoping to insure that the cold weather
    door in the back room. I was surprised how much affect it
    
  3. This would be a bit more useful if it could show just the individual words matched, rather than the entire sentences. That way I wouldn't have to figure out which words were incorrect. To do this, I can use the awk command. It is a powerful command that uses regular expressions. The command will use a for loop, that is, will repeat the command starting from the initial state (i=1) and keep adding one to the counter (i++) until the end condition is met (i>NF): '{ for (i=1;i<=NF;i++) print $i} '. Each line seen by awk will be printed one word at a time with this command. NF is the number of fields in the current line.

    Here is a short example:

    % echo 'this is a sample sentence' | awk '{ for (i=1;i<=NF;i++) print $i} '
    this
    is
    a
    sample
    sentence
    
  4. I could revise my alias, but trying to get the quotation marks correct is a nightmare. It would be much easier to make this a simple shell script instead:

    % cat bin/wrongwords
    # wrongwords - show a list of commonly misused words in the file
    
    cat $* | 
      awk '{ for (i=1;i<=NF;i++) print $i} ' |
      fgrep -i -f .wrongwords
    

    To make this work correctly, I need to remove the existing alias for wrongwords by using the C shell unalias command, add execute permission to the shell script, and then use rehash to ensure that the C shell can find the command when requested:

    % unalias wrongwords
    % chmod +x bin/wrongwords
    % rehash
    							

    Now it's ready to use:

    % wrongwords sample3
    insure
    affect
    
  5. The fgrep command can also exclude words from a list. If you have been using the spell command, you've quickly discovered that the program doesn't know anything about acronyms or some other correctly spelled words you might use in your writing. That's where fgrep can be a helpful compatriot. Build a list of words you commonly use that aren't misspelled but that spell reports as being misspelled:

    % alias myspell    'spell !* | fgrep -v -i -f $HOME/.dictionary'
    % cat $HOME/.dictionary
    BBS
    FAX
    Taylor
    Utech
    Zygote
    

    Now spell can be more helpful:

    % spell newsample
    FAX
    illetterate
    Letteracy
    letteracy
    letterate
    Papert
    pre
    rithmetic
    Rs
    Taylor
    Utech
    Zygote
    % myspell newsample
    illetterate
    Letteracy
    letteracy
    letterate
    Papert
    pre
    rithmetic
    Rs
    

You have now met the entire family of grep commands. For most of your searches for information, you can use the grep command itself. Sometimes, though, it's nice to have options, particularly if you decide to customize some of your commands as shown in the scripts and aliases explored in this hour.


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset