Chapter 3. Regular Expressions Primer

Regular expressions (regex) are a powerful method for describing a text pattern to be matched by various tools. There is only one place in bash where regular expressions are valid, using the =~ comparison in the [[ compound command, as in an if statement. However, regular expressions are a crucial part of the larger toolkit for commands like grep, awk, and sed in particular. They are powerful and thus worth knowing. Once you’ve mastered regular expressions, you’ll wonder how you ever got along without them.

For many of the examples in this chapter, we will be using the file frost.txt with its seven—yes seven—lines of text; see Example 3-1.

Example 3-1. frost.txt
1    Two roads diverged in a yellow wood,
2    And sorry I could not travel both
3    And be one traveler, long I stood
4    And looked down one as far as I could
5    To where it bent in the undergrowth;
6
7 Excerpt from The Road Not Taken by Robert Frost

The content of frost.txt will be used to demonstrate the power of regular expressions to process text data. This text was chosen because it requires no prior technical knowledge to understand.

Commands in Use

We introduce the grep family of commands to demonstrate the basic regex patterns.

grep

The grep command searches the content of the files for a given pattern and prints any line where the pattern is matched. To use grep, you need to provide it with a pattern and one or more filenames (or piped data).

Common command options

-c

Count the number of lines that match the pattern.

-E

Enable extended regular expressions.

-f

Read the search pattern from a provided file. A file can contain more than one pattern, with each line containing a single pattern.

-i

Ignore character case.

-l

Print only the filename and path where the pattern was found.

-n

Print the line number of the file where the pattern was found.

-P

Enable the Perl regular expression engine.

-R, -r

Recursively search subdirectories.

Command example

In general, grep is used like this: grep options pattern filenames

To search the /home directory and all subdirectories for files containing the word password, regardless of uppercase/lowercase distinctions:

grep -R -i 'password' /home

grep and egrep

The grep command supports some variations, notably extended syntax for the regex patterns (we discuss the regex patterns next). There are three ways to tell grep that you want special meaning on certain characters: 1) by preceding those characters with a backslash; 2) by telling grep that you want the special syntax (without the need for a backslash) by using the -E option when you invoke grep; or 3) by using the command named egrep, which is a script that simply invokes grep as grep -E so you don’t have to.

The only characters that are affected by the extended syntax are ? + { | ( and ). In the examples that follow, we use grep and egrep interchangeably—they are the same binary underneath. We choose the one that seems most appropriate based on which special characters we need. The special, or metacharacters are what make grep so powerful. Here is what you need to know about the most powerful and frequently used metacharacters.

Regular Expression Metacharacters

Regular expressions are patterns that are created using a series of characters and metacharacters. Metacharacters such as the questions mark (?) and asterisk (*) have special meaning beyond their literal meanings in regex.

The “.” Metacharacter

In regex, the period (.) represents a single wildcard character. It will match on any single character except for a newline. As you can see in the following example, if we try to match on the pattern T.o, the first line of the frost.txt file is returned because it contains the word Two:

$ grep 'T.o' frost.txt

1    Two roads diverged in a yellow wood,

Note that line 5 is not returned even though it contains the word To. This pattern allows any character to appear between the T and o, but as written, there must be a character in between. Regex patterns are also case sensitive, which is why line 3 of the file is not returned even though it contains the string too. If you want to treat this metacharacter as a period character rather than a wildcard, precede it with a backslash (.) to escape its special meaning.

The “?” Metacharacter

In regex, the question mark (?) character makes any item that precedes it optional; it matches it zero or one time. By adding this metacharacter to the previous example, you can see that the output is different:

$ egrep 'T.?o' frost.txt

1    Two roads diverged in a yellow wood,
5    To where it bent in the undergrowth;

This time, both lines 1 and 5 are returned. This is because the metacharacter . is optional because of the ? metacharacter that follows it. This pattern will match on any three-character sequence that begins with T and ends with o as well as the two-character sequence To.

Notice that we are using egrep here. We could have used grep -E or we could have used “plain” grep with a slightly different pattern: T.?o, putting the backslash on the question mark to give it the extended meaning.

The “*” Metacharacter

In regex, the asterisk (*) is a special character that matches the preceding item zero or more times. It is similar to ?, the main difference being that the previous item may appear more than once. Here is an example:

$ grep 'T.*o' frost.txt

1    Two roads diverged in a yellow wood,
5    To where it bent in the undergrowth;
7 Excerpt from The Road Not Taken by Robert Frost

The .* in the preceding pattern allows any number of any character to appear between the T and o. Thus, the last line also matches because it contains the pattern The Ro.

The “+” Metacharacter

The plus sign (+) metacharacter is the same as the * except it requires the preceding item to appear at least once. In other words, it matches the preceding item one or more times:

$ egrep 'T.+o' frost.txt

1    Two roads diverged in a yellow wood,
5    To where it bent in the undergrowth;
7 Excerpt from The Road Not Taken by Robert Frost

The preceding pattern specifies one or more of any character to appear in between the T and o. The first line of text matches because of Two—the w is one character between the T and the o. The second line doesn’t match the To, as in the previous example; rather, the pattern matches a much larger string—all the way to the o in undergrowth. The last line also matches because it contains the pattern The Ro.

Grouping

We can use parentheses to group characters. Among other things, this allows us to treat the characters appearing inside the parentheses as a single item that we can later reference. Here is an example of grouping:

$ egrep 'And be one (stranger|traveler), long I stood' frost.txt

3    And be one traveler, long I stood

In the preceding example, we use parentheses and the Boolean OR operator (|) to create a pattern that will match on line 3. Line 3 as written has the word traveler in it, but this pattern would match even if traveler was replaced by the word stranger.

Brackets and Character Classes

In regex, the square brackets, [ ], are used to define character classes and lists of acceptable characters. Using this construct, you can list exactly which characters are matched at this position in the pattern. This is particularly useful when trying to perform user-input validation. As shorthand, you can specify ranges with a dash, such as [a-j]. These ranges are in your locale’s collating sequence and alphabet. For the C locale, the pattern [a-j] will match one of the letters a through j. Table 3-1 provides a list of common examples when using character classes and ranges.

Table 3-1. Regex character ranges
Example Meaning

[abc]

Match only the character a or b or c

[1-5]

Match on digits in the range 1 to 5

[a-zA-Z]

Match any lowercase or uppercase a to z

[0-9 +-*/]

Match on numbers or these four mathematical symbols

[0-9a-fA-F]

Match a hexadecimal digit

Warning

Be careful when defining a range for digits; the range can at most go from 0 to 9. For example, the pattern [1-475] does not match on numbers between 1 and 475; it matches on any one of the digits (characters) in the range 1–4 or the character 7 or the character 5.

There are also predefined character classes known as shortcuts. These can be used to indicate common character classes such as numbers or letters. See Table 3-2 for a list of shortcuts.

Table 3-2. Regex shortcuts
Shortcut Meaning

s

Whitespace

S

Not whitespace

d

Digit

D

Not digit

w

Word

W

Not word

x

Hexadecimal number (e.g., 0x5F)

Note that these shortcuts are not supported by egrep. In order to use them, you must use grep with the -P option. That option enables the Perl regular expression engine to support the shortcuts. For example, you use the following to find any numbers in frost.txt:

$ grep -P 'd' frost.txt

1    Two roads diverged in a yellow wood,
2    And sorry I could not travel both
3    And be one traveler, long I stood
4    And looked down one as far as I could
5    To where it bent in the undergrowth;
6
7 Excerpt from The Road Not Taken by Robert Frost

Other character classes (with a more verbose syntax) are valid only within the bracket syntax, as shown in Table 3-3. They match a single character, so if you need to match many in a row, use the * or + to get the repetition you need.

Table 3-3. Regex character classes in brackets
Character class Meaning

[:alnum:]

Any alphanumeric character

[:alpha:]

Any alphabetic character

[:cntrl:]

Any control character

[:digit:]

Any digit

[:graph:]

Any graphical character

[:lower:]

Any lowercase character

[:print:]

Any printable character

[:punct:]

Any punctuation

[:space:]

Any whitespace

[:upper:]

Any uppercase character

[:xdigit:]

Any hex digit

To use one of these classes, it has to be inside the brackets, so you end up with two sets of brackets. For example, grep '[[:cntrl:]]' large.data will look for lines containing control characters (ASCII 0–25). Here is another example:

grep 'X[[:upper:][:digit:]]' idlist.txt

This will match any line with an X followed by any uppercase letter or digit. It would match these lines:

User: XTjohnson
an XWing model 7
an X7wing model

Each has an uppercase X followed immediately by either another uppercase letter or by a digit.

Back References

Regex back references are one of the most powerful and often confusing regex operations. Consider the following file, tags.txt:

1    Command
2    <i>line</i>
3    is
4    <div>great</div>
5    <u>!</u>

Suppose you want to write a regular expression that will extract any line that contains a matching pair of complete HTML tags. The start tag has an HTML tag name; the ending tag has the same tag name but with a leading slash. <div> and </div> are a matching pair. You can search for these by writing a lengthy regex that contains all possible HTML tag values, or you can focus on the format of an HTML tag and use a regex back reference, as follows:

$ egrep '<([A-Za-z]*)>.*</1>' tags.txt

2    <i>line</i>
4    <div>great</div>
5    <u>!</u>

In this example, the back reference is the 1 appearing in the latter part of the regular expression. It is referring back to the expression enclosed in the first set of parentheses, [A-Za-z]*, which has two parts. The letter range in brackets denotes a choice of any letter, uppercase or lowercase. The * that follows it means to repeat that zero or more times. Therefore, the 1 refers to whatever was matched by that pattern in parentheses. If [A-Za-z]* matches div, then the 1 also refers to the pattern div.

The overall regular expression, then, can be described as matching a less-than sign (<) that literal character is the first one in the regex; followed by zero or more letters; then a greater-than (>) and then zero or more of any character, as . indicates any character, and * indicates zero or more of the previous item; followed by another < and a slash (/); and then the sequence matched by the expression within the parentheses; and finally a > character. If this sequence matches any part of a line from our text file, egrep will print that line.

You can have more than one back reference in an expression and refer to each with a 1 or 2 or 3 depending on its order in the regular expression. A 1 refers to the first set of parentheses, 2 to the second, and so on. Note that the parentheses are metacharacters; they have a special meaning. If you just want to match a literal parenthesis, you need to escape its special meaning by preceding it with a backslash, as in sin([0-9.]*) to match expressions like sin(6.2) or sin(3.14159).

Note

Valid HTML doesn’t have to be all on one line; the end tag can be several lines away from the start tag. Moreover, some single tags can indicate both a start and an end, such as <br/> for a break, or <p/> for an empty paragraph. We would need a more sophisticated approach to include such things in our search.

Quantifiers

Quantifiers specify the number of times an item must appear in a string. Quantifiers are defined by curly braces { }. For example, the pattern T{5} means that the letter T must appear consecutively exactly five times. The pattern T{3,6} means that the letter T must appear consecutively three to six times. The pattern T{5,} means that the letter T must appear five or more times.

Anchors and Word Boundaries

You can use anchors to specify that a pattern must exist at the beginning or the end of a string. The caret (^) character is used to anchor a pattern to the beginning of a string. For example, ^[1-5] means that a matching string must start with one of the digits 1 through 5, as the first character on the line. The $ character is used to anchor a pattern to the end of a string or line. For example, [1-5]$ means that a string must end with one of the digits 1 through 5.

In addition, you can use  to identify a word boundary (i.e., a space). The pattern [1-5] will match on any of the digits 1 through 5, where the digit appears as its own word.

Summary

Regular expressions are extremely powerful for describing patterns and can be used in coordination with other tools to search and process data.

The uses and full syntax of regex far exceed the scope of this book. You can visit the following resources for additional information and utilities related to regex:

In the next chapter, we review some of the high-level principles of cybersecurity to ensure a common understanding of offensive and defensive operations.

Workshop

  1. Write a regular expression that matches a floating-point number (a number with a decimal point) such as 3.14. There can be digits on either side of the decimal point, but there need not be any on one side or the other. Allow the regex to match just a decimal point by itself, too.

  2. Use a back reference in a regular expression to match a number that appears on both sides of an equals sign. For example, it should match “314 is = to 314” but not “6 = 7.”

  3. Write a regular expression that looks for a line that begins with a digit and ends with a digit, with anything occurring in between.

  4. Write a regular expression that uses grouping to match on the following two IP addresses: 10.0.0.25 and 10.0.0.134.

  5. Write a regular expression that will match if the hexadecimal string 0x90 occurs more than three times in a row (i.e., 0x90 0x90 0x90).

Visit the Cybersecurity Ops website for additional resources and the answers to these questions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset