3.21. Search Line by Line

Problem

Traditional grep tools apply your regular expression to one line of text at a time, and display the lines matched (or not matched) by the regular expression. You have an array of strings, or a multiline string, that you want to process in this way.

Solution

C#

If you have a multiline string, split it into an array of strings first, with each string in the array holding one line of text:

string[] lines = Regex.Split(subjectString, "
?
");

Then, iterate over the lines array:

Regex regexObj = new Regex("regex pattern");
for (int i = 0; i < lines.Length; i++) {
    if (regexObj.IsMatch(lines[i])) {
        // The regex matches lines[i]
    } else {
        // The regex does not match lines[i]
    }
}

VB.NET

If you have a multiline string, split it into an array of strings first, with each string in the array holding one line of text:

Dim Lines = Regex.Split(SubjectString, "
?
")

Then, iterate over the lines array:

Dim RegexObj As New Regex("regex pattern")
For i As Integer = 0 To Lines.Length - 1
    If RegexObj.IsMatch(Lines(i)) Then
        'The regex matches Lines(i)
    Else
        'The regex does not match Lines(i)
    End If
Next

Java

If you have a multiline string, split it into an array of strings first, with each string in the array holding one line of text:

String[] lines = subjectString.split("
?
");

Then, iterate over the lines array:

Pattern regex = Pattern.compile("regex pattern");
Matcher regexMatcher = regex.matcher("");
for (int i = 0; i < lines.length; i++) {
    regexMatcher.reset(lines[i]);
    if (regexMatcher.find()) {
        // The regex matches lines[i]
    } else {
        // The regex does not match lines[i]
    }
}

JavaScript

If you have a multiline string, split it into an array of strings first, with each string in the array holding one line of text.

var lines = subject.split(/
?
/);

Then, iterate over the lines array:

var regexp = /regex pattern/;
for (var i = 0; i < lines.length; i++) {
    if (lines[i].match(regexp)) {
        // The regex matches lines[i]
    } else {
        // The regex does not match lines[i]
    }
}

PHP

If you have a multiline string, split it into an array of strings first, with each string in the array holding one line of text:

$lines = preg_split('/
?
/', $subject)

Then, iterate over the $lines array:

foreach ($lines as $line) {
    if (preg_match('/regex pattern/', $line)) {
        // The regex matches $line
    } else {
        // The regex does not match $line
    }
}

Perl

If you have a multiline string, split it into an array of strings first, with each string in the array holding one line of text:

@lines = split(m/
?
/, $subject)

Then, iterate over the $lines array:

foreach $line (@lines) {
    if ($line =~ m/regex pattern/) {
        # The regex matches $line
    } else {
        # The regex does not match $line
    }
}

Python

If you have a multiline string, split it into an array of strings first, with each string in the array holding one line of text:

lines = re.split("
?
", subject)

Then, iterate over the lines array:

reobj = re.compile("regex pattern")
for line in lines[:]:
    if reobj.search(line):
        # The regex matches line
    else:
        # The regex does not match line

Ruby

If you have a multiline string, split it into an array of strings first, with each string in the array holding one line of text:

lines = subject.split(/
?
/)

Then, iterate over the lines array:

re = /regex pattern/
lines.each { |line|
    if line =~ re
        # The regex matches line
    else
        # The regex does not match line
}

Discussion

When working with line-based data, you can save yourself a lot of trouble if you split the data into an array of lines, instead of trying to work with one long string with embedded line breaks. Then, you can apply your actual regex to each string in the array, without worrying about matching more than one line. This approach also makes it easy to keep track of the relationship between lines. For example, you could easily iterate over the array using one regex to find a header line and then another to find the footer line. With the delimiting lines found, you can then use a third regex to find the data lines you’re interested in. Though this may seem like a lot of work, it’s all very straightforward, and will yield code that performs well. Trying to craft a single regex to find the header, data, and footer all at once will be a lot more complicated, and will result in a much slower regex.

Processing a string line by line also makes it easy to negate a regular expression. Regular expressions don’t provide an easy way of saying “match a line that does not contain this or that word.” Only character classes can be easily negated. But if you’ve already split your string into lines, finding the lines that don’t contain a word becomes as easy as doing a literal text search in all the lines, and removing the ones in which the word can be found.

Recipe 3.19 shows how you can easily split a string into an array. The regular expression matches a pair of CR and LF characters, which delimit lines on the Microsoft Windows platforms. matches an LF character, which delimits lines on Unix and its derivatives, such as Linux and even OS X. Since these two regular expressions are essentially plain text, you don’t even need to use a regular expression. If your programming language can split strings using literal text, by all means split the string that way.

If you’re not sure which line break style your data uses, you could split it using the regular expression ? . By making the CR optional, this regex matches either a CRLF Windows line break or an LF Unix line break.

Once you have your strings into the array, you can easily loop over it. Inside the loop, follow the recipe shown in Recipe 3.5 to check which lines match, and which don’t.

See Also

This recipe uses techniques introduced by two earlier recipes. Recipe 3.11 shows code to iterate over all the matches a regex can find in a string. Recipe 3.19 shows code to split a string into an array or list using a regular expression.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset