3.20. Split a String, Keeping the Regex Matches

Problem

You want to split a string using a regular expression. After the split, you will have an array or list of strings with the text between the regular expression matches, as well as the regex matches themselves.

Suppose you want to split a string with HTML tags in it along the HTML tags, and also keep the HTML tags. Splitting Ilike<b>bold</b>and<i>italic</i>fonts should result in an array of nine strings: Ilike, <b>, bold, </b>, and, <i>, italic, </i>, and fonts.

Solution

C#

You can use the static call when you process only a small number of strings with the same regular expression:

string[] splitArray = Regex.Split(subjectString, "(<[^<>]*>)");

Construct a Regex object if you want to use the same regular expression with a large number of strings:

Regex regexObj = new Regex("(<[^<>]*>)");
string[] splitArray = regexObj.Split(subjectString);

VB.NET

You can use the static call when you process only a small number of strings with the same regular expression:

Dim SplitArray = Regex.Split(SubjectString, "(<[^<>]*>)")

Construct a Regex object if you want to use the same regular expression with a large number of strings:

Dim RegexObj As New Regex("(<[^<>]*>)")
Dim SplitArray = RegexObj.Split(SubjectString)

Java

List<String> resultList = new ArrayList<String>();
Pattern regex = Pattern.compile("<[^<>]*>");
Matcher regexMatcher = regex.matcher(subjectString);
int lastIndex = 0;
while (regexMatcher.find()) {
    resultList.add(subjectString.substring(lastIndex,
                                           regexMatcher.start()));
    resultList.add(regexMatcher.group());
    lastIndex = regexMatcher.end();
}
resultList.add(subjectString.substring(lastIndex));

JavaScript

result = subject.split(/(<[^<>]*>)/);

XRegExp

result = XRegExp.split(subject, /(<[^<>]*>)/);

PHP

$result = preg_split('/(<[^<>]*>)/', $subject, -1,
                     PREG_SPLIT_DELIM_CAPTURE);

Perl

@result = split(m/(<[^<>]*>)/, $subject);

Python

If you have only a few strings to split, you can use the global function:

result = re.split("(<[^<>]*>)", subject))

To use the same regex repeatedly, use a compiled object:

reobj = re.compile("(<[^<>]*>)")
result = reobj.split(subject)

Ruby

list = []
lastindex = 0;
subject.scan(/<[^<>]*>/) {|match|
    list << subject[lastindex..$~.begin(0)-1];
    list << $&
    lastindex = $~.end(0)
}
list << subject[lastindex..subject.length()]

Discussion

.NET

In .NET, the Regex.Split() method includes the text matched by capturing groups into the array. .NET 1.0 and 1.1 include only the first capturing group. .NET 2.0 and later include all capturing groups as separate strings into the array. If you want to include the overall regex match into the array, place the whole regular expression inside a capturing group. For .NET 2.0 and later, all other groups should be noncapturing, or they will be included in the array.

The capturing groups are not included in the string count that you can pass to the Split() function. If you call regexObj.Split(subject, 4) with the example string and regex of this recipe, you’ll get an array with seven strings. Those will be the four strings with the text before, between, and after the first three regex matches, plus three strings between them with the regex matches, as captured by the only capturing group in the regular expression. Simply put, you’ll get an array with: Ilike, <b>, bold, </b>, and, <i>, and italic</i>fonts. If your regex has 10 capturing groups and you’re using .NET 2.0 or later, regexObj.Split(subject, 4) returns an array with 34 strings.

.NET does not provide an option to exclude the capturing groups from the array. Your only solution is to replace all named and numbered capturing groups with noncapturing groups. An easy way to do this in .NET is to use RegexOptions.ExplicitCapture, and replace all named groups with normal groups (i.e., just a pair of parentheses) in your regular expression.

Java

Java’s Pattern.split() method does not provide the option to add the regex matches to the resulting array. Instead, we can adapt Recipe 3.12 to add the text between the regex matches along with the regex matches themselves to a list. To get the text between the matches, we use the match details explained in Recipe 3.8.

JavaScript

JavaScript’s string.split() function does not provide an option to control whether regex matches should be added to the array. According to the JavaScript standard, all capturing groups should have their matches added to the array.

All the major web browsers now implement String.prototype.split() correctly. Older browsers did not always correctly add capturing groups to the returned array. If you want an implementation of String.prototype.split() that follows the standard and also works with all browsers, Steven Levithan has a solution for you at http://blog.stevenlevithan.com/archives/cross-browser-split.

XRegExp

When using XRegExp in JavaScript, call XRegExp.split(subject, regex) instead of subject.split(regex) for standards-compliant results in all browsers.

PHP

Pass PREG_SPLIT_DELIM_CAPTURE as the fourth parameter to preg_split() to include the text matched by capturing groups in the returned array. You can use the | operator to combine PREG_SPLIT_DELIM_CAPTURE with PREG_SPLIT_NO_EMPTY.

The capturing groups are not included in the string count that you specify as the third argument to the preg_split() function. If you set the limit to four with the example string and regex of this recipe, you’ll get an array with seven strings. Those will be the four strings with the text before, between, and after the first three regex matches, plus three strings between them with the regex matches, as captured by the only capturing group in the regular expression. Simply put, you’ll get an array with: Ilike, <b>, bold, </b>, and, <i>, and italic</i>fonts.

Perl

Perl’s split() function includes the text matched by all capturing groups into the array. If you want to include the overall regex match into the array, place the whole regular expression inside a capturing group.

The capturing groups are not included in the string count that you can pass to the split() function. If you call split(/(<[^<>]*>)/, $subject, 4) with the example string and regex of this recipe, you’ll get an array with seven strings. Those will be the four strings with the text before, between, and after the first three regex matches, plus three strings between them with the regex matches, as captured by the only capturing group in the regular expression. Simply put, you’ll get an array with: Ilike, <b>, bold, </b>, and, <i>, and italic</i>fonts. If your regex has 10 capturing groups, split($regex, $subject, 4) returns an array with 34 strings.

Perl does not provide an option to exclude the capturing groups from the array. Your only solution is to replace all named and numbered capturing groups with noncapturing groups.

Python

Python’s split() function includes the text matched by all capturing groups into the array. If you want to include the overall regex match into the array, place the whole regular expression inside a capturing group.

The capturing groups do not affect the number of times the string is split. If you call split(/(<[^<>]*>)/, $subject, 3) with the example string and regex of this recipe, you’ll get an array with seven strings. The string is split three times, which results in four pieces of text between the matches, plus three pieces of text matched by the capturing group. Simply put, you’ll get an array with: “I like”, “<b>”, “bold”, “</b>”, " and ", “<i>”, and “italic</i> fonts”. If your regex has 10 capturing groups, split($regex, $subject, 3) returns an array with 34 strings.

Python does not provide an option to exclude the capturing groups from the array. Your only solution is to replace all named and numbered capturing groups with noncapturing groups.

Ruby

Ruby’s String.split() method does not provide the option to add the regex matches to the resulting array. Instead, we can adapt Recipe 3.11 to add the text between the regex matches along with the regex matches themselves to a list. To get the text between the matches, we use the match details explained in Recipe 3.8.

See Also

Recipe 2.9 explains capturing and noncapturing groups. Recipe 2.11 explains named capturing groups. Some programming languages also add text matched by capturing groups to the array when splitting a string.

Recipe 3.19 shows code that splits a string into an array without adding the regex matches to the array.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset