You want to split a string using a regular expression. After the split, you will have an array or list of strings with the text between the regular expression matches, as well as the regex matches themselves.
Suppose you want to split a string with HTML tags in it along the
HTML tags, and also keep the HTML tags. Splitting I●like●<b>bold</b>●and●<i>italic</i>●fonts
should result in an array of
nine strings: I●like●
, <b>
, bold
, </b>
, ●and
●, <i>
, italic
, </i>
, and ●fonts
.
You can use the static call when you process only a small number of strings with the same regular expression:
string[] splitArray = Regex.Split(subjectString, "(<[^<>]*>)");
Construct a Regex
object if you want to use the same regular expression with a large
number of strings:
Regex regexObj = new Regex("(<[^<>]*>)"); string[] splitArray = regexObj.Split(subjectString);
You can use the static call when you process only a small number of strings with the same regular expression:
Dim SplitArray = Regex.Split(SubjectString, "(<[^<>]*>)")
Construct a Regex
object if you want to use the same regular expression with a large
number of strings:
Dim RegexObj As New Regex("(<[^<>]*>)") Dim SplitArray = RegexObj.Split(SubjectString)
List<String> resultList = new ArrayList<String>(); Pattern regex = Pattern.compile("<[^<>]*>"); Matcher regexMatcher = regex.matcher(subjectString); int lastIndex = 0; while (regexMatcher.find()) { resultList.add(subjectString.substring(lastIndex, regexMatcher.start())); resultList.add(regexMatcher.group()); lastIndex = regexMatcher.end(); } resultList.add(subjectString.substring(lastIndex));
In .NET, the Regex.Split()
method includes the text matched
by capturing groups into the array. .NET 1.0 and 1.1 include only the
first capturing group. .NET 2.0 and later include all capturing groups
as separate strings into the array. If you want to include the overall
regex match into the array, place the whole regular expression inside
a capturing group. For .NET 2.0 and later, all other groups should be
noncapturing, or they will be included in the array.
The capturing groups are not included in the string count that
you can pass to the Split()
function. If you call regexObj.Split(subject, 4)
with
the example string and regex of this recipe, you’ll get an array with
seven strings. Those will be the four strings with the text before,
between, and after the first three regex matches, plus three strings
between them with the regex matches, as captured by the only capturing
group in the regular expression. Simply put, you’ll get an array with:
I●like●
, <b>
, bold
,
</b>
, ●and●
, <i>
, and
italic</i>●fonts
. If your regex has 10 capturing
groups and you’re using .NET 2.0 or later, regexObj.Split(subject, 4)
returns an array with
34 strings.
.NET does not provide an option to exclude the capturing groups
from the array. Your only solution is to replace all named and
numbered capturing groups with noncapturing groups. An easy way to do
this in .NET is to use RegexOptions.ExplicitCapture
, and replace all
named groups with normal groups (i.e., just a pair of parentheses) in
your regular expression.
Java’s Pattern.split()
method does not provide the
option to add the regex matches to the resulting array. Instead, we
can adapt Recipe 3.12 to add the text
between the regex matches along with the regex matches themselves to a
list. To get the text between the matches, we use the match details
explained in Recipe 3.8.
JavaScript’s
function does not provide an option to control whether regex matches
should be added to the array. According to the JavaScript standard,
all capturing groups should have their matches added to the
array.string
.split()
All the major web browsers now implement String.prototype.split()
correctly. Older
browsers did not always correctly add capturing groups to the returned
array. If you want an implementation of String.prototype.split()
that follows the
standard and also works with all browsers, Steven Levithan has a
solution for you at http://blog.stevenlevithan.com/archives/cross-browser-split.
When using XRegExp in JavaScript, call XRegExp.split(subject, regex)
instead of subject.split(regex)
for standards-compliant
results in all browsers.
Pass PREG_SPLIT_DELIM_CAPTURE
as the fourth parameter
to preg_split()
to include the text matched by
capturing groups in the returned array. You can use the |
operator to combine PREG_SPLIT_DELIM_CAPTURE
with PREG_SPLIT_NO_EMPTY
.
The capturing groups are not included in the string count that
you specify as the third argument to the preg_split()
function. If you set the limit to
four with the example string and regex of this recipe, you’ll get an
array with seven strings. Those will be the four strings with the text
before, between, and after the first three regex matches, plus three
strings between them with the regex matches, as captured by the only
capturing group in the regular expression. Simply put, you’ll get an
array with: I
●like
●, <b>
,
bold
, </b>
, ●and
●, <i>
, and italic</i>
●fonts
.
Perl’s split()
function includes the text matched by all capturing groups into the
array. If you want to include the overall regex match into the array,
place the whole regular expression inside a capturing group.
The capturing groups are not included in the string count that
you can pass to the split()
function. If you call split(/(<[^<>]*>)/, $subject,
4)
with the example string and regex of this recipe, you’ll
get an array with seven strings. Those will be the four strings with
the text before, between, and after the first three regex matches,
plus three strings between them with the regex matches, as captured by
the only capturing group in the regular expression. Simply put, you’ll
get an array with: I
●like
●, <b>
,
bold
, </b>
, ●and
●, <i>
, and italic</i>
●fonts
. If
your regex has 10 capturing groups, split($regex, $subject, 4)
returns an array with
34 strings.
Perl does not provide an option to exclude the capturing groups from the array. Your only solution is to replace all named and numbered capturing groups with noncapturing groups.
Python’s split()
function includes the text matched by all capturing groups into the
array. If you want to include the overall regex match into the array,
place the whole regular expression inside a capturing group.
The capturing groups do not affect the number of times the
string is split. If you call split(/(<[^<>]*>)/, $subject, 3)
with the example string and regex of this recipe, you’ll get an array
with seven strings. The string is split three times, which results in
four pieces of text between the matches, plus three pieces of text
matched by the capturing group. Simply put, you’ll get an array with:
“I like”
,
“<b>”
,
“bold”
,
“</b>”
, " and
"
, “<i>”
, and
“italic</i> fonts”
. If your regex has
10 capturing groups, split($regex, $subject, 3)
returns an array with
34 strings.
Python does not provide an option to exclude the capturing groups from the array. Your only solution is to replace all named and numbered capturing groups with noncapturing groups.
Ruby’s String.split()
method does not provide the
option to add the regex matches to the resulting array. Instead, we
can adapt Recipe 3.11 to add the text
between the regex matches along with the regex matches themselves to a
list. To get the text between the matches, we use the match details
explained in Recipe 3.8.
Recipe 2.9 explains capturing and noncapturing groups. Recipe 2.11 explains named capturing groups. Some programming languages also add text matched by capturing groups to the array when splitting a string.
Recipe 3.19 shows code that splits a string into an array without adding the regex matches to the array.