Matching groups in Regex

The last main topic that I have left out until now is groups. However, in order to work with groups, we have to move back into a JavaScript console, as this will provide the actual results object that we will need to look at.

Groups show how we can extract data from the input provided. Without groups, you can check whether there is a match, or if a given input text follows a specific pattern. However, you can't take advantage of vague definitions to extract relevant content. The syntax is fairly simple: you wrap the pattern you want inside brackets, and then this part of the expression will be extracted in its own property.

Grouping characters together to create a clause

Let's start with something basic—a person's name—in standard JavaScript. If you had a string with someone's name, you would probably split it by the space character and check whether there are two or three components in it. In case there are two, the first would consist of the first name and the second would consist of the last name; however, if there are three components, then the second component would include the middle name and the third would include the last name.

Instead of imposing a condition like this, we can create a simple pattern as shown:

/(S+) (S*) ?(S+)/

The first group contains a mandatory non-space word. The plus sign will again multiply the pattern indefinitely. Next, we want a space with a second word; this time, I've used the asterisk to denote that it could be of length zero, and after this, we have another space, though, this time, it's optional.

Note

If there is no middle name, there won't be a second space, followed by a word boundary. This is because the space is optional, but we still want to make sure that a new word is present, followed by the final word.

Now, open up a JavaScript console (in Chrome) and create a variable for this pattern:

var pattern = /(S+) (S*) ?(S+)/

Then, try running the exec command on this pattern with different names, with and without a middle name, and take a look at this resulting output:

Grouping characters together to create a clause

Whether the string has a middle name or not, it will have the three patterns that we can assign to variables, therefore, we can use something else instead of this:

var res = name.split(" ");
first_name = res[0];

if (res.length == 2) {
   middle_name = "";
   last_name = res[1];
} else {
   middle_name = res[1];
   last_name = res[2];
}

We can remove the conditional statements (if-else) from the preceding code and write the code something similar to this:

var res = /(S+) (S*) ?(S+)/.exec(name);

first_name = res[1];
middle_name = res[2];
last_name = res[3];

If the middle name is left out, our expression will still have the group, it will just be an empty string.

Another thing worth mentioning is that the indexes of the groups start at 1, so the first group is in the result 1 index, and the result 0 index holds the entire match.

Capture and noncapture groups

In the first chapter, we saw an example where we wanted to parse some kind of XML tokens, and we said that we needed an extra constraint where the closing tag had to match the opening tag for it to be valid. So, for example, this should be parsed:

<duration>5 Minutes</duration>

Here, this should not be parsed:

<duration>5 Minutes</title>

Since the closing tag doesn't match the opening tag, the way to reference previous groups in your pattern is by using a backslash character, followed by the group's index number. As an example, let's write a small script that will accept a line delimited series of XML tags, and then convert it into a JavaScript object.

To start with, let's create an input string:

var xml = [
   "<title>File.js</title>",
   "<size>36 KB</size>",
   "<language>JavaScript</language>",
   "<modified>5 Minutes</name>"
].join("
");

Here, we have four properties, but the last property does not have a valid closing tag, so it should not be picked up. Next, we will cycle through this pattern and set the properties of a data object:

var data = {};

xml.split("
").forEach(function(line){
   match = /<(w+)>([^<]*)</1>/.exec(line);
   if (match) {
      var tag = match[1];
      data[tag] = match[2];
   }
});

If we output data in a console, you will see that we do, in fact, get three valid properties:

Capture and noncapture groups

However, let's take a moment to examine the pattern; we look for some opening tags with a name inside them, and we then pick up all the characters, except for an opening triangle brace using a negated range. After this, we look for a closing tag using a (1) back reference to make sure it matches. You may have also realized that we needed to escape the forward slash, so it wouldn't think we were closing the Regexp pattern.

Note

A back reference, when added to the end of a regular expression pattern, allows you to back reference a sub-pattern within a pattern, so that the value of the sub-pattern is remembered and used as part of the matching. For example, /(no)1/ matches nono in nono. 1 and is replaced with the value of the first sub-pattern within a pattern, or with (no), so as to form the final pattern.

All the groups we have seen so far have been capture groups, and they tell Regexp to extract this portion of the pattern into its own variable. However, there are other groups or uses for brackets that can be made to achieve even more functionality, the first of these is a non capture group.

Matching non capture groups

A non capture group groups a part of a pattern but it does not actually extract this data into the results array, or use it in back referencing. One benefit of this is that it allows you to use character modifiers on full sections in your pattern. For example, if we want to get a pattern that repeats world indefinitely, we can write it as this:

/(?:world)*/

This will match world as well as worldworldworld and so on. The syntax for a noncapture group is similar to a standard group, except that you start it with a question mark and a (?:) colon. Grouping it allows us to consider the entire thing as a single object, and use modifiers, which usually only work on individual characters.

The other most common use for noncapture groups (which can be done in capture groups as well) works in conjunction with a pipe character. A pipe character allows you to insert multiple options one after the other inside your pattern, for example, in a situation where we want to match either yes or no, we can create this pattern:

/yes|no/

Most of the time, though, this set of options will only be a small piece of your pattern. For example, if we are parsing log messages, we may want to extract the log level and the message. The log level can be one of only a few options (such as debug, info, error, and so on), but the message will always be there. Now, you can write a pattern instead of this one:

/[info] - .*|[debug] - .*|[error] - .*/

We can extract the common part into its own noncapture group:

/[(?:info|debug|error)] - .*/

By doing this we remove a lot of the duplicate code.

Matching lookahead groups

The last sets of groups you can have in your code are lookahead groups. These groups allow us to set a constraint on a pattern, but not really include this constraint in an actual match. With noncapture groups, JavaScript will not create a special index for a section, although, it will include it in the full results (the result's first element). With lookahead groups, we want to be able to make sure there is or isn't some text after our match, but we don't want this text in the results.

For example, let's say we have some input text and we want to parse out all .com domain names. We might not necessarily want .com in the match, just the actual domain name. In this case, we can create this pattern:

/w+(?=.com)/g

The group with the ?= character will mean that we want it to have this text at the end of our pattern, but we don't actually want to include it; we also have to escape the period since it is a special character. Now, we can use this pattern to extract the domains:

text.match(/w+(?=.com)/g)

We can assume that we have a variable text similar to this:

Matching lookahead groups

Using a negative lookahead

Finally, if we wanted to use a negative lookahead, as in a lookahead group that makes sure that the included text does not follow a pattern, we can simply use an exclamation point instead of an equal to sign:

var text = "Mr. Smith & Mrs. Doe";

text.match(/w+(?!.)/g);

This will match all the words that do not end in a period, that is, it will pull out the names from this text:

Using a negative lookahead
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset