3.3. Create Regular Expression Objects

Problem

You want to instantiate a regular expression object or otherwise compile a regular expression so you can use it efficiently throughout your application.

Solution

C#

If you know the regex to be correct:

Regex regexObj = new Regex("regex pattern");

If the regex is provided by the end user (UserInput being a string variable):

try {
    Regex regexObj = new Regex(UserInput);
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}

VB.NET

If you know the regex to be correct:

Dim RegexObj As New Regex("regex pattern")

If the regex is provided by the end user (UserInput being a string variable):

Try
    Dim RegexObj As New Regex(UserInput)
Catch ex As ArgumentException
    'Syntax error in the regular expression
End Try

Java

If you know the regex to be correct:

Pattern regex = Pattern.compile("regex pattern");

If the regex is provided by the end user (userInput being a string variable):

try {
	Pattern regex = Pattern.compile(userInput);
} catch (PatternSyntaxException ex) {
	// Syntax error in the regular expression
}

To be able to use the regex on a string, create a Matcher:

Matcher regexMatcher = regex.matcher(subjectString);

To use the regex on another string, you can create a new Matcher, as just shown, or reuse an existing one:

regexMatcher.reset(anotherSubjectString);

JavaScript

Literal regular expression in your code:

var myregexp = /regex pattern/;

Regular expression retrieved from user input, as a string stored in the variable userinput:

var myregexp = new RegExp(userinput);

XRegExp

If you want to use XRegExp’s extended regular expression syntax in JavaScript, you need to create an XRegExp object from a string:

var myregexp = XRegExp("regex pattern");

Perl

$myregex = qr/regex pattern/

Regular expression retrieved from user input, as a string stored in the variable $userinput:

$myregex = qr/$userinput/

Python

reobj = re.compile("regex pattern")

Regular expression retrieved from user input, as a string stored in the variable userinput:

reobj = re.compile(userinput)

Ruby

Literal regular expression in your code:

myregexp = /regex pattern/;

Regular expression retrieved from user input, as a string stored in the variable userinput:

myregexp = Regexp.new(userinput);

Discussion

Before the regular expression engine can match a regular expression to a string, the regular expression has to be compiled. This compilation happens while your application is running. The regular expression constructor or compile function parses the string that holds your regular expression and converts it into a tree structure or state machine. The function that does the actual pattern matching will traverse this tree or state machine as it scans the string. Programming languages that support literal regular expressions do the compilation when execution reaches the regular expression operator.

.NET

In C# and VB.NET, the .NET class System.Text.RegularExpressions.Regex holds one compiled regular expression. The simplest constructor takes just one parameter: a string that holds your regular expression.

If there’s a syntax error in the regular expression, the Regex() constructor will throw an ArgumentException. The exception message will indicate exactly which error was encountered. It is important to catch this exception if the regular expression is provided by the user of your application. Display the exception message and ask the user to correct the regular expression. If your regular expression is a hardcoded string literal, you can omit catching the exception if you use a code coverage tool to make sure the line is executed without throwing an exception. There are no possible changes to state or mode that could cause the same literal regex to compile in one situation and fail to compile in another. Note that if there is a syntax error in your literal regex, the exception will occur when your application is run, not when your application is compiled.

You should construct a Regex object if you will be using the regular expression inside a loop or repeatedly throughout your application. Constructing the regex object involves no extra overhead. The static members of the Regex class that take the regex as a string parameter construct a Regex object internally anyway, so you might just as well do it in your own code and keep a reference to the object.

If you plan to use the regex only once or a few times, you can use the static members of the Regex class instead, to save a line of code. The static Regex members do not throw away the internally constructed regular expression object immediately; instead, they keep a cache of the 15 most recently used regular expressions. You can change the cache size by setting the Regex.CacheSize property. The cache lookup is done by looking up your regular expression string in the cache. But don’t go overboard with the cache. If you need lots of regex objects frequently, keep a cache of your own that you can look up more efficiently than with a string search.

Java

In Java, the Pattern class holds one compiled regular expression. You can create objects of this class with the Pattern.compile() class factory, which requires just one parameter: a string with your regular expression.

If there’s a syntax error in the regular expression, the Pattern.compile() factory will throw a PatternSyntaxException. The exception message will indicate exactly which error was encountered. It is important to catch this exception if the regular expression is provided by the user of your application. Display the exception message and ask the user to correct the regular expression. If your regular expression is a hardcoded string literal, you can omit catching the exception if you use a code coverage tool to make sure the line is executed without throwing an exception. There are no possible changes to state or mode that could cause the same literal regex to compile in one situation and fail to compile in another. Note that if there is a syntax error in your literal regex, the exception will occur when your application is run, not when your application is compiled.

Unless you plan to use a regex only once, you should create a Pattern object instead of using the static members of the String class. Though it takes a few lines of extra code, that code will run more efficiently. The static calls recompile your regex each and every time. In fact, Java provides static calls for only a few very basic regex tasks.

A Pattern object only stores a compiled regular expression; it does not do any actual work. The actual regex matching is done by the Matcher class. To create a Matcher, call the matcher() method on your compiled regular expression. Pass the subject string as the only argument to matcher().

You can call matcher() as many times as you like to use the same regular expression on multiple strings. You can work with multiple matchers using the same regex at the same time, as long as you keep everything in a single thread. The Pattern and Matcher classes are not thread-safe. If you want to use the same regex in multiple threads, call Pattern.compile() in each thread.

If you’re done applying a regex to one string and want to apply the same regex to another string, you can reuse the Matcher object by calling reset(). Pass the next subject string as the only argument. This is more efficient than creating a new Matcher object. reset() returns the same Matcher you called it on, allowing you to easily reset and use a matcher in one line of code—for example, regexMatcher.reset(nextString).find().

JavaScript

The notation for literal regular expressions shown in Recipe 3.2 already creates a new regular expression object. To use the same object repeatedly, simply assign it to a variable.

If you have a regular expression stored in a string variable (e.g., because you asked the user to type in a regular expression), use the RegExp() constructor to compile the regular expression. Notice that the regular expression inside the string is not delimited by forward slashes. Those slashes are part of JavaScript’s notation for literal RegExp objects, rather than part of the regular expression itself.

Tip

Since assigning a literal regex to a variable is trivial, most of the JavaScript solutions in this chapter omit this line of code and use the literal regular expression directly. In your own code, when using the same regex more than once, you should assign the regex to a variable and use that variable instead of pasting the same literal regex multiple times into your code. This increases performance and makes your code easier to maintain.

XRegExp

If you want to use XRegExp’s enhancements to JavaScript’s regular expression syntax, you have to use the XRegExp() constructor to compile the regular expression. For best performance when using the same regular expression repeatedly, you should assign it to a variable. Pass that variable to methods of the XRegExp class when using the regular expression.

In situations where it isn’t practical to keep a variable around to hold the XRegExp object, you can use the XRegExp.cache() method to compile the regular expression. This method will compile each regular expression only once. Each time you call it with the same parameters, it will return the same XRegExp instance.

PHP

PHP does not provide a way to store a compiled regular expression in a variable. Whenever you want to do something with a regular expression, you have to pass it as a string to one of the preg functions.

The preg functions keep a cache of up to 4,096 compiled regular expressions. Although the hash-based cache lookup is not as fast as referencing a variable, the performance hit is not as dramatic as having to recompile the same regular expression over and over. When the cache is full, the regex that was compiled the longest ago is removed.

Perl

You can use the “quote regex” operator to compile a regular expression and assign it to a variable. It uses the same syntax as the match operator described in Recipe 3.1, except that it starts with the letters qr instead of the letter m.

Perl is generally quite efficient at reusing previously compiled regular expressions. Therefore, we don’t use qr// in the code samples in this chapter. Only Recipe 3.5 demonstrates its use.

qr// is useful when you’re interpolating variables in the regular expression or when you’ve retrieved the whole regular expression as a string (e.g., from user input). With qr/$regexstring/, you can control when the regex is recompiled to reflect the new contents of $regexstring. m/$regexstring/ would recompile the regex every time, whereas m/$regexstring/o never recompiles it. Recipe 3.4 explains /o.

Python

The compile() function in Python’s re module takes a string with your regular expression, and returns an object with your compiled regular expression.

You should call compile() explicitly if you plan to use the same regular expression repeatedly. All the functions in the re module first call compile(), and then call the function you wanted on the compiled regular expression object.

The compile() function keeps a reference to the last 100 regular expressions that it compiled. This reduces the recompilation of any of the last 100 used regular expressions to a dictionary lookup. When the cache is full, it is cleared out entirely.

If performance is not an issue, the cache works well enough that you can use the functions in the re module directly. But when performance matters, calling compile() is a good idea.

Ruby

The notation for literal regular expressions shown in Recipe 3.2 already creates a new regular expression object. To use the same object repeatedly, simply assign it to a variable.

If you have a regular expression stored in a string variable (e.g., because you asked the user to type in a regular expression), use the Regexp.new() factory or its synonym Regexp.compile() to compile the regular expression. Notice that the regular expression inside the string is not delimited by forward slashes. Those slashes are part of Ruby’s notation for literal Regexp objects and are not part of the regular expression itself.

Tip

Since assigning a literal regex to a variable is trivial, most of the Ruby solutions in this chapter omit this line of code and use the literal regular expression directly. In your own code, when using the same regex more than once, you should assign the regex to a variable and use the variable instead of pasting the same literal regex multiple times into your code. This increases performance and makes your code easier to maintain.

Compiling a Regular Expression Down to CIL

C#

Regex regexObj = new Regex("regex pattern", RegexOptions.Compiled);

VB.NET

Dim RegexObj As New Regex("regex pattern", RegexOptions.Compiled)

Discussion

When you construct a Regex object in .NET without passing any options, the regular expression is compiled in the way we described in Discussion. If you pass RegexOptions.Compiled as a second parameter to the Regex() constructor, the Regex class does something rather different: it compiles your regular expression down to CIL, also known as MSIL. CIL stands for Common Intermediate Language, a low-level programming language that is closer to assembly than to C# or Visual Basic. All .NET compilers produce CIL. The first time your application runs, the .NET Framework compiles the CIL further down to machine code suitable for the user’s computer.

The benefit of compiling a regular expression with RegexOptions.Compiled is that it can run up to 10 times faster than a regular expression compiled without this option. The drawback is that this compilation can be up to two orders of magnitude slower than simply parsing the regex string into a tree. The CIL code also becomes a permanent part of your application until it is terminated. CIL code is not garbage collected.

Use RegexOptions.Compiled only if a regular expression is either so complex or needs to process so much text that the user experiences a noticeable wait during operations using the regular expression. The compilation and assembly overhead is not worth it for regexes that do their job in a split second.

See Also

Recipe 3.1 explains how to insert regular expressions as literal strings into source code.

Recipe 3.2 explains how to import the regular expression library into your source code. Some programming languages require this extra step before you can create regular expression objects.

Recipe 3.4 explains how to set regular expression options, which is done as part of literal regular expressions in some programming languages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset