You want to instantiate a regular expression object or otherwise compile a regular expression so you can use it efficiently throughout your application.
If you know the regex to be correct:
Regex regexObj = new Regex("regex pattern
");
If the regex is provided by the end user (UserInput
being a string
variable):
try { Regex regexObj = new Regex(UserInput); } catch (ArgumentException ex) { // Syntax error in the regular expression }
If you know the regex to be correct:
Dim RegexObj As New Regex("regex pattern
")
If the regex is provided by the end user (UserInput
being a string
variable):
Try Dim RegexObj As New Regex(UserInput) Catch ex As ArgumentException 'Syntax error in the regular expression End Try
If you know the regex to be correct:
Pattern regex = Pattern.compile("regex pattern
");
If the regex is provided by the end user (userInput
being a string
variable):
try { Pattern regex = Pattern.compile(userInput); } catch (PatternSyntaxException ex) { // Syntax error in the regular expression }
To be able to use the regex on a string, create a Matcher
:
Matcher regexMatcher = regex.matcher(subjectString);
To use the regex on another string, you can create a new
Matcher
,
as just shown, or reuse an existing one:
regexMatcher.reset(anotherSubjectString);
Literal regular expression in your code:
var myregexp = /regex pattern
/;
Regular expression retrieved from user input, as a string stored
in the variable userinput
:
var myregexp = new RegExp(userinput);
If you want to use XRegExp’s extended regular expression syntax in JavaScript, you need to create an XRegExp object from a string:
var myregexp = XRegExp("regex pattern
");
$myregex = qr/regex pattern
/
Regular expression retrieved from user input, as a string stored
in the variable $userinput
:
$myregex = qr/$userinput/
Before the regular expression engine can match a regular expression to a string, the regular expression has to be compiled. This compilation happens while your application is running. The regular expression constructor or compile function parses the string that holds your regular expression and converts it into a tree structure or state machine. The function that does the actual pattern matching will traverse this tree or state machine as it scans the string. Programming languages that support literal regular expressions do the compilation when execution reaches the regular expression operator.
In C# and VB.NET, the .NET class System.Text.RegularExpressions.Regex
holds one
compiled regular expression. The simplest constructor takes just one
parameter: a string that holds your regular expression.
If there’s a syntax error in the regular expression, the
Regex()
constructor
will throw an ArgumentException
. The exception message will
indicate exactly which error was encountered. It is important to catch
this exception if the regular expression is provided by the user of
your application. Display the exception message and ask the user to
correct the regular expression. If your regular expression is a
hardcoded string literal, you can omit catching the exception if you
use a code coverage tool to make sure the line is executed without
throwing an exception. There are no possible changes to state or mode
that could cause the same literal regex to compile in one situation
and fail to compile in another. Note that if there is a syntax error
in your literal regex, the exception will occur when your application
is run, not when your application is compiled.
You should construct a Regex
object if you will be using the regular
expression inside a loop or repeatedly throughout your application.
Constructing the regex object involves no extra overhead. The static
members of the Regex
class that
take the regex as a string parameter construct a Regex
object internally anyway,
so you might just as well do it in your own code and keep a reference
to the object.
If you plan to use the regex only once or a few times, you can
use the static members of the Regex class instead, to save a line of
code. The static Regex
members do not throw away the internally constructed regular
expression object immediately; instead, they keep a cache of the 15
most recently used regular expressions. You can change the cache size
by setting the Regex.CacheSize
property. The cache lookup is
done by looking up your regular expression string in the cache. But
don’t go overboard with the cache. If you need lots of regex objects
frequently, keep a cache of your own that you can look up more
efficiently than with a string search.
In Java, the Pattern
class holds one compiled regular expression. You can create objects of
this class with the Pattern.compile()
class factory, which requires
just one parameter: a string with your regular expression.
If there’s a syntax error in the regular expression, the
Pattern.compile()
factory will throw a
PatternSyntaxException
. The exception message
will indicate exactly which error was encountered. It is important to
catch this exception if the regular expression is provided by the user
of your application. Display the exception message and ask the user to
correct the regular expression. If your regular expression is a
hardcoded string literal, you can omit catching the exception if you
use a code coverage tool to make sure the line is executed without
throwing an exception. There are no possible changes to state or mode
that could cause the same literal regex to compile in one situation
and fail to compile in another. Note that if there is a syntax error
in your literal regex, the exception will occur when your application
is run, not when your application is compiled.
Unless you plan to use a regex only once, you should create a
Pattern
object instead of using the static members of the String
class. Though it takes a few lines of extra code, that code will run
more efficiently. The static calls recompile your regex each and every
time. In fact, Java provides static calls for only a few very basic
regex tasks.
A Pattern
object only stores a compiled regular expression; it does not do any
actual work. The actual regex matching is done by the Matcher
class. To create a Matcher
,
call the matcher()
method on your compiled regular expression. Pass the subject string as
the only argument to matcher()
.
You can call matcher()
as many times as you like to use the
same regular expression on multiple strings. You can work with
multiple matchers using the same regex at the same time, as long as
you keep everything in a single thread. The Pattern
and Matcher
classes are not thread-safe. If
you want to use the same regex in multiple threads, call Pattern.compile()
in each
thread.
If you’re done applying a regex to one string and want to apply
the same regex to another string, you can reuse the Matcher
object by calling
reset()
. Pass the next
subject string as the only argument. This is more efficient than
creating a new Matcher
object. reset()
returns
the same Matcher
you
called it on, allowing you to easily reset and use a matcher in one
line of code—for example, regexMatcher.reset(nextString).find()
.
The notation for literal regular expressions shown in Recipe 3.2 already creates a new regular expression object. To use the same object repeatedly, simply assign it to a variable.
If you have a regular expression stored in a string variable
(e.g., because you asked the user to type in a regular expression),
use the RegExp()
constructor to compile the regular expression. Notice that the regular
expression inside the string is not delimited by forward slashes.
Those slashes are part of JavaScript’s notation for literal RegExp
objects, rather than part of the regular expression itself.
Since assigning a literal regex to a variable is trivial, most of the JavaScript solutions in this chapter omit this line of code and use the literal regular expression directly. In your own code, when using the same regex more than once, you should assign the regex to a variable and use that variable instead of pasting the same literal regex multiple times into your code. This increases performance and makes your code easier to maintain.
If you want to use XRegExp’s enhancements to
JavaScript’s regular expression syntax, you have to use the XRegExp()
constructor to compile the regular expression. For best performance
when using the same regular expression repeatedly, you should assign
it to a variable. Pass that variable to methods of the XRegExp
class when using the regular expression.
In situations where it isn’t practical to keep a variable around
to hold the XRegExp
object, you can use the XRegExp.cache()
method to compile the regular
expression. This method will compile each regular expression only
once. Each time you call it with the same parameters, it will return
the same XRegExp
instance.
PHP does not provide a way to store a compiled regular
expression in a variable. Whenever you want to do something with a
regular expression, you have to pass it as a string to one of the
preg
functions.
The preg
functions keep a cache of up to 4,096 compiled regular expressions.
Although the hash-based cache lookup is not as fast as referencing a
variable, the performance hit is not as dramatic as having to
recompile the same regular expression over and over. When the cache is
full, the regex that was compiled the longest ago is
removed.
You can use the “quote regex” operator to compile a
regular expression and assign it to a variable. It uses the same
syntax as the match operator described in Recipe 3.1, except that it starts with the
letters qr
instead of the letter m
.
Perl is generally quite efficient at reusing previously compiled
regular expressions. Therefore, we don’t use qr//
in
the code samples in this chapter. Only Recipe 3.5 demonstrates its use.
qr//
is
useful when you’re interpolating variables in the regular expression
or when you’ve retrieved the whole regular expression as a string
(e.g., from user input). With qr/$regexstring/
, you can control when the regex
is recompiled to reflect the new contents of $regexstring
. m/$regexstring/
would recompile the regex every
time, whereas m/$regexstring/o
never recompiles it. Recipe 3.4 explains /o
.
The compile()
function in Python’s re
module
takes a string with your regular expression, and returns an object
with your compiled regular expression.
You should call compile()
explicitly if you plan to use the same regular expression repeatedly.
All the functions in the re
module first call compile()
, and then call the function you wanted
on the compiled regular expression object.
The compile()
function keeps a reference to the last 100 regular expressions that it
compiled. This reduces the recompilation of any of the last 100 used
regular expressions to a dictionary lookup. When the cache is full, it
is cleared out entirely.
If performance is not an issue, the cache works well enough that
you can use the functions in the re
module directly. But when performance
matters, calling compile()
is a good idea.
The notation for literal regular expressions shown in Recipe 3.2 already creates a new regular expression object. To use the same object repeatedly, simply assign it to a variable.
If you have a regular expression stored in a string variable
(e.g., because you asked the user to type in a regular expression),
use the Regexp.new()
factory or its synonym Regexp.compile()
to compile the regular
expression. Notice that the regular expression inside the string is
not delimited by forward slashes. Those slashes are part of Ruby’s
notation for literal Regexp
objects and are not part of the regular
expression itself.
Since assigning a literal regex to a variable is trivial, most of the Ruby solutions in this chapter omit this line of code and use the literal regular expression directly. In your own code, when using the same regex more than once, you should assign the regex to a variable and use the variable instead of pasting the same literal regex multiple times into your code. This increases performance and makes your code easier to maintain.
When you construct a Regex
object in .NET without passing any options,
the regular expression is compiled in the way we described in Discussion. If you pass RegexOptions.Compiled
as a second
parameter to the Regex()
constructor, the Regex
class does something rather different: it compiles your regular
expression down to CIL, also known as MSIL. CIL stands for Common
Intermediate Language, a low-level programming language that is closer
to assembly than to C# or Visual Basic. All .NET compilers produce CIL.
The first time your application runs, the .NET Framework compiles the
CIL further down to machine code suitable for the user’s
computer.
The benefit of compiling a regular expression with RegexOptions.Compiled
is that it
can run up to 10 times faster than a regular expression compiled without
this option. The drawback is that
this compilation can be up to two orders of magnitude slower than simply
parsing the regex string into a tree. The CIL code also becomes a
permanent part of your application until it is terminated. CIL code is
not garbage collected.
Use RegexOptions.Compiled
only if a regular expression
is either so complex or needs to process so much text that the user
experiences a noticeable wait during operations using the regular
expression. The compilation and assembly overhead is not worth it for
regexes that do their job in a split second.
Recipe 3.1 explains how to insert regular expressions as literal strings into source code.
Recipe 3.2 explains how to import the regular expression library into your source code. Some programming languages require this extra step before you can create regular expression objects.
Recipe 3.4 explains how to set regular expression options, which is done as part of literal regular expressions in some programming languages.