This chapter explains how to implement regular expressions with your programming language of choice. The recipes in this chapter assume you already have a working regular expression at your disposal; the previous chapters can help in that regard. Now you face the job of putting a regular expression into your source code and actually making it do something.
We’ve done our best in this chapter to explain exactly how and why each piece of code works the way it does. Because of the level of detail in this chapter, reading it from start to finish may get a bit tedious. If you’re reading Regular Expression Cookbook for the first time, we recommend you skim this chapter to get an idea of what can or needs to be done. Later, when you want to implement one of the regular expressions from the following chapters, come back here to learn exactly how to integrate the regexes with your programming language of choice.
Chapters 4 through 9 use regular expressions to solve real-world problems. Those chapters focus on the regular expressions themselves, and many recipes in those chapters don’t show any source code at all. To make the regular expressions you find in those chapters work, simply plug them into one of the code snippets in this chapter.
Because the other chapters focus on regular expressions, they present their solutions for specific regular expression flavors, rather than for specific programming languages. Regex flavors do not correspond one-on-one with programming languages. Scripting languages tend to have their own regular expression flavor built-in, and other programming languages rely on libraries for regex support. Some libraries are available for multiple languages, while certain languages have multiple libraries available for them.
Many Flavors of Regular Expressions describes all the regular expression flavors covered in this book. Many Flavors of Replacement Text lists the replacement text flavors, used for searching and replacing with a regular expression. All of the programming languages covered in this chapter use one of these flavors.
This chapter covers eight programming languages. Each recipe has separate solutions for all eight programming languages, and many recipes also have separate discussions for all eight languages. If a technique applies to more than one language, we repeat it in the discussion for each of those languages. We’ve done this so you can safely skip the discussions of programming languages that you’re not interested in:
C# uses the Microsoft .NET Framework. The System.Text.RegularExpressions
classes
use the “.NET” regular expression and replacement text flavor.
This book covers C# 1.0 through 4.0, or Visual Studio 2002 until
Visual Studio 2010.
This book uses VB.NET and Visual Basic.NET to refer
to Visual Basic 2002 and later, to distinguish these versions from
Visual Basic 6 and earlier. Visual Basic now uses the Microsoft
.NET Framework. The System.Text.RegularExpressions
classes
use the “.NET” regular expression and replacement text flavor.
This book covers Visual Basic 2002 until Visual Basic 2010.
Java 4 is the first Java release to provide built-in
regular expression support through the java.util.regex
package. The java.util.regex
package uses the “Java” regular expression and replacement text
flavor. This book covers Java 4, 5, 6, and 7.
This is the regex flavor used in the programming language commonly known as JavaScript. All modern web browsers implement it: Internet Explorer (as of version 5.5), Firefox, Opera, Safari, and Chrome. Many other applications also use JavaScript as a scripting language.
Strictly speaking, in this book we use the term JavaScript to indicate the programming language defined in versions 3 and 5 of the ECMA-262 standard. This standard defines the ECMAScript programming language, which is better known through its implementations JavaScript and JScript in various web browsers.
ECMA-262v3 and ECMA-262v5 also define the regular expression and replacement text flavors used by JavaScript. Those flavors are labeled as “JavaScript” in this book.
XRegExp is an open source JavaScript library developed by Steven Levithan. You can download it at http://xregexp.com. XRegExp extends JavaScript’s regular expression syntax. XRegExp also provides replacement functions for JavaScript’s regex matching functions for better cross-browser consistency, as well as new higher-level functions that make tasks such as iterating over all matches easier.
Most recipes in this chapter do not have separate JavaScript and XRegExp solutions. You can use the standard JavaScript solutions with regular expressions created by XRegExp. In situations where XRegExp’s methods offer a significantly better solution, we show code for both standard JavaScript, as well as JavaScript with XRegExp.
PHP has three sets of regular expression functions.
We strongly recommend using the preg
functions. Therefore, this book only covers the preg
functions, which are built into PHP as of version 4.2.0. This book
covers PHP 4 and 5. The preg
functions are PHP wrappers around the PCRE library. The PCRE regex
flavor is indicated as “PCRE” in this book. Since PCRE does not
include search-and-replace functionality, the PHP developers
devised their own replacement text syntax for preg_replace
. This replacement text flavor
is labeled “PHP” in this book.
The mb_ereg
functions are part of PHP’s
“multibyte” functions, which are designed to work well with
languages that are traditionally encoded with multibyte character
sets, such as Japanese and Chinese. In PHP 5, the mb_ereg
functions use the Oniguruma regex
library, which was originally developed for Ruby. The Oniguruma
regex flavor is indicated as “Ruby 1.9” in this book. Using the
mb_ereg
functions is recommended only if you
have a specific requirement to deal with multibyte code pages and you’re
already familiar with the mb_
functions in PHP.
The ereg
group of functions is the oldest set of PHP regex functions, and
are officially deprecated as of PHP 5.3.0. They don’t depend on
external libraries, and implement the POSIX ERE flavor. This
flavor offers only a limited feature set, and is not discussed in this book.
POSIX ERE is a strict subset of the Ruby 1.9 and PCRE flavors. You
can take the regex from any ereg
function call and use it with mb_ereg
or preg
.
For preg
,
you have to add Perl-style delimiters (Recipe 3.1).
Perl’s built-in support for regular expressions is
the main reason why regexes are popular today. The regular
expression and replacement text flavors used by Perl’s m//
and s///
operators are labeled as “Perl” in this book. This book covers
Perl 5.6, 5.8, 5.10, 5.12, and 5.14.
Python supports regular expressions through its
re
module. The
regular expression and replacement text flavor used by this module
are labeled “Python” in this book. This book covers Python 2.4
until 3.2.
Ruby has built-in support for regular expressions. This book covers Ruby 1.8 and Ruby 1.9. These two versions of Ruby have different default regular expression engines. Ruby 1.9 uses the Oniguruma engine, which has more regex features than the classic engine in Ruby 1.8. Regex Flavors Covered by This Book has more details on this.
In this chapter, we don’t talk much about the differences between Ruby 1.8 and 1.9. The regular expressions in this chapter are very basic, and they don’t use the new features in Ruby 1.9. Because the regular expression support is compiled into the Ruby language itself, the Ruby code you use to implement your regular expressions is the same, regardless of whether you’ve compiled Ruby using the classic regex engine or the Oniguruma engine. You could recompile Ruby 1.8 to use the Oniguruma engine if you need its features.
The programming languages in the following list aren’t covered by this book, but they do use one of the regular expression flavors in this book. If you use one of these languages, you can skip this chapter, but all the other chapters are still useful:
ActionScript is Adobe’s implementation of the ECMA-262 standard. As of version 3.0, ActionScript has full support for ECMA-262v3 regular expressions. This regex flavor is labeled “JavaScript” in this book. The ActionScript language is also very close to JavaScript. You should be able to adapt the JavaScript examples in this chapter for ActionScript.
C can use a wide variety of regular expression libraries. The open source PCRE library is likely the best choice out of the flavors covered by this book. You can download the full C source code at http://www.pcre.org. The code is written to compile with a wide range of compilers on a wide range of platforms.
C++ can use a wide variety of regular expression libraries. The open source PCRE library is likely the best choice out of the flavors covered by this book. You can either use the C API directly or use the C++ class wrappers included with the PCRE download itself (see http://www.pcre.org).
On Windows, you could import the VBScript 5.5 RegExp COM object, as explained later for Visual Basic 6. That could be useful for regex consistency between a C++ backend and a JavaScript frontend.
C++ TR1 defines a <regex>
header file that defines
functions such as regex_search()
, regex_match()
, and regex_replace()
that you can use to search
through strings, validate strings, and search-and-replace through
strings with regular expressions. The regular expression support
in C++ TR1 is based on the Boost.Regex library. You can use the
Boost.Regex library if your C++ compiler does not support TR1. You
can find full documentation at http://www.boost.org/libs/regex/.
Delphi XE was the first version of Delphi to have
built-in support for regular expressions. The regex features are
unchanged in Delphi XE2. The RegularExpressionsAPI
unit is a thin wrapper
around the PCRE library. You won’t use this unit directly.
The RegularExpressionsCore
unit implements the
TPerlRegEx
class. It provides a full set of
methods to search, replace, and split strings using regular
expressions. It uses the UTF8String
type for all strings, as PCRE is
based on UTF-8. You can use the TPerlRegEx
class in situations where you
want full control over when strings are converted to and from
UTF-8, or if your data is in UTF-8 already. You can also use this
unit if you’re porting code from an older version of Delphi that
used Jan Goyvaerts’s TPerlRegEx
class. The RegularExpressionsCore
unit
is based on code that Jan Goyvaerts donated to Embarcadero.
The RegularExpressions
unit is the one you’ll
use most for new code. It implements records such as TRegex
and TMatch
that have names and
methods that closely mimic the regular expression classes in the
.NET Framework. Because they’re records, you don’t have to worry
about explicitly creating and destroying them. They provide many
static methods that allow you to use a regular expression with
just a single line of code.
If you are using an older version of Delphi, your best
choice is Jan Goyvaerts’s TPerlRegEx class. You can download the
full source code at http://www.regexp.info/delphi.html. It is open
source under the Mozilla Public License. The latest release of
TPerlRegEx
is fully
compatible with the RegularExpressionsCore
unit in Delphi XE.
For new code written in Delphi 2010 or earlier, using the latest
release of TPerlRegEx is strongly recommended. If you later
migrate your code to Delphi XE, all you have to do is replace
PerlRegEx
with
RegularExpressionsCore
in the uses clause of
your units. When compiled with Delphi 2009 or Delphi 2010, the
PerlRegEx
unit uses
UTF8String
and
fully supports Unicode. When compiled with Delphi 2007 or earlier,
the unit uses AnsiString
and does not support
Unicode.
Another popular PCRE wrapper for Delphi is the TJclRegEx
class part of the JCL library at
http://www.delphi-jedi.org. It is also open
source under the Mozilla Public License.
In Delphi Prism, you can use the regular expression
support provided by the .NET Framework. Simply add System.Text.RegularExpressions
to the
uses
clause of any Delphi Prism unit in which you want to use regular
expressions.
Once you’ve done that, you can use the same techniques shown in the C# and VB.NET code snippets in this chapter.
You can use regular expressions in Groovy with the
java.util.regex
package, just as you can in Java. In fact, all of the Java
solutions in this chapter should work with Groovy as well.
Groovy’s own regular expression syntax merely provides notational
shortcuts. A literal regex delimited with forward slashes is an
instance of java.lang.String
and the =~
operator
instantiates java.util.regex.Matcher
. You can freely
mix the Groovy syntax with the standard Java syntax—the classes
and objects are all the same.
PowerShell is Microsoft’s shell-scripting language,
based on the .NET Framework. PowerShell’s built-in -match
and
-replace
operators use the .NET regex flavor and replacement text as
described in this book.
The R Project supports regular expressions via the
grep
,
sub
,
and regexpr
functions in the base
package. All these
functions take an argument labeled perl
, which is FALSE
if you omit it. Set it to TRUE
to use the PCRE regex
flavor as described in this book. The regular expressions shown
for PCRE 7 work with R 2.5.0 and later. For earlier versions of R,
use the regular expressions marked as “PCRE 4 and later” in this
book. The “basic” and “extended” flavors supported by R are older
and limited regex flavors not discussed in this book.
REALbasic has a built-in RegEx
class. Internally, this class uses the UTF-8 version of the PCRE
library. This means that you can use PCRE’s Unicode support, but
you have to use REALbasic’s TextConverter
class to convert non-ASCII
text into UTF-8 before passing it to the RegEx
class.
All regular expressions shown in this book for PCRE 7 will
work with REALbasic 2011. One caveat is that in REALbasic, the
“case insensitive” (Regex.Options.CaseSensitive
) and “^ and
$ match at line breaks” (Regex.Options.TreatTargetAsOneLine
)
options are on by default. If you want to use a regular expression
from this book that does not tell you to turn on these matching
modes, you have to turn them off explicitly in
REALbasic.
Scala provides built-in regex support through the
scala.util.matching
package. This
support is built on the regular expression engine in Java’s
java.util.regex
package. The regular expression and replacement text flavors used
by Java and Scala are labeled “Java” in this book.
Visual Basic 6 is the last version of Visual Basic that does not require the .NET Framework. That also means Visual Basic 6 cannot use the excellent regular expression support of the .NET Framework. The VB.NET code samples in this chapter won’t work with VB 6 at all.
Visual Basic 6 does make it very easy to use the functionality provided by ActiveX and COM libraries. One such library is Microsoft’s VBScript scripting library, which has decent regular expression capabilities starting with version 5.5. The scripting library implements the same regular expression flavor used in JavaScript, as standardized in ECMA-262v3. This library is part of Internet Explorer 5.5 and later. It is available on all computers running Windows XP or Vista, and previous versions of Windows if the user has upgraded to IE 5.5 or later. That includes almost every Windows PC that is used to connect to the Internet.
To use this library in your Visual Basic application, select Project|References in the VB IDE’s menu. Scroll down the list to find the item “Microsoft VBScript Regular Expressions 5.5”, which is immediately below the “Microsoft VBScript Regular Expressions 1.0” item. Make sure to tick the 5.5 version. The 1.0 version is only provided for backward compatibility, and its capabilities are less than satisfactory.
After adding the reference, you can see which classes and class members the library provides. Select View|Object Browser in the menu. In the Object Browser, select the “VBScript_RegExp_55” library in the drop-down list in the upper-left corner.