Included
in the .NET Framework Class Library is the
System.Text.RegularExpressions namespace
that is
devoted to creating, executing, and obtaining results from regular
expressions executed against a string.
Regular expressions take the form of a pattern that can be matched to
zero or more characters within a string. The simplest of these
patterns, such as .*
(match anything and
everything) and [A-Za-z]
(match any letter) are
easy to learn, but more advanced patterns can be difficult to learn
and even more difficult to implement correctly. Learning and
understanding regular expressions can take considerable time and
effort, but the work will pay off.
Regular expression patterns can take a simple form—such as a
single word or character—or a much more complex pattern. The
more complex patterns can recognize and match such things as the year
portion of a date, all of the <SCRIPT>
tags
in an ASP page, or a phrase in a sentence that varies with each use.
The .NET regular expression classes provide a very flexible and
powerful way to do such things as recognize text, replace text within
a string, and split up text into individual sections based on one or
more complex delimiters.
Despite the complexity of regular expression patterns, the regular expression classes in the FCL are easy to use in your applications. Executing a regular expression consists of the following steps:
Create an
instance of the Regex
object that contains the
regular expression pattern along with any options for executing that
pattern.
Retrieve a reference to an instance of the
Match
object by calling the
Match
instance method if you want only the first
match found, or to an instance of the
MatchesCollection
object by calling the
Matches
instance method if you want more than just
the first match found.
If you’ve called the Matches
method to retrieve a MatchCollection
object,
iterate over the MatchCollection
using a
foreach
loop. Each iteration will allow access to
every Match
object that the regular expression
produced.
You need to find one or more substrings corresponding to a particular pattern within a string. You need to be able to inform the searching code to return either all matching substrings or only the matching substrings that are unique within the set of all matched strings.
Call
the FindSubstrings
method, which executes a
regular expression and obtains all matching text. This method returns
either all matching results or only the unique matches; this behavior
is controlled by the findAllUnique
parameter. Note
that if the findAllUnique
parameter is set to
true
, the unique matches are returned sorted
alphabetically. Its source code is as follows:
using System; using System.Collections; using System.Text.RegularExpressions; public static Match[] FindSubstrings(string source, string matchPattern, bool findAllUnique) { SortedList uniqueMatches = new SortedList( ); Match[] retArray = null; Regex RE = new Regex(matchPattern, RegexOptions.Multiline); MatchCollection theMatches = RE.Matches(source); if (findAllUnique) { for (int counter = 0; counter < theMatches.Count; counter++) { if (!uniqueMatches.ContainsKey(theMatches[counter].Value)) { uniqueMatches.Add(theMatches[counter].Value, theMatches[counter]); } } retArray = new Match[uniqueMatches.Count]; uniqueMatches.Values.CopyTo(retArray, 0); } else { retArray = new Match[theMatches.Count]; theMatches.CopyTo(retArray, 0); } return (retArray); }
The following method searches for any
tags in an XML string; it does this by searching for a block of text
that begins with the <
character and ends with
the >
character.
This method first displays all unique tag matches present in the XML string and then displays all tag matches within the string:
public static void TestFindSubstrings( ) { string matchPattern = "<.*>"; string source = @"<?xml version='1.0' encoding='UTF-8'?> <!-- my comment --> <![CDATA[<escaped> <><chars>>>>>]]> <Window ID='Main'> <Control ID='TextBox'> <Property Top='0' Left='0' Text='BLANK'/> </Control> <Control ID='Label'> <Property Top='0' Left='0' Caption='Enter Name Here'/> </Control> <Control ID='Label'> <Property Top='0' Left='0' Caption='Enter Name Here'/> </Control> </Window>"; Console.WriteLine("UNIQUE MATCHES"); Match[] x1 = FindSubstrings(source, matchPattern, true); foreach(Match m in x1) { Console.WriteLine(m.Value); } Console.WriteLine( ); Console.WriteLine("ALL MATCHES"); Match[] x2 = FindSubstrings(source, matchPattern, false); foreach(Match m in x2) { Console.WriteLine(m.Value); } }
The following text will be displayed:
UNIQUE MATCHES <!-- my comment --> <![CDATA[<escaped> <><chars>>>>>]]> </Control> </Window> <?xml version="1.0" encoding="UTF-8"?> <Control ID="Label"> <Control ID="TextBox"> <Property Top="0" Left="0" Caption="Enter Name Here"/> <Property Top="0" Left="0" Text="BLANK"/> <Window ID="Main"> ALL MATCHES <?xml version="1.0" encoding="UTF-8"?> <!-- my comment --> <![CDATA[<escaped> <><chars>>>>>]]> <Window ID="Main"> <Control ID="TextBox"> <Property Top="0" Left="0" Text="BLANK"/> </Control> <Control ID="Label"> <Property Top="0" Left="0" Caption="Enter Name Here"/> </Control> <Control ID="Label"> <Property Top="0" Left="0" Caption="Enter Name Here"/> </Control> </Window>
As you can see, the regular expression classes in the FCL are quite
easy to use. The first step is to create an instance of the
Regex
object that
contains the regular expression pattern along with any options for
running this pattern. The second step is to get a reference to an
instance of the
Match
object, if you
only need the first found match, or a
MatchCollection
object, if you need more than just the first found match. To get a
reference to this object, the two instance methods
Match
and Matches
can be called
from the Regex
object that was created in the
first step. The Match
method returns a single
match object (Match
) and
Matches
returns a collection of match objects
(MatchCollection
).
The
FindSubstrings
method returns an array of
Match
objects that can be used by the calling
code. You might have noticed that the unique elements are returned
sorted, and the nonunique elements are not sorted. A
SortedList
, which is used by the
FindSubstrings
method to store unique strings that
match the regular expression pattern, automatically sorts its items
when they are added.
The regular expression used in the
TestFindSubstrings
method is very simplistic and
will work in most—but not all—conditions. For example, if
two tags are on the same line, as shown here:
<tagData></tagData>
the regular expression will catch the entire line, not each tag
separately. You could change the regular expression from
<.*>
to <[^>]*>
to match only up to the closing >
([^>]*
matches everything that is not
a >
). However, this will fail in the
CDATA section, matching
<![CDATA[<escaped>
,
<>
, and <chars>
instead of <![CDATA[<escaped>
<><chars>>>>>]]>
. The
more complicated
@"(<![CDATA.*>|<[^>]*>)
" will
match either <![CDATA.*>
(a greedy match
for everything within the CDATA
section) or
<[^>]*>
, described previously.