Text processing is, together with mathematics, the most important basic discipline in programming. Almost all data produced by humans will, at some point, be represented as text, or strings, in your system.
As you learned in Chapter 2, D provides basic string types, making the definition and usage of strings a breeze. An additional feature of strings is that they already are encoded as 8-, 16-, or 32-bit Unicode. This means that you can use D immediately in internationalized environments. You also can manipulate the encoding of strings directly.
Considering that Tango is a general-purpose library, text processing is given a prominent place via the tango.text
package, and this chapter will cover most of the important functionality available there.
In this chapter, we'll begin by discussing the basic string-manipulation utilities present in Tango. Then we'll describe the Text
class that Tango provides as a string wrapper; discuss conversions between strings, numeric types, and dates and times; and finally, explain how to do text formatting using Tango's Layout
class.
The string-manipulation utilities in Tango are functions that perform common operations on strings. You will find all of the basic string operation utilities in tango.text.Util
. Other modules contain more advanced functionality, such as regular-expression handling and encoding-specific operations. This section focuses on the operations in the Util
module and the tango.text.stream
package.
tango.core.Array
contains generalized operations for arrays of all types, not only strings.
All of the functionality in tango.text.Util
is presented through function templates that can be instantiated implicitly and called in the same manner as any other free functions. They can generally be divided into three main groups of functionality:
Functions that can be used to modify a string's content in various ways
Functions that help you search text
Functions that lets you split a string into several components, especially for iteration
Let's explore each of these groups, further subdividing the third into splitting strings and splitting streams.
The trim
and strip
functions are mainly used to clean up strings that may be padded at the start or the end, either with whitespace or other characters.
trim
is the simplest operation, removing whitespace from the beginning and end of the input text, so that you're left with the content only. The returned string is a slice of your original, so you will need to duplicate it if you want to manipulate it further.
strip
is a generalized version of trim
, where you can pass it any character that should be removed from either end of the input string. As with trim
, a clean slice of the original argument is returned.
Several functions let you edit the content itself. replace
and substitute
both replace occurrences of a specified pattern with a different one. replace
replaces single characters with a new character. substitute
replaces a substring of the input with a new string. Both of these operations do the replacement in place to avoid allocations. If you want to keep the original, duplicate it before passing it to either of these functions, as in the following example:
char[] original = "A string to have some letters replaced"; auto result = replace (original.dup, 'r', 'l'),
The end result of this example is the string "A stling to have some lettels leplaced."
The utilities include two simple operations to build strings: join
and repeat
. join
concatenates a list of strings into a single big one. An optional postfix can be passed to the function that will be appended to each of the joined strings except the last. If you want to control the output buffer, you can pass your own to the function. repeat
builds a string by repeating a pattern n times. This function also accepts an optional output buffer.
The next set of operations provides querying capabilities to help you find the location of certain substrings or to determine whether they are present.
contains
and containsPattern
check an input string for an embedded character or string, respectively. They both will return true
if a match is found.
locate
, locatePrior
, locatePattern
, and locatePatternPrior
operate in a similar manner, except that they return the position in the input string of the first match. The first two look for a character. locate
searches from the start of the string or starts at the optional start parameter. locatePrior
starts at the end going toward the beginning or starts at the optional starting position. locatePattern
and locatePatternPrior
do the same, but they look for substrings (patterns) instead. In the case that no match is found, the length of the source string is returned instead. Here is an example:
locatePatternPrior ("ababababaaaab", "aba", 8);
This example returns 6
as the first position where the pattern aba
starts, going backward from index 8
.
Typically, most libraries return 1
when a pattern has not been found. That Tango does not do this means that the result in all cases can be used as a valid index into a slice.
If you need to check if a given character is whitespace, pass it to isSpace
, which will return true
if it is.
matching
, indexOf
, and mismatch
provide highly efficient routines that you can use for testing certain aspects of your strings. matching
returns true
if both of the strings are equal up to the specified length. indexOf
returns the index of the first match up to the specified length. In the case where no match is found, this length is returned. The returned index itself is zero-based. mismatch
does the opposite, returning the first index where the strings no longer match. If the strings are actually matching, the length is returned instead.
In the cases where you need to separate strings on a given pattern, several approaches are available through Tango. The first approach uses the operation delimit
, split
, or splitLines
, returning an array of strings. delimit
takes an array of delimiting characters, each of which results in a new string being put into the result array where one of those delimiters is found in the source string. split
does the same thing, except that it looks for one given pattern. splitLines
returns an array of distinct lines (as given by the presence of
or
in the source string). All of these functions remove from the resulting arrays the characters or patterns used.
If you prefer to iterate over the resulting components instead of receiving an array of them, tango.text.Util
provides some efficient alternatives, using slices to make sure no allocations are made during the split operation itself. lines
, delimiters
, patterns
, and quotes
are all entities that can be used in foreach
. Here is an example:
foreach (segment; patterns("one, two, three", ", ")) { . . . }
This example will loop on each segment found in the string passedin this case, one
, two
, and three
.
If you would rather replace the delimiting pattern with a new one, the new pattern can be passed as an additional string to patterns
. lines
will let you iterate over the lines in your string, similar to splitLines
. delimiters
is the iterator version of delimit
, whereas quotes
will ignore delimiters found inside a pair of quotation marks, such as in the following example:
foreach (segment, quotes ("one two 'three four' five", " ")) { . . . }
The results of this example are the segments one
, two
, three four
, and five
.
In Tango, you will also find proper support for splitting operations over streams, whether you stream data from a file, a socket, a pipe, or any other class implementing the InputStream
interface (see Chapter 7). Four stream iterators are available, all based on the StreamIterator
superclass, and you can further extend this number by creating other subclasses. The iterators can all be found in the tango.text.stream
package.
The following example demonstrates iterating over the elements of a stream:
import tango.io.FileConduit, tango.io.Console; import tango.text.stream.LineIterator; foreach (line; new LineIterator!(char) ( new FileConduit ("filename") )) Cout (line).newline;
This example uses the LineIterator
on a stream from a file. The iterators are templated for each of the three character types in D. In this example, char
is used. Each line is simply output to the console.
The other iterators are SimpleIterator
, which takes a delimiter parameter in addition to the stream; QuoteIterator
, which ignores delimiters within quotation marks; and RegexIterator
, which lets you specify a regular expression to use as the delimiting pattern.
Regular expressions are patterns that usually are more complex than those that match a literal string. Regular expressions can be considered a language of their own.
Tango's Text
class is in tango.text.Text
. This class wraps a string array for you, keeps a current selection of where you last interacted with the string, and ensures proper and efficient operation.
The Text
class provides for an object-oriented approach to the most common functionality in the basic text-processing routines. It abstracts away the potentially complex parts of string manipulation, especially where there may be a danger of slicing into the middle of Unicode code units.
When instantiating Text
, the char
type is used if the string needs to be specified. Text
also implements the interface TextView
, which can be used as a read-only gateway to the string. TextView
's superinterface is UniText
, which enables conversion of the string to one of the other Unicode encodings.
The main means of operating on the string wrapped by Text
is to select a portion of it and then perform an operation on that selection.
The methods in the read-only view of Text
mainly let you query the text for various information and compare it to other instances, whether they are of the TextView
type or a D string type.
The length of the text, a hash of it, and the encoding (represented as a TypeInfo
instance) are available through the properties length
, toHash
, and encoding
.
Both opCmp
and opEquals
are overloaded, as is the duplicate functionality also available through the methods compare
and equals
. In addition, you can check whether your text starts with or ends with a given substring.
The other methods of the read-only view are slice
, copy
, and comparator
. If you use slice
, you get the underlying array as a slice. The array is not in itself safe, as there is no way in D 1.0 to restrict its use, and so you are expected to respect that it should not be changed through this interface.
With comparator
, you can set the algorithms that are used in different comparison methods, and copy
accepts a target array into which you can copy the text's content.
After instantiating Text
, either with or without content, the content can be set/reset using two set methods. One takes a D string of the type that the instance was created with, and the other takes a TextView
instance. Both have an optional parameter that can be set to false
if the instance is not intended for modification of the contents.
Almost all of the methods in Text
return an instance of the enclosing class, making it possible to chain calls, as in this example:
auto text = new Text!(char)("The couple danced the rumba"); text.select("couple"); text.replace("party").prepend("large ").select("rumba"); text.replace("tango");
This sequence results in the content "The large party danced the tango."
Since selecting a portion of the text to operate on is important, the Text
class provides several ways to do this. To select a part of the text, you can use either an explicit or implicit approach.
select(int, int)
allows you to perform an explicit selection, as you can set the start and end index of your selection.
To perform an implicit selection, you pass a character, a pattern (D string), or a TextView
instance to select
or selectPrior
. These will search for the given argument, either toward the end or toward the front from the current selection, and set the selection to the match. If the argument wasn't found, the method you called will return false
.
When you need to see what is selected, you can obtain the selected slice via the selection
property. You can also get the starting index and length by calling selectionSpan
.
Most of the modifying methods of Text
operate based on the current selection. By using append
, you can append some text directly to the current selection. This text can be a character (or the same character multiple times), a D string, a TextView
instance, or a number. The append
methods that accept numeric arguments can take additional formatting hints. If you need to append text that is not in the same Unicode encoding as your text, append and transcode it by using the encode
method.
Similar to append
, prepend
allows you to put some text immediately ahead of the current selection.
A selection can also be replaced by a string by using replace
with the string you want to use as the replacement. To completely remove a selection, use remove
. You can also truncate
the text, in which case the default behavior is to truncate at the end of the current selection.
For additional power, Text
can be combined with the utilities in tango.text.Util
, as in the following example, where the line iterator is used to iterate over lines in a Text
instance.
auto source = new Text!(char)("one two three"); foreach (line; Util.lines(source.slice)) { . . . }
Similarly, all instances of a word in a block of text can be replaced with another:
auto dst = new Text!(char); foreach (element; Util.patterns ("all cows eat grass", "eat", "chew")) { dst.append (element); }
An important facet of text processing is the conversion of text to and from different representation, such as numeric values. Tango has a full set of operations to efficiently convert from text into the various numeric types in the language, as well as support for parsing date and time information.
Tango provides this conversion through three modulestango.text.convert.Integer
, Float
, and TimeStamp
and makes sure that the conversion can be done without any heap activity unless strictly necessary for other reasons.
Each of these modules has two common operations that function as the main workhorses of the converters: parse
to convert from a text to some numeric type, and format
to convert from a number into a textual representation. These give you full control of the process.
Following the Tango convention, format
requires that you pass along an output buffer for the result. Here is an example where the number 15 is formatted:
char[10] output; auto result = format (output, 15);
In this example, the formatted number is placed into output
, which, in this case, is allocated on the stack. The result returned is a slice of output
containing the exact string. Many of the format
functions provide additional parameters so you can better control the formatted output.
Creating a value in a numeric type based on a textual representation is the opposite operation to formatting. parse
supports additional parameters to control the output, such as to say something about the radix used.
The Integer
module lets you convert to or from the types short
, ushort
, int
, uint
, long
, and ulong
. In this module, parse
can take three arguments:
The first argument is the string to be parsed,
The second argument can be a uint
describing the radix. If omitted, 10
is the default.
As a third argument, you can pass a pointer to a uint
, which will be set to a number representing how much of the string was processed to create the result.
In the same vein, format
has two additional parameters with default values. The first two are always the output buffer and the number to be formatted. The third is a style identifier that says how the number is to be represented in the resulting text. You can choose from the values shown in Table 6-1.
Table 6.1. Style Identifiers Available for Integer.parse
Style Identifier | Description |
---|---|
| Format as unsigned decimal |
| Format as signed decimal (the default) |
| Format as an octal number |
| Format as a lowercase hexadecimal number |
| Format as an uppercase hexadecimal number |
| Format as a binary number |
In addition to specifying how the number should be formatted, you can pass along a style flag, as shown in Table 6-2.
Table 6.2. Style Flags Available for Integer.parse
Style Flag | Description |
---|---|
| No modifiers are applied (the default) |
| Prefix the conversion with a radix specifier |
| Prefix positive numbers with a plus sign (+) |
| Prefix positive numbers with a space |
| Pad with zeros on the left |
The following example shows how you can format a number into hexadecimals including a prefix:
auto text = Integer.format (new char[32], 12345L, Style.HexUpper, Flags.Prefix);
In the example of formatting numbers as hexadecimals, tango.text.convert.Integer
has been imported using renamed imports, so that the operations in the module can be used through the Integer
namespace.
Along with parse
and format
, the Integer
module contains some convenience functions. toString
, toString16
, and toString32
will format a number for you, allocating the necessary storage as it goes along. toInt
and toLong
will parse a string, assuming that it is fully parsable.
You can also use convert
, which does not look for a radix in the input text. The trim
function will extract optional signs and the radix, while removing extraneous space at the start, leaving the digits ready for parsing.
For the various floating-point types in Dfloat
, double
, and real
you should use tango.text.convert.Float
to convert between them and their text representations. This module is somewhat simpler to use than Integer
, as there are fewer variations in how the results can be formatted.
parse
normally takes only the string representing the number, but can take an additional pointer to a uint
, which will represent how much of the string was processed (or eaten) to create the resulting numeric type.
When formatting a floating-point number using format
, you can customize it using two additional parameters. The first parameter after the number to be formatted is the number of decimals to be used. The default is 6
. The second parameter is an int
indicating how many exponent places should be emitted, effectively saying at which point a number should start being formatted in scientific notation. You can use 0
for always and 2
for numbers larger than 100 or smaller than 0.01. The default is 10
. Here is an example of formatting a floating-point number:
auto text = Float.format (new char[64], 223.1456667, 5, 2);
This will convert the number into a string using five decimal places and scientific notation: 2.23145e+02
. The result will be a slice from the buffer, which is created on the heap in this example. Also in this example, the renaming of imports is used to create the Float
namespace.
As with Integer
, Float
has wrappers for the most common setup, negating the need to preallocate a buffer for the output. toString
, toString16
, and toString32
wrap format, whereas toDouble
wraps the parsing, requiring that the full string is parsable as a number.
Strings representing points in time can also be converted into a numeric value, a so-called timestamp, by importing tango.text.convert.TimeStamp
. The resulting value is usually how many units of time have passed since a particular point, called an epoch. The most common type of timestamp in modern computing has been the number of milliseconds since 1970.
You can pass a string to parse
in one of the formats specified in Table 6-3, getting a timestamp in return. The one additional optional parameter is a pointer to a uint
saying how much of the string was parsed to create the timestamp. If the parsing fails, the predefined value InvalidEpoch
will be returned instead.
Table 6.3. Some Ttimestamp Formats Handled by tango.text.convert.TimeStamp
Format | Example |
---|---|
RFC 1123 | Sun, 06 Nov 1994 08:49:37 GMT |
RFC 850 | Sunday, 06-Nov-94 08:49:37 GMT |
asctime | Sun Nov 6 08:49:37 1994 |
DOS time | 12-31-06 08:49AM |
ISO-8601 | 2006-01-31 14:49:30,001 |
Formatting a timestamp using format
will yield a string in the RFC 1123 format. toString
, toString16
, and toString32
wrap this for the cases where you don't want to or don't need to preallocate a buffer for the output. The following example shows a string that is parsed before the resulting value is formatted.
auto date = "Sun, 06 Nov 1994 08:49:37 GMT"; auto msSinceJan1st0001 = TimeStamp.parse (date); auto text = TimeStamp.format (new char[64], msSinceJan1st0001);
dostime
and iso8601
can be used to convert from the DOS time and ISO-8601 formats, respectively. rfc1123
, rfc850
, and asctime
do the work for parse
, and can be used directly if you are sure of the format of the timestamp's textual representation.
See tango.time.ISO8601
for a more complete module for ISO-8601 parsing.
As part of the Tango text-processing functionality, you will find a powerful text-formatting framework. It replaces what has typically been done by printf
in C and related languages, and is similar to the formatting frameworks of .NET and ICU. Tango's formatter is more flexible in how it can format. Tango itself uses the flexibility of the formatting system to extend it for locale support.
The formatter is accessible through several levels in Tango, depending on your needs. You will find the core functionality in tango.text.convert.Layout
, whereas tango.io.Stdout
provides formatting to the console, similar to printf
. Stdout
again utilizes tango.io.Print
, which wraps Layout
for all cases where you need to send formatted output to a stream. In addition, you can also use tango.text.convert.Sprint
, which wraps Layout
for heapless formatting, reusing memory from the construction of the Sprint
instance, whether this is allocated on the heap or the stack. Layout
is also supported in the tango.util.log
package and via tango.io.stream.FormatStream
for generic stream output.
In this section, first we will cover the format string and how it can be composed, then the Layout
class, and finally, the Locale
extension.
If you are going to print a formatted string to the console, you will typically do so using Stdout
, as in the following example:
import tango.io.Stdout; Stdout.format ("Printing the value {} to the {}", 5, "console").newline;
Here, the first string passed to format
is the format string. It describes how you want your output to look, with the braces ({}
) as placeholders for dynamic content. In this example, the first pair of braces will be substituted with 5
and the second with console
. Typically, you will want to specify your output in more detail and also say something about where in the template the arguments should be put. The newline
at the end emits a newline and flushes Stdout
so that the message is seen on the console immediately.
A number within the braces functions as a zero-based index into format
's argument list. Thus, the previous format
call is equivalent to this:
Stdout.format ("Printing the value {0} to the {1}", 5, "console").newline;
Since the order for the example already is implicit, there is not much point to using this technique here. The line can be rewritten to the following:
Stdout.format ("Printing the value {1} to the {0}", "console", 5).newline;
The output will be the same, but now the second argument after the format string is used first. This becomes particularly useful when internationalizing your application. By passing format strings from different languages, and having a fixed order of the arguments, you can use the indices to make the words of the different languages be printed in a sane and correct order. The following example shows how this is done with an English phrase, where the second form tends to be more common in poetry, "I can see Bill" versus "Bill I can see."
char[] s = "I"; // subject char[] o = "Bill"; // object Stdout.format ("{0} can see {1}.", s, o).newline; Stdout.format ("{1} {0} can see.", s, o).newline;
An index can also be reused several times so that a particular value can be repeated, as follows:
Stdout.format ("Printing the value {0} and then the same again {0}", 5);
If you need to print a brace, escape it with another one, like this {{
.
A pair of braces is called a format item, and can contain up to three components, all of which are optional. The first one is the index (as shown in the previous example), the second is an alignment component, and the third is a format string component. Let's take a closer look at the latter two components.
You can use the alignment component of a format item to specify how much space a format item should take in the resulting string. The default is that the formatter uses as much space as is needed. If the alignment component specifies less than is actually needed, the specified value will be ignored. If more than needed is specified, the remaining space will be padded.
Alignment is specified by a number directly preceded by a comma, thus an index plus alignment component will become {index,alignment}
. Here's an example:
char[] myFName = "Johnny"; Stdout.formatln("First Name = |{0,15}|", myFName); Stdout.formatln("Last Name = |{0,15}|", "Foo de Bar"); Stdout.formatln("First Name = |{0,-15}|", myFName); Stdout.formatln("Last Name = |{0,-15}|", "Foo de Bar"); Stdout.formatln("First name = |{0,5}|", myFName);
An additional aspect of the alignment component is shown in the third and fourth calls to formatln.
(Stdout.formatln
emits an additional newline at the end of the output, but is otherwise the same as format
; this is an alternative to adding newline
at the end.) By negating the alignment, the printed text will be left-adjusted instead of being put to the right (the default). In other words, negation will pad behind, while the default behavior is to pad in front of the value. The output of the preceding example will look like this:
First Name = | Johnny| Last Name = | Foo de Bar| First Name = |Johnny | Last Name = |Foo de Bar | First name = |Johnny|
On the last line of this output, you can see how the alignment is ignored, as the amount of space specified wasn't enough to output the value Johnny
.
To apply additional cues to the formatter, you can use a format string component within the braces of the format item. When specifying it, prepend it with a colon, as in the following example:
Stdout.formatln ("I have {:G} birds on the roof", 100);
In the example, the letter G
is used, which stands for General and is also the default used if nothing is specified.
The format string component does not need to be only one letter. It can also be a string to indicate a more complex specification. Subclasses of Layout
, such as Locale
, can add support for more format string components or reinterpret, such as for localization purposes.
Table 6-4 shows which format components are supported for Layout
.
Table 6.4. Supported Format String Components in tango.text.convert.Layout
Format | Description |
---|---|
| Decimal format (default) |
| Hexadecimal format |
| Uppercase hexadecimal format |
| Scientific notation |
Appending a positive number immediately after one of the format components listed in Table 6-4 will give the minimum number of digits used to format the number. Here is an example:
Stdout.formatln ("A hexadecimal number follows: {0:X9}", 0xafe0000);
This will print an uppercase hexadecimal number using nine digits: 00AFE0000
.
When you are formatting using Tango, tango.text.convert.Layout
does most of the work in its convert
method, or sprint
when you want to format to a preallocated buffer. For sprint
, the first argument should be the output, and the format string should be the second buffer. For convert
, the format string is first. The arguments following the format string are those that will be formatted and substituted into the template string.
If you need to format a string, but don't want to print it to the console, you can instantiate Layout
directly. convert
is aliased to Layout
's opCall
, thus you can format as in this example:
import tango.text.convert.Layout; auto layout = new Layout!(char); // Need to specify encoding you are going to use auto result = layout("A format {}", "string");
If you want to format text in a different encoding, you can use the fromUtf8
, fromUtf16
, or fromUtf32
method.
tango.text.convert.Format
contains a Layout
already instantiated for UTF-8.
The locale support works by subclassing Layout
, hooking into the various format string components, and specifying a lot more functionality. To use Locale
, you just need to instantiate it instead of Layout
.
import tango.text.locale.Locale; auto locale = new Locale;
Note that Locale
has UTF-8 set as its default encoding. To enable localized output using Stdout
, set it as the layout engine using Stdout
's layout
property, as follows:
Stdout.layout = new Locale;
If nothing more is specified, Locale
will try to look up the locale settings of the user's computer and format according to what it finds. If you want to specify a specific locale setting, do this via Locale
's constructor, as follows:
auto locale = new Locale(Culture.getCulture("fr_FR"));
The format string components that are understood by Locale
when given a numeric value are shown in Table 6-5.
Table 6.5. Format String Components for tango.text.locale.Locale for Numeric Values
Format | Description |
---|---|
| General (default) |
| Decimal |
| Lowercase and uppercase hexadecimal |
| Binary string |
| Currency |
| Fixed-point |
| Number with a delimiter (, or .) every three digits |
All the format string components support an additional number to specify the precision of the formatted number.
The other kind of values localized by Locale
are tango.time.Time.Time
objects. Table 6-6 shows which format string components are available as shortcuts when a Time
instance is passed to the formatter. These shortcuts are substituted with a pattern in the formatter.
Table 6.6. Format String Components for tango.text.locale.Locale for Time Values
Format[a] | Description | Pattern |
---|---|---|
[a] | ||
| Short date | dd/MM/yyyy |
| Long date | dddd dd MMMM yyyy. |
| Long date and short time | dddd dd MMMM yyyy HH': 'mm |
| Full date and time | dddd dd MMMM yyyy HH': 'mm': 'ss |
| Short date and short time | dd/MM/yyyy HH': 'mm |
| Short date and long time | dd/MM/yyyy HH': 'mm': 'ss |
| Day in month | MM MMMM |
| RFC 1123 | ddd, dd MMM yyyy HH': 'mm': 'ss 'GMT'" |
| A sortable date time | yyyy'-'MM'-'dd'T'HH': 'mm': 'ss" |
| Short time | HH': 'mm |
| Long time | HH': 'mm': 'ss |
| Month in year | MMMM yyyy |
[a] This format is locale-independent. If a locale for a specific culture is chosen, the format may vary. |
If the format string component for a Time
instance is more than one character, it means that it is a custom format, which may include the components shown in Table 6-7. The patterns shown in Table 6-6 are predefined examples of such patterns. Elements in these patterns that are enclosed in single quotation marks are printed as is. This is mainly used to escape parts of the pattern that otherwise may be interpreted in some capacity by the formatter.
Table 6.7. Custom Formatting Options for Time Instances
Format | Description |
---|---|
| Full name of the day in the week |
| The name of day in the week, three letters |
| The day in the month, two digits |
| Full name of month |
| Short name of month, four letters |
| Month in the year, two digits |
| Year, two or four digits |
| Hours |
| Minutes |
| Seconds |
The following is an example of printing a Time
instance to the console in a customized format, using direct console output via tango.io.Console.Cout
.
import tango.io.Console; import tango.time.WallClock; import tango.text.locale.Locale; auto layout = new Locale; Cout (layout ("{:ddd, dd MMMM yyyy HH':'mm':'ss z}", WallClock.now)).newline;
You should by now have a good understanding of the basic and intermediate routines present in Tango's text-processing functionality. In the next chapter, you'll learn about Tango's input/output packages.