Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5
Cleaning Text Data

Preparing text data takes place on a number of levels. From the bottom up there are technical (encoding) issues, string issues which are related to the structuring of the storage format, case folding, and so on, and semantic issues which relate to the meaning of text.

Cleaning on the technical level mostly concerns solving encoding issues. That is, can we properly interpret the sequence of bytes comprising a file as a set of symbols? We will find that there is no definite answer to this but there are tools that implement effective heuristics that work in many cases.

At the string level, we find issues like checking the file format (is a file proper csv or HTML?), cleaning up punctuation or spurious white spaces, and localizing typos by approximate matching against a list of known or allowed terms. It also includes extracting the interesting portions of text string, for example, by harvesting the url that is stored in a <a href="[url]"> tag of an HTML file. Activities at the string level therefore consist of more or less classical techniques for finding, replacing, and substituting substrings as well as approximate matching of strings based on string metrics. We shall find that methods based on pattern matching and methods based on approximate matching are in a way complementary approaches to common text cleaning problems and it is effective to combine them.

The semantic domain involves activities that make use of the meaning of text. Here, one may distinguish many sublevels again by looking at the meaning of single terms, terms in the context of a sentence, sentences in the context of a text, and so on. Following this path we quickly move into the domain of semantic text analyses which is beyond the scope of this book. However, we discuss a few tools and techniques that are readily available such as reducing verbs to their stem and language detection.

Notation of alphabets and strings: We introduce a few concepts and notation that will be used throughout this chapter. We already encountered the term alphabet in Section 3.2.1, where it was defined as a nonempty finite set of abstract symbols. Alphabets are denoted with the Greek capital letter $c05-math-001$ . Examples include the binary alphabet $c05-math-002$ or the complete set of Unicode characters. Any finite nonempty subset of an alphabet is also an alphabet.

The symbols of an alphabet can be concatenated to form strings. The set of all strings that consist of zero or any finite number of characters is denoted as $c05-math-003$ . Elements of $c05-math-004$ are denoted as $c05-math-005$ , $c05-math-006$ , or $c05-math-007$ and string literals will be printed between quotes. For example, if $c05-math-008$ , then $c05-math-009$ is an element of $c05-math-010$ . The string consisting of zero characters is a special case with its own notation: $c05-math-011$ . The length of a string is the number of characters it is composed of (in Unicode: code points). We will use the notation $c05-math-012$ to indicate the string length. We have $c05-math-013$ and $c05-math-014$ . The set of all strings of length $c05-math-015$ is denoted as $c05-math-016$ .

The definition of $c05-math-017$ gives rise to a more formal definition of $c05-math-018$ , namely,

5.1

In fact, we may consider $c05-math-020$ as an operator that generates concatenations. We will encounter this operator again when we discuss regular expressions.

5.1 Character Normalization

Normalization is the process of making choices consistent. That is, in many cases, the same information may be represented in more than one way and before any further processing it is convenient to represent the same information in the same way. For example, the strings "Sarah", "sarah", and "SARAH" all refer to the same person but in a different representation.

Character normalization comes in two flavors: normalization of the encoding (technical representation of text) and normalization of symbols (character conversion). Both cases are discussed in the next section.

5.1.1 Encoding Conversion and Unicode Normalization

A fundamental step in text cleaning is making sure that all character data is stored in a single encoding scheme. In principle, any encoding that supports the alphabet the text is written in suffices. However, because of its broad support across many systems, and software, and because it extends ASCII, it is a good idea to use UTF-8, at least for internal representation of text.

Detecting in what encoding a file is stored was discussed in Section 3.2.7. Also recall that text data read in R is always internally stored as latin1 (Windows) or UTF-8 (others). Base R has several tools available to convert encoding, all of them based on an implementation of the iconv specification (The Open Group, 2004). To convert all elements of a character vector to UTF-8, one can use enc2utf8(). Converting between any pair of encoding schemes can be done with iconv().

  y <- iconv(x, to="[encoding]", from="[encoding]")

The encoding specifications that are supported depend on the system running R and can be requested by typing the following command.

  iconvlist()

The stringi package is not based on iconv but utilizes the ICU library (ICU, 2014) instead, thus guaranteeing platform independence. The stringi equivalent of enc2utf8 is stri_enc_toutf8()

  y <- stringi::stri_encode(str, from="[encoding]", to="[encoding]")

The valid encoding specifications can be obtained with the following command:

  stringi::stri_enc_list()

The stringi package supports less encoding schemes than iconv on most or all systems supported by R; but unlike R, it does guarantee platform independence.

As discussed in Section 3.2.3, some symbols can be represented in more than one way in Unicode. As an example, consider singly accented Latin characters. These may be represented by one byte sequence representing the accented character, or by two byte sequences where the first determines the character and the second adds the accent. Before any further processing it is a good idea to normalize such cases since, for example, find-and-replace operations performed later on in the text may be sensitive to the underlying representation.

Unicode has several normal forms. As a first example, we normalize to the NFC form which is recommended by the Internet Engineering Task Force [IETF, Klensin and Padlipski (2008)]. To demonstrate Unicode normalization, we first define the letter ‘ö’ in two ways:

  x <- c("uf6"        # small latin 'o' with diaereses
       , "u6fu0308") # small latin 'o' followed by combining diaereses
  x
  ## [1] "ö" "ö"

The stored byte sequences can be revealed with iconv.

  iconv(x,toRaw=TRUE)
  ## [[1]]
  ## [1] c3 b6
  ##
  ## [[2]]
  ## [1] 6f cc 88

We now normalize each element of x and again inspect the byte sequences.

  y <- stringi::stri_trans_nfc(x)
  iconv(y,toRaw=TRUE)
  ## [[1]]
  ## [1] c3 b6
  ##
  ## [[2]]
  ## [1] c3 b6

So in NFC the collapsed version of ‘ö’ is preferred.

The abbreviation NFC stands for Normal Form canonical Composition. The algorithm behind it first decomposes, then recomposes (accented) characters in a predefined, unique order. Not all combined characters are decomposed by NFC. For example, ligatures such as ‘ff’ are left alone by NFC (a ligature combines two or more letters to a single symbol, e.g., $c05-math-021$ $c05-math-022$ $c05-math-023$ becomes ffi). The NFKC normalization (Normalization Form canonical compatible Composition) also takes those cases into account. Compare the following results:

  x <- "ufb00" # Latin small ligature ff
  x
  ## [1] "ff"
  nchar(x)
  ## [1] 1
  nchar(stringi::stri_trans_nfc(x))
  ## [1] 1
  nchar(stringi::stri_trans_nfkc(x))
  ## [1] 2

In the first case, the normalized x still consists of a single character representing the ligature ‘ff’. In the second case, the ligature character is split into two separate f's. As a consequence, the letter ‘f’ will only be detected after normalization.

  grepl("f", stringi::stri_trans_nfc(x))
  ## [1] FALSE
  grepl("f", stringi::stri_trans_nfkc(x))
  ## [1] TRUE

Here, we used grepl to determine whether the letter ‘f’ occurs in the string.

Finally, we note that there are two more normal forms which are less interesting from a text cleaning point of view, namely, NFD (Normalization Form canonical Decomposition) where (e.g., accented) symbols are decomposed into constituting parts and NFKD (normalization form compatibility decomposition) where combined (e.g., ligature) characters are decomposed.

5.1.2 Character Conversion and Transliteration

Character conversion is a simple process where a user-specified set of symbols is replaced by another set of symbols. Mathematically, it can be interpreted as a function $c05-math-024$ sending symbols from an alphabet $c05-math-025$ to an alphabet $c05-math-026$

5.2

Here, $c05-math-028$ can be one-to-one or many-to-one and $c05-math-029$ and $c05-math-030$ may or may not be equal. Since there is no such thing as an ‘empty character’, mappings like $c05-math-031$ cannot be used to remove characters from a string. Character removal should be interpreted as an operation on strings where a substring (possibly of length 1) is replaced by the empty string. Such operations are discussed in the next Section 5.2.

General conversion maps can in R be specified with the function chartr.

  chartr(old = "[character]", new = "[character]", x = "[character]")

It takes as input a string of characters to convert, a string of characters to convert to, and a character vector of strings to convert. As an example, consider the soundex algorithm (see also Procedure 5.4.1). One step of this algorithm involves replacing every vowel (of the ASCII alphabet) by the small latin letter ‘a’. Using chartr this can be specified as follows:

  chartr("eiou","aaaa", c("hello world","Olá Mundo"))
  ## [1] "halla warld" "Olá Manda"

For each old character, a new version must be provided. The chartr function is general enough to define any map of the form (5.2). That is, any translation that can be done character by character. This example shows that chartr requires a precise specification: both the capital letter ‘O’ and the accented á are not recognized and must be specified explicitly.

There are two character translation jobs that are so common that they are accessible as a separate functionality: case conversion (also called case folding) and transliteration. The base R functions toupper and tolower convert a string to completely upper or lowercase. The mapping between upper and lowercase is determined by locale setting (LC_CTYPE), see Section 3.4) and need not be specified explicitly by the user.

  toupper(c("hello world","Olá Mundo"))
  ## [1] "HELLO WORLD" "OLÁ MUNDO"
  tolower(c("hello world","Olá Mundo"))
  ## [1] "hello world" "olá mundo"

The stringi package also offers a function to translate words to upper, lower, or title case.

  stringi::stri_trans_tolower(c("hello world","Olá Mundo"))
  ## [1] "hello world" "olá mundo"
  stringi::stri_trans_toupper(c("hello world","Olá Mundo"))
  ## [1] "HELLO WORLD" "OLÁ MUNDO"
  stringi::stri_trans_totitle(c("hello world","Olá Mundo"))
  ## [1] "Hello World" "Olá Mundo"

The advantages of the stringi package over base R are the ease of switching between locales (by the argument locale) and guaranteed consistency across operating systems. Moreover, the title case conversion, which necessarily involves word detection, has detailed controls for word boundary detection.

Transliteration is a mapping of the form (5.2), where $c05-math-032$ is either a subset of $c05-math-033$ or an altogether different alphabet from $c05-math-034$ . A common use of transliteration is to replace accented characters by characters without accent. The actual transformations are governed by library-specific transliteration rules that determine how to convert from one alphabet to another. In base R, transliteration can be achieved by appending //TRANSLIT in the encoding specification of iconv. The stringi package offers a specialized function called stri_trans_general.

  iconv(c("hello world","Olá Mundo"), to="ASCII//TRANSLIT")
  ## [1] "hello world" "Ola Mundo"
  stringi::stri_trans_general(c("hello world","Olá Mundo"),
  id="latin-ascii")
  ## [1] "hello world" "Ola Mundo"

In both cases, the non-ASCII character á has been replaced by a.

5.2 Pattern Matching with Regular Expressions

The ability to recognize a substring within a larger string lies at the heart of many string cleaning actions. Actions like trimming of white spaces, punctuation removal, or splitting strings into separate lines or words all rely on the fact that patterns in strings can be recognized.

The theory of pattern recognition in strings is closely tied to early developments in theoretical computer science. In fact, the so-called regular expressions, which may be considered one of the most basic tools in text processing, were first developed by Kleene (1951) in a paper on “Representation of events in nerve nets and finite automata”. The algebraic theory of regular expressions is nowadays well understood and described in standard textbooks on the foundations of computer science, such as the excellent book by Hopcroft et al. (2007). In this chapter, we give a compact overview of the most important features of regular expressions, focusing on their use within R. For a very thorough and complete overview of practical regular expressions, we recommend the book by Friedl (2006).

For our purposes, a regular expression can be interpreted as a short notation for defining a range of strings over an alphabet. Computer programs such as grep or sed, which are a standard part of Unix-like operating systems, accept regular expressions and are able to very efficiently find, respectively find and replace substrings in larger pieces of text.

5.2.1 Basic Regular Expressions

A regular expression is a short notation that indicates a subset of $c05-math-035$ , the set of all strings that can be composed of the symbols in an alphabet $c05-math-036$ . Regular expressions are capable of denoting finite or infinite subsets of $c05-math-037$ , but not every subset of $c05-math-038$ can be denoted by a regular expression.

All regular expressions can, in principle, be built up out of a few basic elements. As an example, consider the following regular expression that already contains all basic operations:

5.3

This expression stands for the set of strings that start with an $c05-math-040$ or a $c05-math-041$ , followed by zero or more $c05-math-042$ 's. In other words, it corresponds to the set

5.4

To understand the notation, we read expression (5.3) from left to right. Between brackets, we find a|b, where the vertical bar | is the familiar ‘or’ operator. Since these are the first substrings in the expression, this means all strings indicated by (5.3) must begin with either "a" or "b". Next, we find a c, followed by an asterisk *. The asterisk operator may be read as ‘zero or more of the previous’, where ‘the previous’ only extends to the character immediately previous to it in this case.

Before continuing to the formal definition of these operators, let us look at an example of how regular expressions can be used in R. The function grepl accepts two main arguments: a regular expression and a character vector. It returns a logical vector indicating for each element of the character vector whether a substring matches the regular expression.

  str <- c("a","xacc","xdcc","xbccccxyac")
  grepl("(a|b)c*",str)
  ## [1]  TRUE  TRUE FALSE  TRUE

Here, the first element of str matches since the whole string ("a") is in $c05-math-044$ of Eq. (5.4). The second string, "xacc", is not in $c05-math-045$ , but since it has a substring "acc" that is, it matches. The same holds for the fourth element, while the third element "xdcc" has no substring that is in $c05-math-046$ .

The function sub replaces the first substring that matches a regular expression with another string. For example, to replace matching substrings from the example with "H", we do the following:

  sub("(a|b)c*", "H", str)
  ## [1] "H"      "xH"     "xdcc"   "xHxyac"

Observe that, indeed, while the fourth element of str has two substrings matching the regular expression, only the first is replaced. The function $c05-math-047$ replaces all occurrences.

  gsub("(a|b)c*", "H", str)
  ## [1] "H"     "xH"    "xdcc"  "xHxyH"

Formal Regular Expressions

Put together in a set, all regular expressions together form an algebra. That is, regular expressions can be built up and combined systematically by applying a few basic operations, possibly over and over again. Each such operation yields a new and valid regular expression.

The first three rules for constructing regular expressions just introduce some notation and relate regular expressions to sets of strings.

Rule 1 The symbols $c05-math-048$ and $c05-math-049$ are regular expressions, representing, respectively, the empty string "" and the empty subset of $c05-math-050$ .
Rule 2 If $c05-math-051$ is a symbol in $c05-math-052$ , then a is a regular expression representing $c05-math-053$ .
Rule 3 If $c05-math-054$ is a regular expression, then so is $c05-math-055$ .

The first rule defines mathematical cases that are necessary for a complete algebraic development of regular expressions. They will not be important for our purpose. Brackets have exactly the effect one expects of them: they determine the order in which combining operations are executed, just like in normal arithmetic. The next set of rules determines how regular expressions can be operated upon.

The fourth and fifth rules introduce concatenation and ‘or’-ing of regular expressions.

Rule 4 If $c05-math-056$ and $c05-math-057$ are regular expressions, then so is $c05-math-058$ .
Rule 5 If $c05-math-059$ and $c05-math-060$ are regular expressions, then so is $c05-math-061$ .

The | notation indicates that a position in a string can be occupied by either of its operands. For example, the regular expression a(a|ab)a indicates the set $c05-math-062$ . The set of strings defined by the concatenation of two regular expressions is the concatenation of all strings of the first with all strings of the second set. In other words, if $c05-math-063$ and $c05-math-064$ are the sets of strings defined by regular expressions $c05-math-065$ and $c05-math-066$ , then the set defined by $c05-math-067$ is defined as $c05-math-068$ . For example, take

Concatenating $c05-math-069$ with $c05-math-070$ , we get

The operations introduced thus far only allow us to define finite sets of strings. The star operator, introduced in the last rule, changes this.

Rule 6 If $c05-math-071$ is a regular expressions, then so is $c05-math-072$ .

The star operator indicates ‘repeat the preceding symbol zero or more times’. The operator was first introduced by Kleene (1951), and it is therefore often referred to as the Kleene (star) operator. For example, the regular expression a*ba*c corresponds to the set

To complete our discussion of formal regular expressions, we need to define the rules for operator precedence. That is, we need to define rules so we understand whether for example a|b* should be read as (a|b)*, as a|(b*). The rule is that the * goes first, then comes concatenation, and finally |. In other words, the expression ab|c* must be read as (ab)|(c*).

Limits of Regular Expressions

As stated before, regular expressions are finite notations indicating subsets of $c05-math-073$ . In particular, the Kleene operator allows us to indicate infinite sets of strings with a regular expression of finite size. A natural question is whether every possible subset of $c05-math-074$ can be denoted with a finite regular expression. The answer to this question is ‘no’. In fact, subsets of $c05-math-075$ that can be defined by a regular expression are called regular.

There are several ways to prove this. One way is to show that the number of subsets of $c05-math-076$ is much larger (uncountably infinite) than the number of sets that can be indicated by regular expressions (countably infinite). A second way is by giving a counterexample. For example, the set of strings defined by

cannot be denoted by a finite regular expression. The actual proof of this statement relies on automata theory and is beyond the scope of this book. It can be found in many textbooks on languages and automata theory, such as Hopcroft et al. (2007). However, the result can be made intuitively clear by attempting to build a regular expression for $c05-math-077$ . For example, the expression aa*bb* fails, since this also includes strings not in $c05-math-078$ , such as "abb" and "aaab". Another natural candidate would be ab|aabb|aaabbb|. However, this expression is clearly not finite and so contradicts our demand that $c05-math-080$ be expressed with a finite regular expression. The main intuition here is that regular expressions lack a certain sense of memory. To be able to tell whether a string is in $c05-math-081$ or not, one needs to count the number of "a"'s and then verify whether ensuing number of "b"'s is the same. Since we have no way to count the number of characters, it is not possible to define a regular expression that defines $c05-math-082$ . In the next section we shall see that extended regular expressions (EREs) add behavior and operators to relieve this limitation.

Because sets of strings can be defined in many ways, there is no easy general technique or rule to determine whether an infinite set of strings is definable by a regular expression or not. Rather, there are a set of mathematical techniques that have to be applied on a case-by-case basis. An easy to remember general rule, however, is that any set of strings where for each string in the set, the structure of a string at one position depends directly on the structure of the string at another position is not regular.

Exercises for Section 5.2.1

5.2.2 Practical Regular Expressions

Thus far, we have discussed regular expressions in the more or less theoretical context of very simple alphabets such as $c05-math-085$ or $c05-math-086$ . For larger alphabets it quickly becomes cumbersome to express string sets such as ‘any string that begins with four arbitrary symbols’ since that would involve ‘or’-ing all characters of the alphabet.

For this reason, programs working with regular expressions include extra operators that indicate ranges of characters (called character classes), special versions of the Kleene operator to indicate (possibly finite) repetition, or the beginning or ending of a string.

The extensions that are supported ultimately depend on the software one uses, and there are many programs and programming languages around that support regular expressions. However, EREs specified in the Single Unix Specification (The Open Group, 1997) are supported as a minimum in many applications, including R. Perhaps, most notably, extensions added by the Perl programming language on top of ERE have become very popular and widely supported as well. Indeed, R also supports Perl regular expressions out of the box. The stringi package of Gagolewski and Tartanus (2014) also comes with an implementation of regular expressions. It is based on the ICU library (ICU, 2014), which offers another extension of the ERE regular expressions.

To complicate things a little further, the actual alphabet used to interpret regular expressions may depend on locale settings. For example, in R, the LC_CTYPE setting determines the alphabet (e.g., the Unicode alphabet) while LC_COLLATE determines how character equivalence classes are defined.

Character Ranges

There are four ways to indicate that a string must have any of a range of characters at a certain position. The first we have already seen: it involves using the ‘or’ operator. Secondly, to indicate a position where any single character may occur, one uses the . (period) as a ‘wildcard’ character: it stands for any symbol from the alphabet. For example, the regular expression a.b stands for all strings starting with "a", followed by an arbitrary character, followed by "b".

Third, expressions surrounded by square brackets [ ] can be used to indicate custom character ranges. So the range [AaBb012] is equivalent to A|a|B|b|0|1|2. If the closing bracket ] is part of the range, it must be included as the first character after the opening bracket, i.g. []ABC]. A range of consecutive characters can be indicated with a dash -, so, for example, [abcd] is equivalent to [a-d]. If the dash itself is part of the range, it must be included as the final character in the range, that is, -|a|b|c|d can be expressed as [a-d-]. It is also possible to negate a range of characters. For example, to express strings that start with "a", then have any character, except [c-f] one uses the ∧ at the beginning of the range: a[∧c-f].

Finally, a number of common character classes have been given predefined names. For example, the class [:digit:] is equivalent to 0-9. Note that named character ranges have to be used within a bracketed expression, so a valid expression detecting any single digit would look like this: [[:digit:]]. An overview of named character classes is given in Table 5.1. Here, we give a few more examples.

Consider the problem of detecting a class of e-mail addresses from domains ending with .com, .org, or .net. With the operators and character classes just discussed, one may construct a regular expression like this:

5.5

This expression detects strings that start with one or more alphanumeric characters, followed by a @, followed again by one or more alphanumeric characters, a period, and then com, net, or org. Note that the period is prepended with a backslash to indicate that the period must not be interpreted as a wildcard but as a simple period.

Table 5.1 Character ranges in the POSIX standard (version 2), with ranges corresponding to symbols of the ASCII alphabet

Range symbol	ASCII range	Description
`.` (period)	All	Any single character
`[:alpha:]`	`[a-zA-Z]`	Letters
`[:digit:]`	`[0-9]`	Numbers
`[:alnum:]`	`[a-zA-Z0-9]`	Letters and numbers
`[:lower:]`	`[a-z]`	Lowercase letters
`[:upper:]`	`[A-Z]`	Uppercase letters
`[:cntrl:]`	Characters 0–31	Control characters
`[:space:]`	Space, horizontal and vertical tab, line and form feed, and carriage return	Space characters
	`[:print:]`	Printable characters
`[:blank:]`	Space and tab	Blank characters
`[:graph:]`	`[:alnum:]` and `[:punct:]`	Graphical characters
`[:punct:]`	`! " # $ % & ' () * + , - . / : ; < = > ? @ [ ] ∧ _ { \| }`	Punctuation
`[:xdigit:]`	`[0-9a-fA-F]`	Hexadecimal numbers

The actual interpretation depends on the alphabet used, which in R depends on the locale settings.

Suppose, as a second example, that we want to detect filenames that end with .csv. The following regular expression detects patterns consisting of one or more arbitrary characters, followed by .csv.

5.6

Here, we also introduced the dollar as a special character. The dollar stands for ‘the end of the string’. Recall from Section 5.2.1 that R functions like grepl match any string that has a substring matching the regular expression. By appending the dollar sign we, explicitly specify that the .csv must be at the end of the string.

There are two more types of character classes that can be defined with bracketed expressions, both of them related to collation and character equivalence classes (see Section 3.2.8). For example, in one language, the "ä" may be a separate letter of the alphabet, whereas in another language it may be considered equivalent to "a". The special notation for such equivalence classes is [= =]. To indicate a particular class, one specifies a single representative from an equivalence class between [= and =]. For example, if "o", "ó","ô", and "ö" belong to a single class, then the expression [[=o=]p] is equivalent to [oóôöp].

In Section 3.2.8, we also discussed the example from the Czech alphabet, where the combination"CH" is considered a single character sorting between "H" and "I". To distinguish such cases within a bracketed expression, one places multicharacter collations between [. and .]. So, for example, in [[.CH.]abc], the "CH" is treated as a single unit.

Repetitions

The Kleene star operator adds the ability to express zero or more repetitions of a pattern in a regular expression. In the extended regular expressions of the Single Unix Specification, there are several variants that facilitate other numbers of repetitions, all of them having the same precedence order as the Kleene star. An overview is given in Table 5.2.

To demonstrate their purpose, we express the regular expressions of examples (5.5) and (5.6) more compactly as follows:

The combined use of character classes and repetitions can be very powerful. For example, to detect phone numbers consisting of 10 digits, the following regular expression will do.

Suppose we want to allow phone numbers that may have a dash between the second and third number, we might use

If we also want to include phone numbers where a blank character is used instead of a dash, we can expand the expression further.

Table 5.2 Extended regular expression notation for repetition operators

Operator	Description
`*`	Zero or more
`?`	Zero or one
`+`	One or more
`{}`	Precisely $c05-math-091$
`{,}`	$c05-math-093$ or more
`{,}`	Minimum $c05-math-096$ to maximally $c05-math-097$

Recognizing the Beginning or Ending of Lines

EREs have two special characters that anchor a pattern to the beginning or ending of a line.

`∧`	indicates the beginning of a string
`$`	indicates the end of a string.

For formal regular expressions, such operators are not needed since they only define a precise range of strings, from beginning to end. However, programs (and R functions) such as grep recognize strings that have a substring matching a regular expression. As an example, compare the two calls to grep given here.

  x <- c("aa","baa")
  grepl("a(a|b)", x)
  ## [1] TRUE TRUE
  grepl("∧a(a|b)", x)
  ## [1]  TRUE FALSE

The first regular expression is interpreted by grep to match any string having a(a|b) as a substring. Therefore, both strings are matched. The second regular expression, ∧a(a|b), explicitly requires the string to begin with an "a".

Special Symbols and Escaping them in R

In the previous sections we introduced several characters that have a special interpretation. This means that to recognize such characters simply as themselves, we have to do something extra, that is, we need to escape their special interpretation. As a reminder, we thus far encountered the following characters with a special interpretation.

To interpret them as literal characters, in EREs, one can either precede them by a backslash or include them in a bracketed expression [ ] (but see the section on character ranges on how to properly include ] and ∧). So, for example, the expression a.b would recognize "a.b" but not "aHb" and a\b recognizes "a".

In R, however, there is a catch. Recall from Section 3.2.4 that in string literals the backslash already has a special meaning: it can be used to define characters using their Unicode code point. A double backslash indicates an actual single backslash, preventing the interpretation of "u" as a command.

  c("uf6","\uf6")
  ## [1] "ö"     "\uf6"

Observe that R prints both backslashes at the command prompt. However, one may check, for example by exporting the string to a text file or by checking the string length with nchar, that the actual string contains a single backslash. For string literals that are to be interpreted as regular expressions, a double backslash must be included to indicate a single escape character.

  # detect a.b, not a<wildcard>b
  grepl("a\.b", c("a.b", "aHb"))
  ## [1]  TRUE FALSE

To recognize a single backslash, a double backslash should be used in an ERE. In R, this means using two escaped backslashes so as to recognize the string "a", and we can do the following:

  # recognize "a" and not "ab"
  grepl("a\\b", c("a\b","ab"))
  ## [1]  TRUE FALSE

Such constructions quickly become unreadable as regular expressions get larger. A better alternative is to use bracketed expressions as follows:

  grepl("a[\]b", c("a\b","ab"))
  ## [1]  TRUE FALSE

This does not reduce the length of an expression, but it does enhance readability. More importantly, the latter statement is compatible with extended regular expressions according to the Single Unix Specification.

Finally, we note that R has an option to escape regular expressions altogether. Functions like sub, gsub, grep, and grepl have an option called fixed. This option allows one to interpret the search pattern as is and not as a regular expression.

  x <- c("a.b", "aHb", "a.b.*")
  grepl("a.b.*", x)
  ## [1] TRUE TRUE TRUE
  grepl("a.b.*", x, fixed=TRUE)
  ## [1] FALSE FALSE  TRUE

Being Lazy, Being Greedy

When a regular expression operator contains one or more of the repetition operators + or *, it is possible for a string to contain matches of varying length. For example, suppose we wish to remove the HTML tags in the following string.

  The <em>hello world</em> programme.

As a first guess, we might attempt to use the pattern <.*> to find the tags. The problem is that there are several substrings matching this pattern: "", "" and "hello world". If one is only interested in whether a substring occurs or not, this is not a problem. It does become a problem when specific substrings need to be replaced or removed.

For this reason, EREs include a syntax to specify whether the shortest (lazy) or the longest (greedy) match should be used. The default behavior is to use the greedy match. Here is a simple demonstration using R's gsub function, which replaces every occurrence of a match with a replacement string.

  s <- "The <em>hello world</em> programme"
  gsub(pattern="<.*>", replacement="", x=s)
  ## [1] "The  programme"

Here, we replace each occurrence of the pattern <.*> with the empty string "". Since the default behavior is to be greedy, the pattern matches everything from the first occurrence of "<" to the last occurrence of ">". The question mark operator introduced in Table 5.2 switches the behavior to lazy matching.

  gsub(pattern="<.*?>", replacement="", x=s)
  ## [1] "The hello world programme"

The pattern <.*?> translates to: first a "<", followed by zero or more characters, to end with ">". Since we use gsub (the g stands for global), every such occurrence is replaced.

Regular Expressions with Memory: Groups and Back Referencing

We noted in Section 5.2.1 that regular expressions are limited since they lack a sense of memory and this is true for ‘pure’ regular expressions. In EREs, this restriction is relieved by introducing the concept of groups. The idea is that the strings that match a bracketed tag are stored for later reference. For example, to replace the string

  The <em>hello world</em> programme

with

  The <b>hello world</b> programme

we can devise a regular expression that stores everything between the  and  tags and places it between  and  tags.

   s <- "the <em>hello world</em> programme"
   gsub("<em>(.*?)</em>", "<b>\1</b>", s)
  ## [1] "the <b>hello world</b> programme"

Here, the \1 refers back to the first bracketed expressions. Most regular expression engines, including R's, allow for back referencing up to nine grouped expressions in this way: 1, 2, $c05-math-098$ , 9 (recall that R requires the double backslash, since the first gets processed by R prior to passing it to the regular expression engine).

A second use of groups and references is to use the reference within the search pattern. For example, to match any HTML tag that is closed properly, one can use the expression <(.*)?>.*</\1>. Here, the first group, defined by (.*?), is referenced in the closing tag.

  grepl("<(.*?)>.*?</\1>", c("<li> hello","<em>hello world</em>"))
  ## [1] FALSE  TRUE

Finally, we note that some popular regular expression formats like the Perl Compatible Regular Expressions PCRE (2015) support named groups and back references. In R, support for Perl-like syntax is available through the perl=TRUE option.

  grepl("<(?P<tag>.*?)>.*?</(?P=tag)>"
    , c("<li> hello","<em>hello world</em>")
    , perl=TRUE)
  ## [1] FALSE  TRUE

Here, we use the syntax (?:<[groupname]>[pattern]) to define a named group. The syntax (?P[groupname]) is used for back reference. Support for named groups is somewhat limited, since group names cannot be used in the replacement string of (g)sub, even when using the perl=TRUE switch.

Since any modern (read: extended) regular expression engine interprets a bracketed expression as a group to be captured, one might wonder if this capturing may be switched off. This is indeed the case, and the syntax for that is (?:[pattern]). One use of this is to avoid adding to the back-reference counter for groups that are not reused.

A Few Tips on Building and Maintaining Regular Expressions

Regular expressions are powerful tools that allow for fast text processing in a compactly defined way. In practice, regular expressions have the property of being fairly easy to build up, but they are typically also hard to decypher and debug.

When building regular expressions, it is a good idea to work step by step, setting up a simple expression, trying it out, check the cases you missed, and expand it further. When expressions get too long to read comfortably, it may be a good idea to split your job in two, applying two filters consecutively. In any case, it is a good idea to spend a few lines of comment explaining what you are trying to accomplish with a regular expression.

For many well-known patterns, such as zip codes, e-mail addresses or social security numbers, regular expressions may well already be around. There are many websites that list regular expression solutions to common pattern matching problems. We strongly advise to test regular expressions against examples that should, and should not match.

Finally, we emphasize that regular expressions should not be regarded as the answer to all text recognition problems. It was noted before that regular expressions have their limitations. For example, regardless of the example we have used, validating all possible e-mail addresses conforming to the standard that defines valid notation is notoriously hard to do with regular expressions (see, e.g., Friedl (2006)). Another example would be a regular expression that recognizes valid IPv4 addresses. Such addresses have the form $c05-math-099$ where $c05-math-100$ , $c05-math-101$ , $c05-math-102$ , and $c05-math-103$ are numbers between 0 and 255. Since the possible allowed value of the second and third digit depends on the value of the first (099 is valid but 299 is not), regular expressions are not a logical choice. As an alternative, one could split the string along the periods, convert each substring to numbers, and check their ranges.

Exercises for Section 5.2.2

5.2.3 Generating Regular Expressions in R

The rex package of Ushey and Hester (2014) offers tools to generate regular expressions from a human readable syntax. The idea is to replace regular expression operators and character ranges by readable functions or keywords from which the actual regular expression is then derived. For example, the regular expression (a|b)c* can also be generated with the following statement.

  rex::rex( "a" %or% "b", zero_or_more("c"))
  ## (?:a|b)(?:c)*

Observe that rex by default generates non-capturing groups (denoted by (?:[regex]).

The rex function takes a variable sequence of arguments, where each subsequent argument describes a new part of the pattern to be matched. Such a structure makes it much easier to alter and debug a regular expression. Moreover, since R expressions may be spread over several lines, it is much easier to comment separate parts of a pattern. The output is an object of class regex that holds a (generalized) regular expression that can be used with any function accepting regular expressions such as sub or grep. A regex object is like a string literal, but it differs in how it is printed to screen. Most notably, in regexp objects, the extra backslashes required by R are not printed to screen.

  # look for . or ,:
  r <- rex::rex("." %or% ",")
  # the double backslash necessary in R (\.) is not printed.
  r
  ## (?:.|,)
  # however, using str, we see that it is actually there
  str(r)
  ## Class 'regex'  chr "(?:\.|,)"

Let us look at a more extended example. Suppose we want to extract numbers from strings. The numbers may be integers, they may be written in floating point notation with either period or comma separator (e.g., 34.05 or $c05-math-104$ ), or in scientific notation, with E or e separating the mantissa from the exponent. With rex, we may specify this as follows. We only look for numbers that are surrounded by blanks.

  r <- rex::rex(blank
   , one_or_more(digit)
   , maybe( "." %or% ",", one_or_more(digit))
   , maybe( one_of("e","E"), one_or_more(digit))
   , blank
  )

This statement is so obviously readable that it is hardly necessary to provide comment. Just pronouncing the arguments of rex, replacing every comma by ‘followed by’ gives a good impression of pattern being matched. Compare this with the regular expression that is produced (we break it into two lines here for presentation).

  [[:blank:]](?:[[:digit:]])+(?:(?:.|,)
    (?:[[:digit:]])+)?(?:[eE](?:[[:digit:]])+)?[[:blank:]]

This example also shows a few of the operators offered by rex. The function maybe replaces the ? operator, the comma stands for concatenation, one_of stands for a bracketed range [ ], and one_or_more replaces stands for the + operator. The full list of range functions can be found by typing

  ?rex::character_class

at the command line. We already saw the %or% operator. The following operators are supported as well.

`%if_next_is%`	`%if_prev_is%`
`%if_next_isnt%`	`%if_prev_isnt%`

Again, the names are so obvious that they hardly require an explanation. These operators are actually part of the generalized regular expression syntax and have no equivalent operator in the regular expressions discussed until now.

Standard character ranges like those in Table 5.1 are supported as well. The complete list is quite extensive and can be found by typing

  ?rex::shortcuts

at the command line. In this example, we used blank ([:blank:]) and digit ([:digit:]).

The string literals provided to rex are interpreted as is. It is also possible to reuse previously defined regular expressions with arguments passed to rex, using the rex function. For example, the abovementioned extended example could also have been written as follows:

  nr <- rex::rex(one_or_more(digit))
  Ee <- rex::rex(one_of("E","e"))
  r <- rex::rex(blank, nr, maybe("." %or% "," , nr), maybe(Ee,nr), blank)

5.3 Common String Processing Tasks in R

Among the most basic tasks in text string processing are pattern matching (detection of substrings) and localization, pattern replacement, pattern extraction, and string splitting. All of these are commonly performed with tools based on regular expressions or extensions thereof.

The traditional text processing tools in R are functions like (g)sub, grep(l), substr, regex, and strsplit. These functions combined, one can perform any of the basic tasks mentioned. However, the more recently published stringi package of Gagolewski and Tartanus (2014) makes many tasks, especially string extraction, a lot easier. The functionality of the package is richer than base R's and it uses a clear naming convention. Moreover, since it is based on the ICU library (ICU, 2014), it guarantees platform-independent results. At the time of writing, stringi's popularity is increasing rapidly. For example, the latest versions of text processing tools like stringr (Wickham, 2010) and qdapRegex (Rinker, 2014) have switched to using stringi as backend for their higher level functions. Table 5.3 compares a selection of basic text processing functions from base R and stringi.

Table 5.3 A selection of basic text processing functions in base R and in stringi.

Base R	`stringi`	Description
`sub`	`stri_replace_first`	Substite the first matching instance of a pattern
	`stri_replace_last`	Substite the last matching instance of a pattern
`gsub`	`stri_replace_all`	Substitute all matching instances of a pattern
`grep`		Match strings against a pattern; returns index in vector
`grepl`	`stri_detect`	Match strings against a pattern; returns a logical vector
`regexpr`		Match strings against pattern; returns index in vector and location in strings
	`stri_match`	Match strings against a pattern; returns the first matched pattern. Also `str_match_all`, `_first`, `_last`
	`stri_locate`	Match strings against a pattern; return location of first match in string Also `stri_locate_all`, `_first`, `_last`
`substr`		Extract substring from `start` to `stop` index
	`stri_extract`	Extract first occurrence of a pattern from a string. Also `stri_extract_all`, `_first`, `_last`
`split`	`stri_split`	Split a string into substrings, using a pattern to define the separators
	`stri_reverse`	Reverse a string

Replacing or Removing Substrings

The qdapRegex package exposes many functions (all starting with rm_) that allow one to remove common patterns. For example, rm_number removes numeric values from strings. It recognizes both integers and numbers with decimal separators.

  qdapRegex::rm_number("Sometimes 12,5 is denoted 12.5 but we may
  round it to 13")
  ## [1] "Sometimes 12,5 is denoted but we may round it to"

Scientific notation, for example, 1.25E1 is not recognized at the time of writing. The package contains similar functions for removal of e-mail addresses, urls, several date formats, twitter tags, and more.

The tm package includes the function removePunctuation, which removes occurrences of characters defined in the [:punct:] range (see Table 5.1). It has an option to prevent intra-word dashes from being removed.

  x <- "'Is that an intra-word dash?', she asked"
  tm::removePunctuation(x)
  ## [1] "Is that an intraword dash she asked"
  tm::removePunctuation(x, preserve_intra_word_dashes=TRUE)
  ## [1] "Is that an intra-word dash she asked"

The tm package is focused on analyses of texts, which in this case means that many functions are overloaded to work on both character data and tm-defined objects that store text with some metadata.

The Regular Expression Interface of `stringr`

The stringr package has a utility function for removing blanks at the start and/or end of a string.¹

  library(stringr)
  str_trim("  hello world ", side='both')
  ## [1] "hello world"

The side arguments control whether prepended and/or trailing white spaces are removed.

There are many options to replace substrings in R. The stringr function str_replace differs from R's default sub function in that it accepts a vector of regular expressions and a vector of replacements, where the shorter arguments are recycled to match the longest.

  pat <- c("[[:digit:]]+","[∧[:digit:]]+")
  str_replace("Hello 12 34", pattern = pat, replacement = "")
  ## [1] "Hello  34" "12 34"

Here, we apply two regular expressions to the (single) input string and replace the matched patterns with the empty string (hence removing the matched substrings). In the first case, we remove sequences of at least one digit and in the second we precisely keep sequences of at least one digit. This function is a handy replacement for sub: it only replaces the first match. The package offers a similar replacement for gsub, which is called stringr::str_replace_all. For example, to collapse all double white spaces into a single whitespace, we could do the following:

  pat <- "[[:blank:]]{2,}" # two or more blank characters
  rep <- " "               # replace with single space
  str_replace_all("Hello   many    spaces", pattern = pat, replace = " ")
  ## [1] "Hello many spaces"

Their consistent naming and vectorization over input strings, patterns, and replacements make stringr's functions more readable and versatile than base R functions. Moreover, since they always have the data (here, the string) as the first input, it is possible to combine them in statements that include the %>% pipe operator of the magrittr package.

Like in base R, stringr allows one to escape the interpretation of a pattern as regular expression and just interpret it as is. This is done with the fixed function. Compare the following example with the example on page 94.

  x <- c("a.b", "aHb", "a.b.*")
  stringr::str_detect(x, "a.b.*")
  ## [1] TRUE TRUE TRUE
  stringr::str_detect(x, stringr::fixed("a.b.*"))
  ## [1] FALSE FALSE  TRUE

Here, we use the str_detect function (a replacement for grepl), but any function in the stringr package that accepts a regular expression will obey the fixed directive. This function also allows for case-insensitive matching

  x <- c("a.b", "aHb", "A.B.*")
  str_detect(x, fixed("a.b.*"))
  ## [1] FALSE FALSE FALSE
  str_detect(x, fixed("a.b.*", ignore_case=TRUE))
  ## [1] FALSE FALSE  TRUE

The fixed function causes its argument to be compared byte by byte, ignoring possible multibyte encoding. Since case-insensitive matching depends on case folding rules which are locale dependent, the fixed option may not always return the expected result. If you are sure that your matching pattern is in the same encoding as the string you are searching in, this is a safe option. The coll function makes sure that collation rules defined by the current or custom-defined locale are used. The following example from the documentation of stringr gives a clear example.

  i <- c("I", "u0130", "i")
  i
  ## [1] "I" "İ" "i"
  str_detect(i, fixed("i",ignore_case=TRUE))
  ## [1]  TRUE FALSE  TRUE
  str_detect(i, coll("i", ignore_case=TRUE))
  ## [1]  TRUE FALSE  TRUE
  str_detect(i, coll("i", ignore_case=TRUE, locale = "tr"))
  ## [1] FALSE  TRUE  TRUE

Here, the Turkish locale ("tr") forces a different case folding, mapping "I" to the dotless "ı". This can also be made visible with stringr's case folding functions.

  str_to_lower(i)
  ## [1] "i" "i?" "i"
  str_to_lower(i, locale="tr")
  ## [1] "?" "i" "i"

Versatile Substitutions with `gsubfn`

A seemingly simple, but very powerful, generalization of gsub is implemented by the gsubfn package. It allows replacements of matched patterns by a (user-defined) function of those patterns. For example, in the given statements, the numbers in the strings "(1,2)" and "x-23-y" are multiplied by two.

  gsubfn::gsubfn("[0-9]+", function(x) 2*as.numeric(x), c("(1,2)","x-23-y"))
  ## [1] "(2,4)"  "x-46-y"

Note that the result of the user-defined function is automatically converted to character.

Extracting or Splitting Substrings

The most convenient way of substring extraction in R is offered by the stringr package. Functions str_extract and str_extract_all return the first, respectively all matches of a regular expression.

  x <- c("date: 11-11-2011", "Either 10-01-2001 or 07-03-2012")
  stringr::str_extract(x,"[0-9]{2}-[0-9]{2}-[0-9]{2}")
  ## [1] "11-11-20" "10-01-20"
  stringr::str_extract_all(x,"[0-9]{2}-[0-9]{2}-[0-9]{2}")
  ## [[1]]
  ## [1] "11-11-20"
  ##
  ## [[2]]
  ## [1] "10-01-20" "07-03-20"

Observe that str_extract_all returns a list, with one element for each entry in x. A nice feature of str_extract_all is the simplify option which returns the results, instead as a character matrix.

  stringr::str_extract_all(x,"[0-9]{2}-[0-9]{2}-[0-9]{2}", simplify=TRUE)
  ##      [,1]       [,2]
  ## [1,] "11-11-20" ""
  ## [2,] "10-01-20" "07-03-20"

If it is known what positions in a string contain the information to extract, substr from base R can be used. For example, each IBAN account number starts with a two-letter (ISO 3166-1) country code which may be extracted.

  iban <- "NLABNA987654321"
  substr(start=1, stop=2,iban)
  ## [1] "NL"

It is possible to split a string at each match of a regular expression using str_split. The result is a vector with the original string cut in pieces and the matches left out.

  x <- c("10-10 2010","12.12-2012")
  stringr::str_split(x,"[-\. ]")
  ## [[1]]
  ## [1] "10"   "10"   "2010"
  ##
  ## [[2]]
  ## [1] "12"   "12"   "2012"

Again, the result is a list with one entry for each element in x. There is also a variant called str_split_fixed that interprets the pattern as a literal string rather then a regular expression.

5.4 Approximate Text Matching

Approximate or ‘fuzzy’ text matching is in some ways complementary to the techniques discussed in the previous sections. Where character normalization and pattern extraction and substitution techniques are focused on bringing text closer to some standardized form, approximate text matching methods allow one to leave some slack between text and a standardized form.

As an example we consider the task of matching some input data against a list of standardized categories.

  # standard categories
  codes <- c("male", "female")
  # vector of 'raw' input data
  gender <- c("M", "male ", "Female", "fem.","female")

Here, we wish to assign each value in gender to a category in codes. If the entries in gender were perfect, R's native match function could be used.

  # lookup of gender in the 'codes' lookup table
  match(x = gender, table = codes)
  ## [1] NA NA NA NA  2

We get a single match. The last element of gender matches the second element of codes. We can do a little better by cleaning up gender a little.

  # remove trailing whitespace, cast to lowercase
  gender_clean <- stringi::stri_replace_all(gender,"",
  regex="[[:blank:]]+$")
  gender_clean <- stringi::stri_trans_tolower(gender_clean)
  gender_clean
  ## [1] "m"      "male"   "female" "fem."   "female"
  match(x = gender_clean, table = codes)
  ## [1] NA  1  2 NA  2

We now get three exact matches. The amatch function of the stringdist package (van der Loo, 2014) also matches strings, but it allows for a certain amount of slack between the data to match and entries in the lookup table.

  stringdist::amatch(x = gender, table = codes, maxDist = 3)
  ## [1] NA  1  2  2  2
  stringdist::amatch(x = gender_clean, table = codes, maxDist = 3)
  ## [1] 1 1 2 2 2

For raw data, four matches are found; while for the cleaned data we find a match for each data item in gender. The results are summarized in Figure 5.1.

Illustration of Results of exact or inexact matching against a lookup table with raw and normalized text data. — **Figure 5.1** Results of exact or inexact matching against a lookup table with raw and normalized text data. The combination of normalization and approximate matching works best.

In the call to amatch, we specified a maximum distance (maxDist) of three. In this case it means that for two elements to match, the optimal string alignment distance between them will be maximally 3. Roughly, this distance counts the number of allowed operations necessary to turn one string into another. The four operations include substitution, deletion, or insertion of a single character, and transposition of neighboring characters. For example, "fem." can be turned into "female" by substituting "a" for "." and inserting "l" and "a", giving a total distance of three.

If no list of fixed codes is available, one may attempt to cluster strings with any method that accepts a distance matrix. For example, with R's hierarchical clustering implementation we can do the following:

  # some 'raw' data to be clustered.
  subject <- c("mathematics", "physics",
       "astrophysics", "math", "numerical math")
  # compute distance matrix, adding strings as row/column names
  M <-stringdist::stringdistmatrix(subject,subject, useNames=TRUE)
  # Convert to 'dist' object and cluster hierarchically
  h <- hclust(as.dist(M), method='single')

The resulting clustering is shown in Figure 5.2. In this case, the topics are clustered more or less as one would expect. The clustering is, of course, heavily dependent on both the chosen distance metric as well as the clustering method. For example, the same method gives unsatisfactory results for the gender example we introduced earlier. The category names "male" and "female" are so close to each other than some of the polluted data items are to the right solution.

Illustration of Hierarchical clustering of five strings. — **Figure 5.2** Hierarchical clustering (‘single’ method) of five strings based on the optimal string alignment distance matrix.

5.4.1 String Metrics

The purpose of a string metric is to introduce a sense of distance between elements of $c05-math-105$ . Formally, a distance function on $c05-math-106$ is defined as a function that assigns to each pair of strings a real number,

5.7

Furthermore, a formal distance function satisfies the following rules:

5.8a

5.8b

5.8c

5.8d

These rules are easily intuitively understood: the first rule states that distances cannot be negative; the second rule states that the distance between two strings is zero if and only if the two strings are the same; and the third rule states that taking a walk through $c05-math-112$ from $c05-math-113$ to $c05-math-114$ is as far as going from $c05-math-115$ to $c05-math-116$ . The fourth rule says that the value of the distance function is the value of the shortest route between two strings. Going from $c05-math-117$ to $c05-math-118$ via a third string $c05-math-119$ may not be shorter than going from $c05-math-120$ to $c05-math-121$ directly.

Over the years, many ‘string metrics’ have been developed. Perhaps surprisingly, not all of them fulfill the natural conditions laid down in Eq. (5.8). For example, in the following sections we show that $c05-math-122$ -gram-based distances do not always obey the identity property. That is, two strings may have distance zero between them while being different. Obviously, this means one needs to take care when choosing a distance function, for example, when clustering strings. Indeed, some of the ‘metrics’ we are going to discuss are not metrics in the mathematical sense. However, since it is common in literature to talk about string distance or string metrics regardless of which formal rules are satisfied, we shall apply the same convention here.

The principles from which metrics are derived vary strongly. We generally distinguish between edit-based distances, $c05-math-123$ -gram-based distances, heuristic distances, kernel-based distances, and distances based on phonetic algorithms. Here, we discuss these distances, pointing out their most important properties, including which, if any, of the rules in Eq. (5.8) are violated. We should state that there are more distance metrics than discussed in this book. Since we focus on text cleaning in this chapter, we focus on metrics that are commonly used in that area.

Edit-Based Distances

These distances are based on determining a sequence of basic transformations necessary to turn one string into another. The set of basic operations (‘edits’) includes one or more of the following:

Deletion of a single character, for example, $c05-math-124$ ;
Insertion of a single character, for example, $c05-math-125$ ;
Substitution of a single character, for example, $c05-math-126$ ;
Transposition of two adjacent characters, for example, $c05-math-127$ .

A distance function is defined by determining the shortest sequence of basic operations that turns one string into another and determine the length of the sequence. Alternatively, one may assign weights specific to each type of transformation and determine the sequence of least total weight. Such distances are referred to as generalized distances (Boytsov, 2011).

First note that when a too limited set of basic operations is chosen, not every pair of strings from $c05-math-128$ is connected by a sequence of allowed operations. Formally, such functions are not total functions on $c05-math-129$ . We will follow the same convention as Navarro (2001) and define the distance between such unconnected pairs to be infinite. The advantage of this convention is that numerical comparisons ( $c05-math-130$ ) are then valid between all computed distances.

It can be shown that unweighted edit-based distances all obey the properties of a proper distance (Eq. 5.8), see, for example, Boytsov (2011). When different weights are assigned to insertion and deletion, distances become asymmetric: walking from "abc" to "ac" requires an insertion while walking from "ac" to "abc" requires an deletion. Distance functions with weighted insertion or deletion are symmetric under simultaneous swapping of both the input strings and deletion and insertion weights.

Table 5.4 Edit-based string distances

Distance	Operation
	del	ins	sub	trn
Hamming	$c05-math-131$	$c05-math-132$	$c05-math-133$	$c05-math-134$
Longest common substring	$c05-math-135$	$c05-math-136$	$c05-math-137$	$c05-math-138$
Levenshtein	$c05-math-139$	$c05-math-140$	$c05-math-141$	$c05-math-142$
Optimal string alignment	$c05-math-143$	$c05-math-144$	$c05-math-145$	$c05-math-146$ ^a
Damerau–Levenshtein	$c05-math-147$	$c05-math-148$	$c05-math-149$	$c05-math-150$

^a Each character can take part in maximally a single transposition.

Table 5.4 gives an overview of commonly used edit-based string distances and their allowed operations. The distance of Hamming (1950) is the simplest and allows only for substitution of characters. Hence, it is finite only when computed between strings of the same length. For example, we have $c05-math-151$ and $c05-math-152$ . Since there is only one allowed operation, there is no use in defining weights. Since it is defined only on strings of equal lengths, in the context of text processing, the Hamming distance is most useful for standardized strings, such as phone numbers, product codes, or other identifying numbers.

The longest common subsequence distance (lcs distance) of Needleman and Wunsch (1970) is the simplest generalization of the Hamming distance and it replaces the substitution operation by allowing insertion and deletion of characters. This means that the distance function is a complete function on $c05-math-153$ . In fact, it is easy to see that given two strings $c05-math-154$ and $c05-math-155$ , their distance will be largest if they have no characters in common. In that case, the distance from $c05-math-156$ to $c05-math-157$ is simply $c05-math-158$ , the cost of deleting one by one all characters from $c05-math-159$ and then inserting all characters of $c05-math-160$ . As an example, consider the distance $c05-math-161$ between "fu" and "foo". Using only deletion and insertion, it takes three steps to go from one to the other.

This also shows that there is not necessarily a unique shortest path from the start point to the end. Other than in Euclidean geometry, for example, we could have chosen a different path through $c05-math-162$ (e.g., $c05-math-163$ ) and obtained a path of the same length.

As implied by its name, the lcs distance has a second interpretation. Given two strings, we pair common characters, passing through the strings from left to right but without changing their order. The distance between the two strings is then given by the number of unpaired characters in both strings. The diagram given here demonstrates this process.

Indeed, there are three unpaired characters in the two strings. Note that a subsequence is not the same as a substring. If $c05-math-164$ is a string, a substring indicates a consecutive sequence $c05-math-165$ . A subsequence is a sequence of characters $c05-math-166$ that preserves order ( $c05-math-167$ ) but need not be consecutive. Hence, "fbar" is a subsequence of both "fubar" and "foobar" but is not a substring. Applications of the lcs distance lie mainly outside of data cleaning, although it can be interpreted as a crude (and fast) approach for typo-detection. In principle it is possible to define a generalized lcs distance by assigning different weights to insertion and deletion but this seems not very common in practice.

The Levenshtein distance (Levenshtein, 1966) is like the longest common substring distance but it is more flexible since it also allows for direct substitution of a character, which would take two steps in the lcs distance. This extra flexibility leads to shorter minimal distances. For example, the Levenshtein distance between "fu" and "foo" equals two:

5.9

The generalized Levenshtein distance satisfies the properties of Eq. (5.8) except when the weights for insertion and deletion are different. In that case, the distance is no longer symmetric, which is easily illustrated with an example. Suppose we weigh an insertion with 0.1 and deletion and substitution with 1. We have (denoting the cost per step underneath the arrows)

giving a distance of 1.1. Going the other way around, however, we get

giving a distance of 2.

The optimal string alignment distance, which is also referred to as the restricted Damerau–Levenshtein distance, allows for insertions, deletions, substitutions, and limited transpositions of adjacent characters. The limitation lies therein that each substring may be edited only once.

The most important consequence is that optimal string alignment distance does not satisfy the triangle inequality. The following counterexample is taken from Boytsov (2011). If we take $c05-math-169$ , $c05-math-170$ and $c05-math-171$ , the distance $c05-math-172$ is given by

giving a total distance of 2. The intuitive direct path from "ab" to "bca"

5.10

However, this involves editing the subsequence "ba" twice: once by transposing "a" and "b" and once by inserting a "c" between "a" and "b". In fact, it turns out that there is no way to go from "ba" to "acb" directly in less than three steps, and so one has to resort to the sequence

The optimal string alignment distance between a string variable and a known fixed string may be interpreted as the number of typing errors made in the string variable.

The Damerau–Levenshtein distance (Lowrance and Wagner, 1975) does allow for arbitrary transpositions of characters. That is, subsequences may be edited more than once, allowing the path of Eq. (5.10).

The maximum distance between two strings $c05-math-174$ and $c05-math-175$ , as measured by any of the discussed distances that allows for character substitution, equals $c05-math-176$ . Distances that allow for a larger number of basic operations generally yield numerically smaller distances. The given diagram summarizes the relation between edit-based distances and their limits.

5.11

Note that the maxima may be used to scale distances between 0 (exactly equal) and 1 (completely different) should a method based on a string metric require it.

Finally, some notes on the computational time it takes to compute edit-based distance. The Hamming distance can obviously be computed in $c05-math-178$ time. The other edit-based distances are usually implemented using dynamic programming techniques that run in $c05-math-179$ time. See Navarro (2001) or Boytsov (2011) for an overview.

The Jaro–Winkler Distance

The Jaro distance was originally developed at the US bureau of the Census as part of an application called UNIMATCH. The purpose of this program was to match records based on inexact text fields, mostly name and address data. The first description appeared in the manual of this software (Jaro, 1978), which may be a reason why it is not widely described in computer science literature. Jaro (1989), however, reports successful application of this distance for matching inexact name and address fields.

The Jaro distance is an expression based on the intuition that character mismatches and character transpositions are caused by human-typed errors and that transposition of characters that are further apart are less likely to occur. It can be expressed as follows:

5.12

Here, $c05-math-181$ is the number of matching characters and $c05-math-182$ the number of transpositions, counted in a way to be explained later and the $c05-math-183$ are nonnegative weight parameters which are usually chosen to be 1. The distance ranges between 0 ( $c05-math-184$ and $c05-math-185$ are identical) and 1 (complete mismatch). It is zero if both $c05-math-186$ and $c05-math-187$ are the empty string and it equals 1 if there are no matches while at least one string is nonempty. Otherwise, the distance depends on a weighted sum of character matches and transpositions.

The number of matches is computed as follows. First, pair characters from $c05-math-188$ and $c05-math-189$ as you would for the longest common subsequence distance. It now depends on the distance between the positions of paired characters whether a pair is considered a match. In particular, if $c05-math-190$ is a pair, this is considered a match only when

5.13

where $c05-math-192$ indicates the floor function and each character can be included no more than one match. For example, $c05-math-193$ and $c05-math-194$ the number of matches $c05-math-195$ .

The number of transpositions $c05-math-196$ is computed as follows. First take the subsequences $c05-math-197$ and $c05-math-198$ of $c05-math-199$ and $c05-math-200$ consisting only of the matching characters. Then $c05-math-201$ is the number of characters in $c05-math-202$ that have to change position to turn $c05-math-203$ into $c05-math-204$ . For example, take $c05-math-205$ and $c05-math-206$ , we have $c05-math-207$ and $c05-math-208$ and $c05-math-209$ . To turn $c05-math-210$ into $c05-math-211$ , we need two operations:

The term transposition here is used differently from the definition in Section 5.4.1 where it is defined for transposition of two characters. Here, it should be interpreted as transposing a single character with a substring. For example, in the first step of the abovementioned example, we may interpret the relocation of "d" as swapping $c05-math-212$ with "cb".

Unfortunately, there seems to be some confusion in literature over the precise definition of the Jaro distance. Cohen et al. (2003), for example, use $c05-math-213$ as an upper limit for $c05-math-214$ . The definition used here, and which is also implemented in stringdist package, follows the original definition of Jaro (1978).

Winkler (1990) proposes a simple variant of the Jaro distance that rests on the observation that typing errors are less likely to occur at the beginning of a human-typed string than near the end. If differences occur at the beginning, the heuristic is that the strings are likely to be truly dissimilar. In the Jaro–Winkler distance, this phenomenon is taken advantage of by multiplying the Jaro distance with a penalty factor that depends on the number of unequal characters in the first four positions. The expression is as follows:

5.14

Here, $c05-math-216$ is the length of the longest common prefix of $c05-math-217$ and $c05-math-218$ , up to a maximum of 4. The value of $c05-math-219$ is a user-defined adjustable parameter but it must be between 0 and $c05-math-220$ to ensure $c05-math-221$ . The value of $c05-math-222$ determines how strong the length of the common prefix weighs into the total distance and the Jaro distance emerges as a special case when $c05-math-223$ . Both Winkler (1990) and Cohen et al. (2003) report good results in matching name–address data when setting $c05-math-224$ .

As far as this author knows, the properties as described in Eq. (5.8) are not reported anywhere except in van der Loo (2014). They are however easily derived, so we will state them in the following observation.

Proof

Note first that for the number transpositions $c05-math-226$ and the number of matches $c05-math-227$ we must have $c05-math-228$ . Thus, all terms in Eq. (5.12) are nonnegative, and the equation is easily seen to be invariant under transposition of $c05-math-229$ and $c05-math-230$ .

For the identity property observe that if $c05-math-231$ = $c05-math-232$ , we have $c05-math-233$ and $c05-math-234$ , giving $c05-math-235$ . To prove that the reverse is also true, first note from Eq. (5.12) that $c05-math-236$ if either $c05-math-237$ or

Since $c05-math-238$ and $c05-math-239$ , each term in this sum is nonnegative and the maximum value for the left-hand side is reached uniquely when $c05-math-240$ and $c05-math-241$ , giving $c05-math-242$ . The fact that $c05-math-243$ (all characters match) and $c05-math-244$ (all characters are on the same position) together imply that $c05-math-245$ .

The fact that $c05-math-246$ does not satisfy the triangle inequality is easily demonstrated with a counterexample. Take $c05-math-247$ , $c05-math-248$ and $c05-math-249$ . It is a simple calculation to confirm that

thus violating the triangle inequality. □

$c05-math-250$ -Gram-Based Distances

The edit-based distances and the Jaro–Winkler distance described are typically used for the detection of misspelling errors in fairly short strings. They are based on a precise description of what operations are allowed to turn one string into another and therefore have natural paths from $c05-math-251$ to $c05-math-252$ through $c05-math-253$ associated with them.

Distances based on $c05-math-254$ -grams, on the other hand, are computed by first deriving an integer vector (the so-called $c05-math-255$ -gram profile) from strings $c05-math-256$ and $c05-math-257$ and defining a distance measure on the space of vectors. Well-known examples include the Manhattan distance or the cosine distance.

A $c05-math-258$ -gram is a string of $c05-math-259$ characters. Given a string $c05-math-260$ ( $c05-math-261$ ), the associated $c05-math-262$ -gram profile is a table of size $c05-math-263$ counting the occurrence of all $c05-math-264$ -grams in $c05-math-265$ . For example, the nonzero entries of the 2-gram profile of "banana" looks like this:

In literature on text classification, the term $c05-math-266$ -gram is often used instead of $c05-math-267$ -gram. Since this term is also used to indicate word sequences rather than character sequences, we follow the terminology of Ukkonen (1992) and use the term $c05-math-268$ -gram to avoid confusion.

We will denote the $c05-math-269$ -gram profile of a string $c05-math-270$ as a vector-valued function $c05-math-271$ . The range of $c05-math-272$ is $c05-math-273$ , the set of integer vectors whose size is the number of possible $c05-math-274$ -grams allowed by the alphabet. In general, $c05-math-275$ is not a complete function on $c05-math-276$ . In particular, it is undefined when $c05-math-277$ , and we also need to decide what to do when $c05-math-278$ . Here, we adopt the following conventions:

Intuitively, this corresponds to the idea that all distances are zero as long as we are not comparing anything. In particular, it means that $c05-math-279$ for any $c05-math-280$ -gram-based distance.

The traditional $c05-math-281$ -gram distance is computed as the Manhattan distance between two $c05-math-282$ -gram profiles:

Observe that since the $c05-math-283$ are counts rather than frequencies, this distance is both an expression of difference in $c05-math-284$ -gram distribution as well as a difference in string lengths. It is not hard to show that the maximum $c05-math-285$ -gram distance between two strings $c05-math-286$ and $c05-math-287$ is equal to

To accommodate for the effect of differences in string length, one can base the distance on the (cosine of the) angle between the $c05-math-288$ -gram vectors.

This distance varies between 0 and 1, where 0 means complete similarity and 1 means complete dissimilarity (no overlap in occurring $c05-math-289$ -grams).

A third way to compare $c05-math-290$ -gram profiles is the Jaccard distance.

where $c05-math-291$ returns the set of all $c05-math-292$ -grams occurring in $c05-math-293$ . The Jaccard distance scales from 0, indicating complete similarity to 1, indicating complete dissimilarity.

The term ‘complete similarity’ is not the same as equality. We already saw the case for $c05-math-294$ , which defines all strings as equal (there is nothing to make them different when $c05-math-295$ ). The fact that unequal strings have a $c05-math-296$ -gram-based distance equal to zero is not, however, limited to $c05-math-297$ . For example, if $c05-math-298$ , the table of $c05-math-299$ -gram counts is the same for two words if they are anagrams. Ukkonen (1992) has shown that one can derive transformations on strings of a certain structure that leave the $c05-math-300$ -gram profile intact, for any value of $c05-math-301$ . For example, the reader can check that the strings["aFOObBARa" and "bBARaFOOb"]have the same 2-gram profile.

We have seen that $c05-math-302$ -gram distances do not obey the identity property of distance metrics. The other properties depend on the metric that is chosen over the positive integer vectors. The distances discussed here are all nonnegative, symmetric. Both the Jaccard and $c05-math-303$ -gram distance also satisfy the triangle inequality. The cosine distance does not satisfy the triangle inequality. This is easily seen by assuming three $c05-math-304$ -gram profiles, $c05-math-305$ , $c05-math-306$ , and $c05-math-307$ with angles $c05-math-308$ and $c05-math-309$ . We have $c05-math-310$ and $c05-math-311$ . To show that the cosine distance does not satisfy the triangle inequality, we only need to find an $c05-math-312$ for which $c05-math-313$ . This is the case for every $c05-math-314$ .

$c05-math-315$ -gram profiles have been widely used as a tool for text classification. Applications and methodology have been reported where texts have been classified according to the language used (Cavnar and Trenkle, 1994), topics under discussion (Clement and Sharp, 2003), or author style and identity (Kešelj, et al. 2003), to name but a very few examples. In many cases, the use of bi- or trigrams seems an appropriate choice for natural language classification (Fürnkranz, 1998).

Distances Based on String Kernels

Just like $c05-math-316$ -gram-based methods, string kernels are interesting for text categorization or clustering purposes.

A string kernel is defined by a map $c05-math-317$ that assigns to each string $c05-math-318$ in $c05-math-319$ a real vector $c05-math-320$ with nonnegative coefficients. The normalized kernel function $c05-math-321$ of two strings is defined as

where ‘ $c05-math-322$ ’ is the standard inner product on a real vector space. The right-hand side is the cosine between the angle of the feature vectors $c05-math-323$ and $c05-math-324$ , and thus a measure of distance can be obtained as

and a suitable map $c05-math-325$ defines a sense of distance on $c05-math-326$ .

Several authors have defined string kernels and tested their effectiveness. Karatzoglou and Feinerer (2007) propose two string kernels. In the first kernel (simply called the string kernel), each element $c05-math-327$ of $c05-math-328$ counts the number of occurrences of subsequences in $c05-math-329$ .

The second kernel, called the full string kernel, is defined by concatenating the feature vectors defined by every subsequence up to length $c05-math-330$ .

Lodhi et al. (2002) propose a string kernel that penalizes contributions of each subsequence by the number of positions traversed going from the first to the last character of a subsequence $c05-math-331$ of $c05-math-332$ .

5.15

where the sum is over every occurrence of the subsequence $c05-math-334$ in $c05-math-335$ , and $c05-math-336$ is the length of the subsequence $c05-math-337$ defined by the index vector $c05-math-338$ in $c05-math-339$ : $c05-math-340$ .

Phonetic Algorithms

Phonetic algorithms aim to match strings that sound similar when they are pronounced. Phonetic algorithms transform a string to some kind of phonetic string that represents pronunciation features. The distance between two strings is then defined as a distance between the phonetic strings. In the simplest case, the distance is 0 when two phonetic strings are equal and 1 otherwise.

One of the most famous phonetic algorithms, called soundex, was patented for the first time in the early twentieth century (Russell, 1918 1922). It is aimed at names, written in the ASCII alphabet, pronounced in English. One widely used version of the algorithm is published by the United States National Archives and Records Administration (NARA, 2015).

The soundex algorithm encodes a string to a phonetic string consisting of a letter and three numbers. The algorithm can be summarized as follows:

For example, the names "van der loo" and de jonge map to "V536" and "D252", respectively, (after removing spaces). Phonetically similar names map to similar results. For example, "john" and "joan" both map to "J500", and similarly "ashley" and "ashly" map to "A240".

The soundex algorithm has been adapted and expanded over the years by many researchers and practitioners. Like the original algorithm, such extensions are often strongly related to a particular purpose. For example, the Metaphone algorithms by Philips (2000) started as improvement on the soundex algorithm for better matching of English names, but it has been extended to other languages including Spanish, and Brazilian Portuguese. Holmes and McCabe (2002) focus on the original purpose of English name matching, using a mix of different phonetic techniques. Mokotov (1997), and later Beider and Morse (2008) developed alternatives aimed at matching Eastern European and Ashkenazic Jewish surnames, respectively. Finally, Pinto et al. (2012) report on a soundex-like algorithm for strings typically occurring in short text messages (sms).

5.4.2 String Metrics and Approximate Text Matching in R

Base R comes with a few basic capabilities for approximate text matching. The function adist (of the standard utils package) computes the generalized Levenshtein distance. By default, the weights for insertion, deletion, and substitution are all equal to one, but this may be set with the cost argument.

  adist("fu","foo")
  ##      [,1]
  ## [1,]    2
  adist("fu","foo", cost=c(i=0.1, d=1, s=1))
  ##      [,1]
  ## [1,]  1.1

It is obligated to name the elements of cost so it (partially) matches insertion, deletion, and substitution. If adist is provided with vector(s) of strings, the distance matrix between the strings is computed.

  adist(c("hello","world"),c("olá mundo"))
  ##      [,1]
  ## [1,]    8
  ## [2,]    8

The agrep function is based on the work of Laurikari (2001) who combines the idea of approximate matching with regular expression search. Recall that a regular expression defines a subset of all possible strings that can be built up out of a certain alphabet. The function grep takes a regular expression and a string x and determines whether any substring of x is in the set defined by the regular expression. The function agrep is even more flexible and searches whether there is a substring in x that is maximally some Levenshtein distance away from an element in the set defined by the regular expression.

  # a pattern to search for
  re <- "abc"
  # two strings to search in
  x <- c("FOOabcBAR","FOOaXcBAR")
  # grep finds only one match (in the first element)
  grep(pattern=re, x=x)
  ## [1] 1
  # agrep finds both
  agrep(pattern=re, x=x, max.distance=1)
  ## [1] 1 2

In this simple example, the set of strings matched by the regular expression is precisely $c05-math-345$ . Thus, grep returns 1, the index into x where a match is found. Int The argument max.distance tells agrep to search for strings that are maximally Levensthein distance 1 removed from the substrings defined by the pattern "abc". Therefore, agrep also matches the substring "aXc" since it is a single substitution away from "abc".

The concept behind agrep is extremely powerful and elegant. However, do note that it can be hard to visualize what strings are defined by a regular expression. Combining complex regular expressions with the flexibility of agrep may yield unexpected matches. So, just like when working with grep and related tools, it is strongly recommended to build up regular expressions piece by piece, and to create test sets of strings with cases that you expect to match (or not).

Beyond the Levenshtein Distance with `stringdist`

The stringdist package offers a number of string metrics which makes it easy to try and test several distance measures for your application. Metrics can be computed using the stringdist function, where the default function is the optimal string alignment distance.

  library(stringdist)
  stringdist("hello",c("hallo","wereld"))
  ## [1] 1 4

Here, the shortest argument ("hello") is recycled and the result is a vector of length 2, measuring the distance between "hello" and "hallo" and "hello" and "wereld". So unlike adist, this function does not return a matrix. A matrix of distances can be computed with the stringdistmatrix function.

  stringdistmatrix("hello",c("hallo","wereld"), useNames="strings")
  ##       hallo wereld
  ## hello     1      4

To perform approximate matching, the amatch function can be used. Here, amatch is similar to R's match function. It is given some strings x and a lookup table table and returns an index in table for each element of x that can be matched. The difference is that amatch allows for a certain string distance.

  match("hello",c("hallo", "wereld"), nomatch=0)
  ## [1] 0
  amatch("hello",c("hallo", "wereld"), nomatch=0, maxDist=1)
  ## [1] 1

Here, match finds no match for "hello" in the vector c("hello","wereld"), while amatch matches "hello" with "hallo". The strings "hello" and "hallo" are one substitution away from each other, which is equal to the maximally allowed distance.

In R, there is a variant of match called %in%, which returns a logical vector that indicates which elements of x occur in the lookup table. The stringdist package provides an approximate matching equivalent called ain.

  "hello" %in% c("hallo", "wereld")
  ## [1] FALSE
  ain("hello",c("hallo", "wereld"), maxDist=1)
  ## [1] TRUE

Each function from the stringdist package discussed (stringdist, stringdistmatrix, amatch, and ain) have a method argument that allows one to switch between string metrics.

  stringdist("hello",c("hallo","wereld"),method="hamming")
  ## [1]   1 Inf
  amatch("hello",c("hallo", "wereld"), nomatch=0,method="hamming",
  maxDist=1)
  ## [1] 1

Table 5.5 gives an overview of the available distance metrics.

Table 5.5 Methods supported by the stringdist package

`method`	Distance
`"hamming"`	Hamming distance
`"lcs"`	Longest common subsequence
`"lv"`	Generalized Levenshtein
`"dl"`	Generalized Damerau–Levenshtein
`"osa"`	Generalized optimal string alignment (default)
`"jw"`	Jaro, Jaro–Winkler distance
`"qgram"`	$c05-math-346$ -gram distance
`"cosine"`	Cosine distance (based on q-gram profiles)
`"jaccard"`	Jaccard distance (based on $c05-math-347$ -gram profiles)
`"soundex"`	Soundex distance

The term ‘generalized’ indicates that edit operations can be weighted.

Depending on the chosen distance, there are some options to set. The edit-based distances (Levenshtein, optimal string alignment, and Damerau–Levenshtein) are implemented in the generalized form, so weights for different edits may be specified in the order deletion, insertion, substitution, and transposition (if applicable).

  stringdist("hello","hallo", method="osa", weight=c(1,1,0.1,1))
  ## [1] 0.1

For the Jaro–Winkler distance, there are weights for matched characters in the first string, matched characters in the second string, and a weight for the contribution of the transpositions [see Eq. (5.12)]. These can be passed through the same weight argument. If method is set to "jw", Eq. (5.14) is used to compute the Jaro–Winkler distance where the default value of Winkler's penalty factor $c05-math-348$ is set to 0. That is, the pure Jaro distance is computed by default. This can be controlled with the parameter p.

  # Compute the Jaro distance
  stringdist("marhta", "martha", method="jw")
  ## [1] 0.05555556
  # Compute the Jaro-Winkler distance with p=0.1
  stringdist("marhta", "martha", method="jw", p=0.1)
  ## [1] 0.03888889

Like the method parameter, these extra options are available for amatch, ain, and stringdistmatrix as well. So to do an approximate lookup using the Jaro–Winkler distance with $c05-math-349$ , one simply does the following.

  # raw data
  x <- c("Stan Marhs","Kyle Brovlofski")
  # lookup table
  sp_char <- c("Kenny McCormick","Kyle Broflovski","Stan Marsh",
  "Eric Cartman")
  # find matches
  amatch(x=x, table=sp_char, method="jw", p=0.1, maxDist=0.2)
  ## [1] 3 2

For distances based on $c05-math-350$ -grams there is an optional parameter q, which by default is set to 1.

  amatch(x, table=sp_char, method="qgram", q=2, maxDist=5)
  ## [1] 3 2
  stringdistmatrix(x, sp_char, method="qgram", q=2, useNames="strings")
  ##               Kenny McCormick Kyle Broflovski Stan Marsh Eric Cartman
  ## Stan Marhs                   21             23          4           16
  ## Kyle Brovlofski              28              4         23           25

Finally, we mention that soundex codes can be computed with the phonetic function.

  phonetic(sp_char)
  ## [1] "K552" "K416" "S356" "E626"

In stringdist, the soundex distance between two strings is equal to 0 if they have the same soundex code and 1 otherwise.

  stringdistmatrix(x, sp_char, method="soundex", useNames="strings")
  ##               Kenny McCormick Kyle Broflovski Stan Marsh Eric Cartman
  ## Stan Marhs                    1              1          0            1
  ## Kyle Brovlofski               1              0          1            1

String Similarity

For some applications, it can be more convenient to think in terms of string similarity than string distance. For any given distance measure, a similarity measure can be defined, but the procedure to do so depends on the distance used. In particular, it depends on the range of the distance function, and we distinguish two cases.

The stringdist package implements a function that maps any string distance between two strings to a value in $c05-math-351$ , where 0 means complete dissimilarity and 1 means complete similarity. Note that for distances depending on $c05-math-352$ -grams, complete similarity does not necessarily mean equality since those distances do not satisfy the identity property of Eq. (5.8). If $c05-math-353$ is the distance between two strings, the string similarity is defined as

5.16

That is, the distance is divided by the maximum possible distance between two strings of the same length as the input strings. It depends on the distance how these maxima are computed. In exceptional cases where the maximum equals 0 (e.g., $c05-math-355$ -gram distances and $c05-math-356$ ), the ratio in the abovementioned equation equals 0/0, which is defined as equal to one for this purpose. This definition includes the simple case where distances by definition yield values in $c05-math-357$ , such as the Jaro–Winkler distance. In that case, Eq. (5.16) reduces to $c05-math-358$ .

Computing string similarities is done with the stringsim function, which works similar to the stringdist function.

  stringdist::stringsim("hello",c("hallo", "olá"), method="lcs")
  ## [1] 0.80 0.25

Here, lcs distance between "hello" and "hallo" equals 2 since the characters "e" and "a" are unmatched. The maximum lcs distance for two strings of length 5 equals $c05-math-359$ [see Eq. (5.11)], so the string similarity equals $c05-math-360$ .

The advantage of such a normalized string similarity over a string distance is that string similarities can be better compared across different string metrics. The normalization to a maximum possible distance yields a more neutral degree of similarity.

$c05-math-361$ -gram-Based Distances in the `textcat` Package

The textcat package offers a number of $c05-math-362$ -gram-based distances that are especially interesting for text categorization purposes such as language detection. Being based on a method of Cavnar and Trenkle (1994), the computed profile for a string $c05-math-363$ is the concatenation of profiles for $c05-math-364$ to $c05-math-365$ . The function textcat_xdist computes a distance matrix over a character vector based on their profiles.

  sp_char <- c("Kenny McCormick","Kyle Broflovski","Stan Marsh",
  "Eric Cartman")
  library(textcat)
  textcat_xdist(sp_char)
  ##      [,1] [,2] [,3] [,4]
  ## [1,]    0 1891  954 1286
  ## [2,] 1894    0  980 1349
  ## [3,]  910  934    0  509
  ## [4,] 1244 1334  557    0

Using the method argument, several distance functions between $c05-math-366$ -gram profiles can be chosen. The default method is the out-of-place measure of Cavnar and Trenkle (1994), which is based on a simple rank statistic over the profiles.

textcat supports a number of distance measures on the Cavner–Trenkle profiles, including measures based on the Kullback–Leibler divergence and the cosine distance.

Computing String Kernel Distances with `kernlab`

To compute a string kernel distance, one proceeds in two steps. The function stringdot creates a kernel function for subsequences of length $c05-math-367$ .

  library(kernlab)
  sk <- stringdot()

Here, sk is a function that computes the cosine between two string kernel profiles. Since the cosine expresses a similarity, we can compute a distance between two strings as follows.

  1 - sk("kyle brovlofski", "kyle broflovski")
  ## [1] 0.5384615

The default values for $c05-math-368$ and $c05-math-369$ are 1.1 and 4, respectively. These parameters may be altered with the lambda and length parameter.

  sk3 <- stringdot(length=3,lambda=0.5)
  1 - sk3("kyle brovlofski", "kyle broflovski")
  ## [1] 0.4285714

Deciding on Matching Method and Parameters

The metric and parameter set to be used depends both on the application and the type of strings being matched. Both $c05-math-370$ -gram-based metrics and string kernels have been widely researched in the context categorization of text corpora. For example, Cavnar and Trenkle (1994), Huffman (1995), Huffman and Damashek (1995), Lodhi et al. (2002), and Karatzoglou and Feinerer (2007) all categorize texts based on combinations of $c05-math-371$ -gram profiles spanning multiple values of $c05-math-372$ .

In the context of data cleaning, the most relevant application is lookup of polluted items in a list of known valid values. A closely related application is statistical matching where approximate key matching is an important subtask. Several authors have assessed the accuracy of (combinations) of string metrics, in the context of finding spelling mistakes (Zamora et al., 1981), bibliography (Lawrence et al., 1999), name matching (Cohen et al., 2003; Winkler, 1990), and toponym matching (Recchia and Louwerse, 2013; Sehgal et al., 2006). It is hard to point out a single optimal method for each, or all cases. For example, Cohen et al. (2003) report good results using the Jaro–Winkler distance (setting $c05-math-373$ ), while Sehgal et al. (2006) obtain better results using edit-based distances.

As a result, the choice of matching algorithm will be a practical trade-off between performance, necessary accuracy, and the amount of resources that can be invested in developing a matching methodology. There are however a few basic considerations.

If strings have an imposed structure such as a fixed length (think of phone numbers, IBAN bank accounts, credit card numbers), the Hamming distance may be of use. If not, (word) order, the supposed error mechanism, string length, and performance are of consideration.

For some strings, like name data or short product descriptions, the order of words may be of less relevance. For example, the strings "Farnsworth, Hubert J." and "Hubert J. Farnsworth" represent the same person. To compare such strings, a measure that is less dependent on (word) order is to be preferred. As an example, compare the cosine similarity (based on $c05-math-374$ -gram profiles) with the Jaro–Winkler similarity.

  stringsim("Farnsworth, Hubert J.", "Hubert J. Farnsworth",
  method="cosine", q=2)
  ## [1] 0.88396
  stringsim("Farnsworth, Hubert J.", "Hubert J. Farnsworth",
  method="jw", p=0.1)
  ## [1] 0.465873

Clearly, the fact that $c05-math-375$ -gram profiles are to an extent independent from in which order the $c05-math-376$ -grams appear in the string results in a high similarity measure. The Jaro–Winkler distance, which measures a certain type of typing mistakes, results in a much smaller similarity.

When there is reason to suspect that strings differ in only a few characters, an edit-based distance or the Jaro–Winkler distance is a good option. For edit-based distances, long strings may be prohibitive as the computational time increases with the product of the lengths of the strings compared. In Figure 5.3 we compare the computation time necessary to compute the lower tridiagonal distance matrix on 400 randomly generated names with lengths ranging from 8 to 28 characters. The boxplots show timing distributions over 250 repetitions of the experiment, conducted with the stringdist package. One immediately recognizes why the Jaro–Winkler distance has become so popular: it is even faster than the fairly simple longest common substring distance. If a more accurate edit-based distance is necessary, one may choose the optimal string alignment distance. It is just a little slower than the Levenshtein method but much faster than the full Damerau–Levenshtein distance. The $c05-math-377$ -gram distance takes about 40% longer than the optimal string alignment distance in this benchmark; but for longer strings, this disadvantage quickly disappears: running time for $c05-math-378$ -gram-based distances depends on the sum of the input string lengths which makes them ideal candidates for ploughing through long texts.

Illustration of Benchmark of times to compute string distance measures in the stringdist package. — **Figure 5.3** Benchmark of times to compute string distance measures in the `stringdist` package (smaller is faster). In this benchmark, the lower triangular of $c05-math-379$ distance matrix was computed 250 times using `stringdist::stringdistmatrix`. The durations were measured with the `microbenchmark` package. The strings consisted of randomly generated names of lengths varying between 8 and 28 characters.

**Figure 5.3** Benchmark of times to compute string distance measures in the `stringdist` package (smaller is faster). In this benchmark, the lower triangular of $c05-math-379$ distance matrix was computed 250 times using `stringdist::stringdistmatrix`. The durations were measured with the `microbenchmark` package. The strings consisted of randomly generated names of lengths varying between 8 and 28 characters.

Figure 5.4 shows a benchmark for string distance calculations between pairs of equally sized random strings of lengths ranging from 8 to 256 characters. At 200 characters length, the asymptotic computational complexity has been reached. The soundex distance is a little faster to compute than the Hamming distance. The full Damerau–Levenshtein distance is slowest in all cases, while the $c05-math-380$ -gram-based method surpasses the optimal string alignment, Levenshtein, and lcs distance somewhere between 30 and 50 characters. At about 175 characters, $c05-math-381$ -gram-based distances are computed faster then the Jaro–Winkler distance, which is faster to compute than any edit-based distance for any length of string. In general, accuracy is probably of more concern than performance; but if accuracy is of no or little concern, Figure 5.4 provides a guide of what distance measure to choose. The fact that the Jaro–Winkler distance is about 3–7 times faster than the optimal string alignment while providing acceptable accuracy probably explains its wide popularity in the record linkage community [see, e.g., Cohen et al. (2003)]. Finally, we note that functionality of the stringdist automatically uses multiple cores so performance can be improved by simply switching to a larger machine.

Illustration of Benchmarks of computing string distance measures in the stringdist package. — **Figure 5.4** Benchmarks of computing string distance measures in the `stringdist` package (smaller is faster). In this benchmark, $c05-math-382$ lower triangular distance matrices were computed between random strings of lengths varying from 8 to 256 characters. The lines represent median relative times over 10 repetitions. Results have been normalized to the maximum median time. Results for the soundex distance were not plotted as it nearly coincides with that of the Hamming distance.

**Figure 5.4** Benchmarks of computing string distance measures in the `stringdist` package (smaller is faster). In this benchmark, $c05-math-382$ lower triangular distance matrices were computed between random strings of lengths varying from 8 to 256 characters. The lines represent median relative times over 10 repetitions. Results have been normalized to the maximum median time. Results for the soundex distance were not plotted as it nearly coincides with that of the Hamming distance.

Measuring Matching Accuracy

Suppose we have a list $c05-math-383$ of polluted strings and a lookup table $c05-math-384$ . The task is to match as many polluted strings as possible correctly to elements in $c05-math-385$ . We assume that for every element in $c05-math-386$ there should be a corresponding element in $c05-math-387$ . This matching may be one-to-many, that is, multiple elements of $c05-math-388$ might match the same element of $c05-math-389$ . For example, $c05-math-390$ can be a long list of possibly misspelled place names and $c05-math-391$ a short list of official place names, and $c05-math-392$ . In general, a matching algorithm generates a subset of pairs $c05-math-393$ . Suppose that the true match is given by the set $c05-math-394$ . The precision of the approximate matching algorithm is defined as the fraction of elements in $c05-math-395$ that are correctly matched.

The recall is defined as the fraction elements in $c05-math-396$ that are correctly matched:

Observe that both measures scale from 0 to 1, where 0 corresponds to not a single match ( $c05-math-397$ ). A precision equal to 1 means that every found pair is also a true match ( $c05-math-398$ ), but not necessarily that every possible match has been found. A recall equal to 1 means that every true match is in $c05-math-399$ ( $c05-math-400$ ) but not necessarily that every pair in $c05-math-401$ represents a true match.

The precision and recall can usefully be combined into what is called the $c05-math-402$ measure, which is defined as

5.17

Here, we silently agree that $c05-math-404$ if the precision and/or the recall equals zero. It is easy to see that $c05-math-405$ if and only if $c05-math-406$ and decreases to zero when either the precision or the recall approaches zero.

If test sets with known matches are available, the abovementioned measures can easily be computed for different metrics and parameters. In general, it is a good strategy to use cross-validation techniques to evaluate a matching methodology: split the data into training and test sets, and evaluate $c05-math-407$ on the test sets. Discussion of cross-validation techniques is beyond the scope of this book, but there are several textbooks that introduce the topic. James et al. (2013) discuss basic cross-validation techniques with applications in R, while Hastie et al. (2001) focus on theory. A comprehensive overview of cross-validation theory and techniques is given in the paper by Arlot et al. (2010).

Encoding Issues

The accuracy of string distance computation may decrease when encoding is not taken into account. In particular, some implementations of (edit-based) string metrics compute distances between the byte-wise representation of strings rather then taking account of possible multi-byte characters. Both the distance functions in base R and in the stringdist package by default take account of encoding and have the option to ignore it.

  stringdist(enc2utf8("Motörhead"), "Motorhead", method="lv")
  ## [1] 1
  stringdist(enc2utf8("Motörhead"), "Motorhead", method="lv", useBytes=TRUE)
  ## [1] 2

Here, we use enc2utf8 to ensure that the results are the same on all operating systems supported by R (R assumes the OS's native encoding if it is not explicitly specified, see Section 3.2.4). In the first example, the Levenshtein distance between "Motörhead" and "Motorhead" equals 1, corresponding to the replacement "ö" $c05-math-408$ "o". In the second example, the strings are treated byte-wise. Since "ö" is stored as a two-byte character in UTF-8, going from "ö" to "o" equals removing one byte and substituting another. The situation is even worse than this, since UTF-8 has a second way to represent "o", namely, as an "o", followed by a modifying symbol that adds the umlaut to the previous character. The two different representations are generally not automatically recognized by implementations of distance metrics.

Therefore, before computing string distances, one should ensure at least that all strings are stored in the same encoding. Here UTF-8 is a good choice since it both supports (nearly) every symbol invented and is widely supported across different softwares. Second, one needs to ensure that the encoding is normalized so that (combined) characters have a unique byte-wise representation. Techniques for Unicode normalization and encoding conversion were discussed in Section 5.1.1.

¹ Strictly, it removes characters defined by the s macro (this is Perl compliant). The documentation from which R's documentation is derived (http://linux.die.net/man/3/pcrepattern) is not entirely clear what symbols are included in this range. Presumably, the range is wider than just [:blank:], which is a subset of ASCII characters (Table 5.1).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.