Chapter 5
Cleaning Text Data

Preparing text data takes place on a number of levels. From the bottom up there are technical (encoding) issues, string issues which are related to the structuring of the storage format, case folding, and so on, and semantic issues which relate to the meaning of text.

Cleaning on the technical level mostly concerns solving encoding issues. That is, can we properly interpret the sequence of bytes comprising a file as a set of symbols? We will find that there is no definite answer to this but there are tools that implement effective heuristics that work in many cases.

At the string level, we find issues like checking the file format (is a file proper csv or HTML?), cleaning up punctuation or spurious white spaces, and localizing typos by approximate matching against a list of known or allowed terms. It also includes extracting the interesting portions of text string, for example, by harvesting the url that is stored in a <a href="[url]"> tag of an HTML file. Activities at the string level therefore consist of more or less classical techniques for finding, replacing, and substituting substrings as well as approximate matching of strings based on string metrics. We shall find that methods based on pattern matching and methods based on approximate matching are in a way complementary approaches to common text cleaning problems and it is effective to combine them.

The semantic domain involves activities that make use of the meaning of text. Here, one may distinguish many sublevels again by looking at the meaning of single terms, terms in the context of a sentence, sentences in the context of a text, and so on. Following this path we quickly move into the domain of semantic text analyses which is beyond the scope of this book. However, we discuss a few tools and techniques that are readily available such as reducing verbs to their stem and language detection.

Notation of alphabets and strings: We introduce a few concepts and notation that will be used throughout this chapter. We already encountered the term alphabet in Section 3.2.1, where it was defined as a nonempty finite set of abstract symbols. Alphabets are denoted with the Greek capital letter c05-math-001. Examples include the binary alphabet c05-math-002 or the complete set of Unicode characters. Any finite nonempty subset of an alphabet is also an alphabet.

The symbols of an alphabet can be concatenated to form strings. The set of all strings that consist of zero or any finite number of characters is denoted as c05-math-003. Elements of c05-math-004 are denoted as c05-math-005, c05-math-006, or c05-math-007 and string literals will be printed between quotes. For example, if c05-math-008, then c05-math-009 is an element of c05-math-010. The string consisting of zero characters is a special case with its own notation: c05-math-011. The length of a string is the number of characters it is composed of (in Unicode: code points). We will use the notation c05-math-012 to indicate the string length. We have c05-math-013 and c05-math-014. The set of all strings of length c05-math-015 is denoted as c05-math-016.

The definition of c05-math-017 gives rise to a more formal definition of c05-math-018, namely,

5.1 equation

In fact, we may consider c05-math-020 as an operator that generates concatenations. We will encounter this operator again when we discuss regular expressions.

5.1 Character Normalization

Normalization is the process of making choices consistent. That is, in many cases, the same information may be represented in more than one way and before any further processing it is convenient to represent the same information in the same way. For example, the strings "Sarah", "sarah", and "SARAH" all refer to the same person but in a different representation.

Character normalization comes in two flavors: normalization of the encoding (technical representation of text) and normalization of symbols (character conversion). Both cases are discussed in the next section.

5.1.1 Encoding Conversion and Unicode Normalization

A fundamental step in text cleaning is making sure that all character data is stored in a single encoding scheme. In principle, any encoding that supports the alphabet the text is written in suffices. However, because of its broad support across many systems, and software, and because it extends ASCII, it is a good idea to use UTF-8, at least for internal representation of text.

Detecting in what encoding a file is stored was discussed in Section 3.2.7. Also recall that text data read in R is always internally stored as latin1 (Windows) or UTF-8 (others). Base R has several tools available to convert encoding, all of them based on an implementation of the iconv specification (The Open Group, 2004). To convert all elements of a character vector to UTF-8, one can use enc2utf8(). Converting between any pair of encoding schemes can be done with iconv().

  y <- iconv(x, to="[encoding]", from="[encoding]")

The encoding specifications that are supported depend on the system running R and can be requested by typing the following command.

  iconvlist()

The stringi package is not based on iconv but utilizes the ICU library (ICU, 2014) instead, thus guaranteeing platform independence. The stringi equivalent of enc2utf8 is stri_enc_toutf8()

  y <- stringi::stri_encode(str, from="[encoding]", to="[encoding]")

The valid encoding specifications can be obtained with the following command:

  stringi::stri_enc_list()

The stringi package supports less encoding schemes than iconv on most or all systems supported by R; but unlike R, it does guarantee platform independence.

As discussed in Section 3.2.3, some symbols can be represented in more than one way in Unicode. As an example, consider singly accented Latin characters. These may be represented by one byte sequence representing the accented character, or by two byte sequences where the first determines the character and the second adds the accent. Before any further processing it is a good idea to normalize such cases since, for example, find-and-replace operations performed later on in the text may be sensitive to the underlying representation.

Unicode has several normal forms. As a first example, we normalize to the NFC form which is recommended by the Internet Engineering Task Force [IETF, Klensin and Padlipski (2008)]. To demonstrate Unicode normalization, we first define the letter ‘ö’ in two ways:

  x <- c("uf6"        # small latin 'o' with diaereses
       , "u6fu0308") # small latin 'o' followed by combining diaereses
  x
  ## [1] "ö" "ö"

The stored byte sequences can be revealed with iconv.

  iconv(x,toRaw=TRUE)
  ## [[1]]
  ## [1] c3 b6
  ##
  ## [[2]]
  ## [1] 6f cc 88

We now normalize each element of x and again inspect the byte sequences.

  y <- stringi::stri_trans_nfc(x)
  iconv(y,toRaw=TRUE)
  ## [[1]]
  ## [1] c3 b6
  ##
  ## [[2]]
  ## [1] c3 b6

So in NFC the collapsed version of ‘ö’ is preferred.

The abbreviation NFC stands for Normal Form canonical Composition. The algorithm behind it first decomposes, then recomposes (accented) characters in a predefined, unique order. Not all combined characters are decomposed by NFC. For example, ligatures such as ‘ff’ are left alone by NFC (a ligature combines two or more letters to a single symbol, e.g., c05-math-021c05-math-022c05-math-023 becomes ffi). The NFKC normalization (Normalization Form canonical compatible Composition) also takes those cases into account. Compare the following results:

  x <- "ufb00" # Latin small ligature ff
  x
  ## [1] "ff"
  nchar(x)
  ## [1] 1
  nchar(stringi::stri_trans_nfc(x))
  ## [1] 1
  nchar(stringi::stri_trans_nfkc(x))
  ## [1] 2

In the first case, the normalized x still consists of a single character representing the ligature ‘ff’. In the second case, the ligature character is split into two separate f's. As a consequence, the letter ‘f’ will only be detected after normalization.

  grepl("f", stringi::stri_trans_nfc(x))
  ## [1] FALSE
  grepl("f", stringi::stri_trans_nfkc(x))
  ## [1] TRUE

Here, we used grepl to determine whether the letter ‘f’ occurs in the string.

Finally, we note that there are two more normal forms which are less interesting from a text cleaning point of view, namely, NFD (Normalization Form canonical Decomposition) where (e.g., accented) symbols are decomposed into constituting parts and NFKD (normalization form compatibility decomposition) where combined (e.g., ligature) characters are decomposed.

5.1.2 Character Conversion and Transliteration

Character conversion is a simple process where a user-specified set of symbols is replaced by another set of symbols. Mathematically, it can be interpreted as a function c05-math-024 sending symbols from an alphabet c05-math-025 to an alphabet c05-math-026

Here, c05-math-028 can be one-to-one or many-to-one and c05-math-029 and c05-math-030 may or may not be equal. Since there is no such thing as an ‘empty character’, mappings like c05-math-031 cannot be used to remove characters from a string. Character removal should be interpreted as an operation on strings where a substring (possibly of length 1) is replaced by the empty string. Such operations are discussed in the next Section 5.2.

General conversion maps can in R be specified with the function chartr.

  chartr(old = "[character]", new = "[character]", x = "[character]")

It takes as input a string of characters to convert, a string of characters to convert to, and a character vector of strings to convert. As an example, consider the soundex algorithm (see also Procedure 5.4.1). One step of this algorithm involves replacing every vowel (of the ASCII alphabet) by the small latin letter ‘a’. Using chartr this can be specified as follows:

  chartr("eiou","aaaa", c("hello world","Olá Mundo"))
  ## [1] "halla warld" "Olá Manda"

For each old character, a new version must be provided. The chartr function is general enough to define any map of the form (5.2). That is, any translation that can be done character by character. This example shows that chartr requires a precise specification: both the capital letter ‘O’ and the accented á are not recognized and must be specified explicitly.

There are two character translation jobs that are so common that they are accessible as a separate functionality: case conversion (also called case folding) and transliteration. The base R functions toupper and tolower convert a string to completely upper or lowercase. The mapping between upper and lowercase is determined by locale setting (LC_CTYPE), see Section 3.4) and need not be specified explicitly by the user.

  toupper(c("hello world","Olá Mundo"))
  ## [1] "HELLO WORLD" "OLÁ MUNDO"
  tolower(c("hello world","Olá Mundo"))
  ## [1] "hello world" "olá mundo"

The stringi package also offers a function to translate words to upper, lower, or title case.

  stringi::stri_trans_tolower(c("hello world","Olá Mundo"))
  ## [1] "hello world" "olá mundo"
  stringi::stri_trans_toupper(c("hello world","Olá Mundo"))
  ## [1] "HELLO WORLD" "OLÁ MUNDO"
  stringi::stri_trans_totitle(c("hello world","Olá Mundo"))
  ## [1] "Hello World" "Olá Mundo"

The advantages of the stringi package over base R are the ease of switching between locales (by the argument locale) and guaranteed consistency across operating systems. Moreover, the title case conversion, which necessarily involves word detection, has detailed controls for word boundary detection.

Transliteration is a mapping of the form (5.2), where c05-math-032 is either a subset of c05-math-033 or an altogether different alphabet from c05-math-034. A common use of transliteration is to replace accented characters by characters without accent. The actual transformations are governed by library-specific transliteration rules that determine how to convert from one alphabet to another. In base R, transliteration can be achieved by appending //TRANSLIT in the encoding specification of iconv. The stringi package offers a specialized function called stri_trans_general.

  iconv(c("hello world","Olá Mundo"), to="ASCII//TRANSLIT")
  ## [1] "hello world" "Ola Mundo"
  stringi::stri_trans_general(c("hello world","Olá Mundo"),
  id="latin-ascii")
  ## [1] "hello world" "Ola Mundo"

In both cases, the non-ASCII character á has been replaced by a.

5.2 Pattern Matching with Regular Expressions

The ability to recognize a substring within a larger string lies at the heart of many string cleaning actions. Actions like trimming of white spaces, punctuation removal, or splitting strings into separate lines or words all rely on the fact that patterns in strings can be recognized.

The theory of pattern recognition in strings is closely tied to early developments in theoretical computer science. In fact, the so-called regular expressions, which may be considered one of the most basic tools in text processing, were first developed by Kleene (1951) in a paper on “Representation of events in nerve nets and finite automata”. The algebraic theory of regular expressions is nowadays well understood and described in standard textbooks on the foundations of computer science, such as the excellent book by Hopcroft et al. (2007). In this chapter, we give a compact overview of the most important features of regular expressions, focusing on their use within R. For a very thorough and complete overview of practical regular expressions, we recommend the book by Friedl (2006).

For our purposes, a regular expression can be interpreted as a short notation for defining a range of strings over an alphabet. Computer programs such as grep or sed, which are a standard part of Unix-like operating systems, accept regular expressions and are able to very efficiently find, respectively find and replace substrings in larger pieces of text.

5.2.1 Basic Regular Expressions

A regular expression is a short notation that indicates a subset of c05-math-035, the set of all strings that can be composed of the symbols in an alphabet c05-math-036. Regular expressions are capable of denoting finite or infinite subsets of c05-math-037, but not every subset of c05-math-038 can be denoted by a regular expression.

All regular expressions can, in principle, be built up out of a few basic elements. As an example, consider the following regular expression that already contains all basic operations:

This expression stands for the set of strings that start with an c05-math-040 or a c05-math-041, followed by zero or more c05-math-042's. In other words, it corresponds to the set

To understand the notation, we read expression (5.3) from left to right. Between brackets, we find a|b, where the vertical bar | is the familiar ‘or’ operator. Since these are the first substrings in the expression, this means all strings indicated by (5.3) must begin with either "a" or "b". Next, we find a c, followed by an asterisk *. The asterisk operator may be read as ‘zero or more of the previous’, where ‘the previous’ only extends to the character immediately previous to it in this case.

Before continuing to the formal definition of these operators, let us look at an example of how regular expressions can be used in R. The function grepl accepts two main arguments: a regular expression and a character vector. It returns a logical vector indicating for each element of the character vector whether a substring matches the regular expression.

  str <- c("a","xacc","xdcc","xbccccxyac")
  grepl("(a|b)c*",str)
  ## [1]  TRUE  TRUE FALSE  TRUE

Here, the first element of str matches since the whole string ("a") is in c05-math-044 of Eq. (5.4). The second string, "xacc", is not in c05-math-045, but since it has a substring "acc" that is, it matches. The same holds for the fourth element, while the third element "xdcc" has no substring that is in c05-math-046.

The function sub replaces the first substring that matches a regular expression with another string. For example, to replace matching substrings from the example with "H", we do the following:

  sub("(a|b)c*", "H", str)
  ## [1] "H"      "xH"     "xdcc"   "xHxyac"

Observe that, indeed, while the fourth element of str has two substrings matching the regular expression, only the first is replaced. The function c05-math-047 replaces all occurrences.

  gsub("(a|b)c*", "H", str)
  ## [1] "H"     "xH"    "xdcc"  "xHxyH"

Formal Regular Expressions

Put together in a set, all regular expressions together form an algebra. That is, regular expressions can be built up and combined systematically by applying a few basic operations, possibly over and over again. Each such operation yields a new and valid regular expression.

The first three rules for constructing regular expressions just introduce some notation and relate regular expressions to sets of strings.

  1. Rule 1 The symbols c05-math-048 and c05-math-049 are regular expressions, representing, respectively, the empty string "" and the empty subset of c05-math-050.
  2. Rule 2 If c05-math-051 is a symbol in c05-math-052, then a is a regular expression representing c05-math-053.
  3. Rule 3 If c05-math-054 is a regular expression, then so is c05-math-055.

The first rule defines mathematical cases that are necessary for a complete algebraic development of regular expressions. They will not be important for our purpose. Brackets have exactly the effect one expects of them: they determine the order in which combining operations are executed, just like in normal arithmetic. The next set of rules determines how regular expressions can be operated upon.

The fourth and fifth rules introduce concatenation and ‘or’-ing of regular expressions.

  1. Rule 4 If c05-math-056 and c05-math-057 are regular expressions, then so is c05-math-058.
  2. Rule 5 If c05-math-059 and c05-math-060 are regular expressions, then so is c05-math-061.

The | notation indicates that a position in a string can be occupied by either of its operands. For example, the regular expression a(a|ab)a indicates the set c05-math-062. The set of strings defined by the concatenation of two regular expressions is the concatenation of all strings of the first with all strings of the second set. In other words, if c05-math-063 and c05-math-064 are the sets of strings defined by regular expressions c05-math-065 and c05-math-066, then the set defined by c05-math-067 is defined as c05-math-068. For example, take

equation

Concatenating c05-math-069 with c05-math-070, we get

equation

The operations introduced thus far only allow us to define finite sets of strings. The star operator, introduced in the last rule, changes this.

  1. Rule 6 If c05-math-071 is a regular expressions, then so is c05-math-072.

The star operator indicates ‘repeat the preceding symbol zero or more times’. The operator was first introduced by Kleene (1951), and it is therefore often referred to as the Kleene (star) operator. For example, the regular expression a*ba*c corresponds to the set

equation

To complete our discussion of formal regular expressions, we need to define the rules for operator precedence. That is, we need to define rules so we understand whether for example a|b* should be read as (a|b)*, as a|(b*). The rule is that the * goes first, then comes concatenation, and finally |. In other words, the expression ab|c* must be read as (ab)|(c*).

Limits of Regular Expressions

As stated before, regular expressions are finite notations indicating subsets of c05-math-073. In particular, the Kleene operator allows us to indicate infinite sets of strings with a regular expression of finite size. A natural question is whether every possible subset of c05-math-074 can be denoted with a finite regular expression. The answer to this question is ‘no’. In fact, subsets of c05-math-075 that can be defined by a regular expression are called regular.

There are several ways to prove this. One way is to show that the number of subsets of c05-math-076 is much larger (uncountably infinite) than the number of sets that can be indicated by regular expressions (countably infinite). A second way is by giving a counterexample. For example, the set of strings defined by

equation

cannot be denoted by a finite regular expression. The actual proof of this statement relies on automata theory and is beyond the scope of this book. It can be found in many textbooks on languages and automata theory, such as Hopcroft et al. (2007). However, the result can be made intuitively clear by attempting to build a regular expression for c05-math-077. For example, the expression aa*bb* fails, since this also includes strings not in c05-math-078, such as "abb" and "aaab". Another natural candidate would be ab|aabb|aaabbb|c05-math-079. However, this expression is clearly not finite and so contradicts our demand that c05-math-080 be expressed with a finite regular expression. The main intuition here is that regular expressions lack a certain sense of memory. To be able to tell whether a string is in c05-math-081 or not, one needs to count the number of "a"'s and then verify whether ensuing number of "b"'s is the same. Since we have no way to count the number of characters, it is not possible to define a regular expression that defines c05-math-082. In the next section we shall see that extended regular expressions (EREs) add behavior and operators to relieve this limitation.

Because sets of strings can be defined in many ways, there is no easy general technique or rule to determine whether an infinite set of strings is definable by a regular expression or not. Rather, there are a set of mathematical techniques that have to be applied on a case-by-case basis. An easy to remember general rule, however, is that any set of strings where for each string in the set, the structure of a string at one position depends directly on the structure of the string at another position is not regular.

Exercises for Section 5.2.1

5.2.2 Practical Regular Expressions

Thus far, we have discussed regular expressions in the more or less theoretical context of very simple alphabets such as c05-math-085 or c05-math-086. For larger alphabets it quickly becomes cumbersome to express string sets such as ‘any string that begins with four arbitrary symbols’ since that would involve ‘or’-ing all characters of the alphabet.

For this reason, programs working with regular expressions include extra operators that indicate ranges of characters (called character classes), special versions of the Kleene operator to indicate (possibly finite) repetition, or the beginning or ending of a string.

The extensions that are supported ultimately depend on the software one uses, and there are many programs and programming languages around that support regular expressions. However, EREs specified in the Single Unix Specification (The Open Group, 1997) are supported as a minimum in many applications, including R. Perhaps, most notably, extensions added by the Perl programming language on top of ERE have become very popular and widely supported as well. Indeed, R also supports Perl regular expressions out of the box. The stringi package of Gagolewski and Tartanus (2014) also comes with an implementation of regular expressions. It is based on the ICU library (ICU, 2014), which offers another extension of the ERE regular expressions.

To complicate things a little further, the actual alphabet used to interpret regular expressions may depend on locale settings. For example, in R, the LC_CTYPE setting determines the alphabet (e.g., the Unicode alphabet) while LC_COLLATE determines how character equivalence classes are defined.

Character Ranges

There are four ways to indicate that a string must have any of a range of characters at a certain position. The first we have already seen: it involves using the ‘or’ operator. Secondly, to indicate a position where any single character may occur, one uses the . (period) as a ‘wildcard’ character: it stands for any symbol from the alphabet. For example, the regular expression a.b stands for all strings starting with "a", followed by an arbitrary character, followed by "b".

Third, expressions surrounded by square brackets [ ] can be used to indicate custom character ranges. So the range [AaBb012] is equivalent to A|a|B|b|0|1|2. If the closing bracket ] is part of the range, it must be included as the first character after the opening bracket, i.g. []ABC]. A range of consecutive characters can be indicated with a dash -, so, for example, [abcd] is equivalent to [a-d]. If the dash itself is part of the range, it must be included as the final character in the range, that is, -|a|b|c|d can be expressed as [a-d-]. It is also possible to negate a range of characters. For example, to express strings that start with "a", then have any character, except [c-f] one uses the at the beginning of the range: a[∧c-f].

Finally, a number of common character classes have been given predefined names. For example, the class [:digit:] is equivalent to 0-9. Note that named character ranges have to be used within a bracketed expression, so a valid expression detecting any single digit would look like this: [[:digit:]]. An overview of named character classes is given in Table 5.1. Here, we give a few more examples.

Consider the problem of detecting a class of e-mail addresses from domains ending with .com, .org, or .net. With the operators and character classes just discussed, one may construct a regular expression like this:

This expression detects strings that start with one or more alphanumeric characters, followed by a @, followed again by one or more alphanumeric characters, a period, and then com, net, or org. Note that the period is prepended with a backslash to indicate that the period must not be interpreted as a wildcard but as a simple period.

Table 5.1 Character ranges in the POSIX standard (version 2), with ranges corresponding to symbols of the ASCII alphabet

Range symbol ASCII range Description
. (period) All Any single character
[:alpha:] [a-zA-Z] Letters
[:digit:] [0-9] Numbers
[:alnum:] [a-zA-Z0-9] Letters and numbers
[:lower:] [a-z] Lowercase letters
[:upper:] [A-Z] Uppercase letters
[:cntrl:] Characters 0–31 Control characters
[:space:] Space, horizontal and vertical tab, line and form feed, and carriage return Space characters
[:print:] Printable characters
[:blank:] Space and tab Blank characters
[:graph:] [:alnum:] and [:punct:] Graphical characters
[:punct:] ! " # $ % & ' () * + , - . / : ; < = > ? @ [ ] ∧ _ { | } c05-math-088 Punctuation
[:xdigit:] [0-9a-fA-F] Hexadecimal numbers

The actual interpretation depends on the alphabet used, which in R depends on the locale settings.

Suppose, as a second example, that we want to detect filenames that end with .csv. The following regular expression detects patterns consisting of one or more arbitrary characters, followed by .csv.

Here, we also introduced the dollar as a special character. The dollar stands for ‘the end of the string’. Recall from Section 5.2.1 that R functions like grepl match any string that has a substring matching the regular expression. By appending the dollar sign we, explicitly specify that the .csv must be at the end of the string.

There are two more types of character classes that can be defined with bracketed expressions, both of them related to collation and character equivalence classes (see Section 3.2.8). For example, in one language, the "ä" may be a separate letter of the alphabet, whereas in another language it may be considered equivalent to "a". The special notation for such equivalence classes is [= =]. To indicate a particular class, one specifies a single representative from an equivalence class between [= and =]. For example, if "o", "ó","ô", and "ö" belong to a single class, then the expression [[=o=]p] is equivalent to [oóôöp].

In Section 3.2.8, we also discussed the example from the Czech alphabet, where the combination"CH" is considered a single character sorting between "H" and "I". To distinguish such cases within a bracketed expression, one places multicharacter collations between [. and .]. So, for example, in [[.CH.]abc], the "CH" is treated as a single unit.

Repetitions

The Kleene star operator adds the ability to express zero or more repetitions of a pattern in a regular expression. In the extended regular expressions of the Single Unix Specification, there are several variants that facilitate other numbers of repetitions, all of them having the same precedence order as the Kleene star. An overview is given in Table 5.2.

To demonstrate their purpose, we express the regular expressions of examples (5.5) and (5.6) more compactly as follows:

equation

The combined use of character classes and repetitions can be very powerful. For example, to detect phone numbers consisting of 10 digits, the following regular expression will do.

equation

Suppose we want to allow phone numbers that may have a dash between the second and third number, we might use

equation

If we also want to include phone numbers where a blank character is used instead of a dash, we can expand the expression further.

equation

Table 5.2 Extended regular expression notation for repetition operators

Operator Description
* Zero or more
? Zero or one
+ One or more
{c05-math-090} Precisely c05-math-091
{c05-math-092,} c05-math-093 or more
{c05-math-094,c05-math-095} Minimum c05-math-096 to maximally c05-math-097

Recognizing the Beginning or Ending of Lines

EREs have two special characters that anchor a pattern to the beginning or ending of a line.

indicates the beginning of a string
$ indicates the end of a string.

For formal regular expressions, such operators are not needed since they only define a precise range of strings, from beginning to end. However, programs (and R functions) such as grep recognize strings that have a substring matching a regular expression. As an example, compare the two calls to grep given here.

  x <- c("aa","baa")
  grepl("a(a|b)", x)
  ## [1] TRUE TRUE
  grepl("∧a(a|b)", x)
  ## [1]  TRUE FALSE

The first regular expression is interpreted by grep to match any string having a(a|b) as a substring. Therefore, both strings are matched. The second regular expression, ∧a(a|b), explicitly requires the string to begin with an "a".

Special Symbols and Escaping them in R

In the previous sections we introduced several characters that have a special interpretation. This means that to recognize such characters simply as themselves, we have to do something extra, that is, we need to escape their special interpretation. As a reminder, we thus far encountered the following characters with a special interpretation.

equation

To interpret them as literal characters, in EREs, one can either precede them by a backslash or include them in a bracketed expression [ ] (but see the section on character ranges on how to properly include ] and ). So, for example, the expression a.b would recognize "a.b" but not "aHb" and a\b recognizes "a".

In R, however, there is a catch. Recall from Section 3.2.4 that in string literals the backslash already has a special meaning: it can be used to define characters using their Unicode code point. A double backslash indicates an actual single backslash, preventing the interpretation of "u" as a command.

  c("uf6","\uf6")
  ## [1] "ö"     "\uf6"

Observe that R prints both backslashes at the command prompt. However, one may check, for example by exporting the string to a text file or by checking the string length with nchar, that the actual string contains a single backslash. For string literals that are to be interpreted as regular expressions, a double backslash must be included to indicate a single escape character.

  # detect a.b, not a<wildcard>b
  grepl("a\.b", c("a.b", "aHb"))
  ## [1]  TRUE FALSE

To recognize a single backslash, a double backslash should be used in an ERE. In R, this means using two escaped backslashes so as to recognize the string "a", and we can do the following:

  # recognize "a" and not "ab"
  grepl("a\\b", c("a\b","ab"))
  ## [1]  TRUE FALSE

Such constructions quickly become unreadable as regular expressions get larger. A better alternative is to use bracketed expressions as follows:

  grepl("a[\]b", c("a\b","ab"))
  ## [1]  TRUE FALSE

This does not reduce the length of an expression, but it does enhance readability. More importantly, the latter statement is compatible with extended regular expressions according to the Single Unix Specification.

Finally, we note that R has an option to escape regular expressions altogether. Functions like sub, gsub, grep, and grepl have an option called fixed. This option allows one to interpret the search pattern as is and not as a regular expression.

  x <- c("a.b", "aHb", "a.b.*")
  grepl("a.b.*", x)
  ## [1] TRUE TRUE TRUE
  grepl("a.b.*", x, fixed=TRUE)
  ## [1] FALSE FALSE  TRUE

Being Lazy, Being Greedy

When a regular expression operator contains one or more of the repetition operators + or *, it is possible for a string to contain matches of varying length. For example, suppose we wish to remove the HTML tags in the following string.

  The <em>hello world</em> programme.

As a first guess, we might attempt to use the pattern <.*> to find the tags. The problem is that there are several substrings matching this pattern: "<em>", "</em>" and "<em>hello world</em>". If one is only interested in whether a substring occurs or not, this is not a problem. It does become a problem when specific substrings need to be replaced or removed.

For this reason, EREs include a syntax to specify whether the shortest (lazy) or the longest (greedy) match should be used. The default behavior is to use the greedy match. Here is a simple demonstration using R's gsub function, which replaces every occurrence of a match with a replacement string.

  s <- "The <em>hello world</em> programme"
  gsub(pattern="<.*>", replacement="", x=s)
  ## [1] "The  programme"

Here, we replace each occurrence of the pattern <.*> with the empty string "". Since the default behavior is to be greedy, the pattern matches everything from the first occurrence of "<" to the last occurrence of ">". The question mark operator introduced in Table 5.2 switches the behavior to lazy matching.

  gsub(pattern="<.*?>", replacement="", x=s)
  ## [1] "The hello world programme"

The pattern <.*?> translates to: first a "<", followed by zero or more characters, to end with ">". Since we use gsub (the g stands for global), every such occurrence is replaced.

Regular Expressions with Memory: Groups and Back Referencing

We noted in Section 5.2.1 that regular expressions are limited since they lack a sense of memory and this is true for ‘pure’ regular expressions. In EREs, this restriction is relieved by introducing the concept of groups. The idea is that the strings that match a bracketed tag are stored for later reference. For example, to replace the string

  The <em>hello world</em> programme

with

  The <b>hello world</b> programme

we can devise a regular expression that stores everything between the <em> and </em> tags and places it between <b> and </b> tags.

   s <- "the <em>hello world</em> programme"
   gsub("<em>(.*?)</em>", "<b>\1</b>", s)
  ## [1] "the <b>hello world</b> programme"

Here, the \1 refers back to the first bracketed expressions. Most regular expression engines, including R's, allow for back referencing up to nine grouped expressions in this way: 1, 2, c05-math-098, 9 (recall that R requires the double backslash, since the first gets processed by R prior to passing it to the regular expression engine).

A second use of groups and references is to use the reference within the search pattern. For example, to match any HTML tag that is closed properly, one can use the expression <(.*)?>.*</\1>. Here, the first group, defined by (.*?), is referenced in the closing tag.

  grepl("<(.*?)>.*?</\1>", c("<li> hello","<em>hello world</em>"))
  ## [1] FALSE  TRUE

Finally, we note that some popular regular expression formats like the Perl Compatible Regular Expressions PCRE (2015) support named groups and back references. In R, support for Perl-like syntax is available through the perl=TRUE option.

  grepl("<(?P<tag>.*?)>.*?</(?P=tag)>"
    , c("<li> hello","<em>hello world</em>")
    , perl=TRUE)
  ## [1] FALSE  TRUE

Here, we use the syntax (?:<[groupname]>[pattern]) to define a named group. The syntax (?P[groupname]) is used for back reference. Support for named groups is somewhat limited, since group names cannot be used in the replacement string of (g)sub, even when using the perl=TRUE switch.

Since any modern (read: extended) regular expression engine interprets a bracketed expression as a group to be captured, one might wonder if this capturing may be switched off. This is indeed the case, and the syntax for that is (?:[pattern]). One use of this is to avoid adding to the back-reference counter for groups that are not reused.

A Few Tips on Building and Maintaining Regular Expressions

Regular expressions are powerful tools that allow for fast text processing in a compactly defined way. In practice, regular expressions have the property of being fairly easy to build up, but they are typically also hard to decypher and debug.

When building regular expressions, it is a good idea to work step by step, setting up a simple expression, trying it out, check the cases you missed, and expand it further. When expressions get too long to read comfortably, it may be a good idea to split your job in two, applying two filters consecutively. In any case, it is a good idea to spend a few lines of comment explaining what you are trying to accomplish with a regular expression.

For many well-known patterns, such as zip codes, e-mail addresses or social security numbers, regular expressions may well already be around. There are many websites that list regular expression solutions to common pattern matching problems. We strongly advise to test regular expressions against examples that should, and should not match.

Finally, we emphasize that regular expressions should not be regarded as the answer to all text recognition problems. It was noted before that regular expressions have their limitations. For example, regardless of the example we have used, validating all possible e-mail addresses conforming to the standard that defines valid notation is notoriously hard to do with regular expressions (see, e.g., Friedl (2006)). Another example would be a regular expression that recognizes valid IPv4 addresses. Such addresses have the form c05-math-099 where c05-math-100, c05-math-101, c05-math-102, and c05-math-103 are numbers between 0 and 255. Since the possible allowed value of the second and third digit depends on the value of the first (099 is valid but 299 is not), regular expressions are not a logical choice. As an alternative, one could split the string along the periods, convert each substring to numbers, and check their ranges.

Exercises for Section 5.2.2

5.2.3 Generating Regular Expressions in R

The rex package of Ushey and Hester (2014) offers tools to generate regular expressions from a human readable syntax. The idea is to replace regular expression operators and character ranges by readable functions or keywords from which the actual regular expression is then derived. For example, the regular expression (a|b)c* can also be generated with the following statement.

  rex::rex( "a" %or% "b", zero_or_more("c"))
  ## (?:a|b)(?:c)*

Observe that rex by default generates non-capturing groups (denoted by (?:[regex]).

The rex function takes a variable sequence of arguments, where each subsequent argument describes a new part of the pattern to be matched. Such a structure makes it much easier to alter and debug a regular expression. Moreover, since R expressions may be spread over several lines, it is much easier to comment separate parts of a pattern. The output is an object of class regex that holds a (generalized) regular expression that can be used with any function accepting regular expressions such as sub or grep. A regex object is like a string literal, but it differs in how it is printed to screen. Most notably, in regexp objects, the extra backslashes required by R are not printed to screen.

  # look for . or ,:
  r <- rex::rex("." %or% ",")
  # the double backslash necessary in R (\.) is not printed.
  r
  ## (?:.|,)
  # however, using str, we see that it is actually there
  str(r)
  ## Class 'regex'  chr "(?:\.|,)"

Let us look at a more extended example. Suppose we want to extract numbers from strings. The numbers may be integers, they may be written in floating point notation with either period or comma separator (e.g., 34.05 or c05-math-104), or in scientific notation, with E or e separating the mantissa from the exponent. With rex, we may specify this as follows. We only look for numbers that are surrounded by blanks.

  r <- rex::rex(blank
   , one_or_more(digit)
   , maybe( "." %or% ",", one_or_more(digit))
   , maybe( one_of("e","E"), one_or_more(digit))
   , blank
  )

This statement is so obviously readable that it is hardly necessary to provide comment. Just pronouncing the arguments of rex, replacing every comma by ‘followed by’ gives a good impression of pattern being matched. Compare this with the regular expression that is produced (we break it into two lines here for presentation).

  [[:blank:]](?:[[:digit:]])+(?:(?:.|,)
    (?:[[:digit:]])+)?(?:[eE](?:[[:digit:]])+)?[[:blank:]]

This example also shows a few of the operators offered by rex. The function maybe replaces the ? operator, the comma stands for concatenation, one_of stands for a bracketed range [ ], and one_or_more replaces stands for the + operator. The full list of range functions can be found by typing

  ?rex::character_class

at the command line. We already saw the %or% operator. The following operators are supported as well.

%if_next_is% %if_prev_is%
%if_next_isnt% %if_prev_isnt%

Again, the names are so obvious that they hardly require an explanation. These operators are actually part of the generalized regular expression syntax and have no equivalent operator in the regular expressions discussed until now.

Standard character ranges like those in Table 5.1 are supported as well. The complete list is quite extensive and can be found by typing

  ?rex::shortcuts

at the command line. In this example, we used blank ([:blank:]) and digit ([:digit:]).

The string literals provided to rex are interpreted as is. It is also possible to reuse previously defined regular expressions with arguments passed to rex, using the rex function. For example, the abovementioned extended example could also have been written as follows:

  nr <- rex::rex(one_or_more(digit))
  Ee <- rex::rex(one_of("E","e"))
  r <- rex::rex(blank, nr, maybe("." %or% "," , nr), maybe(Ee,nr), blank)

5.3 Common String Processing Tasks in R

Among the most basic tasks in text string processing are pattern matching (detection of substrings) and localization, pattern replacement, pattern extraction, and string splitting. All of these are commonly performed with tools based on regular expressions or extensions thereof.

The traditional text processing tools in R are functions like (g)sub, grep(l), substr, regex, and strsplit. These functions combined, one can perform any of the basic tasks mentioned. However, the more recently published stringi package of Gagolewski and Tartanus (2014) makes many tasks, especially string extraction, a lot easier. The functionality of the package is richer than base R's and it uses a clear naming convention. Moreover, since it is based on the ICU library (ICU, 2014), it guarantees platform-independent results. At the time of writing, stringi's popularity is increasing rapidly. For example, the latest versions of text processing tools like stringr (Wickham, 2010) and qdapRegex (Rinker, 2014) have switched to using stringi as backend for their higher level functions. Table 5.3 compares a selection of basic text processing functions from base R and stringi.

Table 5.3 A selection of basic text processing functions in base R and in stringi.

Base R stringi Description
sub stri_replace_first Substite the first matching instance of a pattern
stri_replace_last Substite the last matching instance of a pattern
gsub stri_replace_all Substitute all matching instances of a pattern
grep Match strings against a pattern; returns index in vector
grepl stri_detect Match strings against a pattern; returns a logical vector
regexpr Match strings against pattern; returns index in vector and location in strings
stri_match Match strings against a pattern; returns the first matched pattern. Also str_match_all, _first, _last
stri_locate Match strings against a pattern; return location of first match in string Also stri_locate_all, _first, _last
substr Extract substring from start to stop index
stri_extract Extract first occurrence of a pattern from a string. Also stri_extract_all, _first, _last
split stri_split Split a string into substrings, using a pattern to define the separators
stri_reverse Reverse a string

Replacing or Removing Substrings

The qdapRegex package exposes many functions (all starting with rm_) that allow one to remove common patterns. For example, rm_number removes numeric values from strings. It recognizes both integers and numbers with decimal separators.

  qdapRegex::rm_number("Sometimes 12,5 is denoted 12.5 but we may
  round it to 13")
  ## [1] "Sometimes 12,5 is denoted but we may round it to"

Scientific notation, for example, 1.25E1 is not recognized at the time of writing. The package contains similar functions for removal of e-mail addresses, urls, several date formats, twitter tags, and more.

The tm package includes the function removePunctuation, which removes occurrences of characters defined in the [:punct:] range (see Table 5.1). It has an option to prevent intra-word dashes from being removed.

  x <- "'Is that an intra-word dash?', she asked"
  tm::removePunctuation(x)
  ## [1] "Is that an intraword dash she asked"
  tm::removePunctuation(x, preserve_intra_word_dashes=TRUE)
  ## [1] "Is that an intra-word dash she asked"

The tm package is focused on analyses of texts, which in this case means that many functions are overloaded to work on both character data and tm-defined objects that store text with some metadata.

The Regular Expression Interface of stringr

The stringr package has a utility function for removing blanks at the start and/or end of a string.1

  library(stringr)
  str_trim("  hello world ", side='both')
  ## [1] "hello world"

The side arguments control whether prepended and/or trailing white spaces are removed.

There are many options to replace substrings in R. The stringr function str_replace differs from R's default sub function in that it accepts a vector of regular expressions and a vector of replacements, where the shorter arguments are recycled to match the longest.

  pat <- c("[[:digit:]]+","[∧[:digit:]]+")
  str_replace("Hello 12 34", pattern = pat, replacement = "")
  ## [1] "Hello  34" "12 34"

Here, we apply two regular expressions to the (single) input string and replace the matched patterns with the empty string (hence removing the matched substrings). In the first case, we remove sequences of at least one digit and in the second we precisely keep sequences of at least one digit. This function is a handy replacement for sub: it only replaces the first match. The package offers a similar replacement for gsub, which is called stringr::str_replace_all. For example, to collapse all double white spaces into a single whitespace, we could do the following:

  pat <- "[[:blank:]]{2,}" # two or more blank characters
  rep <- " "               # replace with single space
  str_replace_all("Hello   many    spaces", pattern = pat, replace = " ")
  ## [1] "Hello many spaces"

Their consistent naming and vectorization over input strings, patterns, and replacements make stringr's functions more readable and versatile than base R functions. Moreover, since they always have the data (here, the string) as the first input, it is possible to combine them in statements that include the %>% pipe operator of the magrittr package.

Like in base R, stringr allows one to escape the interpretation of a pattern as regular expression and just interpret it as is. This is done with the fixed function. Compare the following example with the example on page 94.

  x <- c("a.b", "aHb", "a.b.*")
  stringr::str_detect(x, "a.b.*")
  ## [1] TRUE TRUE TRUE
  stringr::str_detect(x, stringr::fixed("a.b.*"))
  ## [1] FALSE FALSE  TRUE

Here, we use the str_detect function (a replacement for grepl), but any function in the stringr package that accepts a regular expression will obey the fixed directive. This function also allows for case-insensitive matching

  x <- c("a.b", "aHb", "A.B.*")
  str_detect(x, fixed("a.b.*"))
  ## [1] FALSE FALSE FALSE
  str_detect(x, fixed("a.b.*", ignore_case=TRUE))
  ## [1] FALSE FALSE  TRUE

The fixed function causes its argument to be compared byte by byte, ignoring possible multibyte encoding. Since case-insensitive matching depends on case folding rules which are locale dependent, the fixed option may not always return the expected result. If you are sure that your matching pattern is in the same encoding as the string you are searching in, this is a safe option. The coll function makes sure that collation rules defined by the current or custom-defined locale are used. The following example from the documentation of stringr gives a clear example.

  i <- c("I", "u0130", "i")
  i
  ## [1] "I" "İ" "i"
  str_detect(i, fixed("i",ignore_case=TRUE))
  ## [1]  TRUE FALSE  TRUE
  str_detect(i, coll("i", ignore_case=TRUE))
  ## [1]  TRUE FALSE  TRUE
  str_detect(i, coll("i", ignore_case=TRUE, locale = "tr"))
  ## [1] FALSE  TRUE  TRUE

Here, the Turkish locale ("tr") forces a different case folding, mapping "I" to the dotless "ı". This can also be made visible with stringr's case folding functions.

  str_to_lower(i)
  ## [1] "i" "i?" "i"
  str_to_lower(i, locale="tr")
  ## [1] "?" "i" "i"

Versatile Substitutions with gsubfn

A seemingly simple, but very powerful, generalization of gsub is implemented by the gsubfn package. It allows replacements of matched patterns by a (user-defined) function of those patterns. For example, in the given statements, the numbers in the strings "(1,2)" and "x-23-y" are multiplied by two.

  gsubfn::gsubfn("[0-9]+", function(x) 2*as.numeric(x), c("(1,2)","x-23-y"))
  ## [1] "(2,4)"  "x-46-y"

Note that the result of the user-defined function is automatically converted to character.

Extracting or Splitting Substrings

The most convenient way of substring extraction in R is offered by the stringr package. Functions str_extract and str_extract_all return the first, respectively all matches of a regular expression.

  x <- c("date: 11-11-2011", "Either 10-01-2001 or 07-03-2012")
  stringr::str_extract(x,"[0-9]{2}-[0-9]{2}-[0-9]{2}")
  ## [1] "11-11-20" "10-01-20"
  stringr::str_extract_all(x,"[0-9]{2}-[0-9]{2}-[0-9]{2}")
  ## [[1]]
  ## [1] "11-11-20"
  ##
  ## [[2]]
  ## [1] "10-01-20" "07-03-20"

Observe that str_extract_all returns a list, with one element for each entry in x. A nice feature of str_extract_all is the simplify option which returns the results, instead as a character matrix.

  stringr::str_extract_all(x,"[0-9]{2}-[0-9]{2}-[0-9]{2}", simplify=TRUE)
  ##      [,1]       [,2]
  ## [1,] "11-11-20" ""
  ## [2,] "10-01-20" "07-03-20"

If it is known what positions in a string contain the information to extract, substr from base R can be used. For example, each IBAN account number starts with a two-letter (ISO 3166-1) country code which may be extracted.

  iban <- "NLABNA987654321"
  substr(start=1, stop=2,iban)
  ## [1] "NL"

It is possible to split a string at each match of a regular expression using str_split. The result is a vector with the original string cut in pieces and the matches left out.

  x <- c("10-10 2010","12.12-2012")
  stringr::str_split(x,"[-\. ]")
  ## [[1]]
  ## [1] "10"   "10"   "2010"
  ##
  ## [[2]]
  ## [1] "12"   "12"   "2012"

Again, the result is a list with one entry for each element in x. There is also a variant called str_split_fixed that interprets the pattern as a literal string rather then a regular expression.

5.4 Approximate Text Matching

Approximate or ‘fuzzy’ text matching is in some ways complementary to the techniques discussed in the previous sections. Where character normalization and pattern extraction and substitution techniques are focused on bringing text closer to some standardized form, approximate text matching methods allow one to leave some slack between text and a standardized form.

As an example we consider the task of matching some input data against a list of standardized categories.

  # standard categories
  codes <- c("male", "female")
  # vector of 'raw' input data
  gender <- c("M", "male ", "Female", "fem.","female")

Here, we wish to assign each value in gender to a category in codes. If the entries in gender were perfect, R's native match function could be used.

  # lookup of gender in the 'codes' lookup table
  match(x = gender, table = codes)
  ## [1] NA NA NA NA  2

We get a single match. The last element of gender matches the second element of codes. We can do a little better by cleaning up gender a little.

  # remove trailing whitespace, cast to lowercase
  gender_clean <- stringi::stri_replace_all(gender,"",
  regex="[[:blank:]]+$")
  gender_clean <- stringi::stri_trans_tolower(gender_clean)
  gender_clean
  ## [1] "m"      "male"   "female" "fem."   "female"
  match(x = gender_clean, table = codes)
  ## [1] NA  1  2 NA  2

We now get three exact matches. The amatch function of the stringdist package (van der Loo, 2014) also matches strings, but it allows for a certain amount of slack between the data to match and entries in the lookup table.

  stringdist::amatch(x = gender, table = codes, maxDist = 3)
  ## [1] NA  1  2  2  2
  stringdist::amatch(x = gender_clean, table = codes, maxDist = 3)
  ## [1] 1 1 2 2 2

For raw data, four matches are found; while for the cleaned data we find a match for each data item in gender. The results are summarized in Figure 5.1.

Illustration of Results of exact or inexact matching against a lookup table with raw and normalized text data.

Figure 5.1 Results of exact or inexact matching against a lookup table with raw and normalized text data. The combination of normalization and approximate matching works best.

In the call to amatch, we specified a maximum distance (maxDist) of three. In this case it means that for two elements to match, the optimal string alignment distance between them will be maximally 3. Roughly, this distance counts the number of allowed operations necessary to turn one string into another. The four operations include substitution, deletion, or insertion of a single character, and transposition of neighboring characters. For example, "fem." can be turned into "female" by substituting "a" for "." and inserting "l" and "a", giving a total distance of three.

If no list of fixed codes is available, one may attempt to cluster strings with any method that accepts a distance matrix. For example, with R's hierarchical clustering implementation we can do the following:

  # some 'raw' data to be clustered.
  subject <- c("mathematics", "physics",
       "astrophysics", "math", "numerical math")
  # compute distance matrix, adding strings as row/column names
  M <-stringdist::stringdistmatrix(subject,subject, useNames=TRUE)
  # Convert to 'dist' object and cluster hierarchically
  h <- hclust(as.dist(M), method='single')

The resulting clustering is shown in Figure 5.2. In this case, the topics are clustered more or less as one would expect. The clustering is, of course, heavily dependent on both the chosen distance metric as well as the clustering method. For example, the same method gives unsatisfactory results for the gender example we introduced earlier. The category names "male" and "female" are so close to each other than some of the polluted data items are to the right solution.

Illustration of Hierarchical clustering of five strings.

Figure 5.2 Hierarchical clustering (‘single’ method) of five strings based on the optimal string alignment distance matrix.

5.4.1 String Metrics

The purpose of a string metric is to introduce a sense of distance between elements of c05-math-105. Formally, a distance function on c05-math-106 is defined as a function that assigns to each pair of strings a real number,

5.7 equation

Furthermore, a formal distance function satisfies the following rules:

5.8a equation
5.8b equation
5.8c equation
5.8d equation

These rules are easily intuitively understood: the first rule states that distances cannot be negative; the second rule states that the distance between two strings is zero if and only if the two strings are the same; and the third rule states that taking a walk through c05-math-112 from c05-math-113 to c05-math-114 is as far as going from c05-math-115 to c05-math-116. The fourth rule says that the value of the distance function is the value of the shortest route between two strings. Going from c05-math-117 to c05-math-118 via a third string c05-math-119 may not be shorter than going from c05-math-120 to c05-math-121 directly.

Over the years, many ‘string metrics’ have been developed. Perhaps surprisingly, not all of them fulfill the natural conditions laid down in Eq. (5.8). For example, in the following sections we show that c05-math-122-gram-based distances do not always obey the identity property. That is, two strings may have distance zero between them while being different. Obviously, this means one needs to take care when choosing a distance function, for example, when clustering strings. Indeed, some of the ‘metrics’ we are going to discuss are not metrics in the mathematical sense. However, since it is common in literature to talk about string distance or string metrics regardless of which formal rules are satisfied, we shall apply the same convention here.

The principles from which metrics are derived vary strongly. We generally distinguish between edit-based distances, c05-math-123-gram-based distances, heuristic distances, kernel-based distances, and distances based on phonetic algorithms. Here, we discuss these distances, pointing out their most important properties, including which, if any, of the rules in Eq. (5.8) are violated. We should state that there are more distance metrics than discussed in this book. Since we focus on text cleaning in this chapter, we focus on metrics that are commonly used in that area.

Edit-Based Distances

These distances are based on determining a sequence of basic transformations necessary to turn one string into another. The set of basic operations (‘edits’) includes one or more of the following:

  • Deletion of a single character, for example, c05-math-124;
  • Insertion of a single character, for example, c05-math-125;
  • Substitution of a single character, for example, c05-math-126;
  • Transposition of two adjacent characters, for example, c05-math-127.

A distance function is defined by determining the shortest sequence of basic operations that turns one string into another and determine the length of the sequence. Alternatively, one may assign weights specific to each type of transformation and determine the sequence of least total weight. Such distances are referred to as generalized distances (Boytsov, 2011).

First note that when a too limited set of basic operations is chosen, not every pair of strings from c05-math-128 is connected by a sequence of allowed operations. Formally, such functions are not total functions on c05-math-129. We will follow the same convention as Navarro (2001) and define the distance between such unconnected pairs to be infinite. The advantage of this convention is that numerical comparisons (c05-math-130) are then valid between all computed distances.

It can be shown that unweighted edit-based distances all obey the properties of a proper distance (Eq. 5.8), see, for example, Boytsov (2011). When different weights are assigned to insertion and deletion, distances become asymmetric: walking from "abc" to "ac" requires an insertion while walking from "ac" to "abc" requires an deletion. Distance functions with weighted insertion or deletion are symmetric under simultaneous swapping of both the input strings and deletion and insertion weights.

Table 5.4 Edit-based string distances

Distance Operation
del ins sub trn
Hamming c05-math-131 c05-math-132 c05-math-133 c05-math-134
Longest common substring c05-math-135 c05-math-136 c05-math-137 c05-math-138
Levenshtein c05-math-139 c05-math-140 c05-math-141 c05-math-142
Optimal string alignment c05-math-143 c05-math-144 c05-math-145 c05-math-146a
Damerau–Levenshtein c05-math-147 c05-math-148 c05-math-149 c05-math-150

a Each character can take part in maximally a single transposition.

Table 5.4 gives an overview of commonly used edit-based string distances and their allowed operations. The distance of Hamming (1950) is the simplest and allows only for substitution of characters. Hence, it is finite only when computed between strings of the same length. For example, we have c05-math-151 and c05-math-152. Since there is only one allowed operation, there is no use in defining weights. Since it is defined only on strings of equal lengths, in the context of text processing, the Hamming distance is most useful for standardized strings, such as phone numbers, product codes, or other identifying numbers.

The longest common subsequence distance (lcs distance) of Needleman and Wunsch (1970) is the simplest generalization of the Hamming distance and it replaces the substitution operation by allowing insertion and deletion of characters. This means that the distance function is a complete function on c05-math-153. In fact, it is easy to see that given two strings c05-math-154 and c05-math-155, their distance will be largest if they have no characters in common. In that case, the distance from c05-math-156 to c05-math-157 is simply c05-math-158, the cost of deleting one by one all characters from c05-math-159 and then inserting all characters of c05-math-160. As an example, consider the distance c05-math-161 between "fu" and "foo". Using only deletion and insertion, it takes three steps to go from one to the other.

equation

This also shows that there is not necessarily a unique shortest path from the start point to the end. Other than in Euclidean geometry, for example, we could have chosen a different path through c05-math-162 (e.g., c05-math-163) and obtained a path of the same length.

As implied by its name, the lcs distance has a second interpretation. Given two strings, we pair common characters, passing through the strings from left to right but without changing their order. The distance between the two strings is then given by the number of unpaired characters in both strings. The diagram given here demonstrates this process.

c05g001

Indeed, there are three unpaired characters in the two strings. Note that a subsequence is not the same as a substring. If c05-math-164 is a string, a substring indicates a consecutive sequence c05-math-165. A subsequence is a sequence of characters c05-math-166 that preserves order (c05-math-167) but need not be consecutive. Hence, "fbar" is a subsequence of both "fubar" and "foobar" but is not a substring. Applications of the lcs distance lie mainly outside of data cleaning, although it can be interpreted as a crude (and fast) approach for typo-detection. In principle it is possible to define a generalized lcs distance by assigning different weights to insertion and deletion but this seems not very common in practice.

The Levenshtein distance (Levenshtein, 1966) is like the longest common substring distance but it is more flexible since it also allows for direct substitution of a character, which would take two steps in the lcs distance. This extra flexibility leads to shorter minimal distances. For example, the Levenshtein distance between "fu" and "foo" equals two:

5.9 equation

The generalized Levenshtein distance satisfies the properties of Eq. (5.8) except when the weights for insertion and deletion are different. In that case, the distance is no longer symmetric, which is easily illustrated with an example. Suppose we weigh an insertion with 0.1 and deletion and substitution with 1. We have (denoting the cost per step underneath the arrows)

equation

giving a distance of 1.1. Going the other way around, however, we get

equation

giving a distance of 2.

The optimal string alignment distance, which is also referred to as the restricted Damerau–Levenshtein distance, allows for insertions, deletions, substitutions, and limited transpositions of adjacent characters. The limitation lies therein that each substring may be edited only once.

The most important consequence is that optimal string alignment distance does not satisfy the triangle inequality. The following counterexample is taken from Boytsov (2011). If we take c05-math-169, c05-math-170 and c05-math-171, the distance c05-math-172 is given by

equation

giving a total distance of 2. The intuitive direct path from "ab" to "bca"

However, this involves editing the subsequence "ba" twice: once by transposing "a" and "b" and once by inserting a "c" between "a" and "b". In fact, it turns out that there is no way to go from "ba" to "acb" directly in less than three steps, and so one has to resort to the sequence

equation

The optimal string alignment distance between a string variable and a known fixed string may be interpreted as the number of typing errors made in the string variable.

The Damerau–Levenshtein distance (Lowrance and Wagner, 1975) does allow for arbitrary transpositions of characters. That is, subsequences may be edited more than once, allowing the path of Eq. (5.10).

The maximum distance between two strings c05-math-174 and c05-math-175, as measured by any of the discussed distances that allows for character substitution, equals c05-math-176. Distances that allow for a larger number of basic operations generally yield numerically smaller distances. The given diagram summarizes the relation between edit-based distances and their limits.

Note that the maxima may be used to scale distances between 0 (exactly equal) and 1 (completely different) should a method based on a string metric require it.

Finally, some notes on the computational time it takes to compute edit-based distance. The Hamming distance can obviously be computed in c05-math-178 time. The other edit-based distances are usually implemented using dynamic programming techniques that run in c05-math-179 time. See Navarro (2001) or Boytsov (2011) for an overview.

The Jaro–Winkler Distance

The Jaro distance was originally developed at the US bureau of the Census as part of an application called UNIMATCH. The purpose of this program was to match records based on inexact text fields, mostly name and address data. The first description appeared in the manual of this software (Jaro, 1978), which may be a reason why it is not widely described in computer science literature. Jaro (1989), however, reports successful application of this distance for matching inexact name and address fields.

The Jaro distance is an expression based on the intuition that character mismatches and character transpositions are caused by human-typed errors and that transposition of characters that are further apart are less likely to occur. It can be expressed as follows:

Here, c05-math-181 is the number of matching characters and c05-math-182 the number of transpositions, counted in a way to be explained later and the c05-math-183 are nonnegative weight parameters which are usually chosen to be 1. The distance ranges between 0 (c05-math-184 and c05-math-185 are identical) and 1 (complete mismatch). It is zero if both c05-math-186 and c05-math-187 are the empty string and it equals 1 if there are no matches while at least one string is nonempty. Otherwise, the distance depends on a weighted sum of character matches and transpositions.

The number of matches is computed as follows. First, pair characters from c05-math-188 and c05-math-189 as you would for the longest common subsequence distance. It now depends on the distance between the positions of paired characters whether a pair is considered a match. In particular, if c05-math-190 is a pair, this is considered a match only when

5.13 equation

where c05-math-192 indicates the floor function and each character can be included no more than one match. For example, c05-math-193 and c05-math-194 the number of matches c05-math-195.

The number of transpositions c05-math-196 is computed as follows. First take the subsequences c05-math-197 and c05-math-198 of c05-math-199 and c05-math-200 consisting only of the matching characters. Then c05-math-201 is the number of characters in c05-math-202 that have to change position to turn c05-math-203 into c05-math-204. For example, take c05-math-205 and c05-math-206, we have c05-math-207 and c05-math-208 and c05-math-209. To turn c05-math-210 into c05-math-211, we need two operations:

equation

The term transposition here is used differently from the definition in Section 5.4.1 where it is defined for transposition of two characters. Here, it should be interpreted as transposing a single character with a substring. For example, in the first step of the abovementioned example, we may interpret the relocation of "d" as swapping c05-math-212 with "cb".

Unfortunately, there seems to be some confusion in literature over the precise definition of the Jaro distance. Cohen et al. (2003), for example, use c05-math-213 as an upper limit for c05-math-214. The definition used here, and which is also implemented in stringdist package, follows the original definition of Jaro (1978).

Winkler (1990) proposes a simple variant of the Jaro distance that rests on the observation that typing errors are less likely to occur at the beginning of a human-typed string than near the end. If differences occur at the beginning, the heuristic is that the strings are likely to be truly dissimilar. In the Jaro–Winkler distance, this phenomenon is taken advantage of by multiplying the Jaro distance with a penalty factor that depends on the number of unequal characters in the first four positions. The expression is as follows:

Here, c05-math-216 is the length of the longest common prefix of c05-math-217 and c05-math-218, up to a maximum of 4. The value of c05-math-219 is a user-defined adjustable parameter but it must be between 0 and c05-math-220 to ensure c05-math-221. The value of c05-math-222 determines how strong the length of the common prefix weighs into the total distance and the Jaro distance emerges as a special case when c05-math-223. Both Winkler (1990) and Cohen et al. (2003) report good results in matching name–address data when setting c05-math-224.

As far as this author knows, the properties as described in Eq. (5.8) are not reported anywhere except in van der Loo (2014). They are however easily derived, so we will state them in the following observation.

c05-math-250-Gram-Based Distances

The edit-based distances and the Jaro–Winkler distance described are typically used for the detection of misspelling errors in fairly short strings. They are based on a precise description of what operations are allowed to turn one string into another and therefore have natural paths from c05-math-251 to c05-math-252 through c05-math-253 associated with them.

Distances based on c05-math-254-grams, on the other hand, are computed by first deriving an integer vector (the so-called c05-math-255-gram profile) from strings c05-math-256 and c05-math-257 and defining a distance measure on the space of vectors. Well-known examples include the Manhattan distance or the cosine distance.

A c05-math-258-gram is a string of c05-math-259 characters. Given a string c05-math-260 (c05-math-261), the associated c05-math-262-gram profile is a table of size c05-math-263 counting the occurrence of all c05-math-264-grams in c05-math-265. For example, the nonzero entries of the 2-gram profile of "banana" looks like this:

equation

In literature on text classification, the term c05-math-266-gram is often used instead of c05-math-267-gram. Since this term is also used to indicate word sequences rather than character sequences, we follow the terminology of Ukkonen (1992) and use the term c05-math-268-gram to avoid confusion.

We will denote the c05-math-269-gram profile of a string c05-math-270 as a vector-valued function c05-math-271. The range of c05-math-272 is c05-math-273, the set of integer vectors whose size is the number of possible c05-math-274-grams allowed by the alphabet. In general, c05-math-275 is not a complete function on c05-math-276. In particular, it is undefined when c05-math-277, and we also need to decide what to do when c05-math-278. Here, we adopt the following conventions:

equation

Intuitively, this corresponds to the idea that all distances are zero as long as we are not comparing anything. In particular, it means that c05-math-279 for any c05-math-280-gram-based distance.

The traditional c05-math-281-gram distance is computed as the Manhattan distance between two c05-math-282-gram profiles:

equation

Observe that since the c05-math-283 are counts rather than frequencies, this distance is both an expression of difference in c05-math-284-gram distribution as well as a difference in string lengths. It is not hard to show that the maximum c05-math-285-gram distance between two strings c05-math-286 and c05-math-287 is equal to

equation

To accommodate for the effect of differences in string length, one can base the distance on the (cosine of the) angle between the c05-math-288-gram vectors.

equation

This distance varies between 0 and 1, where 0 means complete similarity and 1 means complete dissimilarity (no overlap in occurring c05-math-289-grams).

A third way to compare c05-math-290-gram profiles is the Jaccard distance.

equation

where c05-math-291 returns the set of all c05-math-292-grams occurring in c05-math-293. The Jaccard distance scales from 0, indicating complete similarity to 1, indicating complete dissimilarity.

The term ‘complete similarity’ is not the same as equality. We already saw the case for c05-math-294, which defines all strings as equal (there is nothing to make them different when c05-math-295). The fact that unequal strings have a c05-math-296-gram-based distance equal to zero is not, however, limited to c05-math-297. For example, if c05-math-298, the table of c05-math-299-gram counts is the same for two words if they are anagrams. Ukkonen (1992) has shown that one can derive transformations on strings of a certain structure that leave the c05-math-300-gram profile intact, for any value of c05-math-301. For example, the reader can check that the strings["aFOObBARa" and "bBARaFOOb"]have the same 2-gram profile.

We have seen that c05-math-302-gram distances do not obey the identity property of distance metrics. The other properties depend on the metric that is chosen over the positive integer vectors. The distances discussed here are all nonnegative, symmetric. Both the Jaccard and c05-math-303-gram distance also satisfy the triangle inequality. The cosine distance does not satisfy the triangle inequality. This is easily seen by assuming three c05-math-304-gram profiles, c05-math-305, c05-math-306, and c05-math-307 with angles c05-math-308 and c05-math-309. We have c05-math-310 and c05-math-311. To show that the cosine distance does not satisfy the triangle inequality, we only need to find an c05-math-312 for which c05-math-313. This is the case for every c05-math-314.

c05-math-315-gram profiles have been widely used as a tool for text classification. Applications and methodology have been reported where texts have been classified according to the language used (Cavnar and Trenkle, 1994), topics under discussion (Clement and Sharp, 2003), or author style and identity (Kešelj, et al. 2003), to name but a very few examples. In many cases, the use of bi- or trigrams seems an appropriate choice for natural language classification (Fürnkranz, 1998).

Distances Based on String Kernels

Just like c05-math-316-gram-based methods, string kernels are interesting for text categorization or clustering purposes.

A string kernel is defined by a map c05-math-317 that assigns to each string c05-math-318 in c05-math-319 a real vector c05-math-320 with nonnegative coefficients. The normalized kernel function c05-math-321 of two strings is defined as

equation

where ‘c05-math-322’ is the standard inner product on a real vector space. The right-hand side is the cosine between the angle of the feature vectors c05-math-323 and c05-math-324, and thus a measure of distance can be obtained as

equation

and a suitable map c05-math-325 defines a sense of distance on c05-math-326.

Several authors have defined string kernels and tested their effectiveness. Karatzoglou and Feinerer (2007) propose two string kernels. In the first kernel (simply called the string kernel), each element c05-math-327 of c05-math-328 counts the number of occurrences of subsequences in c05-math-329.

equation

The second kernel, called the full string kernel, is defined by concatenating the feature vectors defined by every subsequence up to length c05-math-330.

equation

Lodhi et al. (2002) propose a string kernel that penalizes contributions of each subsequence by the number of positions traversed going from the first to the last character of a subsequence c05-math-331 of c05-math-332.

5.15 equation

where the sum is over every occurrence of the subsequence c05-math-334 in c05-math-335, and c05-math-336 is the length of the subsequence c05-math-337 defined by the index vector c05-math-338 in c05-math-339: c05-math-340.

Phonetic Algorithms

Phonetic algorithms aim to match strings that sound similar when they are pronounced. Phonetic algorithms transform a string to some kind of phonetic string that represents pronunciation features. The distance between two strings is then defined as a distance between the phonetic strings. In the simplest case, the distance is 0 when two phonetic strings are equal and 1 otherwise.

One of the most famous phonetic algorithms, called soundex, was patented for the first time in the early twentieth century (Russell, 1918 1922). It is aimed at names, written in the ASCII alphabet, pronounced in English. One widely used version of the algorithm is published by the United States National Archives and Records Administration (NARA, 2015).

The soundex algorithm encodes a string to a phonetic string consisting of a letter and three numbers. The algorithm can be summarized as follows:

For example, the names "van der loo" and de jonge map to "V536" and "D252", respectively, (after removing spaces). Phonetically similar names map to similar results. For example, "john" and "joan" both map to "J500", and similarly "ashley" and "ashly" map to "A240".

The soundex algorithm has been adapted and expanded over the years by many researchers and practitioners. Like the original algorithm, such extensions are often strongly related to a particular purpose. For example, the Metaphone algorithms by Philips (2000) started as improvement on the soundex algorithm for better matching of English names, but it has been extended to other languages including Spanish, and Brazilian Portuguese. Holmes and McCabe (2002) focus on the original purpose of English name matching, using a mix of different phonetic techniques. Mokotov (1997), and later Beider and Morse (2008) developed alternatives aimed at matching Eastern European and Ashkenazic Jewish surnames, respectively. Finally, Pinto et al. (2012) report on a soundex-like algorithm for strings typically occurring in short text messages (sms).

5.4.2 String Metrics and Approximate Text Matching in R

Base R comes with a few basic capabilities for approximate text matching. The function adist (of the standard utils package) computes the generalized Levenshtein distance. By default, the weights for insertion, deletion, and substitution are all equal to one, but this may be set with the cost argument.

  adist("fu","foo")
  ##      [,1]
  ## [1,]    2
  adist("fu","foo", cost=c(i=0.1, d=1, s=1))
  ##      [,1]
  ## [1,]  1.1

It is obligated to name the elements of cost so it (partially) matches insertion, deletion, and substitution. If adist is provided with vector(s) of strings, the distance matrix between the strings is computed.

  adist(c("hello","world"),c("olá mundo"))
  ##      [,1]
  ## [1,]    8
  ## [2,]    8

The agrep function is based on the work of Laurikari (2001) who combines the idea of approximate matching with regular expression search. Recall that a regular expression defines a subset of all possible strings that can be built up out of a certain alphabet. The function grep takes a regular expression and a string x and determines whether any substring of x is in the set defined by the regular expression. The function agrep is even more flexible and searches whether there is a substring in x that is maximally some Levenshtein distance away from an element in the set defined by the regular expression.

  # a pattern to search for
  re <- "abc"
  # two strings to search in
  x <- c("FOOabcBAR","FOOaXcBAR")
  # grep finds only one match (in the first element)
  grep(pattern=re, x=x)
  ## [1] 1
  # agrep finds both
  agrep(pattern=re, x=x, max.distance=1)
  ## [1] 1 2

In this simple example, the set of strings matched by the regular expression is precisely c05-math-345. Thus, grep returns 1, the index into x where a match is found. Int The argument max.distance tells agrep to search for strings that are maximally Levensthein distance 1 removed from the substrings defined by the pattern "abc". Therefore, agrep also matches the substring "aXc" since it is a single substitution away from "abc".

The concept behind agrep is extremely powerful and elegant. However, do note that it can be hard to visualize what strings are defined by a regular expression. Combining complex regular expressions with the flexibility of agrep may yield unexpected matches. So, just like when working with grep and related tools, it is strongly recommended to build up regular expressions piece by piece, and to create test sets of strings with cases that you expect to match (or not).

Beyond the Levenshtein Distance with stringdist

The stringdist package offers a number of string metrics which makes it easy to try and test several distance measures for your application. Metrics can be computed using the stringdist function, where the default function is the optimal string alignment distance.

  library(stringdist)
  stringdist("hello",c("hallo","wereld"))
  ## [1] 1 4

Here, the shortest argument ("hello") is recycled and the result is a vector of length 2, measuring the distance between "hello" and "hallo" and "hello" and "wereld". So unlike adist, this function does not return a matrix. A matrix of distances can be computed with the stringdistmatrix function.

  stringdistmatrix("hello",c("hallo","wereld"), useNames="strings")
  ##       hallo wereld
  ## hello     1      4

To perform approximate matching, the amatch function can be used. Here, amatch is similar to R's match function. It is given some strings x and a lookup table table and returns an index in table for each element of x that can be matched. The difference is that amatch allows for a certain string distance.

  match("hello",c("hallo", "wereld"), nomatch=0)
  ## [1] 0
  amatch("hello",c("hallo", "wereld"), nomatch=0, maxDist=1)
  ## [1] 1

Here, match finds no match for "hello" in the vector c("hello","wereld"), while amatch matches "hello" with "hallo". The strings "hello" and "hallo" are one substitution away from each other, which is equal to the maximally allowed distance.

In R, there is a variant of match called %in%, which returns a logical vector that indicates which elements of x occur in the lookup table. The stringdist package provides an approximate matching equivalent called ain.

  "hello" %in% c("hallo", "wereld")
  ## [1] FALSE
  ain("hello",c("hallo", "wereld"), maxDist=1)
  ## [1] TRUE

Each function from the stringdist package discussed (stringdist, stringdistmatrix, amatch, and ain) have a method argument that allows one to switch between string metrics.

  stringdist("hello",c("hallo","wereld"),method="hamming")
  ## [1]   1 Inf
  amatch("hello",c("hallo", "wereld"), nomatch=0,method="hamming",
  maxDist=1)
  ## [1] 1

Table 5.5 gives an overview of the available distance metrics.

Table 5.5 Methods supported by the stringdist package

method Distance
"hamming" Hamming distance
"lcs" Longest common subsequence
"lv" Generalized Levenshtein
"dl" Generalized Damerau–Levenshtein
"osa" Generalized optimal string alignment (default)
"jw" Jaro, Jaro–Winkler distance
"qgram" c05-math-346-gram distance
"cosine" Cosine distance (based on q-gram profiles)
"jaccard" Jaccard distance (based onc05-math-347-gram profiles)
"soundex" Soundex distance

The term ‘generalized’ indicates that edit operations can be weighted.

Depending on the chosen distance, there are some options to set. The edit-based distances (Levenshtein, optimal string alignment, and Damerau–Levenshtein) are implemented in the generalized form, so weights for different edits may be specified in the order deletion, insertion, substitution, and transposition (if applicable).

  stringdist("hello","hallo", method="osa", weight=c(1,1,0.1,1))
  ## [1] 0.1

For the Jaro–Winkler distance, there are weights for matched characters in the first string, matched characters in the second string, and a weight for the contribution of the transpositions [see Eq. (5.12)]. These can be passed through the same weight argument. If method is set to "jw", Eq. (5.14) is used to compute the Jaro–Winkler distance where the default value of Winkler's penalty factor c05-math-348 is set to 0. That is, the pure Jaro distance is computed by default. This can be controlled with the parameter p.

  # Compute the Jaro distance
  stringdist("marhta", "martha", method="jw")
  ## [1] 0.05555556
  # Compute the Jaro-Winkler distance with p=0.1
  stringdist("marhta", "martha", method="jw", p=0.1)
  ## [1] 0.03888889

Like the method parameter, these extra options are available for amatch, ain, and stringdistmatrix as well. So to do an approximate lookup using the Jaro–Winkler distance with c05-math-349, one simply does the following.

  # raw data
  x <- c("Stan Marhs","Kyle Brovlofski")
  # lookup table
  sp_char <- c("Kenny McCormick","Kyle Broflovski","Stan Marsh",
  "Eric Cartman")
  # find matches
  amatch(x=x, table=sp_char, method="jw", p=0.1, maxDist=0.2)
  ## [1] 3 2

For distances based on c05-math-350-grams there is an optional parameter q, which by default is set to 1.

  amatch(x, table=sp_char, method="qgram", q=2, maxDist=5)
  ## [1] 3 2
  stringdistmatrix(x, sp_char, method="qgram", q=2, useNames="strings")
  ##               Kenny McCormick Kyle Broflovski Stan Marsh Eric Cartman
  ## Stan Marhs                   21             23          4           16
  ## Kyle Brovlofski              28              4         23           25

Finally, we mention that soundex codes can be computed with the phonetic function.

  phonetic(sp_char)
  ## [1] "K552" "K416" "S356" "E626"

In stringdist, the soundex distance between two strings is equal to 0 if they have the same soundex code and 1 otherwise.

  stringdistmatrix(x, sp_char, method="soundex", useNames="strings")
  ##               Kenny McCormick Kyle Broflovski Stan Marsh Eric Cartman
  ## Stan Marhs                    1              1          0            1
  ## Kyle Brovlofski               1              0          1            1

String Similarity

For some applications, it can be more convenient to think in terms of string similarity than string distance. For any given distance measure, a similarity measure can be defined, but the procedure to do so depends on the distance used. In particular, it depends on the range of the distance function, and we distinguish two cases.

The stringdist package implements a function that maps any string distance between two strings to a value in c05-math-351, where 0 means complete dissimilarity and 1 means complete similarity. Note that for distances depending on c05-math-352-grams, complete similarity does not necessarily mean equality since those distances do not satisfy the identity property of Eq. (5.8). If c05-math-353 is the distance between two strings, the string similarity is defined as

That is, the distance is divided by the maximum possible distance between two strings of the same length as the input strings. It depends on the distance how these maxima are computed. In exceptional cases where the maximum equals 0 (e.g., c05-math-355-gram distances and c05-math-356), the ratio in the abovementioned equation equals 0/0, which is defined as equal to one for this purpose. This definition includes the simple case where distances by definition yield values in c05-math-357, such as the Jaro–Winkler distance. In that case, Eq. (5.16) reduces to c05-math-358.

Computing string similarities is done with the stringsim function, which works similar to the stringdist function.

  stringdist::stringsim("hello",c("hallo", "olá"), method="lcs")
  ## [1] 0.80 0.25

Here, lcs distance between "hello" and "hallo" equals 2 since the characters "e" and "a" are unmatched. The maximum lcs distance for two strings of length 5 equals c05-math-359 [see Eq. (5.11)], so the string similarity equals c05-math-360.

The advantage of such a normalized string similarity over a string distance is that string similarities can be better compared across different string metrics. The normalization to a maximum possible distance yields a more neutral degree of similarity.

c05-math-361-gram-Based Distances in the textcat Package

The textcat package offers a number of c05-math-362-gram-based distances that are especially interesting for text categorization purposes such as language detection. Being based on a method of Cavnar and Trenkle (1994), the computed profile for a string c05-math-363 is the concatenation of profiles for c05-math-364 to c05-math-365. The function textcat_xdist computes a distance matrix over a character vector based on their profiles.

  sp_char <- c("Kenny McCormick","Kyle Broflovski","Stan Marsh",
  "Eric Cartman")
  library(textcat)
  textcat_xdist(sp_char)
  ##      [,1] [,2] [,3] [,4]
  ## [1,]    0 1891  954 1286
  ## [2,] 1894    0  980 1349
  ## [3,]  910  934    0  509
  ## [4,] 1244 1334  557    0

Using the method argument, several distance functions between c05-math-366-gram profiles can be chosen. The default method is the out-of-place measure of Cavnar and Trenkle (1994), which is based on a simple rank statistic over the profiles.

textcat supports a number of distance measures on the Cavner–Trenkle profiles, including measures based on the Kullback–Leibler divergence and the cosine distance.

Computing String Kernel Distances with kernlab

To compute a string kernel distance, one proceeds in two steps. The function stringdot creates a kernel function for subsequences of length c05-math-367.

  library(kernlab)
  sk <- stringdot()

Here, sk is a function that computes the cosine between two string kernel profiles. Since the cosine expresses a similarity, we can compute a distance between two strings as follows.

  1 - sk("kyle brovlofski", "kyle broflovski")
  ## [1] 0.5384615

The default values for c05-math-368 and c05-math-369 are 1.1 and 4, respectively. These parameters may be altered with the lambda and length parameter.

  sk3 <- stringdot(length=3,lambda=0.5)
  1 - sk3("kyle brovlofski", "kyle broflovski")
  ## [1] 0.4285714

Deciding on Matching Method and Parameters

The metric and parameter set to be used depends both on the application and the type of strings being matched. Both c05-math-370-gram-based metrics and string kernels have been widely researched in the context categorization of text corpora. For example, Cavnar and Trenkle (1994), Huffman (1995), Huffman and Damashek (1995), Lodhi et al. (2002), and Karatzoglou and Feinerer (2007) all categorize texts based on combinations of c05-math-371-gram profiles spanning multiple values of c05-math-372.

In the context of data cleaning, the most relevant application is lookup of polluted items in a list of known valid values. A closely related application is statistical matching where approximate key matching is an important subtask. Several authors have assessed the accuracy of (combinations) of string metrics, in the context of finding spelling mistakes (Zamora et al., 1981), bibliography (Lawrence et al., 1999), name matching (Cohen et al., 2003; Winkler, 1990), and toponym matching (Recchia and Louwerse, 2013; Sehgal et al., 2006). It is hard to point out a single optimal method for each, or all cases. For example, Cohen et al. (2003) report good results using the Jaro–Winkler distance (setting c05-math-373), while Sehgal et al. (2006) obtain better results using edit-based distances.

As a result, the choice of matching algorithm will be a practical trade-off between performance, necessary accuracy, and the amount of resources that can be invested in developing a matching methodology. There are however a few basic considerations.

If strings have an imposed structure such as a fixed length (think of phone numbers, IBAN bank accounts, credit card numbers), the Hamming distance may be of use. If not, (word) order, the supposed error mechanism, string length, and performance are of consideration.

For some strings, like name data or short product descriptions, the order of words may be of less relevance. For example, the strings "Farnsworth, Hubert J." and "Hubert J. Farnsworth" represent the same person. To compare such strings, a measure that is less dependent on (word) order is to be preferred. As an example, compare the cosine similarity (based on c05-math-374-gram profiles) with the Jaro–Winkler similarity.

  stringsim("Farnsworth, Hubert J.", "Hubert J. Farnsworth",
  method="cosine", q=2)
  ## [1] 0.88396
  stringsim("Farnsworth, Hubert J.", "Hubert J. Farnsworth",
  method="jw", p=0.1)
  ## [1] 0.465873

Clearly, the fact that c05-math-375-gram profiles are to an extent independent from in which order the c05-math-376-grams appear in the string results in a high similarity measure. The Jaro–Winkler distance, which measures a certain type of typing mistakes, results in a much smaller similarity.

When there is reason to suspect that strings differ in only a few characters, an edit-based distance or the Jaro–Winkler distance is a good option. For edit-based distances, long strings may be prohibitive as the computational time increases with the product of the lengths of the strings compared. In Figure 5.3 we compare the computation time necessary to compute the lower tridiagonal distance matrix on 400 randomly generated names with lengths ranging from 8 to 28 characters. The boxplots show timing distributions over 250 repetitions of the experiment, conducted with the stringdist package. One immediately recognizes why the Jaro–Winkler distance has become so popular: it is even faster than the fairly simple longest common substring distance. If a more accurate edit-based distance is necessary, one may choose the optimal string alignment distance. It is just a little slower than the Levenshtein method but much faster than the full Damerau–Levenshtein distance. The c05-math-377-gram distance takes about 40% longer than the optimal string alignment distance in this benchmark; but for longer strings, this disadvantage quickly disappears: running time for c05-math-378-gram-based distances depends on the sum of the input string lengths which makes them ideal candidates for ploughing through long texts.

Illustration of Benchmark of times to compute string distance measures in the stringdist package.

Figure 5.3 Benchmark of times to compute string distance measures in the stringdist package (smaller is faster). In this benchmark, the lower triangular of c05-math-379 distance matrix was computed 250 times using stringdist::stringdistmatrix. The durations were measured with the microbenchmark package. The strings consisted of randomly generated names of lengths varying between 8 and 28 characters.

Figure 5.4 shows a benchmark for string distance calculations between pairs of equally sized random strings of lengths ranging from 8 to 256 characters. At 200 characters length, the asymptotic computational complexity has been reached. The soundex distance is a little faster to compute than the Hamming distance. The full Damerau–Levenshtein distance is slowest in all cases, while the c05-math-380-gram-based method surpasses the optimal string alignment, Levenshtein, and lcs distance somewhere between 30 and 50 characters. At about 175 characters, c05-math-381-gram-based distances are computed faster then the Jaro–Winkler distance, which is faster to compute than any edit-based distance for any length of string. In general, accuracy is probably of more concern than performance; but if accuracy is of no or little concern, Figure 5.4 provides a guide of what distance measure to choose. The fact that the Jaro–Winkler distance is about 3–7 times faster than the optimal string alignment while providing acceptable accuracy probably explains its wide popularity in the record linkage community [see, e.g., Cohen et al. (2003)]. Finally, we note that functionality of the stringdist automatically uses multiple cores so performance can be improved by simply switching to a larger machine.

Illustration of Benchmarks of computing string distance measures in the stringdist package.

Figure 5.4 Benchmarks of computing string distance measures in the stringdist package (smaller is faster). In this benchmark, c05-math-382 lower triangular distance matrices were computed between random strings of lengths varying from 8 to 256 characters. The lines represent median relative times over 10 repetitions. Results have been normalized to the maximum median time. Results for the soundex distance were not plotted as it nearly coincides with that of the Hamming distance.

Measuring Matching Accuracy

Suppose we have a list c05-math-383 of polluted strings and a lookup table c05-math-384. The task is to match as many polluted strings as possible correctly to elements in c05-math-385. We assume that for every element in c05-math-386 there should be a corresponding element in c05-math-387. This matching may be one-to-many, that is, multiple elements of c05-math-388 might match the same element of c05-math-389. For example, c05-math-390 can be a long list of possibly misspelled place names and c05-math-391 a short list of official place names, and c05-math-392. In general, a matching algorithm generates a subset of pairs c05-math-393. Suppose that the true match is given by the set c05-math-394. The precision of the approximate matching algorithm is defined as the fraction of elements in c05-math-395 that are correctly matched.

equation

The recall is defined as the fraction elements in c05-math-396 that are correctly matched:

equation

Observe that both measures scale from 0 to 1, where 0 corresponds to not a single match (c05-math-397). A precision equal to 1 means that every found pair is also a true match (c05-math-398), but not necessarily that every possible match has been found. A recall equal to 1 means that every true match is in c05-math-399 (c05-math-400) but not necessarily that every pair in c05-math-401 represents a true match.

The precision and recall can usefully be combined into what is called the c05-math-402 measure, which is defined as

5.17 equation

Here, we silently agree that c05-math-404 if the precision and/or the recall equals zero. It is easy to see that c05-math-405 if and only if c05-math-406 and decreases to zero when either the precision or the recall approaches zero.

If test sets with known matches are available, the abovementioned measures can easily be computed for different metrics and parameters. In general, it is a good strategy to use cross-validation techniques to evaluate a matching methodology: split the data into training and test sets, and evaluate c05-math-407 on the test sets. Discussion of cross-validation techniques is beyond the scope of this book, but there are several textbooks that introduce the topic. James et al. (2013) discuss basic cross-validation techniques with applications in R, while Hastie et al. (2001) focus on theory. A comprehensive overview of cross-validation theory and techniques is given in the paper by Arlot et al. (2010).

Encoding Issues

The accuracy of string distance computation may decrease when encoding is not taken into account. In particular, some implementations of (edit-based) string metrics compute distances between the byte-wise representation of strings rather then taking account of possible multi-byte characters. Both the distance functions in base R and in the stringdist package by default take account of encoding and have the option to ignore it.

  stringdist(enc2utf8("Motörhead"), "Motorhead", method="lv")
  ## [1] 1
  stringdist(enc2utf8("Motörhead"), "Motorhead", method="lv", useBytes=TRUE)
  ## [1] 2

Here, we use enc2utf8 to ensure that the results are the same on all operating systems supported by R (R assumes the OS's native encoding if it is not explicitly specified, see Section 3.2.4). In the first example, the Levenshtein distance between "Motörhead" and "Motorhead" equals 1, corresponding to the replacement "ö" c05-math-408 "o". In the second example, the strings are treated byte-wise. Since "ö" is stored as a two-byte character in UTF-8, going from "ö" to "o" equals removing one byte and substituting another. The situation is even worse than this, since UTF-8 has a second way to represent "o", namely, as an "o", followed by a modifying symbol that adds the umlaut to the previous character. The two different representations are generally not automatically recognized by implementations of distance metrics.

Therefore, before computing string distances, one should ensure at least that all strings are stored in the same encoding. Here UTF-8 is a good choice since it both supports (nearly) every symbol invented and is widely supported across different softwares. Second, one needs to ensure that the encoding is normalized so that (combined) characters have a unique byte-wise representation. Techniques for Unicode normalization and encoding conversion were discussed in Section 5.1.1.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset