Processing text using regular expressions

The web consists predominantly of unstructured text. One of the main tasks in web scraping is to collect the relevant information from heaps of textual data. Within the unstructured text we are often interested in specific information, especially when we want to analyze the data using quantitative methods. Specific information can include numbers such as phone numbers, zip codes, latitude, longitude, or addresses.

First, we gather the unstructured text, next we determine the recurring patterns behind the information we are looking for, and then we apply these patterns to the unstructured text to extract the information. When we are web scraping, we have to identify and extract those parts of the document that contain the relevant information. Ideally, we can do so using xpath althrough, sometimes the crucial information is hidden within values. Sometimes relevant information might be scattered across an HTML document. We need to write regular expressions to retrieve data from such documents. Regular expressions provide us with syntax for systematically accessing patterns in text.

Let's see some basic string manipulation using stringr package:

  • To extract a sub-string of a string, let's use str_extract(string, pattern):
    install.packages("stringr")
    library(stringr)
    simpleString <-"Lets learn about text. Ahh! learn text mining."
    str_extract(simpleString , "learn")
    [1] "learn" # returned the match
    
    str_extract(simpleString , "learning")
    
    [1] NA # could not find match
    
  • Returns all the occurrences of the given string in list format:
    str_extract_all(simpleString , "learn")
    unlist(str_extract_all(simpleString , "learn"))
    
    [1] "learn" "learn"
    
  • Character matching is case sensitive. Thus, capital letters in regular expressions are different from lowercase letters:
    str_extract(simpleString , "LEARN")
    
    [1] NA # could not find match
    
  • We can change this behavior by enclosing a string with ignore.case():
    str_extract(simpleString , ignore.case("LEARN"))
    
  • A string is simply a sequence of characters. We can match characters or sequence of characters white space:
    unlist(str_extract_all(simpleString , "rn"))
    unlist(str_extract_all(simpleString , "Lets le"))
    
  • When we need to find out if the string starts with and ends with specific characters, there are two simple additions we can make to our regular expression to specify locations. The caret symbol (^) at the beginning of a regular expression marks the beginning of a string, and $ at the end marks the end:
    str_extract_all(c("apple","apricot","peach"),"^a")
    
    [[1]]
    [1] "a" # apple starts with 'a'
    
    [[2]]
    [1] "a" # apricot starts with 'a'
    
    [[3]] # No 'a' in the begning
    character(0)
    
  • Pipe (|). This character is treated as an OR operator the function returns all matches to the expressions before and after the pipe:
    str_extract_all(simpleString , "learn|text")
    
  • In order to write more flexible, generalized search queries, we can use the following expressions. The period character matches any character:
    str_extract(simpleString,"t.xt")
    
  • A character class means that any of the characters within the brackets will be matched:
    str_extract(simpleString,"t[ei]xt")
    
  • The previous code extracts the word text, as the character e is part of the character class [ei]. We can add ranges of characters using a dash -.In this case, any characters from e to i are valid matches:
    str_extract(simpleString,"t[e-i]xt")
    

The following table lists predefined character classes in R regular expressions:

[:digit:]

Digits: 0 1 2 3 4 5 6 7 8 9

[:lower:]

Lowercase characters: a–z

[:upper:]

Uppercase characters: A–Z

[:alpha:]

Alphabetic characters: a–z and A–Z

[:alnum:]

Digits and alphabetic characters

[:punct:]

Punctuation characters: ., ;, and so on.

[:graph:]

Graphical characters: [:alnum:] and [:punct:]

[:blank:]

Blank characters: Space and tab

[:space:]

Space characters: Space, tab, newline, and other space characters

[:print:]

Printable characters:[:alnum:], [:punct:], and [:space:]

In order to use the predefined classes, we have to enclose them in brackets. Otherwise, R assumes that we are specifying a character class consisting of the constituent characters. The following code matches all the words that begin with t and end with t:

str_extract(simpleString,"t[[:alpha:]][[:alpha:]]t")

A more readable way is:

str_extract(simpleString,"t[[:alpha:]]{2}t")

The following table lists the quantifiers in regular expressions:

?

The preceding item is optional and will be matched at most once

*

The preceding item will be matched zero or more times

+

The preceding item will be matched one or more times

{n}

The preceding item is matched exactly n times

{n,}

The preceding item is matched n or more times

{n,m}

The preceding item is matched at least n times, but not more than m times

These quantifiers are very powerful and comes handy when we construct regular expressions.

The following table lists symbols with special meaning:

w

Word characters: [[:alnum:]_]

W

No word characters: [ [:alnum:]_]

s

Space characters: [[:blank:]]

S

No space characters: [ [:blank:]]

d

Digits: [[:digit:]]

D

No digits: [ [:digit:]]



Word edge

B

No word edge

<

Word beginning

>

Word end

Now that we have understood some basics of regular expression, let's take the following messy text and extract name and numbers out of it:

"3456-2345tom hank(999) 555-0113 Geo, James555-6542Dr. Paul lee485 0945ted frank345-234-56879Steel, Peter5553642Mr. Bond"

The code would be as follows:

mixedString <-"3456-2345tom hank(999) 555-0113 Geo, James555-6542Dr. Paul lee485 0945ted frank345-234-56879Steel, Peter5553642Mr. Bond"
name <- unlist(str_extract_all(mixedString,"[[:alpha:].,]{2,}"))

The output is as shown following:

Processing text using regular expressions
  • To extract the names, we used the regular expression [[:alpha:]., ]{2, }:
    • We used the character class [:alpha:], which tells us that we are looking for alphabetic characters
    • Names can also contain periods, commas and empty spaces, which we want to add to the character class to read [[:alpha:]., ]
    • Add restriction that the contents of the character class have to be matched at least twice
    • We have to specify that we only want matches of at least length two
  • To get numbers, we can use the following regex:
    numbers <- unlist(str_extract_all(mixedString,"\(?(\d{3})?\)?(-|)?\d{3}(-|)?\d{4}"))
    
  • If we want the location of a match in a given string, we use the functions: str_locate() or str_locate_all():
    str_locate(simpleString,"text")
    
  • We can extract substring using:
    str_sub(simpleString, start = 12, end = 1)
    
  • For replacements we can use str_replace() and str_replace_all():
    str_replace(example.obj, pattern ="text", replacement ="data")
    
  • To split a string into several smaller strings we can use:
    str_split(simpleString,"[[:blank:]]")
    
  • In order to detect a pattern in string we can use the str_detect() function:
    str_detect(simpleString,"text")
    
  • If we need to know the number of occurrences of a word we can use:
    str_count(simpleString,"text")
    
  • To add characters to the edges of strings or trim blank spaces.str_pad() and str_trim() we can use:
    str_trim(simpleString)
    
  • We can join strings using the str_c() function:
    words <- c("lets","learn","text","mining")
    sentence <- str_c(words, collapse =" ")
    
  • One way to deal with messy text data is the agrep() function, which provides approximate matching via the Levenshtein distance. Function agrep() matches substrings of each element of the string that is searched for, just like grep(), and not the entire word itself:
    agrep("Rahul Dravid","Rahul k Dravid", max.distance = list(all = 3))
    
  • We can change the maximum distance between pattern and string by adjusting both the max.distance and the costs parameter. The higher the max.distance parameter, the more approximate matches it will find. Using the costs parameter you can adjust the costs for the different operations necessary to liken to strings.
  • While dealing with text encoding plays a significant role, using the function iconv(), we can translate a string from one encoding scheme to another:
    text.utf8 <- iconv("text", from ="windows-1252", to ="UTF-8")
    
  • Explore tau package form R It has lot of useful methods related to encoding and translation.

Tokenization and segmentation

In this topic, we will learn techniques that can be applied for the tokenization and segmentation of text, to analyze and get some useful information about the text.

Word tokenization

Tokenization is the process of breaking up a stream of text, a character sequence or a defined document unit, into phrases, words, symbols, or other meaningful elements called tokens. The goal of tokenization is the exploration of the words in a sentence. Before we do any kind on analysis on the text using a language processor, we need to normalize the words. When we do quantitative analysis on the text we consider it a bag of words. and extract the key words, frequency of occurrence, and the importance of each word in the text. Tokenizing provides various kind of information about text, like the number of words or tokens in a text, the vocabulary or the type of words.

Some terminologies we need to know are listed as follows:

  • Sentence: Unit of written language
  • Utterance: Unit of spoken language
  • Word form: The inflected form as it actually appears in the corpus
  • Lemma: An abstract form, shared by word forms having the same stem and part of speech
  • Word sense: Stands for the class of words with stem
  • Types: Number of distinct words in a corpus (vocabulary size)
  • Tokens: Total number of words

Use the following code to find all the data sets available in:

data(package = .packages(all.available = TRUE))

If you need to check the data sets available in a specific package, you may specify the package name as the argument.

Let's import a data set available in tm package and build a document term matrix:

library(tm)
data(acq)

Create a document term matrix:

tdm <- TermDocumentMatrix(acq)

Access document IDs, terms and their number of a term document matrix:

Docs(tdm)
nDocs(tdm)
nTerms(tdm)
Terms(tdm)

Operations on a document-term matrix

We will now see how various operations on document term matrix work:

  • Frequent terms: From the document-term matrix created in previous section, let's try to find out the number of terms occurring more than or equal to 30 times.
     findFreqTerms(tdm,30)
    
    
    
     [1] "and"     "company" "dlrs"    "for"     "from"    "has"     "its"     "mln"    
     [9] "pct"     "reuter"  "said"    "shares"  "stock"   "that"    "the"     "was"    
    [17] "will"    "with"    "would"
    
  • Term association: Correlation is a quantitative measure of the co-occurrence of words in multiple documents, so we need to provide a term document matrix as the input to the function and a correlation limit; for example if we provide cottelation limit as 0.70 that function will return terms that have a search term co-occurrence of at least 70% and more.

Let's find out the words that correlate to word stock with the correlation of atleast 0.70:

findAssocs(tdm, "stock", 0.70)


  stock
believe    0.80
several    0.80
would      0.75
all        0.71
business.  0.71
partial    0.71
very       0.71

Let us see a couple of tokenizers from different packages in R:

  • The generated term document matrix may be huge. The size of our DTM is 50 X 2013 It says 96% of the rows are zero, that is, the majority of words will appear in a few documents. We can reduce the sparsity of the document for computational efficiency.inspect(tdm):
    <<TermDocumentMatrix (terms: 2103, documents: 50)>>
    Non-/sparse entries: 4135/101015
    Sparsity: 96%
    Maximal term length: 21
    Weighting: term frequency (tf)
    
  • The output shows that 96% of the terms occur just a few times, thus the document term matrix becomes large even for a small sized data set. We need to remove the sparse terms:
    inspect(removeSparseTerms(tdm, 0.3))
    <<TermDocumentMatrix (terms: 5, documents: 50)>>
    Non-/sparse entries: 231/19
    Sparsity           : 8%
    Maximal term length: 6
    Weighting          : term frequency (tf)
    
  • Sparsity is reduced from 96% to 8%:
    library(tm)
    
    # convert to lower case
    acq <- tm_map(acq, content_transformer(tolower))
    
    #remove whitespaces
    acq <- tm_map(acq, stripWhitespace)
    
    #remove stop words(english)
    acq <- tm_map(acq, removeWords, stopwords("english"))
    
    s <- "i am learning text mining. This is exciting . lot to explore Mr. Paul."
    
    
    MC_tokenizer(s)
    
    
    [1] "i"        "am"       "learning" "text"     "mining"   ""         "This"     "is"       "exciting"
    [10] ""         ""         "lot"      "to"       "explore"  "Mr"       ""         "Paul"     ""  
    
  • This tokenizer is producing 18 tokens, we can see it has removes periods and considered it as a empty token:
    scan_tokenizer(s)
    
    
    [1] "i"        "am"       "learning" "text"     "mining."  "This"     "is"       "exciting" "."       
    [10] "lot"      "to"       "explore"  "Mr."      "Paul"     "!"       
    
  • This tokenizer is producing 15 tokens
  • RWeka package provides word, n-gram and alphabetic tokenizers:
    install.packages("RWeka")library(RWeka)
    
    WordTokenizer(s, control = NULL)
    
    
     [1] "i"        "am"       "learning" "text"     "mining"   "This"     "is"       "exciting" "lot"     
    [10] "to"       "explore"  "Mr"       "Paul" 
    
  • When we run the previous tokenizer we get 13 tokens with all the punctuation cleaned:
    NGramTokenizer(s, control = NULL) 
     [1] "i am learning"        "am learning text"     "learning text mining"
     [4] "text mining This"     "mining This is"       "This is exciting"    
     [7] "is exciting lot"      "exciting lot to"      "lot to explore"      
    [10] "to explore Mr"        "explore Mr Paul"      "i am"                
    [13] "am learning"          "learning text"        "text mining"         
    [16] "mining This"          "This is"              "is exciting"         
    [19] "exciting lot"         "lot to"               "to explore"          
    [22] "explore Mr"           "Mr Paul"              "i"                   
    [25] "am"                   "learning"             "text"                
    [28] "mining"               "This"                 "is"                  
    [31] "exciting"             "lot"                  "to"                  
    [34] "explore"              "Mr"                   "Paul"  
    
    
    
     AlphabeticTokenizer(s, control = NULL)
     [1] "i"        "am"       "learning" "text"     "mining"   "This"     "is"      
     [8] "exciting" "lot"      "to"       "explore"  "Mr"       "Paul"
    

    Note

    kRp.POS.tags is a function in koRpus package, which can be used get a set of POS tags for a given language. It supports English(en), German(de), Spanish(es), French(fr), Italian(it), and Russian(ru).

  • Install TreeTagger and set the environment to tell the tree-tagger function in koRpus where is it installed:
    install.packages("koRpus")library(koRpus)
    set.kRp.env(TT.cmd="~/bin/treetagger/cmd/tree-tagger", lang="en")
    get.kRp.env(TT.cmd=TRUE)
    

We can extract the vocabulary of this sentence by filtering out repeating words, abbreviation and counting words which has same lemma as one. We can build out own tokenizer.

There are various challenges to tokenization some of them are listed as following:

  • New York: Should we keep it as one token or Two?
  • TEXT: should we convert all the tokens to lower case?
  • MS, PhD: how to handle abbreviations.
  • Ernst and Young: Should we consider this as one word?
  • Bangalore's weather: how to handle apostrophes' should we convert it to Bangalore, Bangalores
  • I'm, isn't: Should we choose to expand them to "I am", "is not"

Sentence segmentation

Sentence segmentation is the process of determining the longest unit of words. This task involves determining sentence boundaries, and we know most languages have punctuation to define the end of sentence. Sentence segmentation is also referred as sentence boundary disambiguation, sentence boundary detection. Some of the factors that effects Sentence segmentation is language, character set, algorithm, application, data source. Sentences in most of the languages are delimited by punctuation marks, but the rules for punctuation can vary dramatically. Sentences and sub sentences are punctuated differently in different languages. So for successful sentence segmentation understanding uses of punctuation in that language is important.

Let's consider english as the language, recognizing boundaries must be fairly simple since it has a rich punctuation system like periods, question marks, exclamation. But a period can become quite ambiguous since period can also be used for abbreviations like Mr., representing decimal numbers like 1.2, or abbreviations.

Let's look at R openNLP packages and function Maxent_Sent_Token_Annotator.

Generate an annotator which computes sentence annotations using the apache OpenNLP maxent sentence detector:

install.packages(openNLP)
library(openNLP)
s <- "I am learning text mining. This is exciting.lot to explore Mr. Paul!"
sentence.boundaries <- annotate(s,Maxent_Sent_Token_Annotator(language = "en", probs = FALSE, model = NULL))

sentences <- s[sentence.boundaries]

The output of detected sentences is as shown following:

 id type     start end features
  1 sentence     1  26 
  2 sentence    28  45 
  3 sentence    47  71 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset