The web consists predominantly of unstructured text. One of the main tasks in web scraping is to collect the relevant information from heaps of textual data. Within the unstructured text we are often interested in specific information, especially when we want to analyze the data using quantitative methods. Specific information can include numbers such as phone numbers, zip codes, latitude, longitude, or addresses.
First, we gather the unstructured text, next we determine the recurring patterns behind the information we are looking for, and then we apply these patterns to the unstructured text to extract the information. When we are web scraping, we have to identify and extract those parts of the document that contain the relevant information. Ideally, we can do so using xpath althrough, sometimes the crucial information is hidden within values. Sometimes relevant information might be scattered across an HTML document. We need to write regular expressions to retrieve data from such documents. Regular expressions provide us with syntax for systematically accessing patterns in text.
Let's see some basic string manipulation using stringr
package:
str_extract(string, pattern)
:install.packages("stringr") library(stringr) simpleString <-"Lets learn about text. Ahh! learn text mining." str_extract(simpleString , "learn") [1] "learn" # returned the match str_extract(simpleString , "learning") [1] NA # could not find match
str_extract_all(simpleString , "learn") unlist(str_extract_all(simpleString , "learn")) [1] "learn" "learn"
str_extract(simpleString , "LEARN") [1] NA # could not find match
ignore.case()
:str_extract(simpleString , ignore.case("LEARN"))
unlist(str_extract_all(simpleString , "rn")) unlist(str_extract_all(simpleString , "Lets le"))
str_extract_all(c("apple","apricot","peach"),"^a") [[1]] [1] "a" # apple starts with 'a' [[2]] [1] "a" # apricot starts with 'a' [[3]] # No 'a' in the begning character(0)
OR
operator the function returns all matches to the expressions before and after the pipe:str_extract_all(simpleString , "learn|text")
str_extract(simpleString,"t.xt")
str_extract(simpleString,"t[ei]xt")
e
is part of the character class [ei]
. We can add ranges of characters using a dash -
.In this case, any characters from e
to i
are valid matches:str_extract(simpleString,"t[e-i]xt")
The following table lists predefined character classes in R regular expressions:
|
Digits: 0 1 2 3 4 5 6 7 8 9 |
|
Lowercase characters: a–z |
|
Uppercase characters: A–Z |
|
Alphabetic characters: a–z and A–Z |
|
Digits and alphabetic characters |
|
Punctuation characters: ., ;, and so on. |
|
Graphical characters: |
|
Blank characters: Space and tab |
|
Space characters: Space, tab, newline, and other space characters |
|
Printable characters |
In order to use the predefined classes, we have to enclose them in brackets. Otherwise, R assumes that we are specifying a character class consisting of the constituent characters. The following code matches all the words that begin with t
and end with t
:
str_extract(simpleString,"t[[:alpha:]][[:alpha:]]t")
A more readable way is:
str_extract(simpleString,"t[[:alpha:]]{2}t")
The following table lists the quantifiers in regular expressions:
|
The preceding item is optional and will be matched at most once |
|
The preceding item will be matched zero or more times |
|
The preceding item will be matched one or more times |
|
The preceding item is matched exactly n times |
|
The preceding item is matched n or more times |
|
The preceding item is matched at least n times, but not more than m times |
These quantifiers are very powerful and comes handy when we construct regular expressions.
The following table lists symbols with special meaning:
|
Word characters: |
|
No word characters: |
|
Space characters: |
|
No space characters: |
|
Digits: |
|
No digits: |
|
Word edge |
|
No word edge |
|
Word beginning |
|
Word end |
Now that we have understood some basics of regular expression, let's take the following messy text and extract name and numbers out of it:
"3456-2345tom hank(999) 555-0113 Geo, James555-6542Dr. Paul lee485 0945ted frank345-234-56879Steel, Peter5553642Mr. Bond"
The code would be as follows:
mixedString <-"3456-2345tom hank(999) 555-0113 Geo, James555-6542Dr. Paul lee485 0945ted frank345-234-56879Steel, Peter5553642Mr. Bond" name <- unlist(str_extract_all(mixedString,"[[:alpha:].,]{2,}"))
The output is as shown following:
[[:alpha:]., ]{2, }
:[:alpha:]
, which tells us that we are looking for alphabetic characters[[:alpha:]., ]
numbers <- unlist(str_extract_all(mixedString,"\(?(\d{3})?\)?(-|)?\d{3}(-|)?\d{4}"))
str_locate()
or str_locate_all()
:str_locate(simpleString,"text")
str_sub(simpleString, start = 12, end = 1)
str_replace()
and str_replace_all()
:str_replace(example.obj, pattern ="text", replacement ="data")
str_split(simpleString,"[[:blank:]]")
str_detect()
function:str_detect(simpleString,"text")
str_count(simpleString,"text")
spaces.str_pad()
and str_trim()
we can use:str_trim(simpleString)
str_c()
function:words <- c("lets","learn","text","mining") sentence <- str_c(words, collapse =" ")
agrep()
function, which provides approximate matching via the Levenshtein distance. Function agrep()
matches substrings of each element of the string that is searched for, just like grep()
, and not the entire word itself:agrep("Rahul Dravid","Rahul k Dravid", max.distance = list(all = 3))
max.distance
and the costs parameter. The higher the max.distance
parameter, the more approximate matches it will find. Using the costs parameter you can adjust the costs for the different operations necessary to liken to strings.iconv()
, we can translate a string from one encoding scheme to another:text.utf8 <- iconv("text", from ="windows-1252", to ="UTF-8")
In this topic, we will learn techniques that can be applied for the tokenization and segmentation of text, to analyze and get some useful information about the text.
Tokenization is the process of breaking up a stream of text, a character sequence or a defined document unit, into phrases, words, symbols, or other meaningful elements called tokens. The goal of tokenization is the exploration of the words in a sentence. Before we do any kind on analysis on the text using a language processor, we need to normalize the words. When we do quantitative analysis on the text we consider it a bag of words. and extract the key words, frequency of occurrence, and the importance of each word in the text. Tokenizing provides various kind of information about text, like the number of words or tokens in a text, the vocabulary or the type of words.
Some terminologies we need to know are listed as follows:
Use the following code to find all the data sets available in:
data(package = .packages(all.available = TRUE))
If you need to check the data sets available in a specific package, you may specify the package name as the argument.
Let's import a data set available in tm
package and build a document term matrix:
library(tm) data(acq)
Create a document term matrix:
tdm <- TermDocumentMatrix(acq)
Access document IDs, terms and their number of a term document matrix:
Docs(tdm) nDocs(tdm) nTerms(tdm) Terms(tdm)
We will now see how various operations on document term matrix work:
findFreqTerms(tdm,30) [1] "and" "company" "dlrs" "for" "from" "has" "its" "mln" [9] "pct" "reuter" "said" "shares" "stock" "that" "the" "was" [17] "will" "with" "would"
Let's find out the words that correlate to word stock
with the correlation of atleast 0.70
:
findAssocs(tdm, "stock", 0.70) stock believe 0.80 several 0.80 would 0.75 all 0.71 business. 0.71 partial 0.71 very 0.71
Let us see a couple of tokenizers from different packages in R:
efficiency.inspect(tdm)
:<<TermDocumentMatrix (terms: 2103, documents: 50)>> Non-/sparse entries: 4135/101015 Sparsity: 96% Maximal term length: 21 Weighting: term frequency (tf)
inspect(removeSparseTerms(tdm, 0.3)) <<TermDocumentMatrix (terms: 5, documents: 50)>> Non-/sparse entries: 231/19 Sparsity : 8% Maximal term length: 6 Weighting : term frequency (tf)
library(tm) # convert to lower case acq <- tm_map(acq, content_transformer(tolower)) #remove whitespaces acq <- tm_map(acq, stripWhitespace) #remove stop words(english) acq <- tm_map(acq, removeWords, stopwords("english")) s <- "i am learning text mining. This is exciting . lot to explore Mr. Paul." MC_tokenizer(s) [1] "i" "am" "learning" "text" "mining" "" "This" "is" "exciting" [10] "" "" "lot" "to" "explore" "Mr" "" "Paul" ""
scan_tokenizer(s) [1] "i" "am" "learning" "text" "mining." "This" "is" "exciting" "." [10] "lot" "to" "explore" "Mr." "Paul" "!"
RWeka
package provides word, n-gram and alphabetic tokenizers:install.packages("RWeka")library(RWeka) WordTokenizer(s, control = NULL) [1] "i" "am" "learning" "text" "mining" "This" "is" "exciting" "lot" [10] "to" "explore" "Mr" "Paul"
NGramTokenizer(s, control = NULL) [1] "i am learning" "am learning text" "learning text mining" [4] "text mining This" "mining This is" "This is exciting" [7] "is exciting lot" "exciting lot to" "lot to explore" [10] "to explore Mr" "explore Mr Paul" "i am" [13] "am learning" "learning text" "text mining" [16] "mining This" "This is" "is exciting" [19] "exciting lot" "lot to" "to explore" [22] "explore Mr" "Mr Paul" "i" [25] "am" "learning" "text" [28] "mining" "This" "is" [31] "exciting" "lot" "to" [34] "explore" "Mr" "Paul" AlphabeticTokenizer(s, control = NULL) [1] "i" "am" "learning" "text" "mining" "This" "is" [8] "exciting" "lot" "to" "explore" "Mr" "Paul"
tree-tagger
function in koRpus where is it installed:install.packages("koRpus")library(koRpus) set.kRp.env(TT.cmd="~/bin/treetagger/cmd/tree-tagger", lang="en") get.kRp.env(TT.cmd=TRUE)
We can extract the vocabulary of this sentence by filtering out repeating words, abbreviation and counting words which has same lemma as one. We can build out own tokenizer.
There are various challenges to tokenization some of them are listed as following:
Sentence segmentation is the process of determining the longest unit of words. This task involves determining sentence boundaries, and we know most languages have punctuation to define the end of sentence. Sentence segmentation is also referred as sentence boundary disambiguation, sentence boundary detection. Some of the factors that effects Sentence segmentation is language, character set, algorithm, application, data source. Sentences in most of the languages are delimited by punctuation marks, but the rules for punctuation can vary dramatically. Sentences and sub sentences are punctuated differently in different languages. So for successful sentence segmentation understanding uses of punctuation in that language is important.
Let's consider english as the language, recognizing boundaries must be fairly simple since it has a rich punctuation system like periods, question marks, exclamation. But a period can become quite ambiguous since period can also be used for abbreviations like Mr., representing decimal numbers like 1.2, or abbreviations.
Let's look at R openNLP
packages and function Maxent_Sent_Token_Annotator
.
Generate an annotator which computes sentence annotations using the apache OpenNLP maxent sentence detector:
install.packages(openNLP) library(openNLP) s <- "I am learning text mining. This is exciting.lot to explore Mr. Paul!" sentence.boundaries <- annotate(s,Maxent_Sent_Token_Annotator(language = "en", probs = FALSE, model = NULL)) sentences <- s[sentence.boundaries]
The output of detected sentences is as shown following:
id type start end features 1 sentence 1 26 2 sentence 28 45 3 sentence 47 71