Normalization in text basically refers to standardization or canonicalization of tokens, which we derived from documents in the previous step. The simplest scenario possible could be the case where query tokens are an exact match to the list of tokens in document, however there can be cases when that is not true. The intent of normalization is to have the query and index terms in the same form. For instance, if you query U.K., you might also be expecting U.K.
Token normalization can be performed either by implicitly creating equivalence classes or by maintaining the relations between unnormalized tokens. There might be cases where we find superficial differences in character sequences of tokens, in such cases query and index term matching becomes difficult. Consider the words anti-disciplinary and anti-disciplinary. If both these words get mapped into one term named after one of the members of the set for example anti-disciplinary, text retrieval would become so efficient. Query on one of the terms would fetch the documents containing either of the terms. We will deal with the asymmetric query expansion in detail in upcoming chapters.
Grammar in every language allows usage of derivationally related words with similar meaning, which are nothing but different forms of the same word. Such as develop, developing, developed. The intent of performing lemmatization and stemming revolves around a similar objective of reducing inflectional forms and map derived words to the common base form.
Stemming is a process of chopping off the ends of words, mostly derivational affixes. Lemmatization is a more efficient process, which uses vocabulary and morphological analysis of words and removes only the inflectional endings to return the base form of word as output.
RWeka provides stemming functions to remove the common derivational affixes:
> IteratedLovinsStemmer("cars", control = NULL) [1] "car" > LovinsStemmer("ponies", control = NULL) [1] "pon"
The wordnet
package can be utilized to effectively perform lemmatization. The WordNet
database needs to be downloaded and installed, and the installed path needs to be specified before R wordnet
package can be used.
After downloading WordNet
, set the WNHOME
environment variable to use the path where WordNet
installation can be found:
Sys.setenv(WNHOME = "~/WordNet-3.0") initDict("~/WordNet-3.0/dict/") setDict("~/WordNet-3.0/dict") getIndexTerms(pos, maxLimit, filter) where, pos is the part of speech type, can be either "NOUN","VERB","ADJECTIVE","ADVERB" >getFilterTypes() [1] "ContainsFilter" "EndsWithFilter" "ExactMatchFilter" "RegexFilter" [5] "SoundFilter" "StartsWithFilter" "WildcardFilter"
You can use either of the filter types to fetch the index terms:
>if(initDict()) { filter <- getTermFilter("StartsWithFilter", "beer", TRUE) getIndexTerms("NOUN", 3, filter) } [[1]] [1] "Java-Object{Lemma: beer POS: noun Tag-Sense-Count: 1 List of Synsets (1) #1: 7886849 List of Pointers (3) #1: @ (Hypernym) #2: ~ (Hyponym) #3: + (Derivationally related form)}" [[2]] [1] "Java-Object{Lemma: beer barrel POS: noun Tag-Sense-Count: 0 List of Synsets (1) #1: 2823335 List of Pointers (1) #1: @ (Hypernym)}" [[3]] [1] "Java-Object{Lemma: beer bottle POS: noun Tag-Sense-Count: 1 List of Synsets (1) #1: 2823428 List of Pointers (1) #1: @ (Hypernym)}" > if(initDict()) { filter <- getTermFilter("EndsWithFilter", "beer", TRUE) getIndexTerms("NOUN", 3, filter) } [[1]] [1] "Java-Object{Lemma: beer POS: noun Tag-Sense-Count: 1 List of Synsets (1) #1: 7886849 List of Pointers (3) #1: @ (Hypernym) #2: ~ (Hyponym) #3: + (Derivationally related form)}" [[2]] [1] "Java-Object{Lemma: birch beer POS: noun Tag-Sense-Count: 0 List of Synsets (1) #1: 7927716 List of Pointers (1) #1: @ (Hypernym)}" [[3]] [1] "Java-Object{Lemma: bock beer POS: noun Tag-Sense-Count: 0 List of Synsets (1) #1: 7887461 List of Pointers (1) #1: @ (Hypernym)}"
Let's refer the following example:
>if(initDict()) { filter <- getTermFilter("EndsWithFilter", "organisation", TRUE) getIndexTerms("NOUN", 3, filter) } [[1]] [1] "Java-Object{Lemma: business organization POS: noun Tag-Sense-Count: 0 List of Synsets (1) #1: 8061042 List of Pointers (5) #1: @ (Hypernym) #2: ~ (Hyponym) #3: %m (Member meronym) #4: ; ([Unknown]) #5: - ([Unknown])}" [[2]] [1] "Java-Object{Lemma: disorganization POS: noun Tag-Sense-Count: 0 List of Synsets (2) #1: 14500341 #2: 552922 List of Pointers (2) #1: @ (Hypernym) #2: + (Derivationally related form)}" [[3]] [1] "Java-Object{Lemma: european law enforcement organisation POS: noun Tag-Sense-Count: 0 List of Synsets (1) #1: 8210042 List of Pointers (1) #1: @ (Hypernym)}" Alternaively, >synonyms("organisation","NOUN") [1] "administration" "arrangement" "brass" "constitution" [5] "establishment" "formation" "governance" "governing body" [9] "organisation" "organization" "system"