Normalizing texts

Normalization in text basically refers to standardization or canonicalization of tokens, which we derived from documents in the previous step. The simplest scenario possible could be the case where query tokens are an exact match to the list of tokens in document, however there can be cases when that is not true. The intent of normalization is to have the query and index terms in the same form. For instance, if you query U.K., you might also be expecting U.K.

Token normalization can be performed either by implicitly creating equivalence classes or by maintaining the relations between unnormalized tokens. There might be cases where we find superficial differences in character sequences of tokens, in such cases query and index term matching becomes difficult. Consider the words anti-disciplinary and anti-disciplinary. If both these words get mapped into one term named after one of the members of the set for example anti-disciplinary, text retrieval would become so efficient. Query on one of the terms would fetch the documents containing either of the terms. We will deal with the asymmetric query expansion in detail in upcoming chapters.

Lemmatization and stemming

Grammar in every language allows usage of derivationally related words with similar meaning, which are nothing but different forms of the same word. Such as develop, developing, developed. The intent of performing lemmatization and stemming revolves around a similar objective of reducing inflectional forms and map derived words to the common base form.

Stemming is a process of chopping off the ends of words, mostly derivational affixes. Lemmatization is a more efficient process, which uses vocabulary and morphological analysis of words and removes only the inflectional endings to return the base form of word as output.

Stemming

RWeka provides stemming functions to remove the common derivational affixes:

> IteratedLovinsStemmer("cars", control = NULL)
[1] "car"

> LovinsStemmer("ponies", control = NULL)
[1] "pon"

Lemmatization

The wordnet package can be utilized to effectively perform lemmatization. The WordNet database needs to be downloaded and installed, and the installed path needs to be specified before R wordnet package can be used.

After downloading WordNet, set the WNHOME environment variable to use the path where WordNet installation can be found:

Sys.setenv(WNHOME = "~/WordNet-3.0")

initDict("~/WordNet-3.0/dict/")

setDict("~/WordNet-3.0/dict")

getIndexTerms(pos, maxLimit, filter)

where, pos is the part of speech type, can be either "NOUN","VERB","ADJECTIVE","ADVERB"

>getFilterTypes()

[1] "ContainsFilter"   "EndsWithFilter"   "ExactMatchFilter" "RegexFilter"     
[5] "SoundFilter"      "StartsWithFilter" "WildcardFilter"

You can use either of the filter types to fetch the index terms:

>if(initDict()) {
     filter <- getTermFilter("StartsWithFilter", "beer", TRUE)
     getIndexTerms("NOUN", 3, filter)
 }

[[1]]
[1] "Java-Object{Lemma: beer  POS: noun  Tag-Sense-Count: 1
List of Synsets (1)
  #1: 7886849
List of Pointers (3)
  #1: @ (Hypernym)
  #2: ~ (Hyponym)
  #3: + (Derivationally related form)}"

[[2]]
[1] "Java-Object{Lemma: beer barrel  POS: noun  Tag-Sense-Count: 0
List of Synsets (1)
  #1: 2823335
List of Pointers (1)
  #1: @ (Hypernym)}"

[[3]]
[1] "Java-Object{Lemma: beer bottle  POS: noun  Tag-Sense-Count: 1
List of Synsets (1)
  #1: 2823428
List of Pointers (1)
  #1: @ (Hypernym)}"


> if(initDict()) {
     filter <- getTermFilter("EndsWithFilter", "beer", TRUE)
     getIndexTerms("NOUN", 3, filter)
 }


[[1]]
[1] "Java-Object{Lemma: beer  POS: noun  Tag-Sense-Count: 1
List of Synsets (1)
  #1: 7886849
List of Pointers (3)
  #1: @ (Hypernym)
  #2: ~ (Hyponym)
  #3: + (Derivationally related form)}"

[[2]]
[1] "Java-Object{Lemma: birch beer  POS: noun  Tag-Sense-Count: 0
List of Synsets (1)
  #1: 7927716
List of Pointers (1)
  #1: @ (Hypernym)}"

[[3]]
[1] "Java-Object{Lemma: bock beer  POS: noun  Tag-Sense-Count: 0
List of Synsets (1)
  #1: 7887461
List of Pointers (1)
  #1: @ (Hypernym)}"

Synonyms

Let's refer the following example:

>if(initDict()) {
      filter <- getTermFilter("EndsWithFilter", "organisation", TRUE)
      getIndexTerms("NOUN", 3, filter)
  }

[[1]]
[1] "Java-Object{Lemma: business organization  POS: noun  Tag-Sense-Count: 0
List of Synsets (1)
  #1: 8061042
List of Pointers (5)
  #1: @ (Hypernym)
  #2: ~ (Hyponym)
  #3: %m (Member meronym)
  #4: ; ([Unknown])
  #5: - ([Unknown])}"

[[2]]
[1] "Java-Object{Lemma: disorganization  POS: noun  Tag-Sense-Count: 0
List of Synsets (2)
  #1: 14500341
  #2: 552922
List of Pointers (2)
  #1: @ (Hypernym)
  #2: + (Derivationally related form)}"

[[3]]
[1] "Java-Object{Lemma: european law enforcement organisation  POS: noun  Tag-Sense-Count: 0
List of Synsets (1)
  #1: 8210042
List of Pointers (1)
  #1: @ (Hypernym)}"

Alternaively,
>synonyms("organisation","NOUN")

 [1] "administration" "arrangement"    "brass"          "constitution"  
 [5] "establishment"  "formation"      "governance"     "governing body"
 [9] "organisation"   "organization"   "system"
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset