Text tokenization

To perform feature extraction, we still need to provide individual words-tokens that are composing the original text. However, we do not need to consider all the words or characters. We can, for example, directly skip punctuations or unimportant words such as prepositions or articles, which mostly do not bring any useful information.

Furthermore, a common practice is to regularize tokens to a common representation. This can include methods such as unification of characters (for example, using only lowercase characters, removing diacritics, using common character encoding such as utf8, and so on) or putting words into a common form (so-called stemming, for example, "cry"/"cries"/"cried" is represented by "cry").

In our example, we will perform this process using the following steps:

  1. Lowercase all words ("Because" and "because" are the same word).
  2. Remove punctuation symbols with a regular expression function.
  3. Remove stopwords. These are essentially injunctions and conjunctions such as in, at, the, and, etc, and so on, that add no contextual meaning to the review that we want to classify.

 

  1. Find "rare tokens" that have a total number of occurrences less than three times in our corpus of reviews.
  2. Finally, remove all "rare tokens."
Each of the steps in the preceding sequence represent our best practices when doing sentiment classification on text. For your situation, you may not want to lowercase all words (for example, "Python", the language and "python", the snake type, is an important distinction!). Furthermore, your stopwords list-if you choose to include one-may be different and incorporate more business logic given your task. One website that has done a fine job in collecting lists of stopwords is http://www.ranks.nl/stopwords.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset