Bag of words representation

The scikit-learn has a handy module called feature_extraction that allows us to, as the name suggests, extract features for data such as text in a format supported by machine learning algorithms. This module has methods for us to utilize when working with text. 

Going forward, we may refer to our text data as a corpus, specifically meaning an aggregate of text content or documents. 

The most common method to transform a corpus into a numerical representation, a process known as vectorization, is through a method called bag-of-words. The basic idea behind the bag of words approach is that documents are described by word occurrences while completely ignoring the positioning of words in the document. In its simplest form, text is represented as a bag, without regard for grammar or word order, and is maintained as a set, with importance given to multiplicity. A bag of words representation is achieved in the following three steps:

  • Tokenizing
  • Counting
  • Normalizing

Let's start with tokenizing. This process uses white spaces and punctuation to separate words from each other, turning them into tokens. Each possible token is given an integer ID.

Next comes counting. This step simply counts the occurrences of tokens within a document.

Last comes normalizing, meaning that tokens are weighted with diminishing importance when they occur in the majority of documents. 

Let's consider a couple more methods for vectorizing. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset