CountVectorizer parameters

A few parameters that we will go over include:

  • stop_words
  • min_df
  • max_df
  • ngram_range
  • analyzer

stop_words is a frequently used parameter in CountVectorizer. You can pass in the string english to this parameter, and a built-in stop word list for English is used. You can also specify a list of words yourself. These words will then be removed from the tokens and will not appear as features in your data.

Here is an example:

vect = CountVectorizer(stop_words='english')  # removes a set of english stop words (if, a, the, etc)
_ = vect.fit_transform(X)
print _.shape

(99989, 105545)

You can see that the feature columns have gone down from 105,849 when stop words were not used, to 105,545 when English stop words have been set. The idea behind using stop words is to remove noise within your features and take out words that occur so often that there won't be much meaning to garner from them in your models. 

Another parameter is called min_df. This parameter is used to skim the number of features, by ignoring terms that have a document frequency lower than the given threshold or cut-off. 

Here is an implementation of our CountVectorizer with min_df:

vect = CountVectorizer(min_df=.05)  # only includes words that occur in at least 5% of the corpus documents
# used to skim the number of features
_ = vect.fit_transform(X)
print _.shape

(99989, 31)

This is a method that is utilized to significantly reduce the number of features created. 

There is also a parameter called max_df:

vect = CountVectorizer(max_df=.8)  # only includes words that occur at most 80% of the documents
# used to "Deduce" stop words
_ = vect.fit_transform(X)
print _.shape

(99989, 105849)

This is similar to trying to understand what stop words exist in the document. 

Next, let's look at the ngram_range parameter. This parameter takes in a tuple where the lower and upper boundary of the range of n-values indicates the number of different n-grams to be extracted. N-grams represent phrases, so a value of one would represent one token, however a value of two would represent two tokens together. As you can imagine, this will expand our feature set quite significantly: 

vect = CountVectorizer(ngram_range=(1, 5))  # also includes phrases up to 5 words
_ = vect.fit_transform(X)
print _.shape  # explodes the number of features

(99989, 3219557)

See, we now have 3,219,557 features. Since sets of words (phrases) can sometimes have more meaning, using n-gram ranges can be useful for modeling.

You can also set an analyzer as a parameter in CountVectorizer. The analyzer determines whether the feature should be made of word or character n-grams. Word is the default: 

vect = CountVectorizer(analyzer='word')  # default analyzer, decides to split into words
_ = vect.fit_transform(X)
print _.shape

(99989, 105849)

Given that word is the default, our feature column number doesn't change much from the original.

We can even create our own custom analyzer. Conceptually, words are built from root words, or stems, and we can construct a custom analyzer that accounts for this.

Stemming is a common natural language processing method that allows us to stem our vocabulary, or make it smaller by converting words to their roots. There is a natural language toolkit, known as NLTK, that has several packages that allow us to perform operations on text data. One such package is a stemmer

Let's see how it works:

  1. First, import our stemmer and then initialize it:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
  1. Now, let's see how some words are stemmed: 
stemmer.stem('interesting')
u'interest'
  1. So, the word interesting can be reduced to the root stem. We can now use this to create a function that will allow us to tokenize words into their stems:
# define a function that accepts text and returns a list of lemmas
def word_tokenize(text, how='lemma'):
    words = text.split(' ')  # tokenize into words
    return [stemmer.stem(word) for word in words]
  1. Let's see what our function outputs:
word_tokenize("hello you are very interesting")

[u'hello', u'you', u'are', u'veri', u'interest']
  1. We can now place this tokenizer function into our analyzer parameter:
vect = CountVectorizer(analyzer=word_tokenize)
_ = vect.fit_transform(X)
print _.shape  # fewer features as stemming makes words smaller

(99989, 154397)

This yields us fewer features, which intuitively makes sense since our vocabulary has reduced with stemming.

CountVectorizer is a very useful tool to help us expand our features and convert text to numerical features. There is another common vectorizer that we will look into. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset