Extending the vectorizer with NLTK's stemmer

We need to stem the posts before we feed them into CountVectorizer. The class provides several hooks with which we can customize the stage's preprocessing and tokenization. The preprocessor and tokenizer can be set as parameters in the constructor. We do not want to place the stemmer into any of them, because we will then have to do the tokenization and normalization ourselves. Instead, we overwrite the build_analyzer method:

import nltk.stem
english_stemmer = nltk.stem.SnowballStemmer('english')
class StemmedCountVectorizer(CountVectorizer): 
    def build_analyzer(self): 
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in  analyzer(doc))
    
vect_engl_stem = StemmedCountVectorizer(min_df=1, stop_words='english')

This will do the following process for each post:

  1. Lowercase the raw post in the preprocessing step (done in the parent class).
  2. Extract all individual words in the tokenization step (done in the parent class).
  3. Convert each word into its stemmed version (done in our build_analyzer).

As a result, we now have one less feature, because images and imaging collapsed to one:

['actual', 'capabl', 'contain', 'data', 'databas', 'imag', 'interest', 'learn', 'machin', 'perman', 'post', 'provid', 'save', 'storag', 'store', 'stuff', 'toy']

Running our new stemmed vectorizer over our posts, we see that collapsing imaging and images revealed that actually, Post 2 is the most similar post to our new post, as it contains the concept image twice:

    === Post 0 with dist=1.41:
        'This is a toy post about machine learning. Actually, it contains not much interesting stuff.'
    === Post 1 with dist=0.86:
        'Imaging databases provide storage capabilities.'
    === Post 2 with dist=0.63:
        'Most imaging databases save images permanently.'
    === Post 3 with dist=0.77:
        'Imaging databases store data.'
    === Post 4 with dist=0.77:
        'Imaging databases store data. Imaging databases store data. Imaging databases store data.'
    
    ==> Best post is 2 with dist=0.63 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset