Exercises

  • Do you think all of the top 500 word tokens contain valuable information? If not, can you impose another list of stop words?
  • Can you use stemming instead of lemmatization to process the newsgroups data?
  • Can you increase max_features in CountVectorizer from 500 to 5000 and see how the t-SNE visualization will be affected?
  • Try visualizing documents from six topics (similar or dissimilar) and tweak parameters so that the formed clusters look reasonable.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset