Clustering – Finding Related Posts

Until now, we have always considered training as learning a function that maps some data to some labels. For the tasks in this chapter, we may not possess labels that we can use to learn the classification model. This could be, for example, because they were too expensive to collect. Just imagine the cost if the only way to obtain millions of labels was to ask humans to annotate those manually. What could we do in that case?

We find some pattern within the data itself. This is what we will do in this chapter, where we again consider the challenge of a question and answer website. When a user is browsing our site, perhaps because they are searching for specific information, the search engine will most likely point them to a specific answer. If the presented answers are not what they are looking for, the website should present (at least) the related answers so that they can quickly see what other answers are available and hopefully stay on our site.

The naïve approach would be to simply take the post, calculate its similarity to all other posts, and display the top n most similar posts as links on the page. This would quickly become very costly. Instead, we need a method that quickly finds all the related posts.

We will achieve this goal in this chapter by clustering features we have extracted from text. Clustering is a method of arranging items so that similar items are in one cluster and dissimilar items are in distinct ones. The tricky thing that we must tackle first is how to turn text into something from which we can calculate similarity. With such a similarity measurement, we will then proceed to investigate how we can leverage that to quickly arrive at a cluster that contains similar posts. Once there, we will only have to check those documents that also belong to that cluster. To achieve this, we will introduce you to the marvelous scikit library, which comes with diverse machine learning methods that we will also use in the following chapters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset