How to do it

More robust than edit distance is the so-called bag of words approach. It ignores the order of words and simply uses word counts as their basis. For each word in the post, its occurrence is counted and noted in a vector. Not surprisingly, this step is also called vectorization. The vector is typically huge as it contains as many elements as words that occur in the whole dataset. The previously mentioned two example posts would then have the following word counts:

Word	Occurrences in post 1	Occurrences in post 2
disk	1	1
format	1	1
how	1	0
hard	1	1
my	1	0
problems	0	1
to	1	0

The columns occurrences in post 2 and occurrences in post 1 can now be treated as vectors. We can simply calculate the Euclidean distance between the vectors of all posts and take the nearest one (too slow, as we have found out earlier). And as such we can use them later as our feature vectors in the clustering steps, according to the following procedure:

Extract salient features from each post and store them as a vector per post
Cluster the vectors
Determine the cluster for the post in question
From this cluster, fetch a handful of posts having a different similarity to the post in question. This will increase diversity

But there is some more work to be done before we get there. Before we can do that work, we need some data to work on.

Table of Contents for How to do it

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it