How to do it

More robust than edit distance is the so-called bag of words approach. It ignores the order of words and simply uses word counts as their basis. For each word in the post, its occurrence is counted and noted in a vector. Not surprisingly, this step is also called vectorization. The vector is typically huge as it contains as many elements as words that occur in the whole dataset. The previously mentioned two example posts would then have the following word counts:

Word
Occurrences in post 1
Occurrences in post 2

disk

1

1

format

1

1

how

1

0

hard

1

1

my

1

0

problems

0

1

to

1

0

 

The columns occurrences in post 2 and occurrences in post 1 can now be treated as vectors. We can simply calculate the Euclidean distance between the vectors of all posts and take the nearest one (too slow, as we have found out earlier). And as such we can use them later as our feature vectors in the clustering steps, according to the following procedure:

  1. Extract salient features from each post and store them as a vector per post
  2. Cluster the vectors
  3. Determine the cluster for the post in question
  4. From this cluster, fetch a handful of posts having a different similarity to the post in question. This will increase diversity

But there is some more work to be done before we get there. Before we can do that work, we need some data to work on.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset