Strengths and weaknesses

K-means is a simple algorithm, both to understand, as well as to implement. Furthermore, it usually converges relatively fast, requiring small computing power. Nonetheless, it has some disadvantages. The first one is its sensitivity to the initial conditions. Depending on the examples chosen as the first cluster centers, it can require more iterations in order to converge. For example, in the following diagram we present three initial points that put the algorithm at a disadvantage. In fact, in the third iteration, two cluster centers happen to coincide:

An example of unfortunate initial cluster centers

Thus, the algorithm does not produce clusters deterministically. Another major problem is the number of clusters. This is a parameter that the data analyst must choose. There are usually three different solutions to this problem. The first concerns problems where some prior knowledge about the problem exists. Such examples are datasets where there is a need to uncover the structure of something that is known, for example, what is the driving factor behind athletes who improve their performance during a season, given their statistics? In this example, a sports coach could advise that athletes actually either improve drastically, stay the same, or deteriorate. Thus, the analyst could choose 3 as the number of clusters. Another possible solution is to experiment with different values of K, and measure the appropriateness of each value. This approach does not require any prior knowledge about the problem domain, but introduces the problem of measuring the appropriateness of each solution. We will see how we can solve these problems in the rest of this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset