k-nearest neighbors

The k-nearest neighbor (k-NN) classification is one of the easiest classification methods to understand (particularly when there is little or no prior knowledge about the distribution of the data). The k-nearest neighbor classification has a way to store all the known cases and classify new cases based on a similarity measure (for example, the Euclidean distance function). The k-NN algorithm is popular in its statistical estimation and pattern recognition because of its simplicity.

For 1-nearest neighbor (1-NN), the label of one particular point is set to be the nearest training point. When you extend this for a higher value of k, the label of a test point is the one that is measured by the k nearest training points. The k-NN algorithm is considered to be a lazy learning algorithm because the optimization is done locally, and the computations are delayed until classification.

There are advantages and disadvantages of this method. The advantages are high accuracy, insensitive to outliers, and no assumptions about data. The disadvantages of k-NN is that it is computationally expensive and requires a lot of memory.

One of the following distance metrics could be used:

k-nearest neighbors

Let's consider an example where we are given a big basket of fruits with apples, bananas, and pears only. We will assume that the apples were red apples, not green. There is one characteristic that will distinguish these fruits from one another: color. Apples are red, bananas are yellow, and pears are green. These fruits can also be characterized by the weight of each. The following assumptions are made for the purpose of illustrating this example:

The shape characteristic is categorized as follows:

  • For an apple, the shape value lies between 1 and 3, whereas the weight lies between 6 and 7 ounces
  • For a pear, the shape value lies between 2 and 4, whereas the weight lies between 5 and 6 ounces
  • For a banana, the shape value lies between 3 and 5, whereas the weight lies between 7 and 9 ounces

We have the data about the fruits in a basket as follows:

k-nearest neighbors

If we have an unlabeled fruit with a known weight and a color category, then applying the k-nearest neighbor method (with any distance formula) will most likely find the nearest k neighbors (if they are green, red, or yellow, the unlabeled fruit is most likely a pear, apple, or banana respectively). The following code demonstrates k-nearest neighbor algorithm using the shape and weight of fruits:

import csv
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt

count=0
x=[]
y=[]
z=[]

with open('/Users/myhome/fruits_data.csv', 'r') as csvf:
  reader = csv.reader(csvf, delimiter=',')
  for row in reader:
    if count > 0:
      x.append(row[0])
      y.append(row[1])
      if ( row[2] == 'Apple' ): z.append('r')
      elif ( row[2] == 'Pear' ): z.append('g')
      else: z.append('y')
    count += 1

plt.figure(figsize=(11,11))

recs=[]
classes=['Apples', 'Pear', 'Bananas']
class_colours = ['r','g','y']
plt.title("Apples, Bananas and Pear by Weight and Shape", fontsize=18)

plt.xlabel("Shape category number", fontsize=14)
plt.ylabel("Weight in ounces", fontsize=14)

plt.scatter(x,y,s=600,c=z)
k-nearest neighbors

Let's pick four unlabeled fruits with their x and y values as A(3.5,6.2), B(2.75,6.2), C(2.9, 7.6), and D(2.4, 7.2) with the following code:

from math import pow, sqrt
dist=[]
def determineFruit(xv, yv, threshold_radius):
  for i in range(1,len(x)):
    xdif=pow(float(x[i])-xv, 2)
    ydif=pow(float(y[i])-yv, 2)
    sqrtdist = sqrt(xdif+ydif))
    if ( xdif < threshold_radius and 
         ydif < thresholdradius and sqrtdist < threshold_radius):
      dist.append(sqrtdist)
    else:
      dist.append(99)
  pear_count=0
  apple_count=0
  banana_count=0
  for i in range(1,len(dist)):
      if dist[i] < threshold_radius:
        if z[i] == 'g': pear_count += 1
        if z[i] == 'r': apple_count += 1
        if z[i] == 'y': banana_count += 1
  if ( apple_count >= pear_count and apple_count >= banana_count ):
    return "apple"
  elif ( pear_count >= apple_count and pear_count >= banana_count):
    return "pear"
  elif ( banana_count >= apple_count and banana_count >= pear_count):
    return "banana"

dist=[]
determine = determineFruit(3.5,6.2, 1)
print determine

'pear'
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset