Data understanding and preparation

Let's start with loading the R packages that we will need for this chapter. As always, make sure that you have installed them first:

> library(cluster) #conduct cluster analysis
> library(compareGroups) #build descriptive statistic tables
> library(HDclassif) #contains the dataset
> library(NbClust) #cluster validity measures
> library(sparcl) #colored dendrogram

The dataset is in the HDclassif package, which we installed. So, we can load the data and examine the structure with the str() function:

> data(wine)

> str(wine)
'data.frame':178 obs. of  14 variables:
 $ class: int  1 1 1 1 1 1 1 1 1 1 ...
 $ V1   : num  14.2 13.2 13.2 14.4 13.2 ...
 $ V2   : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
 $ V3   : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
 $ V4   : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
 $ V5   : int  127 100 101 113 118 112 96 121 97 98 ...
 $ V6   : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
 $ V7   : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
 $ V8   : num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
 $ V9   : num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
 $ V10  : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
 $ V11  : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
 $ V12  : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
 $ V13  : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

The data consists of 178 wines with 13 variables of the chemical composition and one variable Class, the label, for the cultivar or plant variety. We won't use this in the clustering but as a test of model performance. The variables, V1 through V13, are the measures of the chemical composition as follows:

  • V1: alcohol
  • V2: malic acid
  • V3: ash
  • V4: alkalinity of ash
  • V5: magnesium
  • V6: total phenols
  • V7: flavonoids
  • V8: non-flavonoid phenols
  • V9: proanthocyanins
  • V10: color intensity
  • V11: hue
  • V12: OD280/OD315
  • V13: proline

The variables are all quantitative. We should rename them to something meaningful for our analysis. This is easily done with the names() function:

> names(wine) = c("Class", "Alcohol", "MalicAcid", "Ash", "Alk_ash", "magnesium", "T_phenols", "Flavanoids", "Non_flav", "Proantho", "C_Intensity", "Hue", "OD280_315", "Proline")

> names(wine)
 [1] "Class"       "Alcohol"     "MalicAcid"   "Ash"        
 [5] "Alk_ash"     "magnesium"   "T_phenols"   "Flavanoids" 
 [9] "Non_flav"    "Proantho"    "C_Intensity" "Hue"        
[13] "OD280_315"   "Proline"    

As the variables are not scaled, we will need to do this using the scale() function. This will first center the data where the column mean is subtracted from each individual in the column. Then the centered values will be divided by the corresponding column's standard deviation. We can also use this transformation to make sure that we only include columns 2 through 14, dropping class and putting it in a data frame. This can all be done with one line of code:

> df = as.data.frame(scale(wine[,-1]))

Now, check the structure to make sure that it all worked according to plan:

> str(df)
'data.frame':178 obs. of  13 variables:
 $ Alcohol    : num  1.514 0.246 0.196 1.687 0.295 ...
 $ MalicAcid  : num  -0.5607 -0.498 0.0212 -0.3458 0.2271 ...
 $ Ash        : num  0.231 -0.826 1.106 0.487 1.835 ...
 $ Alk_ash    : num  -1.166 -2.484 -0.268 -0.807 0.451 ...
 $ magnesium  : num  1.9085 0.0181 0.0881 0.9283 1.2784 ...
 $ T_phenols  : num  0.807 0.567 0.807 2.484 0.807 ...
 $ Flavanoids : num  1.032 0.732 1.212 1.462 0.661 ...
 $ Non_flav   : num  -0.658 -0.818 -0.497 -0.979 0.226 ...
 $ Proantho   : num  1.221 -0.543 2.13 1.029 0.4 ...
 $ C_Intensity: num  0.251 -0.292 0.268 1.183 -0.318 ...
 $ Hue        : num  0.361 0.405 0.317 -0.426 0.361 ...
 $ OD280_315  : num  1.843 1.11 0.786 1.181 0.448 ...
 $ Proline    : num  1.0102 0.9625 1.3912 2.328 -0.0378 ...

Before moving on, let's do a quick table to see the distribution of the cultivars or Class:

> table(wine$Class)

 1  2  3 
59 71 48

We can now move on to the modeling step of the process.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset