Let's start with loading the R packages that we will need for this chapter. As always, make sure that you have installed them first:
> library(cluster) #conduct cluster analysis > library(compareGroups) #build descriptive statistic tables > library(HDclassif) #contains the dataset > library(NbClust) #cluster validity measures > library(sparcl) #colored dendrogram
The dataset is in the HDclassif
package, which we installed. So, we can load the data and examine the structure with the str()
function:
> data(wine) > str(wine) 'data.frame':178 obs. of 14 variables: $ class: int 1 1 1 1 1 1 1 1 1 1 ... $ V1 : num 14.2 13.2 13.2 14.4 13.2 ... $ V2 : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ... $ V3 : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ... $ V4 : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ... $ V5 : int 127 100 101 113 118 112 96 121 97 98 ... $ V6 : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ... $ V7 : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ... $ V8 : num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ... $ V9 : num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ... $ V10 : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ... $ V11 : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ... $ V12 : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ... $ V13 : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
The data consists of 178
wines with 13 variables of the chemical composition and one variable Class
, the label, for the cultivar or plant variety. We won't use this in the clustering but as a test of model performance. The variables, V1
through V13
, are the measures of the chemical composition as follows:
V1
: alcoholV2
: malic acidV3
: ashV4
: alkalinity of ashV5
: magnesiumV6
: total phenolsV7
: flavonoidsV8
: non-flavonoid phenolsV9
: proanthocyaninsV10
: color intensityV11
: hueV12
: OD280/OD315V13
: prolineThe variables are all quantitative. We should rename them to something meaningful for our analysis. This is easily done with the names()
function:
> names(wine) = c("Class", "Alcohol", "MalicAcid", "Ash", "Alk_ash", "magnesium", "T_phenols", "Flavanoids", "Non_flav", "Proantho", "C_Intensity", "Hue", "OD280_315", "Proline") > names(wine) [1] "Class" "Alcohol" "MalicAcid" "Ash" [5] "Alk_ash" "magnesium" "T_phenols" "Flavanoids" [9] "Non_flav" "Proantho" "C_Intensity" "Hue" [13] "OD280_315" "Proline"
As the variables are not scaled, we will need to do this using the scale()
function. This will first center the data where the column mean is subtracted from each individual in the column. Then the centered values will be divided by the corresponding column's standard deviation. We can also use this transformation to make sure that we only include columns 2 through 14, dropping class and putting it in a data frame. This can all be done with one line of code:
> df = as.data.frame(scale(wine[,-1]))
Now, check the structure to make sure that it all worked according to plan:
> str(df) 'data.frame':178 obs. of 13 variables: $ Alcohol : num 1.514 0.246 0.196 1.687 0.295 ... $ MalicAcid : num -0.5607 -0.498 0.0212 -0.3458 0.2271 ... $ Ash : num 0.231 -0.826 1.106 0.487 1.835 ... $ Alk_ash : num -1.166 -2.484 -0.268 -0.807 0.451 ... $ magnesium : num 1.9085 0.0181 0.0881 0.9283 1.2784 ... $ T_phenols : num 0.807 0.567 0.807 2.484 0.807 ... $ Flavanoids : num 1.032 0.732 1.212 1.462 0.661 ... $ Non_flav : num -0.658 -0.818 -0.497 -0.979 0.226 ... $ Proantho : num 1.221 -0.543 2.13 1.029 0.4 ... $ C_Intensity: num 0.251 -0.292 0.268 1.183 -0.318 ... $ Hue : num 0.361 0.405 0.317 -0.426 0.361 ... $ OD280_315 : num 1.843 1.11 0.786 1.181 0.448 ... $ Proline : num 1.0102 0.9625 1.3912 2.328 -0.0378 ...
Before moving on, let's do a quick table to see the distribution of the cultivars or Class
:
> table(wine$Class) 1 2 3 59 71 48
We can now move on to the modeling step of the process.