Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Data understanding and preparation

Let's start with loading the R packages that we will need for this chapter. As always, make sure that you have installed them first:

> library(cluster) #conduct cluster analysis
> library(compareGroups) #build descriptive statistic tables
> library(HDclassif) #contains the dataset
> library(NbClust) #cluster validity measures
> library(sparcl) #colored dendrogram

The dataset is in the HDclassif package, which we installed. So, we can load the data and examine the structure with the str() function:

> data(wine)

> str(wine)
'data.frame':178 obs. of  14 variables:
 $ class: int  1 1 1 1 1 1 1 1 1 1 ...
 $ V1   : num  14.2 13.2 13.2 14.4 13.2 ...
 $ V2   : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
 $ V3   : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
 $ V4   : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
 $ V5   : int  127 100 101 113 118 112 96 121 97 98 ...
 $ V6   : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
 $ V7   : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
 $ V8   : num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
 $ V9   : num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
 $ V10  : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
 $ V11  : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
 $ V12  : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
 $ V13  : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

The data consists of 178 wines with 13 variables of the chemical composition and one variable Class, the label, for the cultivar or plant variety. We won't use this in the clustering but as a test of model performance. The variables, V1 through V13, are the measures of the chemical composition as follows:

V1: alcohol
V2: malic acid
V3: ash
V4: alkalinity of ash
V5: magnesium
V6: total phenols
V7: flavonoids
V8: non-flavonoid phenols
V9: proanthocyanins
V10: color intensity
V11: hue
V12: OD280/OD315
V13: proline

The variables are all quantitative. We should rename them to something meaningful for our analysis. This is easily done with the names() function:

> names(wine) = c("Class", "Alcohol", "MalicAcid", "Ash", "Alk_ash", "magnesium", "T_phenols", "Flavanoids", "Non_flav", "Proantho", "C_Intensity", "Hue", "OD280_315", "Proline")

> names(wine)
 [1] "Class"       "Alcohol"     "MalicAcid"   "Ash"        
 [5] "Alk_ash"     "magnesium"   "T_phenols"   "Flavanoids" 
 [9] "Non_flav"    "Proantho"    "C_Intensity" "Hue"        
[13] "OD280_315"   "Proline"

As the variables are not scaled, we will need to do this using the scale() function. This will first center the data where the column mean is subtracted from each individual in the column. Then the centered values will be divided by the corresponding column's standard deviation. We can also use this transformation to make sure that we only include columns 2 through 14, dropping class and putting it in a data frame. This can all be done with one line of code:

> df = as.data.frame(scale(wine[,-1]))

Now, check the structure to make sure that it all worked according to plan:

> str(df)
'data.frame':178 obs. of  13 variables:
 $ Alcohol    : num  1.514 0.246 0.196 1.687 0.295 ...
 $ MalicAcid  : num  -0.5607 -0.498 0.0212 -0.3458 0.2271 ...
 $ Ash        : num  0.231 -0.826 1.106 0.487 1.835 ...
 $ Alk_ash    : num  -1.166 -2.484 -0.268 -0.807 0.451 ...
 $ magnesium  : num  1.9085 0.0181 0.0881 0.9283 1.2784 ...
 $ T_phenols  : num  0.807 0.567 0.807 2.484 0.807 ...
 $ Flavanoids : num  1.032 0.732 1.212 1.462 0.661 ...
 $ Non_flav   : num  -0.658 -0.818 -0.497 -0.979 0.226 ...
 $ Proantho   : num  1.221 -0.543 2.13 1.029 0.4 ...
 $ C_Intensity: num  0.251 -0.292 0.268 1.183 -0.318 ...
 $ Hue        : num  0.361 0.405 0.317 -0.426 0.361 ...
 $ OD280_315  : num  1.843 1.11 0.786 1.181 0.448 ...
 $ Proline    : num  1.0102 0.9625 1.3912 2.328 -0.0378 ...

Before moving on, let's do a quick table to see the distribution of the cultivars or Class:

> table(wine$Class)

 1  2  3 
59 71 48

We can now move on to the modeling step of the process.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Data understanding and preparation

Create new playlist

Sign In

Sign Up

Data understanding and preparation

Table of Contents for
Data understanding and preparation