Principal component analysis (PCA) transforms the attributes of unlabeled data using a simple rearrangement and transformation with rotation. Looking at the data that does not have any significance, you can find ways to reduce dimensions this way. For instance, when a particular dataset looks similar to an ellipse when run at a particular angle to the axes, while in another transformed representation moves along the x axis and clearly has signs of no variation along the y axis, then it may be possible to ignore that.
k-means clustering is appropriate to cluster unlabeled data. Sometimes, one can use PCA to project data to a much lower dimension and then apply other methods, such as k-means, to a smaller and reduced data space.
However, it is very important to perform dimension reduction carefully because any dimension reduction may lead to the loss of information, and it is crucial that the algorithm preserves the useful part of the data while discarding the noise. Here, we will motivate PCA from at least two perspectives and explain why preserving maximal variability makes sense:
Suppose that we did collect data about students on a campus that involves details about gender, height, weight, tv time, sports time, study time, GPA, and so on. While performing the survey about these students using these dimensions, we figured that the height and weight correlation yields an interesting theory (usually, the taller the student, the more weight due to the bone weight and vice versa). This may probably not be the case in a bigger set of population (more weight does not necessarily mean taller). The correlation can also be visualized as follows:
import matplotlib.pyplot as plt import csv gender=[] x=[] y=[] with open('/Users/kvenkatr/height_weight.csv', 'r') as csvf: reader = csv.reader(csvf, delimiter=',') count=0 for row in reader: if count > 0: if row[0] == "f": gender.append(0) else: gender.append(1) height = float(row[1]) weight = float(row[2]) x.append(height) y.append(weight) count += 1 plt.figure(figsize=(11,11)) plt.scatter(y,x,c=gender,s=300) plt.grid(True) plt.xlabel('Weight', fontsize=18) plt.ylabel('Height', fontsize=18) plt.title("Height vs Weight (College Students)", fontsize=20) plt.legend() plt.show()
Using sklearn
again with the preprocessing
, datasets
, and decomposition
packages, you can write a simple visualization code as follows:
from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt data = load_iris() X = data.data # convert features in column 1 from cm to inches X[:,0] /= 2.54 # convert features in column 2 from cm to meters X[:,1] /= 100 from sklearn.decomposition import PCA def scikit_pca(X): # Standardize X_std = StandardScaler().fit_transform(X) # PCA sklearn_pca = PCA(n_components=2) X_transf = sklearn_pca.fit_transform(X_std) # Plot the data plt.figure(figsize=(11,11)) plt.scatter(X_transf[:,0], X_transf[:,1], s=600, color='#8383c4', alpha=0.56) plt.title('PCA via scikit-learn (using SVD)', fontsize=20) plt.xlabel('Petal Width', fontsize=15) plt.ylabel('Sepal Length', fontsize=15) plt.show() scikit_pca(X)
This plot shows PCA using the scikit-learn
package:
The following command will help the installation of the scikit-learn
package:
$ conda install scikit-learn Fetching package metadata: .... Solving package specifications: . Package plan for installation in environment /Users/myhomedir/anaconda: The following packages will be downloaded: package | build ---------------------------|----------------- nose-1.3.7 | py27_0 194 KB setuptools-18.0.1 | py27_0 341 KB pip-7.1.0 | py27_0 1.4 MB scikit-learn-0.16.1 | np19py27_0 3.3 MB ------------------------------------------------------------ Total: 5.2 MB The following packages will be UPDATED: nose: 1.3.4-py27_1 --> 1.3.7-py27_0 pip: 7.0.3-py27_0 --> 7.1.0-py27_0 scikit-learn: 0.15.2-np19py27_0 --> 0.16.1-np19py27_0 setuptools: 17.1.1-py27_0 --> 18.0.1-py27_0 Proceed ([y]/n)? y Fetching packages ...
For anaconda, as the CLI is all via conda
, one can install it using conda
. For other ways, by default, one would always attempt to use pip install
. However, in any case, you should check the documentation for installation. As all the scikit-learn
packages are pretty popular and have been around for a while, not much has changed. Now, in the following section, we will explore k-means clustering to conclude this chapter.