SVD versus PCA components

Before we move on with our hotel data, let's do a quick experiment with our iris data to see whether our SVD and PCA really give us the same components:

  1. Let's start by grabbing our iris data and creating both a centered and a scaled version:
# import the Iris dataset from scikit-learn
from sklearn.datasets import load_iris

# load the Iris dataset
iris = load_iris()

# seperate the features and response variable
iris_X, iris_y = iris.data, iris.target

X_centered = StandardScaler(with_std=False).fit_transform(iris_X)
X_scaled = StandardScaler().fit_transform(iris_X)
  1. Let's continue by instantiating an SVD and a PCA object:
# test if we get the same components by using PCA and SVD
svd = TruncatedSVD(n_components=2)
pca = PCA(n_components=2)
  1. Now, let's apply both SVD and PCA to our raw iris data, centered version, and scaled version to compare:
# check if components of PCA and TruncatedSVD are same for a dataset
# by substracting the two matricies and seeing if, on average, the elements are very close to 0
print (pca.fit(iris_X).components_ - svd.fit(iris_X).components_).mean()

0.130183123094 # not close to 0
# matrices are NOT the same


# check if components of PCA and TruncatedSVD are same for a centered dataset
print (pca.fit(X_centered).components_ - svd.fit(X_centered).components_).mean()

1.73472347598e-18 # close to 0
# matrices ARE the same


# check if components of PCA and TruncatedSVD are same for a scaled dataset
print (pca.fit(X_scaled).components_ - svd.fit(X_scaled).components_).mean()

-1.59160878921e-16 # close to 0
# matrices ARE the same
  1. This shows us that the SVD module will return the same components as PCA if our data is scaled, but different components when using the raw unscaled data. Let's continue with our hotel data:
svd = TruncatedSVD(n_components=1000)
svd.fit(tfidf_transformed)

The output is as follows:

TruncatedSVD(algorithm='randomized', n_components=1000, n_iter=5,
       random_state=None, tol=0.0)
  1. Let's make a scree plot as we would with our PCA module to see the explained variance of our SVD components:
# Scree Plot

plt.plot(np.cumsum(svd.explained_variance_ratio_))

This gives us the following plot:

We can see that 1,000 components capture about 30% of the variance. Now, let's set up our LSA pipeline. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset