Principal component analysis

One of the most commonly used methods of dimensionality reduction is Principal Component Analysis (PCA). Conceptually, PCA computes the axes along which the variation in the data is greatest. You may recall that in Chapter 3, Finding Patterns in the Noise – Clustering and Unsupervised Learning, we calculated the eigenvalues of the adjacency matrix of a dataset to perform spectral clustering. In PCA, we also want to find the eigenvalue of the dataset, but here, instead of any adjacency matrix, we will use the covariance matrix of the data, which is the relative variation within and between columns. The covariance for columns xi and xj in the data matrix X is given by:

Principal component analysis

This is the average product of the offsets from the mean column values. We saw this value before when we computed the correlation coefficient in Chapter 3, Finding Patterns in the Noise – Clustering and Unsupervised Learning, as it is the denominator of the Pearson coefficient. Let us use a simple example to illustrate how PCA works. We will make a dataset in which the six columns are derived from the same underlying normal distribution, one of which is given reversed in sign, using the following commands:

>>> syn_1 = np.random.normal(0,1,100)
>>> syn_2 = -1*syn_1
>>> syn_data = [ syn_1, syn_1, syn_1, syn_2, syn_2, syn_2]

Note that each of our columns has mean 0 and standard deviation 1. If this were not the case, we could use the scikit-learn utility StandardScaler as we discussed in Chapter 3, Finding Patterns in the Noise – Clustering and Unsupervised Learning, when we normalized data for use in k means clustering. We might simply center the variables at 0 and use the resulting covariance matrix if we believe that the differences in scale of the variables are important to our problem. Otherwise, differences in scale will tend to be reflected by the differing variance values within the columns of the data, so our resulting PCA will reflect not only correlations within variables but also their differences in magnitude. If we do not want to emphasize these differences and are only interested in the relative correlation among variables, we can also divide each column of the data by its standard deviation to give each column a variance of 1. We could also potentially run PCA not on the covariance matrix, but the Pearson correlation matrix between variables, which is already naturally scaled to 0 and a constant range of a values (from -1 to 1) (Kromrey, Jeffrey D., and Lynn Foster-Johnson. Mean centering in moderated multiple regression: Much ado about nothing. Educational and Psychological Measurement 58.1 (1998): 42-67.). For now, we can compute the covariance matrix of our data with the following command:

>>> syn_cov = np.cov(syn_data)

Recalling our discussion of spectral clustering in Chapter 3, Finding Patterns in the Noise – Clustering and Unsupervised Learning, if we consider the covariance matrix as a stretching operation on a vector, then, if we find the vectors that lie along these directions of distortion, we have in a sense found the axes that define the variation in the data. If we then compare the eigenvalues of these vectors, we could determine if one or more of these directions reflect a greater proportion of the overall variation of the data. Let us compute the eigenvalues and vectors of the covariance matrix using:

>>> [eigenvalues, eigenvectors] = np.linalg.eig(syn_cov)

This gives the following eigenvalue variable as:

array([  0.00000000e+00,   4.93682786e+00,   1.23259516e-32,          1.50189461e-16,   0.00000000e+00,  -9.57474477e-34])

You can see that most of the eigenvalues are effectively zero, except the second. This reflects the fact that the data we constructed, despite having six columns, is effectively derived from only one dataset (a normal distribution). Another important property of these eigenvectors is that they are orthogonal, which means that they are at right angles to each other in n-dimensional space: if we were to take a dot product between them, it would be 0, and they thus represent independent vectors that, when linearly combined, can be used to represent the dataset.

If we were to multiply the data by the eigenvector corresponding to this second eigenvalue, we would project the data from a six-dimensional to a one-dimensional space:

>>> plt.hist(np.dot(np.array(syn_data).transpose(),np.array(eigenvectors[:,1])))

Note that we needed to transpose the data to have the 100 rows and 6 columns, as we initially constructed it as a list of 6 columns, which NumPy interprets as instead having 6 rows and 100 columns. The resulting histogram is as shown in the following:

Principal component analysis

In other words, by projecting the data onto the axis of greatest variance, we have recovered that fact that this six-column data was actually generated from a single distribution. Now if we instead use the PCA command, we get a similar result:

>>> syn_pca = PCA().fit(np.array(syn_data))

When we extract the explained_variance_ratio_, the algorithm has effectively taken the preceding eigenvalues, ordered them by magnitude, and divided by the largest one, giving:

array([  1.00000000e+000,   6.38413622e-032,   2.02691244e-063,          2.10702767e-094,   3.98369984e-126,   5.71429334e-157])

If we were to plot these as a barplot, a visualization known as a scree plot could help us determine how many underlying components are represented in our data:

>>> scree, ax = plt.subplots()
>>> plt.bar(np.arange(0,6),syn_pca.explained_variance_ratio_)
>>> ax.set_xlabel('Component Number')
>>> ax.set_ylabel('Variance Explained')
>>> plt.show()

This generates the following plot:

Principal component analysis

Evidently, only the first component carries any variance, represented by the height of the bar, with all other components being near 0 and so not appearing in the plot. This sort of visual analysis is comparable to how we looked for an elbow in the inertia function for k-means in Chapter 3, Finding Patterns in the Noise – Clustering and Unsupervised Learning, as a function of k to determine how many clusters were present in the data.We can also extract the data projected onto the first principal components and see a similar plot as shown previously when we projected the data onto an eigenvector of the covariance matrix:

>>> plt.hist(syn_pca.components_[0])
Principal component analysis

Why are they not exactly identical? While conceptually PCA computes the eigenvalues of the covariance matrix, in practice most packages do not actually implement the calculation we illustrated previously for purposes of numerical efficiency. Instead, they employ a matrix operation known as the Singular Value Decomposition (SVD), which seeks to represent a covariance matrix of X as a set of lower dimensional row and column matrices:

Principal component analysis

Where if X is an n by m, W may be n by k, where k << m. Here, σ represents a matrix with 0 everywhere but the diagonal, which contains non-zero entries. Thus, the covariance matrix is represented as the product of two smaller matrices and a scaling factor given by the diagonal elements in σ. Instead of calculating all eigenvectors of the covariance matrix, as we did previously, we can ask only for the k columns or WT we think are likely to be significant judged by the sort of scree plot analysis we demonstrated above. However, when we project the data onto the principal components we obtain through this method, the calculation of the SVD can potentially give different signs to the projection of the data on the principal components, even if the relative magnitude and signs of these components remains the same. Thus, when we look at the scores assigned to a given row of data after projecting it onto the first k principal components, we should analyze them relative to other values in the dataset, just as when we examined the coordinates produced by Multidimensional Scaling in Chapter 3, Finding Patterns in the Noise – Clustering and Unsupervised Learning. Details of the SVD calculation used by the default scikit-learn implementation of PCA are given in (Tipping, Michael E., and Christopher M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61.3 (1999): 611-622.).

Now that we have examined conceptually what PCA calculates, let us see if it can help us reduce the dimensionality of our text dataset. Let us run PCA on the n-gram feature set from above, asking for 100 components. Note that because the original dataset is a sparse matrix and PCA requires a dense matrix as an input, we need to convert it using toarray(). Also, to retain the right dimensionality for use with the PCA fit function, we need to transpose the result:

>>>  pca_text = PCA(num_components=10).fit(np.transpose(count_vect_sparse.toarray()))

If we make a scree plot of total variance explained by the first 10 principal components of this dataset, we see that we will probably require a relatively large number of variables to capture the variation in our data since the upward trend in variance explained is relatively smooth:

>>> scree, ax = plt.subplots()
>>> plt.bar(np.arange(0,10),pca_text.explained_variance_ratio_)
>>>ax.set_xlabel('Component Number')
>>>ax.set_ylabel('Variance Explained')
>>> plt.show()
Principal component analysis

We could also visualize this by looking at the cumulative variance explained using k components using the following curve:

>>> scree, ax = plt.subplots()
>>> plt.plot(pca_text.explained_variance_ratio_.cumsum())
>>> ax.set_xlabel('Number of Components')
>>> ax.set_ylabel('Cumulative Variance Explained')
>>> plt.show()  
Principal component analysis

A word on normalization: in practice, for document data, we might not want to scale the data by subtracting the mean and dividing by the variance as the data is mostly binary. Instead, we would just apply the SVD to a binary matrix or perhaps the tF-idf scores we computed previously, an approach also known as Latent Semantic Indexing (LSI) (Berry, Michael W., Susan T. Dumais, and Gavin W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM review 37.4 (1995): 573-595; Laham, T. K. L. D., and Peter Foltz. Learning human-like knowledge by singular value decomposition: A progress report. Advances in Neural Information Processing Systems 10: Proceedings of the 1997 Conference. Vol. 10. MIT Press, 1998.). CUR decomposition and nonnegative matrix factorization

What drawbacks might there be to using PCA to reduce the dimensionality of a dataset? For one, the components (covariance matrix eigenvectors) generated by PCA are still essentially mathematical entities: the patterns in variables represented by these axes might not actually correspond to any element of the data, but rather a linear combination of them. This representation is not always easily interpretable, and can particularly difficult when trying to convey the results of such analyses to domain experts to generate subject-matter specific insights. Second, the fact that PCA produces negative values in its eigenvectors, even for positive-only data such as text (where a term cannot be negatively present in a document, just 0, 1, a count, or a frequency), is due to the fact that the data is linearly combined using these factors. In other words, positive and negative values may be summed together when we project the data onto its components through matrix multiplication, yielding an overall positive value for the projection. Again, it may be preferable to have factors that give some insight into the structure of the data itself, for example, by giving a factor that consists of binary indicators for a group of words that tend to co-occur in a particular group of documents. These goals are addressed by two other matrix factorization techniques: CUR Decomposition and Non-negative Matrix Factorization.

Like the SVD used in PCA, CUR attempts to represent a matrix of data X as a product of lower dimensional matrices. Here, instead of eigenvectors, the CUR decomposition attempts to find the set of columns and rows of the matrix that best represent the dataset as:

Principal component analysis

Where C is a matrix of c columns of the original dataset, R is a set of r rows from the original dataset, and U is a matrix of scaling factors. The c columns and r rows used in this reconstruction are sampled from the columns and rows of the original matrix, with probability proportional to the leverage score, given by:

Principal component analysis

Where lvj is the statistical leverage for column (row) j, k is the number of components in the SVD of X, and vj are the jth elements of these k component vectors. Thus, columns (rows) are sampled with high probability if they contribute significantly to the overall norm of the matrix's singular values, meaning they are also have a major influence on the reconstruction error from the SVD (for example, how well the SVD approximates the original matrix) (Chatterjee, Samprit, and Ali S. Hadi. Sensitivity analysis in linear regression. Vol. 327. John Wiley & Sons, 2009; Bodor, András, et al. rCUR: an R package for CUR matrix decomposition. BMC bioinformatics 13.1 (2012): 1).

While this decomposition is not expected to approximate the original dataset with the same accuracy as the SVD approach used in PCA, the resulting factors may be easier to interpret since they are actual elements of the original dataset.

Note

Please note that while we use SVD to determine sampling probabilities for the columns and rows, the final factorization of CUR does not.

There are many algorithms for generating a CUR decomposition (Mahoney, Michael W., and Petros Drineas. CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences 106.3 (2009): 697-702. Boutsidis, Christos, and David P. Woodruff. Optimal cur matrix decompositions. Proceedings of the 46th Annual ACM Symposium on Theory of Computing. ACM, 2014). CUR decomposition is implemented in the pymf library, and we can call it using the following commands:

>>> cur = pymf.CUR(count_vect_sparse.toarray().transpose(),crank=100,rrank=100)
>>> cur.factorize() >>> cur.factorize()

The crank and rrank parameters indicate how many rows and columns, respectively, should be chosen from the original matrix in the process of performing the decomposition. We can then examine which columns (words from the vocabulary) were chosen in this reconstruction using the following commands to print these significant words whose indices are contained in the cur object's ._cid (column index) element. First we need to collect a list of all words in the vocabulary of our spam dataset:

>>> vocab = CountVectorizer().fit(spam.text).vocabulary_
>>> vocab_array = ['']*len(vocab.values())
>>> for k,v in vocab.items():
…      vocab_array[v]=k
>>>vocab_array = np.array(vocab_array)

Since the vocabulary_ variable returned by the CountVectorizer is a dictionary giving the positions of terms in the array to which they are mapped, we need to construct our array by placing the word at the position given by this dictionary. Now we can print the corresponding words using:

>>> for i in cur._cid:
… print(vocab_array[i])

Like CUR, nonnegative matrix factorization attempts to find a set of positive components that represents the structure of a dataset (Lee, Daniel D., and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature 401.6755 (1999): 788-791; Lee, Daniel D., and H. Sebastian Seung. Algorithms for non-negative matrix factorization. Advances in neural information processing systems. 2001.; P. Paatero, U. Tapper (1994). Paatero, Pentti, and Unto Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5.2 (1994): 111-126. Anttila, Pia, et al. Source identification of bulk wet deposition in Finland by positive matrix factorization. Atmospheric Environment 29.14 (1995): 1705-1718.). Similarly, it tries to reconstruct the data using:

Principal component analysis

Where W and H are lower dimensional matrices that when multiplied, reconstruct X; all three of W, H, and X are constrained to have no negative values. Thus, the columns of X are linear combinations of W, using H as the coefficients. For example, if the rows of X are words and the columns are documents, then each document in X is represented as a linear combination of underlying document types in W with weighted given by H. Like the elements returned by CUR decomposition, the components W from nonnegative matrix factorization are potentially more interpretable than the eigenvectors we get from PCA.

There are several algorithms to compute W and H, with one of the simplest being through multiplicative updates (Lee, Daniel D., and H. Sebastian Seung. Algorithms for non-negative matrix factorization. Advances in neural information processing systems. 2001). For example, if we want to minimize the Euclidean distance between X and WH:

Principal component analysis

We can calculate the derivative of this value with respective to W:

Principal component analysis

Then to update W we multiply at each step by this gradient:

Principal component analysis

And the same for H:

Principal component analysis

These steps are repeated until the values of W and H converge. Let us examine what components we retrieve from our text data when we use NMF to extract components:

>>> from sklearn.decomposition import NMF
>>> nmf_text = NMF(n_components=10).fit(np.transpose(count_vect_sparse.toarray())

We can then look at the words represented by the components in NMF, where the words have a large value in the components matrix resulting from the decomposition.

We can see that they appear to capture distinct groups of words, but are any correlated with distinguishing spam versus nonspam? We can transform our original data using the NMF decomposition, which will give the weights for linearly combining these features (for example, the weights to linearly combine the 10 basis documents we get from the decomposition to reconstruct the message) using the command:

>>> nmf_text_transform = nmf_text.transform(count_vect_sparse.toarray())

Now let us plot the average weight assigned to each of these nmf factors for the normal and spam messages. We can do this by plotting a bar chart where the x axis are the 10 nmf factors, and the y axis are the average weight assigned to this factor for a subset of documents:

>>> plt.bar(range(10),nmf_text_transform[spam.label=='spam'].mean(0))
Principal component analysis
>>> plt.bar(range(10),nmf_text_transform[spam.label=='ham'].mean(0))
Principal component analysis

Promisingly, the factors 8 and 9 seem to have very different average weights between these two classes of messages. In fact, we may need fewer than 10 factors to represent the data, since these two classes may well correspond to the underlying spam versus nonspam messages.

Latent Dirichlet Allocation

A related method of decomposing data into an interpretable set of features is Latent Dirichlet Allocation (LDA), a method initially developed for textual and genetics data that has since been extended to other areas (Blei, David M., Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research 3 (2003): 993-1022. Pritchard, Jonathan K., Matthew Stephens, and Peter Donnelly. Inference of population structure using multilocus genotype data. Genetics 155.2 (2000): 945-959.). Unlike the methods we looked at previously, where the data is represented as a set of lower dimensional matrices that, when multiplied, approximate the original data, LDA uses a probability model. This model is often explained using a plate diagram that illustrates the dependencies among the variables, as shown in the following diagram:

Latent Dirichlet Allocation

What exactly does this diagram describe? It is what is known as a generative model: a set of instructions by which to generate a probability distribution over documents. The idea is comparable to a distribution such as the Gaussian 'bell-curve' you are probably familiar with, except here instead of drawing real numbers from the distribution we sample documents. Generative models may be contrasted with the predictive methods which we have seen in previous chapters that attempts to fit the data to a response (as in the regression or classification models we have studied in Chapters 4, Connecting the Dots with Models – Regression Methods, and Chapter 5, Putting Data in its Place – Classification Methods and Analysis), instead of simply generate samples of the data according to a distribution. The plate diagram represents the components of this generative model, and we can think of this model as the following series of steps to generate a document:Initialize a Dirichlet distribution to choose from a set of topics. These topics are analogous to the components we found in NMF, and can be thought of as basis documents representing groups of commonly co-occurring words. The Dirichlet distribution is given by the following formula:

Latent Dirichlet Allocation

The preceding formula gives the probability of observing a given distribution of items (here topics) among K classes and can be used to sample a vector of K class memberships (for example, sample a random vector giving what fraction of documents in the collection belong to a given topic). The alpha parameter in the Dirichlet distribution is used as an exponent of the K category probabilities and increases the significance ascribed to a particular component (for example, a more frequent topic). The term B is the beta function, which is simply a normalization term. We use the Dirichlet distribution in step 1 to generate a per-topic probability distribution for a document i. This distribution would be, for example, a series of weights that sum to 1 giving the relative probability that a document belongs to a given topic. This is the parameter θ in the plate diagram. M represents the number of documents in our dataset.

  1. For each of the N word positions in the document, choose a topic Z from the distribution θ. Each of the M topics has a Dirichlet distribution with parameter β instead of giving per word probabilities, given by ϕ. Use this distribution to choose word in each N position in a document.
  2. Repeat steps 2–4 for each word position for each document in a dataset to generate a group of documents.

In the previous diagram, the numbers (M, N, K) inside the rectangles indicate the number of time that the variables represented by circles are generated in the generative model. Thus, the words w, being innermost, are generated N × M times. You can also notice that the rectangles enclose variables that are generated the same number of times, while arrows indicate dependence among variables during this data generation process. You can also now appreciate where the name of this model comes from, as a document is latently allocated among many topics, just as we used the factors in NMF to find linear combinations of 'basis documents' that could reconstruct our observed data.

This recipe can also be used to find a set of topics (for example word probability distributions) that fit a dataset, assuming the model described previously was used to generate the documents. Without going into the full details of the derivation, we randomly initialize a fixed number of K topics and run the model, as described previously, by always sampling a document's topic, given all other documents, and a word, given the probability of all other words in the document. We then update the parameters of the model based on the observed data and use the updated probabilities to generate the data again. Over many iterations, this process, known as Gibbs sampling, will converge from randomly initialized values to a set of model parameters that best fit the observed document data. Let us now fit an LDA model to the spam dataset using the following commands:

>>> lda = LatentDirichletAllocation(n_topics=10).fit(count_vect_sparse)

As with NMF, we can examine the highest probability words for each topic using:

>>> for i in range(10):
…    print(vocab_array[np.argsort(lda.components_[i])[1:10]])

Likewise, we can see if these topics represent a meaningful separation between the spam and nonspam messages. First we find the topic distribution among the 10 latent topics for each document using the following transform command:

>>> topic_dist = lda.transform(count_vect_sparse)

This is analogous to the weights we calculated in NMF. We can now plot the average topic weight for each message class as follows:

>>> plt.bar(range(10),topic_dist[spam.label=='ham'].mean(0))
Latent Dirichlet Allocation
>>> plt.bar(range(10),topic_dist[spam.label=='spam'].mean(0))
Latent Dirichlet Allocation

Again, promisingly, we find a different average weight for topic 5 for spam than nonspam, indicating that the LDA model has successfully separated out the axis of variation we are interested in for classification purposes.

Using dimensionality reduction in predictive modeling

The analysis we have outlined previously has been largely devoted to trying to extract a lower-dimensional representation of a text collection by finding a smaller set of components that capture the variation among individual documents. In some cases, this sort of analysis can be useful as an exploratory data analysis tool, which, like the clustering techniques we described in Chapter 3, Finding Patterns in the Noise – Clustering and Unsupervised Learning, allows us to understand the structure in a dataset. We might even combine clustering and dimensionality reduction, which is in essence the idea of spectral clustering as we examined in Chapter 3, Finding Patterns in the Noise – Clustering and Unsupervised Learning using SVD to reduce the adjacency matrix to a more compact representation and then clustering this reduced space to yield a cleaner separation between datapoints.

Like the groups assigned through clustering, we can also potentially use the components derived from these dimensionality reduction methods as features in a predictive model. For example, the NMF components we extracted previously could be used as inputs to a classification model to separate spam from nonspam messages. We have even seen this use earlier, as the online news popularity dataset we used in Chapter 4, Connecting the Dots with Models – Regression Methods, had columns derived from LDA topics. Like the regularization methods we saw in Chapter 4, Connecting the Dots with Models – Regression Methods, dimensionality reduction can help reduce overfitting by extracting the underlying correlations among variables since these lower-dimensional variables are often less noisy than using the whole feature space. Now that we have seen how dimensionality reduction could help us find structure in textual data, let us examine another class of potentially high-dimensional data found in images.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset