Data volume and high dimensions pose an astounding challenge in text-mining tasks. Inherent noise and the computational cost of processing huge amount of datasets make it even more arduous. The science of dimensionality reduction lies in the art of losing out on only a commensurately small numbers of information and still being able to reduce the high dimension space into a manageable proportion.
For classification and clustering techniques to be applied to text data, for different natural language processing activities, we need to reduce the dimensions and noise in the data so that each document can be represented using fewer dimensions, thus significantly reducing the noise that can hinder the performance.
In this chapter, we will learn different dimensionality reduction techniques and their implementations in R:
Topic modeling and document clustering are common text mining activities, but the text data can be very high-dimensional, which can cause a phenomenon called the curse of dimensionality. Some literature also calls it the concentration of measure:
Distance concentration is a phenomenon associated with high-dimensional space wherein pairwise distances or dissimilarity between points appear indistinguishable. All the vectors in high dimensions appear to be orthogonal to each other. The distances between each data point to its neighbors, farthest or nearest, become equal. This totally jeopardizes the utility of methods that use distance based measures.
Let's consider that the number of samples is n and the number of dimensions is d. If d is very large, the number of samples may prove to be insufficient to accurately estimate the parameters. For the datasets with number of dimensions d, the number of parameters in the covariance matrix will be d^2. In an ideal scenario, n should be much larger than d^2, to avoid overfitting.
In general, there is an optimal number of dimensions to use for a given fixed number of samples. While it may feel like a good idea to engineer more features, if we are not able to solve a problem with less number of features. But the computational cost and model complexity increases with the rise in number of dimensions. for instance, if n number of samples look to be dense enough for a one-dimensional feature space. For a k-dimensional feature space, n^k samples would be required.