Chapter 4. Dimensionality Reduction

Data volume and high dimensions pose an astounding challenge in text-mining tasks. Inherent noise and the computational cost of processing huge amount of datasets make it even more arduous. The science of dimensionality reduction lies in the art of losing out on only a commensurately small numbers of information and still being able to reduce the high dimension space into a manageable proportion.

For classification and clustering techniques to be applied to text data, for different natural language processing activities, we need to reduce the dimensions and noise in the data so that each document can be represented using fewer dimensions, thus significantly reducing the noise that can hinder the performance.

In this chapter, we will learn different dimensionality reduction techniques and their implementations in R:

  • The curse of dimensionality
  • Dimensionality reduction
  • Correspondence analysis
  • Singular vector decomposition
  • ISOMAP – moving toward non-linearity

The curse of dimensionality

Topic modeling and document clustering are common text mining activities, but the text data can be very high-dimensional, which can cause a phenomenon called the curse of dimensionality. Some literature also calls it the concentration of measure:

  • Distance is attributed to all the dimensions and assumes each of them to have the same effect on the distance. The higher the dimensions, the more similar things appear to each other.
  • The similarity measures do not take into account the association of attributes, which may result in inaccurate distance estimation.
  • The number of samples required per attribute increases exponentially with the increase in dimensions.
  • A lot of dimensions might be highly correlated with each other, thus causing multi-collinearity.
  • Extra dimensions cause a rapid volume increase that can result in high sparsity, which is a major issue in any method that requires statistical significance. Also, it causes huge variance in estimates, near duplicates, and poor predictors.

Distance concentration and computational infeasibility

Distance concentration is a phenomenon associated with high-dimensional space wherein pairwise distances or dissimilarity between points appear indistinguishable. All the vectors in high dimensions appear to be orthogonal to each other. The distances between each data point to its neighbors, farthest or nearest, become equal. This totally jeopardizes the utility of methods that use distance based measures.

Let's consider that the number of samples is n and the number of dimensions is d. If d is very large, the number of samples may prove to be insufficient to accurately estimate the parameters. For the datasets with number of dimensions d, the number of parameters in the covariance matrix will be d^2. In an ideal scenario, n should be much larger than d^2, to avoid overfitting.

In general, there is an optimal number of dimensions to use for a given fixed number of samples. While it may feel like a good idea to engineer more features, if we are not able to solve a problem with less number of features. But the computational cost and model complexity increases with the rise in number of dimensions. for instance, if n number of samples look to be dense enough for a one-dimensional feature space. For a k-dimensional feature space, n^k samples would be required.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset