From unstructured text data to a matrix

An issue with text data is that words and sentences are messy, and algorithms for data mining usually do not work out of the box, as they are designed to operate on abstractions of the data, usually in matrix form. So we need to find a way to represent our messy text data as a matrix. 

One of the most commonly used matrix representations in practice is the bag of words model. This is a very simple and intuitive way of extracting information from text. There are some caveats to it, which will be discussed later. 

A bag of words representation of a text consists of:

  • A vocabulary of known words
  • A numerical measure associated to the presence of such words

For example, suppose we have a corpus (a collection of documents) consisting of three sentences:

  • "My sentence"
  • "Your sentence"
  • "My sentence, your sentence, our sentences"

The vocabulary (ignoring the comma) is the collection ("My","sentence","sentences","Your","your","our"). As for the numerical measure, a natural option would be the count function. So the matrix representation would be:

My Your sentence sentences your our
1 0 1 0 0 0
0 1 1 0 0 0
1 0 2 1 1 1

 

This is now something we can operate with. Note that a few comments are in order. We ignored the comma, which does not seem like a big deal, but it is, and we will come back to it. Second, we are including "Your" and "your" as different words, which is probably not desirable. Note that also "sentence" and "sentences" are similar from the information point of view. If we want to infer the content of a document, it might suffice to keep just one of them. 

To deal with the capitalization issue, we can simply convert all the words to lower case first, before passing to the matrix form. Dealing with plurals or other derived words from a root word is done through different algorithms, called stemmers. The most common choice is Porter's stemming algorithm. Sometimes it is not a good idea to do stemming, it might depend on the language and the context of your problem. 

Depending on the context, sometimes common words such as pronouns are better omitted. There are lists of such words, called stop words readily available on the internet. So before creating the matrix representation, you filter those words, which also reduces the dimension of the problem.

A problem with scoring word frequency is that we will have highly frequent words dominating the matrix representation, but this domination might be useless from the information point of view. Instead, what is done often is to use alternative representations such as TF–IDF, which stands for text document-inverse document frequency. There are different ways to calculate it, roughly equivalent, if w is a word, D is the set of documents and d is a document there (in the preceding example, one of the sentences), then:

You can play around with these, for example, instead of frequency of word in the numerator, you can use the characteristic function (0 if the word is not there, 1 if it is), or the logarithm of that. Similarly, you can consider different possibilities for the denominator. 

The issue with removing the punctuation is a bit more subtle, and it has to do with the main disadvantage of the bag of words approach, meaning is completely lost. Even without considering punctuation, sentences like Alice loves pizza and Pizza loves Alice would be represented identically, but they have different meanings. With punctuation, we can get completely opposite meanings, the sentences Pardon, impossible execution and Pardon impossible, execution mean opposite things.

Context is also lost, and relations within words might be lost. For instance, the documents I was in Paris and I saw the Eiffel tower are clearly related, but they would appear as orthogonal documents in a bag of words representation. We will address some of these issues in later chapters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset