Chapter 6. Words and Pixels – Working with Unstructured Data

Most of the data we have looked at thus far is composed of rows and columns with numerical or categorical values. This sort of information fits in both traditional spreadsheet software and the interactive Python notebooks used in the previous exercises. However, data is increasingly available in both this form, usually called structured data, and more complex formats such as images and free text. These other data types, also known as unstructured data, are more challenging than tabular information to parse and transform into features that can be used in machine learning algorithms.

What makes unstructured data challenging to use? It is challenging largely because images and text are extremely high dimensional, consisting of a much larger number of columns or features than we have seen previously. For example, this means that a document may have thousands of words, or an image thousands of individual pixels. Each of these components may individually or in complex combinations comprise a feature for our algorithms. However, to use these data types in prediction, we need to somehow distill this extremely complex data into common features or trends that might be used effectively in a model. This often involves both removing noise from these data types and finding simpler representations. At the same time, the greater inherent complexity of these data types potentially captures more information than available in tabular datasets, or may reveal information that is not available in any other source.

In this chapter, we will explore unstructured data by:

  • Cleaning raw text through stemming, stop word removal, and other normalizations
  • Using tokenization and n-grams to find common patterns in textual data
  • Normalizing image data and removing noise
  • Decomposing images into lower dimensional features through several common matrix factorization algorithms

Working with textual data

In the following example, we will consider the problem of separating text messages sent between cell phone users. Some of these messages are spam advertisements, and the objective is to separate these from normal communications (Almeida, Tiago A., José María G. Hidalgo, and Akebo Yamakami. Contributions to the study of SMS spam filtering: new collection and results. Proceedings of the 11th ACM symposium on Document engineering. ACM, 2011). By looking for patterns of words that are typically found in spam advertisements, we could potentially derive a smart filter that would automatically remove these messages from a user's inbox. However, while in previous chapters we were concerned with fitting a predictive model for this kind of problem, here we will be shifting focus to cleaning up the data, removing noise, and extracting features. Once these tasks are done, either simple or lower-dimensional features can be input into many of the algorithms we have already studied.

Cleaning textual data

Let us start by loading and inspecting the data using the following commands. Note that we need to supply column names for this data ourselves:

>>> spam = pd.read_csv('smsspamcollection/SMSSpamCollection',sep='	',header=None)
>>> spam.columns = ['label','text']
>>> spam.head()

This gives the following output:

Cleaning textual data

The dataset consists of two columns: the first contains the label (spam or ham) indicating whether the message is an advertisement or a normal message, respectively. The second column contains the text of the message. Right at the start, we can see a number of problems with using this raw text as input to an algorithm to predict the spam/nonspam label:

  • The text of each message contains a mixture of upper and lower case letters, but this capitalization does not affect the meaning of a word.
  • Many words (to, he, the, and so on) are common, but tell us relatively little about the message.

Other issues are subtler:

  • When we compare words such as larger and largest, the most information about the meaning of the words is carried by the root, large—differentiating between the two forms may actually prevent us from capturing common information about the presence of the word large in a text, since the count of this stem in the message will be divided between the variants. Looking only at individual words does not tell us about the context in which they are used. Indeed, it may be more informative to consider sets of words.
  • Even for words that do not fall into the common category, such as and, the, and to, it is sometimes unclear whether a word is present in a document because it is common across all documents or whether it contains special information about a particular document. For example, in a set of online movie reviews, words such as character and film will appear frequently, but do not help to distinguish one review from another since they are common across all reviews. Because the English language has a large vocabulary, the size of the resulting feature set could be enormous.

Let us start by cleaning up the text before delving into the other feature issues. We can base by lowercasing each word in the text using the following function:

>>> def clean_text(input):
…      return "".join([i.lower() for i in input])

We then apply this function to each message using the map function we have seen in previous examples:

>>> spam.text = spam.text.map(lambda x: clean_text(x))

Inspecting the resulting we can verify that all the letters are now indeed lowercase:

Cleaning textual data

Next, we want to remove common words and trim the remaining vocabulary to just the stem portion of the word that is most useful for predictive modeling. We do this using the natural language toolkit (NLTK) library (Bird, Steven. NLTK: the natural language toolkit. Proceedings of the COLING/ACL on Interactive presentation sessions. Association for Computational Linguistics, 2006.). The list of stop words is part of the dataset associated for download with this library; if this is your first time opening NLTK, you can use the nltk.download() command to open a graphical user interface (GUI) where you can select the content you wish to copy to your local machine using the following commands:

>>> import nltk
>>> nltk.download()
>>> from nltk.corpus import stopwords
>>> stop_words = stopwords.words('english')

We then define a function to perform stemming:

>>> def stem_text(input):
…    return " ".join([nltk.stem.porter.PorterStemmer().stem(t) if t not in 
…       stop_words else for t in nltk.word_tokenize(input)])

Finally, we again use a lambda function to perform this operation on each message, and visually inspect the results:

>>> spam.text = spam.text.map(lambda x: stem_text(x))
Cleaning textual data

For example, you can see the stem joke has been extracted from joking, and avail from available.

Now that we have performed lower casing and stemming, the messages are in relatively cleaned up form, and we can proceed to generate features for predictive modeling from this data.

Extracting features from textual data

In perhaps the simplest possible feature for text data, we use a binary vector of 0s and 1s to simply record the presence or absence of each word in our vocabulary in each message. To do this we can utilize the CountVectorizer function in the scikit-learn library, using the following commands:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> count_vect_sparse = CountVectorizer().fit_transform(spam.text)

By default, the result is stored as a sparse vector, which means that only the non-zero elements are held in memory. To calculate the total size of this vector we need to transform it back into a dense vector (where all elements, even 0, are stored in memory):

>>> count_vect_sparse[0].todense().size

By checking the length of the feature vector created for the first message, we can see that it creates a vector of length 7,468 for each message with 1 and 0 indicating the presence or absence, respectively, of a particular word out of all words in this document list.

We can check that this length is in fact the same as the vocabulary (union of all unique words in the messages) using the following command to extract the vocabulary_ element of the vectorizer, which also gives a value of 7,468:

>>> len(CountVectorizer().fit(spam.text).vocabulary_)Recall from the earlier that individual words might not informative features if their meaning is dependent upon the context given by other words in a sentence. Thus, if we want to expand our feature set to potentially more powerful features, we could also consider n-grams, sets of n co-occurring words (for example, the phrase 	he red house contains the n-grams the red, and red house (2-grams), and the red house (3-gram)). These features are calculated similarly as above, by supplying the argument ngram_range to the CountVectorizer constructor:
>>> count_vect_sparse = CountVectorizer(ngram_range=(1, 3)).fit_transform(spam.text)

We can see that this increases the size of the resulting feature by about 10-fold by again inspecting the length of the first row using:

>>> count_vect_sparse[0].todense().sizeInsert

However, even after calculating n-grams, we still have not accounted for the fact that some words or n-grams might be common across all messages and thus provide little information in distinguishing spam from nonspam. To account for this, instead of simply recording the presence or absence of a word (or n-gram), we might compare the frequency of words within a document to the frequency across all documents. This ratio, the term-frequency-inverse document frequency (tf-idf) is calculated in the simplest form as:

Extracting features from textual data

Where ti is a particular term (word or n-gram), dj is a particular document, D is the number of documents, Vj is the set of words in document j, and vk is a particular word in document j. The subscripted 1 in this formula is known as an Indicator Function, which returns 1 if the subscripted condition is true, and 0 otherwise. In essence, this formula compares the frequency (count) of a word within a document to the number of documents that contain this word. As the number of documents containing the word decreases, the denominator decreases, and thus the overall formula becomes larger from dividing by a value much less than 1. This is balanced by the frequency of the word within a document in the numerator. Thus, the tf-idf score will more heavily weight words that are present at greater frequency within a document compared to those common among all documents and thus might be indicative of special features of a particular message.

Note that the formula above represents only the simplest version of this expression. There are also variants in which we might logarithmically transform the counts (to offset the bias from large documents), or scale the numerator by the maximum frequency found for any term within a document (again, to offset bias that longer documents could have higher term frequencies than shorter documents by virtue of simply having more words) (Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. Scoring, term weighting and the vector space model. Introduction to Information Retrieval 100 (2008): 2-4.). We can apply tf-idf to the spam data using the following commands:

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> tf_idf = TfidfVectorizer().fit_transform(spam.text)

We can see the effect of this transformation by taking the maximum value across rows using:

>>> tf_idf.todense().max(1)

Where the '1' argument to max indicates that the function is applied along rows (instead of columns, which would be specified with '0' ). When our features consisted only of binary values, the maximum across each rows would be 1, but we can see that it is now a float value.

The final text feature we will discuss is concerned with condensing our feature set. Simply put, as we consider larger and larger vocabularies, we will encounter many words that are so infrequent as to almost never appear. However, from a computational standpoint, even a single instance of a word in one document is enough to expand the number of columns in our text features for all documents. Given this, instead of directly recording whether a word is present, we might think of compressing this space requirement so that we use fewer columns to represent the same dataset. While in some cases, two words might map to the same column, in practice this happens infrequently enough due to the long-tailed distribution of word frequencies that it can serve as a handy way to reduce the dimensionality of our text data. To perform this mapping, we make use of a hash function that takes as input a word and outputs a random number (column location) that is keyed to the value of that string. The number of columns we ultimately map to in our transformed dataset is controlled by the n_features argument to the HashingVectorizer, which we can apply to our dataset using the following commands:

>>> from sklearn.feature_extraction.text import HashingVectorizer
>>> h = HashingVectorizer(n_features=1024).fit_transform(spam.text)

Using dimensionality reduction to simplify datasets

Even though using the HashingVectorizer allows us to reduce the data to a set of 1,024 columns from a feature set that was much larger, we are still left with many variables in our dataset. Intuition tells us that some of these features, either before or after the application of the HashingVectorizer, are probably correlated. For example, a set of words may co-occur in a document that is spam. If we use n-grams and the words are adjacent to one another, we could pick up on this feature, but not if the words are simply present in the message but separated by other text. The latter might occur, for example, if some common terms are in the first sentence of the message, while others are near the end.More broadly, given a large set of variables such as we have already seen for textual data, we might ask whether we could represent these data using a more compact set of features. In other words, is there an underlying pattern to the variation in thousands of variables that may be extracted by calculating a much smaller number of features representing patterns of correlation between individual variables? In a sense, we already saw several examples of this idea in Chapter 3, Finding Patterns in the Noise – Clustering and Unsupervised Learning, in which we reduced the complexity of a dataset by aggregating individual datapoints into clusters. In the following examples, we have a similar goal, but rather than aggregating individual datapoints, we want to capture groups of correlated variables.

While we might achieve this goal in part through the variable selection techniques such as regularization, which we discussed in the Chapter 4, Connecting the Dots with Models – Regression Methods, we do not necessarily want to remove variables, but rather capture their common patterns of variation.

Let us examine some of the common methods of dimensionality reduction and how we might choose between them for a given problem.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset